Professional Documents
Culture Documents
Continuous
Pr (c1 ≤ Z ≤ c2) =
Pr (Z ≤ c)= φ(c)
φ(c2) - φ(c1)
probability Cumulative
function Distribution
(pdf, pmf) Function (CDF)
Pr (X = 3) Pr (X ≤ 3)
Discrete
A continuous random variable (X) has an infinite number of values within an interval:
b
P (a X b) a f ( x )dx
A discrete random variable (X) assumes a value among a finite set including x1, x2, x3 and so
on. The probability function is expressed by:
P( X xk ) f ( xk )
Examples of a discrete random variable include: coin toss (head or tails, nothing in
between); roll of the dice (1, 2, 3, 4, 5, 6); and “did the fund beat the benchmark?”(yes,
no). In risk, common discrete random variables are default/no default (0/1) and loss
frequency.
Note the similarity between the summation (∑ ) under the discrete variable and the
integral (∫) under the continuous variable. The summation (∑) of all discrete outcomes
must equal one. Similarly, the integral (∫) captures the area under the continuous
distribution function. The total area “under this curve,” from (-∞) to (∞), must equal one.
Examples in Finance
Distance, Time (e.g.) Default (1,0) (e.g.)
Severity of loss (e.g.) Frequency of loss (e.g.)
Asset returns (e.g.)
For example
For example, consider a craps roll of two six-sided dice. What is the probability of rolling a
seven; i.e., P[X=7]? There are six outcomes that generate a roll of seven: 1+6, 2+5, 3+4, 4+3, 5+2,
and 6+1. Further, there are 36 total outcomes. Therefore, the probability is 6/36.
In this case, the outcomes need to be mutually exclusive, equally likely, and
“cumulatively exhaustive” (i.e., all possible outcomes included in total). A key property
of a probability is that the sum of the probabilities for all (discrete) outcomes is 1.0.
Relative frequency is based on an actual number of historical observations (or Monte Carlo
simulations). For example, here is a simulation (produced in Excel) of one hundred (100) rolls of
a single six-sided die:
Empirical Distribution
Roll Freq. %
1 11 11%
2 17 17%
3 18 18%
4 21 21%
5 18 18%
6 15 15%
Total 100 100%
But the empirical frequency, based on this sample, is 18%. If we generate another
sample, we will produce a different empirical frequency.
This relates also to sampling variation. The a priori probability is based on population
properties; in this case, the a priori probability of rolling any number is clearly 1/6th.
However, a sample of 100 trials will exhibit sampling variation: the number of threes (3s)
rolled above varies from the parametric probability of 1/6th. We do not expect the
sample to produce 1/6th perfectly for each outcome.
If we can characterize a random variable (e.g., if we know all outcomes and that each outcome is
equally likely—as is the case when you roll a single die)—the expectation of the random
variable is often called the mean or arithmetic mean.
k
E ( X ) y1p1 y 2 p2 y k pk y i pi
i 1
E( X ) xf ( X )dx
Variance
Variance and standard deviation are the second moment measures of dispersion. The variance of
a discrete random variable Y is given by:
k
Y2 variance(Y ) E Y Y y i Y pi
2 2
i 1
Variance is also expressed as the difference between the expected value of X^2 and the square
of the expected value of X. This is the more useful variance formula:
Please memorize this variance formula above: it comes in handy! For example, if the
probability of loan default (PD) is a Bernouilli trial, what is the variance of PD?
We can solve with E[PD^2] – (E[PD])^2.
For example, what is the variance of a single six-sided die? First, we need to solve for the
expected value of X-squared, E[X2]. This is given by:
1 1 1 1 1 1 91
E [ X 2 ] (12 ) (22 ) (32 ) (42 ) (52 ) (62 )
6 6 6 6 6 6 6
Then, we need to square the expected value of X, [E(X)]2. The expected value of a single six-sided
die is 3.5 (the average outcome). So, the variance of a single six-sided die is given by:
91
Variance( X ) E ( X 2 ) [E ( X )]2 (3.5)2 2.92
6
Here is the same derivation of the variance of a single six-sided die (which has a uniform
distribution) in tabular format:
What is the variance of the total of two six-sided die cast together? It is simply the
Variance (X) plus the Variance (Y) or about 5.83. The reason we can simply add them
together is that they are independent random variables.
Sample Variance:
1 k
sx2
k 1 i 1
( y i Y )2
1. constant
2
0
2a. X2 Y X2 Y2 only if independent
2b. X2 Y X2 Y2 only if independent
3. X2 b X2
4. aX
2
a 2 X2
5. aX
2
b a X
2
6. aX
2
bY a
2 2
X b Y only if independent
2 2
7. X2 E ( X 2 ) E ( X )2
Standard deviation:
Y var(Y ) E Y Y y i Y 2 pi
2
1 k
sX
k 1 i 1
( y i Y )2
This is merely the square root of the sample variance. This formula is important because
this is the technically precise way to calculate volatility.
Skewness (asymmetry)
E [( X )3 ]
Skewness = 3
3
For example, the gamma distribution has positive skew (skew > 0):
Gamma Distribution
Positive (Right) Skew
1.20
1.00 alpha=1,
0.80 beta=1
0.60
0.40 alpha=2,
0.20 beta=.5
-
alpha=4,
0.0
0.6
1.2
1.8
2.4
3.0
3.6
4.2
4.8
beta=.25
Kurtosis
E [( X )4 ]
Kurtosis = 4
4
Note that technically skew and kurtosis are not, respectively, equal to the third and fourth
moments; rather they are functions of the third and fourth moments.
Financial asset returns are typically considered leptokurtic (i.e., heavy or fat- tailed)
For example, the logistic distribution exhibits leptokurtosis (heavy-tails; kurtosis > 3.0):
Logistic Distribution
Heavy-tails (excess kurtosis > 0)
0.50
0.40 alpha=0, beta=1
0.30 alpha=2, beta=1
0.20
alpha=0, beta=3
0.10
- N(0,1)
1 5 9 13 17 21 25 29 33 37 41
A single variable (univariate) probability distribution is concerned with only a single random
variable; e.g., roll of a die, default of a single obligor. A multivariate probability density
function concerns the outcome of an experiment with more than one random variable. This
includes, the simplest case, two variables (i.e., a bivariate distribution).
Density Cumulative
Univariate f(x)= P(X = x) F(x) = P(X ≤ x)
Bivariate f(x)= P(X = x, Y =y) f(x) = P(X ≤ x, Y ≤ y)
The age of the computer (A), a Bernoulli such that the computer is old (0) or new (1)
A marginal (or unconditional) probability is the simple case: it is the probability that does
not depend on a prior event or prior information. The marginal probability is also called the
unconditional probability. It is “just another name for the probability distribution” (Stock).
l
Pr(Y y ) Pr X xi ,Y y Pr( A 1) 0.5
i 1
0 1 2 3 4 Tot
The joint probability is the probability that the random variables (in this case, both random
variables) take on certain values simultaneously.
0 1 2 3 4 Tot
The conditional probability is the probability of an outcome given (i.e., conditional on) another
outcome.
Pr( X x,Y y )
Pr(Y y | X x )
Pr( X x )
0 1 2 3 4 Tot
P( A B)
P (B | A ) P ( A)P (B | A) P ( A B )
P ( A)
An unconditional expectation is the expected value of the variable without any restrictions (or
lacking any prior information).
E(Y | X x )
var(Y | X x )
For example, consider two stocks. Assume that both Stock (S) and Stock (T) can each only reach
three price levels. Stock (S) can achieve: $10, $15, or $20. Stock (T) can achieve: $15, $20, or $30.
Historically, assume we witnessed 26 outcomes and they were distributed as follows.
Note S = S$10/15/20 and T = T$15/20/30 :
The unconditional probability of the outcome where S=$20 = 8/26 because there are eight
events out of 26 total events that produce S=$20. The unconditional probability P(S=20) = 8/26
A joint probability is the probability that both random variables will have a certain outcome.
Here the joint probability P(S=$20, T=$30) = 3/26.
Instead we can ask a conditional probability question: “What is the probability that S=$20 given
that T=$20?” The probability that S=$20 conditional on the knowledge that T=$20 is 3/10
because among the 10 events that produce T=$20, three are S=$20.
P (S $20,T $20) 3
P (S $20 T $20)
P (T $20) 10
In summary:
X and Y are independent if the condition distribution of Y given X equals the marginal
distribution of Y. Since independence implies Pr (Y=y | X=x) = Pr(Y=y):
Pr( X x,Y y )
Pr(Y y | X x )
Pr( X x )
X and Y are independent if their joint distribution is equal to the product of their
marginal distributions.
Statistical independence is when the value taken by one variable has no effect on the value
taken by the other variable. If the variables are independent, their joint probability will equal
the product of their marginal probabilities. If they are not independent, they are dependent.
For example, when rolling two dice, the second will be independent of the first.
This independence implies that the probability of rolling double-sixes is equal to the product
of P(rolling one six) and P(rolling one six). If two die are independent, then P (first roll = 6,
second roll = 6) = P(rolling a six) * P (rolling a six). And, indeed: 1/36 = (1/6)*(1/6)
Mean
E(a bX cY ) a b X c Y
Variance
In regard to the sum of correlated variables, the variance of correlated variables is given by the
following (note the two expressions; the second merely substitutes the covariance with the
product of correlation and volatilities. Please make sure you are comfortable with this
substitution).
If X and Y are independent, the term with the covariance drops out and this simplifies to:
variance( X Y ) X2 Y2
Normal distribution
0.5
1 2 2 2
0.3 f (x) e ( x )
2
0.1
(3.0)
(2.0)
(1.0)
(4.0)
-0.1 0.0
1.0
2.0
3.0
4.0
Key properties of the normal:
Parsimony: Only requires (is fully described by) two parameters: mean and variance
The central limit theorem (CLT) says that sampling distribution of sample means tends
to be normal (i.e., converges toward a normally shaped distributed) regardless of the
shape of the underlying distribution; this explains much of the “popularity” of the normal
distribution.
The normal is economical (elegant) because it only requires two parameters (mean
and variance). The standard normal is even more economical: it requires no
parameters.
A normal distribution is fully specified by two parameters, mean and variance (or standard
deviation). We can transform a normal into a unit or standardized variable:
No parameters required!
This unit or standardized variable is normally distributed with zero mean and variance of
one (1.0). Its standard deviation is also one (variance = 1.0 and standard deviation = 1.0). This is
written as: Variable Z is approximately (“asymptotically”) normally distributed: Z ~ N(0,1)
Key locations on the normal distribution are noted below. In the FRM curriculum, the choice of
one-tailed 5% significance and 1% significance (i.e., 95% and 99% confidence) is common, so
please pay particular attention to the yellow highlights:
Memorize two common critical values: 1.65 and 2.33. These correspond to confidence
levels, respectively, of 95% and 99% for a one-tailed test. For VAR, the one-tailed test is
relevant because we are concerned only about losses (left-tail) not gains (right-tail).
3. If variables with a multivariate normal distribution have covariances that equal zero,
then the variables are independent
Chi-square distribution
40%
30%
k=2
20%
k=5
10% k = 29
0%
0 10 20 30
For the chi-square distribution, we observe a sample variance and compare to hypothetical
population variance. This variable has a chi-square distribution with (n-1) d.f.:
s2
2 (n 1) ~ ( n 1)
2
Nonnegative (>0)
Google’s sample variance over 30 days is 0.0263%. We can test the hypothesis that the
population variance (Google’s “true” variance) is 0.02%. The chi-square variable = 38.14:
0.03
2
0.02
20
0.01 Normal
0.00
0
0.4
0.8
1.2
1.6
2
2.4
2.8
3.2
3.6
The student’s t distribution (t distribution) is among the most commonly used distributions. As
the degrees of freedom (d.f.) increases, the t-distribution converges with the normal
distribution. It is similar to the normal, except it exhibits slightly heavier tails (the lower the d.f..,
the heavier the tails). The student’s t variable is given by:
X X
t
Sx n
Its variance = k/(k-2) where k = degrees of freedom. Note, as k increases, the variance
approaches 1.0. Therefore, as k increases, the t-distribution approximates the
standard normal distribution.
Always slightly heavy-tail (kurtosis>3.0) but converges to normal. But the student’s t is
not considered a really heavy-tailed distribution
In practice, the student’s t is the mostly commonly used distribution. When we test the
significance of regression coefficients, the central limit thereom (CLT) justifies the
normal distribution (because the coefficients are effectively sample means). But we
rarely know the population variance, such that the student’s t is the appropriate
distribution.
When the d.f. is large (e.g., sample over ~30), as the student’s t approximates the
normal, we can use the normal as a proxy. In the assigned Stock & Watson, the sample
sizes are large (e.g., 420 students), so they tend to use the normal.
For example, Google’s average periodic return over a ten-day sample period was +0.02% with
sample standard deviation of 1.54%. Here are the statistics:
Critical t 2.262
The sample mean is a random variable. If we know the population variance, we assume the
sample mean is normally distributed. But if we do not know the population variance (typically
the case!), the sample mean is a random variable following a student’s t distribution.
In the Google example above, we can use this to construct a confidence (random) interval:
s
X t
n
We need the critical (lookup) t value. The critical t value is a function of:
The 95% confidence interval can be computed. The upper limit is given by:
1.54%
X (2.262) 1.12%
10
1.54%
X (2.262) 1.08%
10
Please make sure you can take a sample standard deviation, compute the critical t value
and construct the confidence interval.
Z
X X
t
X X
X SX
n n
F-Distribution
F distribution
10%
8%
6%
4% 19,19
2% 9,9
0%
The F distribution is also called the variance ratio distribution (it may be helpful to think of it as
the variance ratio!). The F ratio is the ratio of sample variances, with the greater sample variance
in the numerator:
s x2
F 2
sy
Properties of F distribution:
Nonnegative (>0)
Skewed right
The square of t-distributed r.v. with k d.f. has an F distribution with 1,k d.f.
m * F(m,n)=χ2
For example, based on two 10-day samples, we calculated the sample variance of Google and
Yahoo. Google’s variance was 0.0237% and Yahoo’s was 0.0084%. The F ratio, therefore, is 2.82
(divide higher variance by lower variance; the F ratio must be greater than, or equal to, 1.0).
GOOG YHOO
=VAR() 0.0237% 0.0084%
=COUNT() 10 10
F ratio 2.82
Confidence 90%
Significance 10%
=FINV() 2.44
At 10% significance, with (10-1) and (10-1) degrees of freedom, the critical F value is 2.44.
Because our F ratio of 2.82 is greater than (>) 2.44, we reject the null (i.e., that the population
variances are the same). We conclude the population variances are different.
Moments of a distribution
( x )k
n
k-th moment i 1 i
n
In this way, the difference of each data point from the mean is raised to a power (k=1, k=2, k=3,
and k=4). There are the four moments of the distribution:
If k=2, refers to the second moment about the mean: the variance.
If k=4, refers to the fourth moment about the mean: tail density and peakedness.
A random sample is a sample of random variables that are independent and identically
distributed (i.i.d.)
Independent Identical
Not (auto) correlated Same Mean,
Same Variance
Homo-skedastic
Each random variable has the same (identical) probability distribution (PDF/PMF, CDF)
distribution
Define, calculate, and interpret the mean and variance of the sample
average.
1 n
E (Y ) E (Yi ) Y
n i 1
Y2 Y
variance(Y ) Std Dev(Y ) Y
n n
E(Y ) Y Y
This formula says, “we expect the average of our sample will equal the average of the
population.” (over-bar signifies sample, Greek mu signifies the mean (average).
If either: (i) the population is infinite and random sampling, or (ii) finite population and
sampling with replacement, the variance of the sampling distribution of means is:
Y2
E [(Y Y ) ]
2 2
Y
n
This says, “The variance of the sample mean is equal to the population variance divided by the
sample size.” For example, the (population) variance of a single six-sided die is 2.92. If we roll
three die (i.e., sampling “with replacement”), then the variance of the sampling distribution =
(2.92 3) = 0.97.
If the population is size (N), if the sample size n N, and if sampling is conducted “without
replacement,” then the variance of the sampling distribution of means is given by:
Y2 N n
2
Y
n N 1
The standard error is the standard deviation of the sampling distribution of the estimator,
and the sampling distribution of an estimator is a probability (frequency distribution) of the
estimator (i.e., a distribution of the set of values of the estimator obtained from all possible
same-size samples from a given population). For a sample mean (per the central limit theorem!),
the variance of the estimator is the population variance divided by sample size. The
standard error is the square root of this variance; the standard error is a standard deviation:
Y2 Y
se
n n
Z
Y Y ~ N(0,1)
Y Y
se Y
n
The denominator is the standard error: which is simply the name for the standard
deviation of sampling distribution.
Describe, interpret, and apply the Law of Large Numbers and the Central
Limit Theorem.
In brief:
Law of large numbers: under general conditions, the sample mean (Ӯ) will be near the
population mean.
Central limit theorem (CLT): As the sample size increases, regardless of the underlying
distribution, the sampling distributions approximates (tends toward) normal
We assume a population with a known mean and finite variance, but not necessarily a normal
distribution (we may not know the distribution!). Random samples of size (n) are then
drawn from the population. The expected value of each random variable is the population’s
mean. Further, the variance of each random variable is equal the population’s variance divided
by n (note: this is equivalent to saying the standard deviation of each random variable is equal to
the population’s standard deviation divided by the square root of n).
The central limit theorem says that this random variable (i.e., of sample size n, drawn from the
population) is itself normally distributed, regardless of the shape of the underlying
population. Given a population described by any probability distribution having mean () and
finite variance (2), the distribution of the sample mean computed from samples (where each
sample equals size n) will be approximately normal. Generally, if the size of the sample is at least
30 (n 30), then we can assume the sample mean is approximately normal!
Each sample has a sample mean. There are many sample means. The sample means have
variation: a sampling distribution. The central limit theorem (CLT) says the sampling
distribution of sample means is asymptotically normal.
We assume a population with a known mean and finite variance, but not necessarily a
normal distribution.
The distribution of the sample mean computed from samples (where each sample equals
size n) will be approximately (asymptotically) normal.
X1 X 2 Xn
X
n
Statistical inference is the process of generalizing from the sample value to the population value.
An estimate is calculated from the sample (a.k.a., a sample statistic). For example, sample
mean, sample variance, sample skew, sample kurtosis.
In addition to the estimate itself (e.g., sample mean), we estimate the sampling error or
sampling variation.
Next, we conduct hypothesis testing by either: (i) confidence interval, (ii) test of
significance, or (iii) p value.
Test of
Confidence
p value significanc
interval
e
Population parameters
Population
Parameters
Sample
Statistic
The population is the entire group under study. The population is often unknowable.
The population size is denoted by a capital “N.”
The population (of which there is typically one) has parameters; e.g., the population
mean or the population variance. A parameter is a quantity in the f(x) distribution—
such as mean, or standard deviation or (p) in the case of the binomial distribution—that
helps describe the distribution. Quantities that appear in f(x), such as the mean () and
the standard deviation () are called population parameters.
The sample is a subset of the population. For practical purposes, we draw a sample
(from the population) in order to make inferences about the population. The sample size
is denoted with small “n”
From the sample (of which there are many) we calculate estimates from estimators or
statistics; e.g., the sample mean or the sample variance. Estimators (statistics) are the
recipes for the “best guesses” about the “true” population parameters. Estimators
(statistics) versus parameters
The sample mean, Ӯ, is the best linear unbiased estimator (BLUE). In the Stock & Watson
example, the average (mean) wage among 200 people is $22.64:
The standard error of the sample mean is $1.28 because $18.14/SQRT(200) = $1.28
The estimator (m) that minimizes the sum of squared gaps (Yi – m) is called the least squares
estimator:
n
Yi m
2
i 1
i 1
Y Y ,0
t
SE (Y )
The critical t-value or “lookup” t-value is the t-value for which the test just rejects the null
hypothesis at a given significance level. For example:
The critical t-values bound a region within the student’s distribution that is a specific
percentage (90%? 95%? 99%?) of the total area under the student’s t distribution curve. The
student’s t distribution with (n-1) degrees of freedom (d.f.) has a confidence interval given by:
SY S
Y t Y Y t Y
n n
Please note, further because the distribution is symmetrical (skew=0), 5% among both tails
implies 2.5% in the left-tail.
The green shaded area represents values less than three (< 3.0). Think of it as the “sweet
spot.” For confidences less than 99% and d.f. > 13, the critical t is always less than 3.0. So, for
example, a computed t of 7 or 13 will generally be significant. Keep this in mind because in
many cases, you do not need to refer to the lookup table if the computed t is large; you can
simply reject the null.
The confidence interval uses the product of [standard error х critical “lookup” t]. In the Stock
& Watson example, the confidence interval is given by 22.64 +/- (1.28)(1.96) because 1.28 is the
standard error and 1.96 is the critical t (critical Z) value associated with 95% two-tailed
confidence:
Mean 23.25
Variance 90.13
Std Dev 9.49
Count 28
d.f. 27
Confidence (1-α) 95%
Significance (α) 5%
Critical t 2.052
Standard error 1.794
Hypothesis 18.5
t value 2.65 = (23.25 - 18.5) / (1.794)
p value 1.3%
Reject null with 98.7%
The confidence coefficient is selected by the user; e.g., 95% (0.95) or 99% (0.99).
The significance = 1 – confidence coefficient.
Determine degrees of freedom (d.f.). d.f. = sample size – 1. In this case, 28 – 1 = 27 d.f.
We are constructing an interval, so we need the critical t value for 5% significance with
two-tails.
The critical t value is equal to 2.052. That’s the value with 27 d.f. and either 2.5% one-
tailed significance or 5% two-tailed significance (see how they are the same provided the
distribution is symmetrical?)
The standard error is equal to the sample standard deviation divided by the square root
of the sample size (not d.f.!). In this case, 9.49/SQRT(28) 1.794.
The lower limit of the confidence interval is given by: the sample mean minus the
critical t (2.052) multiplied by the standard error (9.49/SQRT[28]).
The upper limit of the confidence interval is given by: the sample mean plus the
critical t (2.052) multiplied by the standard error (9.49/SQRT[28]).
Sx S
X t X X t x
n n
9.49 9.49
23.25 2.052 X 23.25 2.052
28 28
This confidence interval is a random interval. Why? Because it will vary randomly with
each sample, whereas we assume the population mean is static.
We don’t say the probability is 95% that the “true” population mean lies within
this interval. That implies the true mean is variable. Instead, we say the
probability is 95% that the random interval contains the true mean. See how the
population mean is trusted to be static and the interval varies?
An estimate is the numerical value of the estimator when it is actually computed using
data from a specific sample.
Linearity: estimator is a linear function of sample observations. For example, the sample
mean is a linear function of the observations.
Unbiasedness: the average or expected value of the estimator is equal to the true value
of the parameter.
Minimum variance: the variance of the estimator is smaller than any “competing”
estimator. Note: an estimator can have minimum variance yet be biased.
Efficiency: Among the set of unbiased estimators, the estimator with the minimum
variance is the efficient estimator (i.e., it has the smallest variance among unbiased
estimators)
Best linear estimator (BLUE): the estimate that combines three properties: (i) linear,
(ii) unbiased, and (iii) minimum variance
E Y Y
If the expected value of the estimator is the population parameter, the estimator is
unbiased. If, in repeated applications of a method the mean value of the estimators
coincides with the true parameter value, that estimator is called an unbiased estimator.
Unbiasedness is a repeated sampling property: if we draw several samples of size (n)
from a population and compute the unbiased sample statistic for each sample, the
average of will tend to approach (converge on) the population parameter.
An efficient estimate is both unbiased (i.e., the mean or expectation of the statistic is equal to
the parameter) and its variance is smaller than the alternatives (i.e., all other things being equal,
we would prefer a smaller variance). A statement of the error or precision of an estimate is
often called its reliability
Efficient Consistent
variance Y variance Y Y
p
Y
Please not the null must contain the equal sign (“=“):
Define & interpret the null
H0 : E (Y ) Y ,0 hypothesis and the alternative
H1 : E (Y ) Y ,0
Distinguish between one‐sided
and two‐sided hypotheses
The null hypothesis, denoted by H0, is tested against Define, calculate and interpret
the alternative hypothesis, which is denoted by H1 or type I and type II errors
sometimes HA.
Often, we test for the significance of the intercept or a Define and interpret the p value
partial slope coefficient in a linear regression. Typically,
in this case, our null hypothesis is: “the slope is zero” or “there is no correlation between X and
Y” or “the regression coefficients jointly are not significant.” In which case, if we reject the null,
we are finding the statistic to be significant which, in this case, means “significantly different
than zero.”
Statistical significance implies our null hypothesis (i.e., the parameter equals zero) was
rejected. We concluded the parameter is nonzero. For example, a “significant” slope
estimate means we rejected the null hypothesis that the true slope is zero.
The null hypothesis always includes the equal sign (=), regardless! The null cannot include
only less than (<) or greater than (>).
Then we simply ascertain if the null hypothesized value Distinguish between one‐sided
is within the interval (within the “acceptance region”). and two‐sided hypotheses
90% CI for Y Y 1.64SE Y Describe the confidence interval
approach to hypothesis testing
95% CI for Y Y 1.96SE Y
Y 2.58SE Y
Describe the test of significance
99% CI for Y approach to hypothesis testing
In the significance approach, instead of defining the Define & interpret the null
confidence interval, we compute the standardized hypothesis and the alternative
distance in standard deviations from the observed mean
to the null hypothesis: this is the test statistic (or
Distinguish between one‐sided
computed t value). We compare it to the critical (or and two‐sided hypotheses
lookup) value.
If the test statistic is greater than the critical (lookup) Describe the confidence interval
value, then we reject the null. approach to hypothesis testing
Under the circumstances, a Type I error is the following: we decide that excess is significant and
the manager adds value, but actually the out-performance was random (he did not add skill). In
technical terms, we mistakenly rejected the null. Under the circumstances, a Type II error is the
following: we decide the excess is random and, to our thinking, the out-performance was
random. But actually it was not random and he did add value. In technical terms, we falsely
accepted the null.
The p-value is the “exact significance level:” Define & interpret the null
hypothesis and the alternative
Lowest significance level a which a null
can be rejected
Distinguish between one‐sided
We can reject null with (1-p)% confidence and two‐sided hypotheses
The p-value is an abbreviation that stands for Describe the confidence interval
“probability-value.” Suppose our hypothesis is approach to hypothesis testing
that a population mean is 10; another way of
saying this is “our null hypothesis is H0: mean = Describe the test of significance
10 and our alternative hypothesis is H1: mean approach to hypothesis testing
10.” Suppose we conduct a two-tailed test, given
the results of a sample drawn from the Define, calculate and interpret
population, and the test produces a p-value of .03. type I and type II errors
This means that we can reject the null hypothesis
with 97% confidence – in other words, we can be
Define and interpret the p value
fairly confident that the true population mean is
not 10.
Our example was a two-tailed test, but recall we have three possible tests:
The parameter is greater than (>) the stated value (right-tailed test), or
The parameter is less than (<) the stated value (left-tailed test), or
The parameter is either greater than or less than (≠) the stated value (two-tailed test).
Small p-values provide evidence for rejecting the null hypothesis in favor of the alternative
hypothesis, and large p values provide evidence for not rejecting the null hypothesis in favor of
the alternative hypothesis.
Keep in mind a subtle point about the p-value and “rejecting the null.” It is a soft rejection.
Rather than accept the alternative, we fail to reject the null. Further, if we reject the null, we are
merely rejecting the null in favor of the alternative.
The analogy is to a jury verdict. The jury does not return a verict of “innocent;” rather they
return a verdict of “not guilty.”
p value PrH0 Z t act 1 t act
sY
SE (Y ) ˆY
n
90% CI for Y Y 1.64SE Y
95% CI for Y Y 1.96SE Y
Perform and interpret hypothesis tests for the difference between two
means
t
Ym Yw d0
SE Ym Yw
Define, describe, apply, and interpret the t-statistic when the sample size
is small.
If the sample size is small, t-statistic has a Student’s t distribution with (n-1) degrees of freedom
Y Y ,0
t
sY2 n
The scattergram is a plot of the dependent variable (on the Y axis) against the independent
(explanatory) variable (on the X axis). In Stock and Waton, the explanatory variable is the
student-teacher ratio (STR). The dependent variable is test score:
700.0
680.0
Test
660.0
Scores
640.0
620.0
600.0
10.0 15.0 20.0 25.0 30.0
Student-teacher ratio
Covariance is the average cross-product. Sample covariance multiplies the sum of cross-
products by 1/(n-1) rather than 1/n:
X i X Yi Y
1 n
s XY
n 1 i 1
Sample correlation is sample covariance divided by the product of sample standard deviations:
s XY
r XY
S X SY
For a very simple example, consider three (X,Y) pairs: {(3,5), (2,4), (4,6)}:
X Y (X-X )(Y-Y )
avg avg
3 5 0.0
2 4 1.0
4 6 1.0
Avg = 3 Avg = 5 Avg = = 0.67
XY
StdDev = SQRT(.67) StdDev = SQRT(.67) Correl. = 1.0
Please note:
Properties of covariance
Note that a variable’s covariance with itself is its variance. Keeping this in mind, we
realize that the diagonal in a covariance matrix is populated with variances.
The correlation coefficient is the covariance (X,Y) divided by the product of the each variable’s
standard deviation. The correlation coefficient translates covariance into a unitless metric
that runs from -1.0 to +1.0:
XY cov( X ,Y )
XY StandardDev( X ) StandardDev(Y )
XY XY
Memorize this relationship between the covariance, the correlation coefficient, and the
standard deviations. It has high testability.
On the next page we illustrate the application of the variance theorems and the correlation
coefficient.
The example refers to two products, Coke (X) and Pepsi (Y).
We (somehow) can generate growth projections for both products. For both Coke (X) and Pepsi
(Y), we have three scenarios (bad, medium, and good). Probabilities are assigned to each
growth scenario.
In regard to Coke:
In regard to Pepsi,
Finally, we know these outcomes are not independent. We want to calculate the correlation
coefficient.
XY 15 63 108
pXY 3 37.8 21.6
E(XY) 62.4
X2 9 81 144
Y2 25 49 81
E(X2) 79.2
E(Y2) 50.6
STDEVP(X) 2.939
STDEVP(Y) 1.265
COV/(STD)(STD) 0.9682
The calculation of expected values is required: E(X), E(Y), E(XY), E(X2) and E(Y2). Make sure you
can replicate the following two steps:
The correlation coefficient () is equal to the Cov(X,Y) divided by the product of the
standard deviations: XY = 97% = 3.6 (2.94 1.26)
Zero covariance → zero correlation (But the converse not necessarily true. For example,
Y=X^2 is nonlinear )
Correlation (or dependence) is not causation. For example, in a basket credit default
swap, the correlation (dependence) between the obligors is a key input. But we do not
assume there is mutual causation (e.g., that one default causes another). Rather, more
likely, different obligors are similarly sensitive to economic conditions. So, economic
deterioration may the the external cause that all obligors have in common.
Consqequently, their default exhibit dependence. But the causation is not internal.
Sample mean
Sample mean is the sum of observations divided by the number of observations:
n
Xi
X i 1
n
Variance
A population variance is given by: The sample variance is divided by (n-1):
1 n
x2
n i 1
( X i X )2 sx2
1 n
( X i X )2
n 1 i 1
Assume the following series of four numbers: 10, 12, 14, and 16. The average of the series is
(10+12+14+16) 4 = 13. For the population variance, in the numerator we want to sum the
squared differences. The population variance is given by [(10-13)2 + (12-13) 2 + (14-13) 2 + (16-
13) 2] 4 = 20 4 = 5. The sample variance has the same numerator and (5-1) for the
denominator: 20 3 6.7. The standard deviation is the square roots of the variance. The
population standard deviation is (square root of 5 2.24) and the sample standard deviation is
(square root of 6.7 2.6).
1 1
XY
n
( X i X )(Yi Y ) sample XY
n 1
( X i X )(Yi Y )
Correlation coefficient
Correlation coefficient is given by: Sample correlation coefficient is given by:
XY
cov( X ,Y ) sample XY
XY StdDe v( X ) StdDev(Y ) sample
S X SY
Skewness
Skewness is given by: Sample skewness is given by:
E [( X )3 ] 3 ( X X )3
Skewness = 3 3 (N 1)
3 Sample Skewness = 3
S3
Kurtosis
Kurtosis is given by: Sample kurtosis is given by:
Kurtosis = 4
E [( X )4 ] 4
4 ( X X )4
4 (N 1)
Sample Kurtosis = 4
S
What is Econometrics?
Econometrics is a social science that applies tools (economic theory, mathematics and statistical
inference) to the analysis of economic phenomena. Econometrics consists of “the application of
mathematical statistics to economic data to lend empirical support to the models constructed
by mathematical economics.”
Methodology of econometrics
Specify the (pure) mathematical model: a linear function with parameters (but without
an error term)
Estimate the parameters of the chosen econometric model: we are likely to use ordinary
least squares (OLS) approach to estimate parameters
Test model
Collect data
specification
Specify
mathematical model
Test hypothesis
Note:
The difference between the mathematical and statistical model is the random error
term (u in the econometric equation below). The statistical (or empirical) econometric
model adds the random error term (u):
Yi B0 B1X i ui
Pooled (combination of time series and cross-sectional) - returns over time for a
combination of assets; and
Panel data (a.k.a., longitudinal or micropanel) data is a special type of pooled data in
which the cross-sectional unit (e.g., family, company) is surveyed over time.
For example, we often characterize a portfolio with a matrix. In such a matrix, the assets are
given in the rows and the period returns (e.g., days/months/years) are given in the columns:
For such a “matrix portfolio,” we can examine the data in at least three ways:
In Stock and Watson, the authors regress student test scores (dependent variable) against class
size (independent variable):
Dependent Independent
(regressand) (regressor)
Variable Variable
Yi 0 1Xi ui
The error term contains all the other factors aside from (X) that determine the value of the
dependent variable (Y) for a specific observation.
Yi 0 1Xi ui
The stochastic error term is a random variable. Its value cannot be a priori determined.
In theory, there is one unknowable population and one set of unknowable parameters (B1, B2).
But there are many samples, each sample → SRF → Estimator (statistic) → Estimate
Each sample produces its own scatterplot. Through this sample scatterplot, we can plot a sample
regression line (SRL). The sample regression function (SRF) characterizes this line; the SRF is
analogous to the PRF, but for each sample.
Note the correspondence between error term and the residual. As we specify the model,
we ex ante anticipate an error; after we analyze the observations, we ex post observe
residuals.
E(Y ) B0 B1X i2
Nonlinear variable, Linear parameter
The process of ordinary least squares estimation seeks to achieve the minimum value for the
residual sum of squares (squared residuals = e^2).
Define and interpret the explained sum of squares, the total sum of
squares, and the residual sum of squares
The explained sum of squares (ESS) is the squared distance between the predicted Y and the
mean of Y:
n
ESS (Yˆi Y )2
i 1
n
SSR (Yi Yˆi )2
i 1
The sum of squared residual (SSR) is the square of the error term. It is directly related to the
standard error of the regression (SER):
n n
SSR (Yi Yˆi )2 uˆi2 SER 2 (n 2)
i 1 i 1
Equivalently:
ˆ
2 ei2
ˆ ei2
n2 n2
The SSR and the standard error of regression (SER) are directly related; the SER is the
standard deviation of the Y values around the regression line.
The standard error of the regression (SER) is a function of the sum of squared residual (SSR):
SSR ei2
SER
n2 n2
Note the use of the use of (n-2) instead of (n) in the denominator. Division by this smaller
number—in this case (n-2) instead of (n)—is referred to as “an unbiased estimate.”
(n-2) is used because the two-variable regression has (n-2) degrees of freedom (d.f.).
In order to compute the slope and intercept estimates, two independent observations are
consumed.
If k = the number of explanatory variables plus the intercept (e.g., 2 if one explanatory
variable; 3 if two explanatory variables), then SER = SQRT[SSR/(n-k)].
If k = the number of slope coefficients (excluding the intercept), then similarly, SER =
SQRT[SSR/(n-k -1)]
In the Stock & Watson example, the authors regress TestScore against the Student-teacher ratio
(STR):
B(1) B(0)
Regression coefficients -2.28 698.93
Standard errors, SE() 0.48 9.47
R^2, SER 0.05 18.58
F, d.f. 22.58 418.00
ESS, RSS 7,794 144,315
Please note:
Both the slope and intercept are both significant at 95%, at least. The test statistics are
73.8 for the slope (699/9.47) and 4.75 (2.28/0.48). For example, given the very high test
statistic for the slope, its p-value is approximately zero.
In the example from Stock and Watson, lower limit = 680.4 = 698.9 – 9.47 × 1.96
Confidence Interval
Coefficient SE Lower Upper
Intercept 698.9 9.47 680.4 717.5
Slope (B1) -2.28 0.48 -3.2 -1.3
The key idea here is that the regression coefficient (the estimator or sample statistic) is a
random variable that follows a student’s t distribution (because we do not know the population
variance, or it would be the normal):
To test the hypothesis that the regression coefficient (b1) is equal to some specified value (),
we use the fact that the statistic
b1
test statistic t
se(b1)
p value 2 Tail ~ 0 %
The error term u(i) is homoskedastic if the variance of the conditional distribution of u(i) given
X(i) is constant for i = 1,…,n and in particular does not depend on X(i).
Yi 0 1X1i ui
The B(1) slope coefficient, for example, is the effect on Y of a unit change in X(1) if we hold the
other independent variables, X(2) …., constant.
Standard error of regression (SER) estimates the standard deviation of the error term u(i). In
this way, the SER is a measure of spread of the distribution of Y around the regression line. In a
multiple regression, the SER is given by:
SSR
SER
n k 1
Where (k) is the number of slope coefficients; e.g., in this case of a two variable regression, k = 1.
For the standard error of the regression (SER), the denominator is n – [# of variables], or
n – [# of coefficients including the interect].
The coefficient of determination is the fraction of the sample variance of Y(i) explained by (or
predicted by) the independent variables”
ESS SSR
R2 1
TSS TSS
“Adjusted R^2”
The unadjusted R^2 will tend to increase as additional independent variables are added.
However, this does not necessarily reflect a better fitted model.
The adjusted R^2 is a modified version of the R^2 that does not necessarily increase with a new
independent variable is added. Adjusted R^2 is given by:
n 1 SSR su2ˆ
R 1
2
1 2
n k 1 TSS sY
2. X(1i), X(2i), … X(ki), Y(i) are independent and identically distributed (i.i.d.)
4. No perfect collinearity
Imperfect multicollinearity does not prevent estimation of the regression, nor does it imply a
logical problem with the choice of independent variables (i.e., regressor).
However, imperfect multicollinearity does mean that one or more of the regression
coefficients could be estimated imprecisely
The Stock & Watson example adds an additional independent variable (regressor). Under this
three variable regression, Test Scores (dependent) are a function of the Student/Teacher ratio
(STR) and the Percentage of English learners in district (PctEL).
The “overall” regression F-statistic tests the joint hypothesis that all slope coefficients are zero
If the error term is homoskedastic, the F-statistic can be written in terms of the improvement in
the fit of the regression as measured either by the sum of squared residuals or by the regression
R^2.
F
SSRrestricted SSRunrestricted q
SSRunrestricted n kunrestricted 1
Confidence ellipse characterizes a confidence set for two coefficients; this is the two-dimension
analog to the confidence interval:
9
8
7
6
5
Coefficient on
4
Expn (B2)
3
2
1
0
-1 -2 -1.5 -1 -0.5 0 0.5 1 1.5
Coefficient on STR (B1)
Omitted variable bias: an omitted determinant of Y (the dependent variable) is correlated with
at least one of the regressor (independent variables).
There are four pitfalls to watch in using the R^2 or adjusted R^2:
1. An increase in the R^2 or adjusted R^2 does not necessarily imply that an added variable
is statistically significant
2. A high R^2 or adjusted R^2 does not mean the regressors are a true cause of the
dependent variable
3. A high R^2 or adjusted R^2 does not mean there is no omitted variable bias
4. A high R^2 or adjusted R^2 does not necessarily mean you have the most appropriate set
of regressors, nor does a low R^2 or adjusted R^2 necessarily mean you have an
inappropriate set of regressors
Bernoulli
A random variable X is called Bernoulli distributed with parameter (p) if it has only two
possible outcomes, often encoded as 1 (“success” or “survival”) or 0 (“failure” or
“default”), and if the probability for realizing “1” equals p and the probability for “0”
equals 1 – p. The classic example for a Bernoulli-distributed random variable is the default
event of a company.
1 if C defaults in I
X
0 else
A binomial distributed random variable is the sum of (n) independent and identically distributed
(i.i.d.) Bernoulli-distributed random variables. The probability of observing (k) successes is
given by:
n n n!
P (Y k ) pk (1 p )n k ,
k k (n k )! k !
Poisson
The Poisson distribution depends upon only one parameter, lambda λ, and can be interpreted as
an approximation to the binomial distribution. A Poisson-distributed random variable is usually
used to describe the random number of events occurring over a certain time interval. The
lambda parameter (λ) indicates the rate of occurrence of the random events; i.e., it tells us
how many events occur on average per unit of time.
In the Poisson distribution, the random number of events that occur during an interval of time,
(e.g., losses/ year, failures/ day) is given by:
k
P (N k ) e
k!
The Bernoulli is used to characterize default; consequently the binomial is used to characterize a
portfolio of credits. In finance, the Poisson distribution is often used, as a generic a stochastic
process, to model the time of default in some credit risk models.
4.0%
2.0%
0.0%
100
105
110
115
120
125
130
135
140
60
65
70
75
80
85
90
95
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
55.0
60.0
65.0
70.0
75.0
80.0
5.0
-
15.0%
lambda = 5
10.0% lambda = 10
5.0% lambda = 20
0.0%
12.0
18.0
24.0
30.0
36.0
42.0
48.0
54.0
60.0
66.0
72.0
78.0
6.0
-
Normal
The middle of the distribution, mu (µ), is the mean (and median). This first moment is
also called the “location”
Standard deviation and variance are measures of dispersion (a.k.a., shape). Variance is
the second-moment; typically, variance is denoted by sigma-squared such that standard
deviation is sigma.
The distribution is symmetric around µ. In other words, the normal has skew = 0
Summation stability: If you take the sum of several independent random variables,
which are all normally distributed with mean (µi) and standard deviation (σi), then the
sum will be normally distributed again.
The normal distribution possesses a domain of attraction. The central limit theorem
(CLT) states that—under certain technical conditions—the distribution of a large sum of
random variables behaves necessarily like a normal distribution.
The normal distribution is not the only class of probability distributions having a domain
of attraction. Actually three classes of distributions have this property: they are called
stable distributions.
Exponential
The exponential distribution is popular in queuing theory. It is used to model the time we have
to wait until a certain event takes place. According to the text, examples include “the time
until the next client enters the store, the time until a certain company defaults or the time until
some machine has a defect.”
f ( x ) e x , 1 , x 0
The exponential function is non-zero:
Exponential
2.00
1.50
0.5
1.00 1
0.50 2
0.00
Weibull is a generalized exponential distribution; i.e., the exponential is a special case of the
Weibull where the alpha parameter equals 1.0.
x
F(x) 1 e ,x 0
Weibull distribution
2.00
1.50 alpha=.5,
beta=1
1.00
alpha=2, beta=1
0.50
- alpha=2, beta=2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
The main difference between the exponential distribution and the Weibull is that, under the
Weibull, the default intensity depends upon the point in time t under consideration. This allows
us to model the aging effect or teething troubles:
For α > 1—also called the “light-tailed” case—the default intensity is monotonically increasing
with increasing time, which is useful for modeling the “aging effect” as it happens for machines:
The default intensity of a 20-year old machine is higher than the one of a 2-year old machine.
For α < 1—the “heavy-tailed” case—the default intensity decreases with increasing time. That
means we have the effect of “teething troubles,” a figurative explanation for the effect that after
some trouble at the beginning things work well, as it is known from new cars. The credit spread
on noninvestment-grade corporate bonds provides a good example: Credit spreads usually
decline with maturity. The credit spread reflects the default intensity and, thus, we have the
effect of “teething troubles.” If the company survives the next two years, it will survive for a
longer time as well, which explains the decreasing credit spread.
The family of Gamma distributions forms a two parameter probability distribution family with
the density function (pdf) given by:
1
1 x
f (x) e x ,x 0
( )
Gamma distribution
1.20
1.00
0.80 alpha=1, beta=1
0.60 alpha=2, beta=.5
0.40 alpha=4, beta=.25
0.20
-
For alpha = k/2 and beta = 2, Gamma distribution becomes Chi-square distribution
Beta distribution
The beta distribution has two parameters: alpha (“center”) and beta (“shape”). The beta
distribution is very flexible, and popular for modeling recovery rates.
Beta distribution
(popular for recovery rates)
0.06
0.05 alpha = 0.6, beta = 0.6
0.04
0.03 alpha = 1, beta = 5
0.02
0.01 alpha = 2, beta = 4
0.00
alpha = 2, beta = 1.5
0.07
0.14
0.21
0.28
0.35
0.42
0.49
0.56
0.63
0.70
0.77
0.84
0.91
0.98
-
The beta distribution is often used to model recovery rates. Here are two examples: one beta
distribution to model a junior class of debt (i.e., lower mean recovery) and another for a senior
class of debt (i.e., lower loss given default):
Junior Senior
alpha (center) 2.0 4.0
beta (shape) 6.0 3.3
Mean recovery 25% 55%
0.03
0.02
0.01 Senior
Junior
0.00
0%
6%
12%
18%
24%
30%
36%
42%
48%
54%
60%
66%
72%
78%
84%
90%
96%
Recovery (Residual Value)
Lognormal
The lognormal is common in finance: If an asset return (r) is normally distributed, the
continuously compounded future asset price level (or ratio or prices; i.e., the wealth ratio) is
lognormal. Expressed in reverse, if a variable is lognormal, its natural log is normal.
Lognormal
Non-zero, positive skew, heavy right tail
1.00%
0.80%
0.60%
0.40%
0.20%
0.00%
Logistic distribution
0.45
0.40
0.35
0.30 alpha=0, beta=1
0.25
alpha=2, beta=1
0.20
alpha=0, beta=3
0.15
0.10 N(0,1)
0.05
-
1 4 7 10 13 16 19 22 25 28 31 34 37 40
Measures of central tendency and dispersion (variance, volatility) are impacted more by
observations near the mean than outliers. The problem is that, typically, we are concerned with
outliers; we want to size the liklihood and magnitude of low frequency, high severity (LFHS)
events. Extreme value theory (EVT) solves this problem by fitting a separate distribution to
the extreme loss tail. EVT uses only the tail of the distribution, not the entire dataset.
Peaks over threshold (POT). The modern approach that is often preferred.
The dataset is parsed into (m) identical, consecutive and non-overlapping periods called blocks.
The length of the block should be greater than the periodicity; e.g., if the returns are daily, blocks
should be weekly or more. Block maxima partitions the set into time-based intervals. It requires
that observations be identically and independently (i.i.d.) distributed.
1
exp (1 y ) 0
H ( y )
y
exp( e ) 0
The (xi) parameter is the “tail index;” it represents the fatness of the tails. In this expression, a
lower tail index corresponds to fatter tails.
0.10
0.05
0.00
0
10
15
20
25
30
35
40
45
Per the (unassigned) Jorion reading on EVT, the key thing to know here is that (1) among
the three classes of GEV distributions (Gumbel, Frechet, and Weibull), we only care
about the Frechet because it fits to fat-tailed distributions, and (2) the shape parameter
determines the fatness of the tails (higher shape → fatter tails)
Peaks over threshold (POT) collects the dataset of losses above (or in excess of) some threshold.
The cumulative distribution function here refers to the probability that the “excess loss” (i.e., the
loss, X, in excess of the threshold, u, is less than some value, y, conditional on the loss exceeding
the threshold):
FU ( y ) P( X u y | X u )
u X
-4 -3 -2 -1 0 1 2 3 4
Peaks over threshold (POTS):
1
x
1 (1 ) 0
G , ( x )
1 exp( x ) 0
1.00
0.50
-
0 1 2 3 4
Block maxima is: time-based (i.e., blocks of time), traditional, less sophisticated, more
restrictive in its assumptions (i.i.d.)
Peaks over threshold (POT) is: more modern, has at least three variations (semi-
parametric; unconditional parametric; and conditional parametric), is more flexible
EVT Highlights:
Both GEV and GPD are parametric distributions used to model heavy-tails.
The sum of independent normally distributed random variables is also normally distributed
In credit risk modeling, the parameter 1/ is interpreted as a hazard rate or default intensity.
x
1
1
f (x) e ,x 0
f ( x ) e x
x
F ( x ) 1 e x
F(x) 1 e , x 0
The Poisson distribution counts the number of discrete events in a fixed time period; it is related
to the exponential distribution, which measures the time between arrivals of the events. If
events occur in time as a Poisson process with parameter λ, the time between events are
distributed as an exponential random variable with parameter λ. For example (from the learning
XLS):
The generalized Pareto distribution (GPD) models the distribution of so-called “peaks over
threshold.” GPD is limiting distribution the distribution of excesses about a threshold (“Peaks-
over-threshold” model). Possible applications are in the field of operational risks, where we are
concerned about losses above a certain threshold.
For severity tails, empirical distributions rarely sufficient (there is rarely enough data!).
If two normal distributions have the same mean, they combine (mix) to produce mixture
distribution with leptokurtosis (heavy-tails). Otherwise, mixtures are infinitely flexible.
0.45
0.40 Normal 1
Normal 2
0.35
Mixture
0.30
0.25
0.20
0.15
0.10
0.05
0.00
-10 -5 0 5 10
Geometric Brownian Motion (GBM) is the continuous motion/ process in which the randomly
varying quantity (in our example ‘Asset Value’) has a fluctuated movement and is dependent on
a variable parameter (in our example the stochastic variable is ‘Time’). The standard variable
parameter is the shock and the progress in the asset’s value is the drift. Now, the GBM can be
represented as drift + shock as shown below.
The above illustration is the shock and drift progression of the asset. The asset drifts upward
with the expected return of over the time interval t . But the drift is also impacted by shocks
from the random variable . We measure the standard deviation by a random variable
(which is the random shock) here. As the variance is adjusted with time t , volatility is adjusted
with the square root of time t .
day 2
day 3
day 4
day 5
day 6
day 7
day 8
day 9
day 10
Expected Drift is the deterministic component but shock is the random component in this stock
price process simulation. The Y-axis has an empirical distribution rather than a parametric
distribution and can be easily used to calculate the VaR. This Monte Carlo Distribution allows us
to produce an empirical distribution in future which can be used to calculate the VaR.
GBM assumes constant volatility (generally a weakness) unlike GARCH(1,1) which models
time-varying volatility.
The inverse transform method translates a random number (under a uniform distribution) into
a cumulative standard normal distribution:
CDF Pdf
Random NORMSINV() NORMDIST()
The bootstrap method is a subclass of (type of) historical simulation (like HS). In regular
historical simulation, current portfolio weights are applied to the historical sample (e.g., 250
trading days). The bootstrap differs because it is “with replacement:” a historical period (i.e., a
vector of daily returns on a given day in the historical window) is selected at random. This
becomes the “simulated” vector of returns for day T+1. Then, for day T+2 simulation, a daily
vector is again selected from the window; it is “with replacement” because each simulated day
can select from the entire historical sample. Unlike historical simulation—which runs the
The advantages of the bootstrap include: can model fat-tails (like HS); by generating
repeated samples, we can ascertain estimate precision. Limitations, according to Jorion,
include: for small sample sizes, the bootstrapped distribution may be a poor
approximation of the actual one.
Randomize
Historical Date,
But same indexed
returns within date
(preserves cross-sectional
correlations)
Monte Carlo simulation is a generation of a distribution of returns and/or asset prices paths by
the use of random numbers. Bootstrapping randomizes the selection of actual historical returns.
Able to account for a range of risks (e.g., price risk, volatility risk, and nonlinear
exposures)
Simple to implement
Once a price path has been generated, we can build a portfolio distribution at the end of the
selected horizon:
3. Calculate the value of the asset (or portfolio) under this particular sequence of prices at
the target horizon
This process creates a distribution of values. We can sort the observations and tabulate the
expected value and the quantile, which is the expected value in c times 10,000 replications.
Value at risk (VaR) relative to the mean is then:
Pricing options
Options can be priced under the risk-neutral valuation method by using Monte Carlo simulation:
The current value of the derivative is obtained by discounting at the risk free rate and averaging
across all experiments:
This formula means that each future simulated price, F(St), is discounted at the risk-free rate;
i.e., to solve for the present value. Then the average of those values is the expected value, or
value of the option. The Monte Carlo method has several advantages. It can be applied in
many situations, including options with so-called price-dependent paths (i.e., where the value
depends on the particular path) and options with atypical payoff patterns. Also, it is powerful
and flexible enough to handle varieties of options. With one notable exception: it cannot
accurately price options where the holder can exercise early (e.g., American-style options).
The relationship between the number of replications and precision (i.e., the standard error of
estimated values) is not linear: to increase the precision by 10X requires approximately 100X
more replications. The standard error of the sample standard deviation:
1 SE (ˆ ) 1
SE (ˆ )
2T 2T
Therefore to increase VaR precision by (1/T) requires a multiple of about T2 the number of
replications; e.g., x 10 precision needs x 100.
= 10^2 =
100x
replications
se() = 1/10
reduce se() for
better precision
Antithetic variable technique: changes the sign of the random samples. Appropriate
when the original distribution is symmetric. Creates twice as many replications at little
additional cost.
Importance sampling technique (Jorion calls this the most effective acceleration
technique): attempts to sample along more important paths
Stratified sampling technique: partitions the simulation region into two zones.
Cholesky factorization
By virtue of the inverse transform method, we can use =NORMSINV(RAND()) to create standard
random normal variables. The RAND() function is a uniform distribution bounded by [0,1]. The
NORMSINV() translates the random number into the z-value that corresponds to the probability
given by a cumulative distribution. For example, =NORMSINV(5%) returns -1.645 because 5% of
the area under a normal curve lies to the left of - 1.645 standard deviations.
But no realistic asset or portfolio contains only one risk factor. To model several risk factors, we
could simply generate multiple random variables. Put more technically, the realistic modeling
scenario is a multivariate distribution function that models multiple random variables. But the
problem with this approach, if we just stop there, is that correlations are not included. What we
really want to do is simulate random variables but in such a way that we capture or reflect the
correlations between the variables. In short, we want random but correlated variables.
The typical way to incorporate the correlation structure is by way of a Cholesky factorization (or
decomposition) . There are four steps:
1. The covariance matrix. This contains the implied correlation structure; in fact, a
covariance matrix can itself be decomposed into a correlation matrix and a volatility
vector.
2. The covariance matrix(R) will be decomposed into a lower-triangle matrix (L) and an
upper-triangle matrix (U). Note they are mirrors of each other. Both have identical
diagonals; their zero elements and nonzero elements are merely "flipped"
3. Given that R = LU, we can solve for all of the matrix elements: a,b,c (the diagonal) and x,
y, z. That is “by definition.” That's what a Cholesky decomposition is, it is the solution
that produces two triangular matrices whose product is the original (covariance) matrix.
4. Given the solution for the matrix elements, we can calculate the product of the triangle
matrix to ensure the produce does equal the original covariance matrix (i.e., does LU =
R?). Note, in Excel a single array formula can be used with = MMULT().
The lower triangle (LU) is the result of the Cholesky Decomposition. It is the thing we can use to
simulate random variables, that itself is "informed" by our covariance matrix.
The following transforms two independent random variables into correlated random variables:
1 1
2 1 (1 2 )2
Series Series
Correlation 0.75 Mean #1
1% #2
1%
Volatility 10% 10%
Correlated
N (0,1) N (0,1) Series Series
2.06 1.26 #1
$10.0 #2
$10.0
0.52 (0.73) $10.62 $9.37
1.51 0.99 $12.34 $10.39
(1.44) 0.48 $10.68 $11.00
If the variables are uncorrelated, randomization can be performed independently for each
variable. Generally, however, variables are correlated. To account for this correlation, we start
with a set of independent variables η, which are then transformed into the (). In a two-variable
setting, we construct the following:
This is a transformation of two independent random variable into correlated random variables.
Prior to the transformation, 1 and 2 are random variables that have necessary correlation.
The first random variable is retained (1 = 1) and the second is transformed (recast) into a
random variable that is correlated
Instead of drawing independent samples, the deterministic scheme systematically fills the space
left by previous numbers in the series.
Monte Carlo simulations methods generate independent, pseudorandom points that attempt to
“fill” an N-dimensional space, where N is the number of variables driving the price of securities.
Researchers now realize that the sequence of points does not have to be chosen randomly. In a
deterministic scheme, the draws (or trials) are not entirely random. Instead of random trials,
this scheme fills space left by previous numbers in the series.
Scenario Simulation
The first step consists of using principal-component analysis to reduce the dimensionality of the
problem; i.e., to use the handful of factors, among many, that are most important.
The second step consists of building scenarios for each of these factors, approximating a normal
distribution by a binomial distribution with a small number of states.
However
A key drawback of the Monte Carlo method is the computational requirements; a large number
of replications are typically required (e.g., thousands of trials are not unusual).
Discuss how historical data and various weighting schemes can be used in
estimating volatility.
Assume that one period equals one day. You can either compute the “continuously compounded
daily return” or the “simple percentage change.” If Si-1 is yesterday’s price and Si is today’s price,
the continuously compounded return (ui) is given by:
S
ui ln i
Si 1
The simple percentage change is given by:
Si Si 1
ui
Si 1
John Hull uses simple percentage but Linda Allen uses log return (continously
compounded) because log returns are “time consistent.” Hull’s method is not incorrect,
rather, it is an acceptable approximation for short (daily) periods.
The series can be either un-weighted (each return is equally weighted) or weighted. A weighted
scheme puts more weight on recent returns because they tend to be more relevant.
1 m
2
n
m 1 i 1
(un i u )2
n2 variance rate per day
m most recent m observations
u the mean/average of all daily returns (ui )
Hull, for practical purposes, makes the following two simplifying assumptions:
How can there be two different formulas for sample variance? Recall (from Gujarati) these
are estimators: recipes intended to produce estimates of the true population variance.
There can be different recipes; although many will have undesirable properties.
The standard approach gives equal weight to each return. But to forecast, it is better to give
greater weight to more recent data. A generic model for this approach is given by a weighted
moving average:
m
i un2 i
2
n
i 1
The alpha () parameters are simply weights; the sum of the alpha () parameters must equal
one because they are weights.
We can now add another factor to the model: the long-run average variance rate. The idea is
that the variance is “mean regressing:” think of it the variance as having a “gravitational pull”
toward its long-run average. We add another term to the equation above, in order to capture the
long-run average variance. The added term is the weighted long-run variance:
m
VL i un2i
2
n
i 1
The added term is gamma (the weighting) multiplied by () the long-run variance because
the variance is a weighted factor.
This is known as an ARCH (m) model. Often omega () replaces the first term. So here is a re-
formatted ARCH (m) model:
m
i un2i
2
n
i 1
Un-Weighted Scheme
1 m 2
un i
2
n
m i 1
Weighted Scheme
m
n2 i un2 i Weights must sum to one
i 1
In exponentially weighted moving average (EWMA), the weights decline (in constant proportion,
given by lambda).
n2 (1 ) 0un21
Ratio between any two consecutive
(1 ) 1un2 2 weights is constant lambda
(1 ) 2un23
Then we let a = 0 and (b + c) =1, such that the above equation simplifies to:
This is now equivalent to the formula for exponentially weighted moving average (EWMA):
RiskMetricsTM Approach
RiskMetrics is a branded form of the exponentially weighted moving average (EWMA) approach:
ht ht 1 (1 )rt21
The optimal (theoretical) lambda varies by asset class, but the overall optimal parameter used
by RiskMetrics has been 0.94. In practice, RiskMetrics only uses one decay factor for all series:
Technically, the daily and monthly models are inconsistent. However, they are both easy to use,
they approximate the behavior of actual data quite well, and they are robust to misspecification.
Each of GARCH (1, 1), EWMA and RiskMetrics are each parametric and recursive.
n2 n21 (1 )un21
GARCH (1,1) is the weighted sum of a long run-variance (weight = gamma), the most recent
squared-return (weight = alpha), and the most recent variance (weight = beta)
n2 VL un21 n21
The mean reversion term is the product of a weight (gamma) and the long-run
(unconditional) variance. If gamma = 0, GARCH(1,1) “reduces” to EWMA
beta (b) or lambda 0.860 0.898 In both, most weight to lag variance
If EWMA: lambda only
1-lambda 0.140 0.102 In EWMA, only two weights
sum of weights 1.00 1.00
If GARCH (1,1): alpha, beta, & gamma
omega (w) 0.00000200 0.00000176 omega = gamma * long run variance
alpha (a) 0.130 0.063 Weight to lag return
alpha + beta (a+b) 0.9900 0.9602 “persistence” of GARCH
gamma 0.010 0.040 Weight to L.R. var = 1 – alpha – beta
sum of weights: 1.000 1.000
Long Term Variance 0.00020 0.00004 omega/(1-alpha-beta)
Long Term Volatility 1.4142% 0.6650% SQRT()
GARCH(1,1)
Updated Variance 0.000236 0.000060 GARCH (1,1) = omega + beta*lag var
Updated Volatility 1.53% 0.77% + alpha * lag return^2
Compared to EWMA, GARCH(1,1) has an extra term (omega). This extra term is the weighted
long-run average variance; i.e., the product of the long-run average variance and the weight
(gamma):
n2 un21 n21
We can solve for the long-run average variance as a function of omega and the weights (alpha,
beta):
VL
1
Discuss how the parameters of the GARCH(1,1) and the EWMA models are
estimated using maximum likelihood methods.
In maximum likelihood methods we choose parameters that maximize the likelihood of the
observations occurring.
E[ n2 k ] VL ( )k ( n2 VL )
The expected future variance rate, in (t) periods forward, is given by:
E[ n2t ] VL ( )t ( n2 VL )
First, solve for the long-run variance. It is not 0.00008; this term is the product of the variance
and its weight. Since the weight must be 0.2 (= 1 - 0.1 -0.7), the long run variance = 0.0004.
0.00008
VL 0.0004
1 1 0.7 0.1
Second, we need the current variance (period n). That is almost given to us above:
Discuss how correlations and covariances are calculated, and explain the
consistency condition for covariances.
Correlations play a key role in the calculation of value at risk (VaR). We can use similar methods
to EWMA for volatility. In this case, an updated covariance estimate is a weighted sum of
Risk varies over time. Models often assume a normal (Gaussian) distribution (“normality”) with
constant volatility from period to period. But actual returns are non-normal and volatility varies
over time (volatility is “time-varying” or “non-constant”). Therefore, it is hard to use parametric
approaches to random returns; in technical terms, it is hard to find robust “distributional
assumptions for stochastic asset returns”
Persistence: In EWMA, the lambda parameter (λ). In GARCH (1,1), the sum of the
alpha (α) and beta () parameters. High persistence implies slow decay toward to
the long-run average variance.
Leptokurtosis: a fat-tailed distribution where relatively more observations are near the
middle and in the “fat tails (kurtosis > 3)
Assume that one period equals one day. You can either compute the “continuously compounded
daily return” or the “simple percentage change.” If Si-1 is yesterday’s price and Si is today’s price,
S Si Si 1
ui ln i ui
Si 1 Si 1
Linda Allen contrasts three periodic returns (i.e., continuously compounded, simple
percentage change, and absolute level change). She argues continuously compounded
must be used when computing VAR because it is “time consistent” (except for interest-
rate related variables which use the absolute level change).
The series can be either un-weighted (each return is equally weighted) or weighted. A weighted
scheme puts more weight on recent returns because they tend to be more relevant.
For practical purposes, the above equation is often simplified with the following assumptions:
1 m 2
variance = un i
2
n
m i 1
This simplified version replaces (m-1) with (m) in the denominator. (m-1) produces an
“unbiased” estimator and (m) produces a “maximum likelihood” estimator.
The standard approach gives no weight (or equal weight) to each return. But for forecasting
purposes, it is better to give greater weight to more recent data. A generic model for this
approach is given by a weighted moving average:
m
i un2 i
2
n
i 1
The alpha () parameters are simply weights; the sum of the alpha () parameters must equal
one because they are weights.We can now add another factor to the model: the long-run average
variance rate. The idea here is that the variance is “mean regressing:” think of it the variance as
having a “gravitational pull” toward its long-run average. We add another term to the equation
above, in order to capture the long-run average variance. The added term is the weighted long-
run variance:
The added term is gamma (the weighting) multiplied by () the long-run variance because
the variance is a weighted factor.
-
formatted ARCH (m) model:
m
n2 i un2i
i 1
Risk measurement (VaR) concerns the tail of a distribution, where losses occur. We want to
impose a mathematical curve (a “distributional assumption”) on asset returns so we can
estimate losses. The parametric approach uses parameters (i.e., a formula with parameters) to
make a distributional assumption but actual returns rarely conform to the distribution curve. A
parametric distribution plots a curve (e.g., the normal bell-shaped curve) that approximates a
range of outcomes but actual returns are not so well-behaved: they rarely “cooperate.”
Know how to compute two-asset portfolio variance & scale portfolio volatility to derive VaR:
Inputs
Trading days /year 252
Initial portfolio value (W) $100
VaR Time horizon (days) 10
(h)
VaR confidence interval 95%
Asset A
Volatility (per year) 10.0%
Expected Return (per year) 12.0%
Portfolio Weight (w) 50%
Asset B
Volatility 20.0%
Expected Return (per year) 25.0%
Portfolio Weight (1-w) 50%
Correlation (A,B) 0.30
Autocorrelation (h-1, h) 0.25 If independent, = 0. Mean reverting returns = negative
Outputs
Annual
Covariance (A,B)
Portfolio variance 0.0060
Exp portfolio return (per 0.0155
18.5%
year)
Portfolio volatility (per 12.4%
year)
Period (h days)
Exp periodic return (u) 0.73%
Std deviation (h), i.i.d 2.48%
Scaling factor 15.78 Don’t need to know this, used for AR(1)
Std deviation (h), 3.12% Standard deviation if auto-correlation.
Autocorrelation
Normal deviate (critical z 1.64 Normal deviate
value)
Expected future value
Relative VaR, i.i.d 100.73
$4.08 Doesn’t include the mean return
Absolute VaR, i.i.d $3.35 Includes return; i.e., loss from zero
Relative VaR, AR(1) $5.12 The corresponding VaRs, if autocorrelation incorporated.
Absolute VaR, AR(1) $4.39 Note VaR is higher!
Unstable: the parameters (e.g., mean, volatility) vary over time due to variability in
market conditions.
10 years of interest rate data are collected (1982 – 1993). The distribution plots the daily change
in the three-month treasury rate. The average change is approximately zero, but the “probability
mass” is greater at both tails. It is also greater at the mean; i.e., the actual mean occurs more
frequently than predicted by the normal distribution.
Actual returns:
4.5% 1. Skewed
4.0% 2. Fat-tailed
3rd Moment = (kurtosis>3)
3.5%
Skew • 3 3. Unstable
3.0%
2.5%
2.0%
1.5% 4th Moment =
1.0%
2nd Variance kurtosis • 4
“scale”
0.5%
0.0% 1st moment
-3 -2 -1 Mean
0 1 2 3
“location”
Conditional mean is time-varying; but this is unlikely given the assumption that
markets are efficient
Conditional volatility is time-varying; Allen says this is the more likely explanation!
Explain how outliers can really be indications that the volatility varies with time.
We observe that actual financial returns tend to exhibit fat-tails. Jorion (like Allen et al) offers
two possible explanations:
The true distribution is stationary. Therefore, fat-tails reflect the true distribution but the
normal distribution is not appropriate
The true distribution changes over time (it is “time-varying”). In this case, outliers can in
reality reflect a time-varying volatility.
A conditional distribution in not always the same: it is different, or conditional on, some
economic or market or other state. It is measured by parameters such as its conditional mean,
conditional standard deviation (conditional volatility), conditional skew, and conditional
kurtosis.
A typical distribution is a regime-switching volatility model: the regime (state) switches from
low to high volatility, but is never in between. A distribution is “regime-switching” if it changes
from high to low volatility.
The problem: a risk manager may assume (and measure) an unconditional volatility but the
distribution is actually regime switching. In this case, the distribution is conditional (i.e., it
depends on conditions) and might be normal but regime-switching; e.g., volatility is 10% during a
low-volatility regime and 20% during a high-volatility regime but during both regimes, the
distribution may be normal. However, the risk manager may incorrectly assume a single 15%
unconditional volatility. But in this case, the unconditional volatility is likely to exhibit fat
tails because it does not account for the regime switching.
VaR$ W$z
VaR% z
The common attribute to all the approaches within this class is their use of historical time series
data in order to determine the shape of the conditional distribution.
This approach uses derivative pricing models and current derivative prices in order to impute
an implied volatility without having to resort to historical data. The use of implied volatility
Please note that Jorion’s taxonomy approaches from the perspective of local versus full
valuation. In that approach, local valuation tends to associate with parametric approaches:
Risk Measurement
Historical
Linear models Nonlinear models
Simulation
Full Covariance Monte Carlo
Gamma
matrix Simulation
Diagonal Models
Parametric
Delta normal
Non parametric
Historical
Simulation
Bootstrap
Monte Carlo
Hybrid (semi-p)
HS + EWMA
EVT
POT (GPD)
Block
maxima
(GEV)
Implied Volatility
GARCH(1,1)
EWMA
Historical simulation is easy: we only need to determine the “lookback window.” The problem is
that, for small samples, the extreme percentiles (e.g., the worst one percent) are less precise.
Historical simulation effectively throws out useful information.
Historical standard deviation is the simplest and most common way to estimate or predict
future volatility. Given a history of an asset’s continuously compounded rate of returns we take
a specific window of the K most recent returns.
This standard deviation is called a moving average (MA) by Jorion. The estimate requires a
window of fixed length; e.g., 30 or 60 trading days. If we observe returns (rt) over M days, the
volatility estimate is constructed from a moving average (MA):
M
t2 (1/ M ) rt2i
i 1
Each day, the forecast is updated by adding the most recent day and dropping the furthest day.
In a simple moving average, all weights on past returns are equal and set to (1/M). Note raw
returns are used instead of returns around the mean (i.e., the expected mean is assumed zero).
This is common in short time intervals, where it makes little difference on the volatility estimate.
For example, assume the previous four daily returns for a stock are 6% (n-1), 5% (m-2), 4% (n-
3) and 3% (n-4). What is a current volatility estimate, applying the moving average, given that
our short trailing window is only four days (m=14)? If we square each return, the series is
0.0036, 0.0025, 0.0016 and 0.0009. If we sum this series of squared returns, we get 0.0086.
Divide by 4 (since m=4) and we get 0.00215. That’s the moving average variance, such that the
moving average volatility is about 4.64%.
The moving average (MA) series is simple but has two drawbacks
The MA series ignores the order of the observations. Older observations may no
longer be relevant, but they receive the same weight.
The MA series has a so-called ghosting feature: data points are dropped arbitrarily due
to length of the window.
Modern methods place more weight on recent information. Both EWMA and GARCH place
more weight on recent information. Further, as EWMA is a special case of GARCH, both EWMA
and GARCH employ exponential smoothing.
Heteroskedastic (H): variances are not constant, they flux over time
GARCH regresses on “lagged” or historical terms. The lagged terms are either variance or
squared returns. The generic GARCH (p, q) model regresses on (p) squared returns and (q)
variances. Therefore, GARCH (1, 1) “lags” or regresses on last period’s squared return (i.e., just 1
return) and last period’s variance (i.e., just 1 variance).
t2 a brt21,t c t21
VL is the long run average variance. Therefore, (VL) is a product: it is the weighted long-run
average variance. The GARCH (1, 1) model solves for the conditional variance as a function of
three variables (previous variance, previous return^2, and long-run variance):
A persistence of 1.0 implies no mean reversion. A persistence of less than 1.0 implies “reversion
to the mean,” where a lower persistence implies greater reversion to the mean.
As above, the sum of the weights assigned to the lagged variance and lagged squared
return is persistence (b+c = persistence). A high persistence (greater than zero but less
than one) implies slow reversion to the mean.
But if the weights assigned to the lagged variance and lagged squared return are greater
than one, the model is non-stationary. If (b+c) is greater than 1 (if b+c > 1) the model is
non-stationary and, according to Hull, unstable. In which case, EWMA is preferred.
GARCH is both “compact” (i.e., relatively simple) and remarkably accurate. GARCH models
predominate in scholarly research. Many variations of the GARCH model have been attempted,
but few have improved on the original.
Note that omega is 0.2 but don’t mistake omega (0.2) for the long-run variance! Omega is the
product of gamma and the long-run variance. So, if alpha + beta = 0.9, then gamma must be
0.1. Given that omega is 0.2, we know that the long-run variance must be 2.0 (0.2 0.1 = 2.0).
EWMA
EWMA is a special case of GARCH (1,1) and GARCH(1,1) is a generalized case of EWMA. The
salient difference is that GARCH includes the additional term for mean reversion and EWMA
lacks a mean reversion. Here is how we get from GARCH (1,1) to EWMA:
Then we let a = 0 and (b + c) =1, such that the above equation simplifies to:
This is now equivalent to the formula for exponentially weighted moving average (EWMA):
In EWMA, the lambda parameter now determines the “decay:” a lambda that is close to one
(high lambda) exhibits slow decay.
RiskMetrics is a branded form of the exponentially weighted moving average (EWMA) approach:
ht ht 1 (1 )rt21
The optimal (theoretical) lambda varies by asset class, but the overall optimal parameter used
by RiskMetrics has been 0.94. In practice, RiskMetrics only uses one decay factor for all series:
Technically, the daily and monthly models are inconsistent. However, they are both easy to use,
they approximate the behavior of actual data quite well, and they are robust to misspecification.
Note: GARCH (1, 1), EWMA and RiskMetrics are each parametric and recursive.
n2 n21 (1 )un21
Recursive EWMA
EWMA is (technically) an infinite series but the infinite series elegantly reduces to a recursive
form:
n2 (1 ) 0un21
(1 ) 1un2 2
(1 ) 2un23
n2 n21 (1 )un21
n2 (0.94) n21 (0.06)un21
GARCH estimations can provide estimations that are more accurate than MA
Except Linda Allen warns: GARCH (1,1) needs more parameters and may pose greater
MODEL RISK (“chases a moving target”) when forecasting out-of-sample
Graphical summary of the parametric methods that assign more weight to recent
returns (GARCH & EWMA)
n2 VL un21 n21
The three parameters are weights and therefore must sum to one:
1
Be careful about the first term in the GARCH (1, 1) equation: omega (ω) = gamma(λ) *
(average long-run variance). If you are asked for the variance, you may need to divide
out the weight in order to compute the average variance.
Determine when and whether a GARCH or EWMA model should be used in volatility
estimation
In practice, variance rates tend to be mean reverting; therefore, the GARCH (1, 1) model is
theoretically superior (“more appealing than”) to the EWMA model. Remember, that’s the big
difference: GARCH adds the parameter that weights the long-run average and therefore it
incorporates mean reversion.
GARCH (1, 1) is preferred unless the first parameter is negative (which is implied if alpha
+ beta > 1). In this case, GARCH (1,1) is unstable and EWMA is preferred.
Explain how the GARCH estimations can provide forecasts that are more accurate.
The moving average computes variance based on a trailing window of observations; e.g., the
previous ten days, the previous 100 days.
Ghosting feature: volatility shocks (sudden increases) are abruptly incorporated into the
MA metric and then, when the trailing window passes, they are abruptly dropped from
the calculation. Due to this the MA metric will shift in relation to the chosen window
length
More recent observations are assigned greater weights. This overcomes ghosting
because a volatility shock will immediately impact the estimate but its influence will fade
gradually as time passes
GARCH (1, 1) is unstable if the persistence > 1. A persistence of 1.0 indicates no mean reversion.
A low persistence (e.g., 0.6) indicates rapid decay and high reversion to the mean.
GARCH (1, 1) has three weights assigned to three factors. Persistence is the sum of the
weights assigned to both the lagged variance and lagged squared return. The other
weight is assigned to the long-run variance.
Therefore, if P (persistence) is high, then G (mean reversion) is low: the persistent series
is not strongly mean reverting; it exhibits “slow decay” toward the mean.
If P is low, then G must be high: the impersistent series does strongly mean revert; it
exhibits “rapid decay” toward the mean.
The average, unconditional variance in the GARCH (1, 1) model is given by:
0
LV
1 1
Advantages Disadvantages
Historical Easiest to implement Uses data inefficiently
Simulation (simple, convenient) (much data is not used)
Multivariate Very flexible: weights are Onerous model: weighting
density function of state (e.g., scheme; conditioning variables;
estimation economic context such as number of observations
interest rates) not constant Data intensive
Hybrid Unlike the HS approach, Requires model
approach better incorporates more recent assumptions; e.g., number of
information observations
The key feature of multivariate density estimation is that the weights (assigned to historical
square returns) are not a constant function of time. Rather, the current state—as
parameterized by a state vector—is compared to the historical state: the more similar the states
(current versus historical period), the greater the assigned weight. The relative weighting is
determined by the kernel function:
K
( t i )ut2i
2
t
i 1
Where EWMA assigns the weight as an exponentially declining function of time (i.e., the
nearer to today, the greater the weight), MDE assigns the weight based on the nature of
the historical period (i.e., the more similar to the historical state, the greater the weight)
The hybrid approach is a variation on historical simulation (HS). Consider the ten (10)
illustrative returns below. In simple HS, the return are sorted from best-to-worst (or worst-to-
best) and the quantile determines the VaR. Simple HS amounts to giving equal weight to each
returns (last column). Given 10 returns, the worst return (-31.8%) earns a 10% weight under
simple HS.
Cum'l
Sorted Periods Hybrid Hybrid Compare
Return Ago Weight Weight to HS
-31.8% 7 8.16% 8.16% 10%
-28.8% 9 6.61% 14.77% 20%
-25.5% 6 9.07% 23.83% 30%
-22.3% 10 5.95% 29.78% 40%
5.7% 1 15.35% 45.14% 50%
6.1% 2 13.82% 58.95% 60%
6.5% 3 12.44% 71.39% 70%
6.9% 4 11.19% 82.58% 80%
12.1% 5 10.07% 92.66% 90%
60.6% 8 7.34% 100.00% 100%
However, under the hybrid approach, the EWMA weighting scheme is instead applied. Since the
worst return happened seven (7) periods ago, the weight applied is given by the following,
assuming a lambda of 0.9 (90%):
Note that because the return happened further in the past, the weight is below the 10% that is
assigned under simple HS.
120%
100%
Hybrid 80%
Weights 60%
HS Weights 40%
20%
0%
1 2 3 4 5 6 7 8 9 10
In this case:
We are solving for the 95th percentile (95%) value at risk (VaR)
The HS 95% VaR = ~ 4.25% because it is the fifth-worst return (actually, the quantile can
be determined in more than one way)
However, the hybrid approach returns a 95% VaR of 3.08% because the “worst returns”
that inform the dataset tend to be further in the past (i.e., days ago = 76, 94, 86, 90…).
Due to this, the individual weights are generally less than 1%.
The question is: how do we compute VAR for a portfolio which consists of several positions.
The second approach is to extend the historical simulation (HS) approach to the portfolio:
apply today’s weights to yesterday’s returns. In other words, “what would have happened if we
held this portfolio in the past?”
The first approach (variance-covariance) requires the dubious assumption of normality—for the
positions “inside” the portfolio. The text says the third approach is gaining in popularity and is
justified by the law of large numbers: even if the components (positions) in the portfolio are not
normally distributed, the aggregated portfolio will converge toward normality.
To impute volatility is to derivate volatility (to reverse-engineer it, really) from the observed
market price of the asset. A typical example uses the Black-Scholes option pricing model to
compute the implied volatility of a stock option; i.e., option traders will average at-the-money
implied volatility from traded puts and calls.
This requires that a market mechanism (e.g., an exchange) can provide a market price for the
option. If a market price can be observed, then instead of solving for the price of an option, we
use an option pricing model (OPM) to reveal the implied (implicit) volatility. We solve (“goal
seek”) for the volatility that produces a model price equal to the market price:
cmarket f ( ISD )
Where the implied standard deviation (ISD) is the volatility input into an option pricing model
(OPM). Similarly, implied correlations can also be “recovered” (reverse-engineered) from
options on multiple assets. According to Jorion, ISD is a superior approach to volatility
estimation. He says, “Whenever possible, VAR should use implied parameters” [i.e., ISD or
market implied volatility].
The key idea refers to the application of the square root rule (S.R.R. says that variance scales
directly with time such that the volatility scales directly with the square root of time). The
square root rule, while mathematically convenient, doesn’t really work in practice because it
requires that normally distributed returns are independent and identically distributed (i.i.d.).
What I mean is, we use it on the exam, but in practice, when applying the square root rule to
scaling delta normal VaR/volatility, we should be sensitive to the likely error introduced.
Allen gives two scenarios that each illustrate “violations” in the use of the square root rule to
scale volatility over time:
Mean reversion in the asset dynamics. The price/return tends towards a long-run
level; e.g., interest rate reverts to 5%, equity log return reverts to +8%
Mean reversion in variance. Variance reverts toward a long-run level; e.g., volatility
reverts to a long-run average of 20%. We can also refer to this as negative
autocorrelation, but it's a little trickier. Negative autocorrelation refers to the fact that a
high variance is likely to be followed in time by a low variance. The reason it's tricky is
due to short/long timeframes: the current volatility may be high relative to the long run
mean, but it may be "sticky" or cluster in the short-term (positive autocorrelation) yet, in
the longer term it may revert to the long run mean. So, there can be a mix of (short-term)
positive and negative autocorrelation on the way being pulled toward the long run mean.
The simplest approach to extending the horizon is to use the “square root rule”
For example, if the 1-period VAR is $10, then the 2-period VAR is $14.14 ($10 x square root of 2)
and the 5-period VAR is $22.36 ($10 x square root of 5).
The square-root-rule: under the two assumptions below, VaR scales with the square root
of time. Extend one-period VaR to J-period VAR by multiplying by the square root of J.
The square root rule (i.e., variance is linear with time) only applies under restrictive i.i.d.
The square-root rule for extending the time horizon requires two key assumptions:
Random-walk (acceptable)