Nassim Nicholas Taleb

Fat Tails and (Anti)fragility
Lectures on Probability, Risk, and Decisions in The Real World
Volume 1
DRAFT VERSION, MAY 2013
!!
2 | Fat Tails and (Anti)fragility - N N Taleb
Fat Tails and (Anti)fragility - N N Taleb| 3
Volume 1- MODEL ERROR & METAPROBABILITY
Volume 1 corresponds mostlly to topics
covered in The Black Swan, 2
nd
Ed.
Volume 2 will address those of
Antifragile and The Bed of Procrustes.
Note that the topics in this book are
Hwith rare exceptionsL discussed in AF, TBS ,
and BoP in verbal or philosophical form
Small vs Large World. The problem of formal probability theory covering very, very narrow situations in which formal proofs can be produced, particularly when facing 1)
preasymptotics, and 2) inverse problems. Yet we cannot afford to blow up. The Procrustean bed error is to impose the “small world” on the large, that is, reduce the large,
rather than the opposite.
Fat Tails and (Anti)fragility - N N Taleb| 5
I am currently teaching a class with the absurd title “risk management and decision-making in the real world”, a title I have selected myself;
this is a total absurdity since risk management and decision-making should never have to justify being about the real world, and what’s worse,
one should never be apologetic about it. In “real” disciplines, titles like “Safety in the Real World”, “Biology and Medicine in the Real World”
would be lunacies. But in social science all is possible as there is no exit from the gene pool for blunders, nothing to check the system. You
cannot blame the pilot of the plane or the brain surgeon for being “too practical”, not philosophical enough; those who have done so have exited
the gene pool. The same applies to decision making under uncertainty and incomplete information. The other absurdity in is the common
separation of risk and decision-making, as the latter cannot be treated in any way except under the constraint: in the real world.
And the real world is about incompleteness: incompleteness of understanding, representation, information, etc., what when does when one does
not know what's going on, or when there is a non-zero chance of not knowing what's going on. It is based on focus on the unknown, not the
production of mathematical certainties based on weak assumptions; rather measure the robustness of the exposure to the unknown, which can be
done mathematically through metamodel (a model that examines the effectiness and reliability of the model), what I call metaprobability, even
if the meta-approach to the model is not strictly probabilistic.
This first section presents a mathematical approach for dealing with errors in conventional risk models, taking the bulls***t out of some, add
robustness, rigor and realism to others. For instance, if a "rigorously" derived model (say Markowitz mean variance) gives a precise risk
measure, but ignores the central fact that the parameters of the model don't fall from the sky, but need to be discovered with some error rate,
then the model is not rigorous for risk management, decision making in the real world, or, for that matter, for anything. We need to add another
layer of uncertainty, which invalidates some models (but not others). The mathematical rigor is shifted from focus on asymptotic (but rather
irrelevant) properties to making do with a certain set of incompleteness. Indeed there is a mathematical way to deal with incompletness.
The focus is squarely on "fat tails", since risks and harm lie principally in the high-impact events, The Black Swan and some statistical methods
fail us there. The section ends with an identification of classes of exposures to these risks, the Fourth Quadrant idea, the class of decisions that
do not lend themselves to modelization and need to be avoided. Modify your decisions. The reason decision-making and risk management are
insparable is that there are some exposure people should never take if the risk assessment is not reliable, something people understand in real
life but not when modeling. About every rational person facing an plane ride with an unreliable risk model or a high degree of uncertainty about
the safety of the aircraft would take a train instead; but the same person, in the absence of skin-in-the game, when working as "risk expert"
would say: "well, I am using the best model we have" and use something not reliable, rather than be consistent with real-life decisions and
subscribe to the straightforward principle: "let's only take those risks for which we have a reliable model".
Finally, someone recently asked me to give a talk at unorthodox statistics session of the American Statistical Association. I refused: the
approach presented here is about as orthodox as possible, much of the bone of this author come precisely from enforcing rigorous standards of
statistical inference on process. Risk (and decisions) require more rigor than other applications of statistical inference.

General Problems
The Black Swan Problem- Incomputability of Small Probalility: It is is not merely that events in the tails of the distributions matter,
happen, play a large role, etc. The point is that these events play the major role and their probabilities are not computable, not reliable for
any effective use. And the smaller the probability, the larger the error, affecting events of high impact. The idea is to work with measures
that are less sensitive to the issue (statistics), or conceive exposures less affected by it (decision theory). Mathematically, the problem
arises from the use of degenerate metaprobability.
Problem Description Chaptersê
Sections
P1 Preasymptotics ê
Incomplete Convergence
The real world is before the asymptote. This affects
the applications Hunder fat tailsL of the Lawof Large
Numbers and the Central Limit Theorem.
2
P2 Inverse Problems aL The direction Model fl Reality produces larger biases
than Reality fl Model
bL Some models can be " arbitraged " in one direction,
not the other
*
.
7, 8
P3 Conflation
aL The statistical properties of an exposure,
f HxL are different fromthose of a r.v. x,
with very significant effects under nonlinearities
Hwhen f HxL convexL.
bL Exposures and decisions can be modified, not
probabilities.
1, 9
P4 Degenerate
Metaprobability
Uncertainty about the probability distributions can
be expressed as additional layer of uncertainty, or,
simpler, errors, hence nested series of errors on errors.
The Black Swan problemcan be summarized as
degenerate metaprobability.
Note : Fat Tails are a special case of degenerate
metaprobability Hsmall probabilities are convex to
perturbationsL.
4, 5, 6, 7,
8, 9
6 | Fat Tails and (Anti)fragility - N N Taleb
Problem Description Chaptersê
Sections
P1 Preasymptotics ê
Incomplete Convergence
The real world is before the asymptote. This affects
the applications Hunder fat tailsL of the Lawof Large
Numbers and the Central Limit Theorem.
2
P2 Inverse Problems aL The direction Model fl Reality produces larger biases
than Reality fl Model
bL Some models can be " arbitraged " in one direction,
not the other
*
.
7, 8
P3 Conflation
aL The statistical properties of an exposure,
f HxL are different fromthose of a r.v. x,
with very significant effects under nonlinearities
Hwhen f HxL convexL.
bL Exposures and decisions can be modified, not
probabilities.
1, 9
P4 Degenerate
Metaprobability
Uncertainty about the probability distributions can
be expressed as additional layer of uncertainty, or,
simpler, errors, hence nested series of errors on errors.
The Black Swan problemcan be summarized as
degenerate metaprobability.
Note : Fat Tails are a special case of degenerate
metaprobability Hsmall probabilities are convex to
perturbationsL.
4, 5, 6, 7,
8, 9
*Arbitrage of Probability Measure. A probability measure m
A
can be arbitraged if one can produce data from another probability measure and
systematically fool the observer into thinking that it is another one m
B
based on his metrics in assessing the validity of m
A
.
Associated Specific “Black Swan Blindness” Errors (Applying Thin-Tailed Metrics to Fat Tailed Domains)
These are shockingly common, arising from mechanistic reliance on software or textbook items (or a culture of bad statistical insight). I skip
the elementary “Pinker” error of mistaking journalistic fact-checking for scientific statistical “evidence” and focus on less obvious but equally
dangerous ones.
1. Overinference: Making an inference from fat-tailed data assuming sample size allows claims (very common in social science). Chapter 3.
2. Underinference: Assuming N=1 is insufficient under large deviations. Chapters 1 and 3.
(In other words both these errors lead to refusing true inference and accepting anecdote as "evidence")
3. Asymmetry: Fat-tailed probability distributions can masquerade as thin tailed (“great moderation”, “long peace”), not the opposite.
4. The econometric ( very severe) violation in using standard deviations and variances as a measure of dispersion without ascertaining the
stability of the fourth moment (1.7) . This error alone allows us to discard everything in economics/econometrics using s as irresponsible
nonsense (with a narrow set of exceptions).
5. Making claims about “robust” statistics in the tails. Chapter 1.
6. Assuming that the errors in the estimation of x apply to f(x) ( very severe).
7. Mistaking the properties of "Bets" and "digital predictions" for those of Vanilla exposures, with such things as "prediction markets".
Chapter 9.
8. Fitting tail exponents power laws in interpolative manner. Chapters 2, 6
9. Misuse of Kolmogorov-Smirnov and other methods for fitness of probability distribution. Chapter 1.
10. Calibration of small probabilities relying on sample size and not augmenting the total sample by a function of 1/p , where p is the
probability to estimate.
11. Considering ArrowDebreu State Space as exhaustive rather than sum of known probabilities § 1
General Principles of Decision Theory in The Real World (Developed in Part II, on Antifragility)
Principles Description
Pr 1 Dutch Book Probabilities need to add up to 1
*
.
Pr2 Asymmetry Some errors are largely one sided.
Pr3 Nonlinear
Response
Fragility is more measurable than
probability
**
.
Pr4 Conditional
Precautionary
Principle
Domain specific precautionary, based
on fat tailedness of errors and asymmetry
of payoff.
Pr5 Decisions Exposures and decisions can be modified,
not probabilities.
Fat Tails and (Anti)fragility - N N Taleb| 7
Principles Description
Pr 1 Dutch Book Probabilities need to add up to 1
*
.
Pr2 Asymmetry Some errors are largely one sided.
Pr3 Nonlinear
Response
Fragility is more measurable than
probability
**
.
Pr4 Conditional
Precautionary
Principle
Domain specific precautionary, based
on fat tailedness of errors and asymmetry
of payoff.
Pr5 Decisions Exposures and decisions can be modified,
not probabilities.
* This and the corrollary that there is a non-zero probability of visible and known states spanned by the probability distribution adding up to <1
confers to probability theory, when used properly, a certain analytical robustness.
**The errors in measuring nonlinearity of responses are more robust and smaller than those in measuring responses. (Transfer theorems)
Metaprobability: the two statements 1) “the probability of Rand Paul winning the election is 15.2%” and 2) the probability of getting n
odds numbers in N throws of a fair die is x%” are different in the sense that the first statement has higher undertainty about its probability,
and you know it may change under a alternative analysis or over time.
Metaprobability (Part I): adds another dimension to the probability distributions, as we consider the effect of a layer of uncertainty over the
probabilities. It results in large effects in the tails, but, visually, these are identified through changes in the “peak” at the center of the
distribution.

8 | Fat Tails and (Anti)fragility - N N Taleb
1
Risk is Not in The Past (the Turkey Problem)
This is an introductory chapter outlining the turkey problem, showing its presence in data, and explaining why an assessment of fragility
is more potent than data-based methods of risk detection.
1.1 Introduction: Fragility, not Statistics
Fragility (Chapter x) can be defined as an accelerating sensitivity to a harmful stressor: this response plots as a concave curve and mathemati-
cally culminates in more harm than benefit from the disorder cluster [(i) uncertainty, (ii) variability, (iii) imperfect, incomplete knowledge, (iv)
chance, (v) chaos, (vi) volatility, (vii) disorder, (viii) entropy, (ix) time, (x) the unknown, (xi) randomness, (xii) turmoil, (xiii) stressor, (xiv)
error, (xv) dispersion of outcomes, (xvi) unknowledge.
Antifragility is the opposite, producing a convex response that leads to more benefit than harm.
We do not need to know the history and statistics of an item to measure its fragility or antifragility, or to be able to predict rare and random
('black swan') events. All we need is to be able to assess whether the item is accelerating towards harm or benefit.
The relation of fragility, convexity and sensitivity to disorder is thus mathematical and not derived from empirical data.
Figure 1.1. The risk of breaking of the coffee cup is not necessarily in the past time series of the variable; if fact surviving objects have to have had a “rosy” past.
The problem with risk management is that “past” time series can be (and actually are) unreliable. Some finance journalist (Bloomberg) was
commenting on my statement in Antifragile about our chronic inability to get the risk of a variable from the past with economic time series.
“Where is he going to get the risk from since we cannot get it from the past? from the future?”, he wrote. Not really, think about it: from the
present, the present state of the system. This explains in a way why the detection of fragility is vastly more potent than that of risk -and much
easier to do.
Fat Tails and (Anti)fragility - N N Taleb| 9
Assymmetry and Insufficiency of Past Data.
Our focus on fragility does not mean you can ignore the past history of an object for risk management, it is just accepting that the past is
highly insufficient.
The past is also highly asymmetric. There are instances (large deviations) for which the past reveals extremely valuable information about
the risk of a process. Something that broke once before is breakable, but we cannot ascertain that what did not break is unbreakable. This
asymmetry is extremely valuable with fat tails, as we can reject some theories, and get to the truth by means of via negativa.
† This confusion about the nature of empiricism, or the difference between empiricism (rejection) and naive empiricism (anecdotal
acceptance) is not just a problem with journalism. Naive inference from time series is incompatible with rigorous statistical inference; yet
many workers with time series believe that it is statistical inference. One has to think of history as a sample path, just as one looks at a
sample from a large population, and continuously keep in mind how representative the sample is of the large population. While
analytically equivalent, it is psychologically hard to take the “outside view”, given that we are all part of history, part of the sample so to
speak.
General Principle To Avoid Imitative, Cosmetic Science:
From Antifragile (2012):
There is such a thing as nonnerdy applied mathematics: find a problem first, and figure out the math that works for it (just as one
acquires language), rather than study in a vacuum through theorems and artificial examples, then change reality to make it look like these
examples.
1.2 Turkey Problems
Turkey and Inverse Turkey (from the Glossary for Antifragile): The turkey is fed by the butcher for a thousand days, and every day the
turkey pronounces with increased statistical confidence that the butcher "will never hurt it"—until Thanksgiving, which brings a Black
Swan revision of belief for the turkey. Indeed not a good day to be a turkey. The inverse turkey error is the mirror confusion, not seeing
opportunities— pronouncing that one has evidence that someone digging for gold or searching for cures will "never find" anything
because he didn’t find anything in the past.
What we have just formulated is the philosophical problem of induction (more precisely of enumerative induction.)
1.3 Simple Risk Estimator
Let us define a risk estimator that we will work with throughout the book.
Definition 1.3.1: Let X be, as of time T, a standard sequence of n+1 observations, X= 8x
t0+iDt
<
0§i§n
(x
t
œ !, i œ "), as the discretely
monitored history of a stochastic process X
t
over the interval [t
0
, TD (with realizations at fixed interval Dt thus T=t
0
+n Dt). The
empirical estimator M
T
X
HA, f L is defined as
(1.1) M
T
X
HA, f L ª

i=0
n
1
A
f Hx
t0+i Dt
L

i=0
n
1
where 1
A
is an indicator function taking values 1 if x
t
œ A and 0 otherwise, and f is a function of x. For instance f(x)=1, f(x)=x, and f(x) = x
N
correspond to the probability, the first moment, and N
th
moment, respectively. A is the subset of the support of the distribution that is of
concern for the estimation.
Let us stay in dimension 1 for the next few chapters not to muddle things.
Standard Estimators tend to be variations about M
t
X
(A,f) where f(x) =x and A is defined as the domain of the distribution of X, standard
measures from x, such as moments of order z, etc., are calculated “as of period” T. Such measures might be useful for the knowledge of some
properties, but remain insufficient for decision making as the decision-maker may be concerned for risk management purposes with the left tail
(for distributions that are not entirely skewed, such as purely loss functions such as damage from earthquakes, terrorism, etc. ), or any arbitrarily
defined part of the distribution.
Standard Risk Estimators
Definition 1.3.2: The empirical estimator S for the unconditional shortfall S below K is defined as
S ª M
T
X
HA, f L, with A = H-¶, KE, f HxL = x
10 | Fat Tails and (Anti)fragility - N N Taleb
(1.2) S ª

i=0
n
1
A
X
t0+i Dt

i=0
n
1
an alternative method is to compute the conditional shortfall:
S’ ª E@M X < KD = M
T
X
HA, f L

i=0
n
1

i=0
n
1
A
(1.3) S’ =

i=0
n
1
A
X
t0+i Dt

i=0
n
1
A
One of the uses of the indicator function for 1
A
, for observations falling into a subsection A of the distribution, is that we can actually derive the
past actuarial value of an option with X as an underlying struck as K as M
T
X
HA, xL, with A=(-¶,K] for a put and A=[K, ¶) for a call, with f(x) =x
Criterion:
The measure M is considered to be an estimator over interval [t- N Dt, T] if and only if it holds in expectation over the period X
T+i Dt
for a given i>0, that is across counterfactuals of the process, with a threshold x (a tolerated divergence that can be a bias) so
|E[M
T+i Dt
X
(A,f)]- M
T
X
(A, f)]|< x . In other words, the estimator as of some future time, should have some stability around the "true" value of
the variable and stay below an upper bound on the tolerated bias.
We skip the notion of “variance” for an estimator and rely on absolute mean deviation so x can be the absolute value for the tolerated bias. And
note that we use mean deviation as the equivalent of a “loss function”; except that with matters related to risk, the loss function is embedded in
the subt A of the estimator.
This criterion is compatible with standard sampling theory. Actually, it is at the core of statistics. Let us rephrase:
Standard statistical theory doesn’t allow claims on estimators made in a given set unless these are made on the basis that they can
“generalize”, that is, reproduce out of sample, into the part of the series that has not taken place (or not seen), i.e., for time series, for t>t.
This should also apply in full force to the risk estimator. In fact we need more, much more vigilance with risks.
For convenience, we are taking some liberties with the notations, pending on context: M
T
X
HA, f L is held to be the estimator, or a conditional
summation on data but for convenience, given that such estimator is sometimes called “empirical expectation”, we will be also using the
same symbol, namely with M
>T
X
HA, f L for the estimated variable in cases M is the M-derived expectation operator ! or !
P
under real world
probability measure P, that is, given a probability space (W, !, P), and a continuously increasing filtration !
t
, !
s
Õ !
t
if s < t. the expecta-
tion operator adapted to the filtration !
T
.
1.4 Fat Tails, the Finite Moment Case
Fat tails are not about the incidence of low probability events, but the contributions of events away from the “center” of the distribution
to the total properties.
As a useful heuristic, take the ratio
E@x
2
D
E@ x D
(or more generally
M
T
X
HA,x
n
L
M
T
X
HA, x L
, n>1); the ratio increases with the fat tailedness of the distribu-
tion, under the condition that the distribution has finite moments up to n).
Simply, x
n
is a weighting operator that assigns a weight, x
n-1
large for large values of x, and small for smaller values.
Norms !
p
: More generally, the !
p
Normof a vector x = 8x
i
<
i=1
n
is defined as
(1.4) »» x »»
p
=
1
n

i=1
n
†x
i
§
p
1êp
p ¥ 1. For the Euclidian norm, p = 2. The normrises with higher values of p , as, for a > 0,
(1.5)
1
n

i=1
n
†x
i
§
p+a
1êHp+aL
r
1
n

i=1
n
†x
i
§
p
1êp
One property quite useful with power laws with infinite moments :
(1.6) !x¥

= MaxH8 x
i
<
i=1
n
L
Fat Tails and (Anti)fragility - N N Taleb| 11
Gaussian Case: For a Gaussian, where x ~ N(0,s), as we assume the mean is 0 without loss of generality,
(1.7)
HM
T
X
HA, x
N
LL
1êN
M
T
X
HA, x L
=
p
N-1
2 N K2
N
2
-1
HH-1L
N
+ 1L GI
N+1
2
MO
1
N
2
or, alternatively
(1.8)
M
T
X
HA, x
N
L
M
T
X
HA, x L
= 2
1
2
HN-3L
I1 + H-1L
N
M
1
s
2
1
2
-
N
2
G
N + 1
2
Where G(z) is the Euler gamma function; GHzL =
Ÿ
0

t
z-1
e
-t
dt.
For odd moments, the ratio is 0. For even moments:
(1.9)
M
T
X
HA, x
2
L
M
T
X
HA, x L
=
p
2
s
hence
(1.10)
M
T
X
HA, x
2
L = Standard Deviation
Mean Absolute Deviation
=
p
2
For a Gaussian the ratio ~ 1.25, and it rises from there with fat tails.
Example: Take an extremely fat tailed distribution with n=10
6
, observations are all -1 except for a single one of 10
6
,
X = 9-1, -1, ... - 1, 10
6
=
The mean absolute deviation, MAD (X) = 2. The standard deviation STD (X)=1000. The ratio standard deviation over mean deviation is
500.
As to the fourth moment:
M
T
X
HA, x
4
L
M
T
X
HA, x L
=3
p
2
s
3
For a power law distribution with tail exponent a=3, say a Student T
(1.11)
M
T
X
HA, x
2
L
M
T
X
HA, x L
=
Standard Deviation
Mean Absolute Deviation
=
p
2
Time
1.1
1.2
1.3
1.4
1.5
1.6
1.7
STDêMAD
Figure 1.2. Standard Deviation/Mean Deviation for the daily returns of the SP500 over the past 47 years, with a monthly window.
We will return to other metrics and definitions of fat tails with power law distributions when the moments are said to be “infinite”, that is, do
not exist. Our heuristic of using the ratio of moments to mean deviation only works in sample, not outside.
12 | Fat Tails and (Anti)fragility - N N Taleb
Infinite moments, say infinite variance, always manifest themselves as computable numbers in observed sample, yielding an estimator M,
simply because the sample is finite. A distribution, say, Cauchy, with infinite means will always deliver a measurable mean in finite
samples; but different samples will deliver completely different means.
The next two figures illustrate the “drifting” effect of M a with increasing information.
2000 4000 6000 8000 10000
T
-2
-1
1
2
3
4
M
T
X
HA, xL
2000 4000 6000 8000 10000
T
3.0
3.5
4.0
M
T
X
IA, x
2
O
Figure 1.3 The mean (left) and standard deviation (right) of two series with Infinite mean (Cauchy) and infinite variance (St(2)), respectively.
A Simple Heuristic to Create Mildly Fat Tails
Since higher moments increase under fat tails, as compared to lower ones, it should be possible so simply increase fat tails without increasing
lower moments.
Variance-preserving heuristic. Keep ![x
2
] constant and increase ![x
4
], by "stochasticizing" the variance of the distribution, since <x
4
> is
itself analog to the variance of < x
2
> measured across samples ( ![x
4
] is the noncentral equivalent of !AHx
2
- !@x
2
DL
2
E . Chapter x will do the
"stochasticizing" in a more involved way.
An effective heuristic to watch the effect of the fattening of tails is to simulate a random variable we set to be at mean 0, but with the following
variance-preserving : it follows a distribution N(0, s 1 - a ) with probability p =
1
2
and N(0, s 1 + a ) with the remaining probability
1
2
,
with 0 b a < 1 .
The characteristic function is
(1.12) fHt, aL =
1
2

-
1
2
H1+aL t
2
s
2
I1 + ‰
a t
2
s
2
M
Odd moments are nil. The second moment is preserved since
(1.13) MH2L = H-ÂL
2
!
t,2
fHtL
0
= s
2
and the fourth moment
(1.14) MH4L = H-ÂL
4
!
t,4
fHtL
0
= 3 Ia
2
+ 1M s
4
which puts the traditional kurtosis at 3 Ha
2
+ 1L. This means we can get an "implied a" from kurtosis. a is roughly the mean deviation of the
stochastic volatility parameter "volatility of volatility" or Vvol in a more fully parametrized form.
This heuristic is of weak powers as it can only raise kurtosis to twice that of a Gaussian, so it should be limited to getting some intuition about
its effects. Section 1.10 will present a more involved technique.
Fat Tails and (Anti)fragility - N N Taleb| 13
1.5 Scalable and Nonscalable, A Deeper View of Fat Tails
So far for the discussion on fat tails we stayed in the finite moments case. For a certain class of distributions, those with finite moments,
P>nK
P>K
depends on n and K. For a scale-free distribution, with K “in the tails”, that is, large enough,
P>n K
P>K
depends on n not K. These latter
distributions lack in characteristic scale and will end up having a Paretan tail, i.e., for X large enough, P
>X
= C X
-a
where a is the tail and C is
a scaling constant.
Table 1.1. The Gaussian Case, By comparison a Student T Distribution with 3 degrees of freedom,
reaching power laws in the tail. And a "pure" power law, the Pareto distribution, with an exponent
a=2.
K
1
P>K
Gaussian
P>K
P>
2 K
1
PK
St H3L
P>K
P>
2 K
St H3L
1
PK
Pareto
Ha = 2L
P>K
P>
2 K
Pareto
2 44.0 7.2µ10
2
14.4 4.97443 8.00 4.
4 3.16µ10
4
3.21µ10
4
71.4 6.87058 64.0 4.
6 1.01µ10
9
1.59µ10
6
216. 7.44787 216. 4.
8 1.61µ10
15
8.2µ10
7
491. 7.67819 512. 4.
10 1.31µ10
23
4.29µ10
9
940. 7.79053 1.00µ10
3
4.
12 5.63µ10
32
2.28µ10
11
1.61µ10
3
7.85318 1.73µ10
3
4.
14 1.28µ10
44
1.22µ10
13
2.53µ10
3
7.89152 2.74µ10
3
4.
16 1.57µ10
57
6.6µ10
14
3.77µ10
3
7.91664 4.10µ10
3
4.
18 1.03µ10
72
3.54µ10
16
5.35µ10
3
7.93397 5.83µ10
3
4.
20 3.63µ10
88
1.91µ10
18
7.32µ10
3
7.94642 8.00µ10
3
4.
Note: We can see from the scaling difference between the Student and the Pareto the conventional definition of a power law tailed distribution
is expressed more formally as P
>x
= L HxL X
-a
where L Hx) is a “slow varying function”, which satisfies lim
xض
L Ht xL
L x
=1 for all constants
t > 0.
For X large enough,
log HP>xL
log HxL
converges to a constant, the tail exponent -a. A scalable should show the slope a in the tails, as x Ø ¶
Gaussian
LogNormal-2
Student (3)
2 5 10 20
Log X
10
-13
10
-10
10
-7
10
-4
0.1
Log P
>X
Figure 1.5. Three Distributions. As we hit the tails, the Student remains scalable while the Standard Lognormal shows an intermediate position.
So far this gives us the intuition of the difference between classes of distributions. Only scalable have “true” fat tails, as others turn into a
Gaussian under summation. And the tail exponent is asymptotic; we may never get there and what we may see is an intermediate version of it.
The figure above drew from Platonic off-the-shelf distributions; in reality processes are vastly more messy, with switches between exponents.
Estimation issues: Note that there are many methods to estimate the tail exponent a from data, what is called a “calibration”. However, we will
see, the tail exponent is rather hard to guess, and its calibration marred with errors, owing to the insufficiency of data in the tails. In general, the
data will show thinner tail than it should.
We will return to the issue in Chapter 3.
Subexponential as definition of fat tailed distribution:
14 | Fat Tails and (Anti)fragility - N N Taleb
1. We introduced the category "true fat tails" as scalable power laws to differenciate it from the weaker one of fat tails as having higher
kurtosis than a Gaussian.
2. Some use as a cut point infinite variance, but Chapter 2 will show it not useful.
3. Another useful distinction: distributions are called sub exponential when they decline more slowly in the tails than the exponential. Let X
i

be i.i.d. random variables in !
+
, taking the exceedance probability P
>X
as short for P[X>x],
a) lim
xض
P >⁄
i§2§n
X
i
P> X
= n ,
which is equivalent to
b) lim
xض
P@>⁄
i=1
n
XiD
P@>Max 8Xi<
i§2§n
D
= 1.
The sum is of the same order as the maximum (positive) value, another way of saying that the tails play a large role. Now the standard
Lognormal belongs to the subexponential category, but just barely so (we used in the graph above Log Normal-2 as a designator as the tail
distribution function ~ K ‰
-b HlogHxL-mL
g
where g=2)
The Black Swan Problem: It is is not merely that events in the tails of the distributions matter, happen, play a large role, etc. The point is
that these events play the major role and their probabilities are not computable, not reliable for any effective use.
Why do we use Student T to simulate symmetric power laws? It is not that we believe that the generating process is Student T. Simply, the
center of the distribution does not matter much for the properties. The lower the exponent, the less the center plays a role. The higher the
exponent, the more the student T resembles the Gaussian, and the more justified its use will be accordingly. More advanced methods involving
the use of Levy laws may help in the event of asymmetry, but the use of two different Pareto distributions with two different exponents, one for
the left tail and the other for the right one would help.
Why power laws? There are a lot of theories on why things should be power laws, as sort of exceptions to the way things work probabilisti-
cally. But it seems that the opposite idea is never presented: power should can be the norm, and the Gaussian a special case as we will see in
Chapt x, of concave-convex responses (sort of dampening of fragility and antifragility, bringing robustness, hence thinning tails).
1.6 Different Approaches For Statistically Derived Estimators
There are broadly two separate ways to go about estimators: nonparametric and parametric.
The nonparametric approach: it is based on observed raw frequencies derived from sample-size n. Roughly, it sets a subset of events A and
M
T
X
HA, 1L (i.e., f(x) =1), so we are dealing with the frequencies jHAL =
1
n

i=0
n
1
A
. Thus these estimates don’t allow discussions on frequencies
j <
1
n
, at least not directly. Further the volatility of the estimator increases with lower frequencies. The error is a function of the frequency itself
(or rather, the smaller of the frequency j and 1-j). So if ⁄
i=0
n
1
A
=30 and n=1000, only 3 out of 100 observations are expected to fall into the
subset A, restricting the claims to too narrow a set of observations for us to be able to make a claim, even if the total sample n=1000 is deemed
satisfactory for other purposes. Some people introduce smoothing kernels between the various buckets corresponding to the various frequen-
cies, but in essence the technique remains frequency-based. So if we nest subsets A
1
ΠA
2
...Œ A, the expected “volatility” (as we will see later in
the chapter, we mean MAD, mean absolute deviation, not STD) of M
T
X
HA
z
, f L, E@ M
T
X
HA
z
, f L - M
>T
X
HA
z
, f L D §
E@ M
T
X
HA
<z
, f L - M
>T
X
HA
<z
, f L D for all functions f. (Proof via twinking of law of large numbers).
The parametric approach: it allows extrapolation but imprison the representation into a specific off-the-shelf probability distribution (which
can itself be composed of more probability distributions); so M
T
X
is an estimated parameter for use input into a distribution or model.
Both methods make is difficult to deal with small frequencies. The nonparametric for obvious reasons of sample insufficiency in the tails, the
parametric because small probabilities are very sensitive to parameter errors.
The problem of payoff and the Convex Payoff Sampling Error Inequality
This is the central problem of model error seen in consequences not in probability. The literature is used to discussing errors on probability
which should not matter much for small probabilities. But it matters for payoffs, as f can depend on x. Let us see how the problem becomes
very bad when we consider f and in the presence of fat tails. Simply, you are multiplying the error in probability by a large number, since fat
tails imply that the probabilities p(x) do not decline fast enough for large values of x. Now the literature seem to have examined errors in
probability, not errors in payoff.
Let M
T
X
HA
z
, f L be the estimator of a function of x in the subset A
z
= (d
1
, d
2
) of the support of the variable. Take x the mean absolute error in the
estimation of the probability in the small subset A
z
= (d
1
, d
2
), i.e., x= E M
T
X
HA
z
, 1L - M
>T
X
HA
z
, 1L . Assume f(x) is either linear or convex (but
not concave) in the form C+ L x
b
, with both L > 0 and b ¥ 1.
Fat Tails and (Anti)fragility - N N Taleb| 15
Then the estimation error of M
T
X
HA
z
, f L compounds the error in probability, thus giving us the lower bound in relation to x. This leads us to the
central inequality from convexity of payoff , which we shorten as Convex Payoff Sampling Error Inequality, CPSEI:
(1.15) EA M
T
X
HA
z
, f L - M
>T
X
HA
z
, f L E ¥ ILH d
2 Ô
d
1
L
b
+ b H d
2 Ô
d
1
L
b-1
d
1
- d
2
M EA M
T
X
HA
z
, 1L - M
>T
X
HA
z
, 1L E
Since
E@M
>T
X
HA
z
, f LD
E@M
>T
X
HA
z
, 1LD
=
Ÿ
d1
d2
f HxL pHxL „x
Ÿ
d1
d2
pHxL „x
The error on p(x) can be in the form of parameter mistake that inputs into p, say s the standard deviation (Chapter x and discussion of metaprob-
ability), or in the frequency estimation. Note now that if d
1
Ø-¶, we may have an infinite error on M
T
X
HA
z
, f L, the left-tail shortfall while, by
definition, the error on probability is necessarily bounded.
If you assume in addition that the distribution p(x) is expected to have fat tails (of any of the kinds seen in XXX.XXX and 0.0) , then the
problem becomes more acute.
Now the mistake of estimating the properties of x, then making a decisions for a nonlinear function of f(x), not realizing that the errors for f(x)
are different from those of x is extremely common. Naively, one needs a lot larger sample for f(x) when f(x) is convex than when f(x)=x. We
will re-examine it along with the “conflation problem” in PART II.
Example illustrating the, CPSEI, the Inequality of 1.15
A comparison of the "true" theoretical value compared to random samples drawn from the Student T with 3 degrees of freedom, for
M
T
X
IA, x
b
M, A =(-¶, -3], n=200, across m simulations (>10
5
), by estimating E M
T
X
IA, x
b
M - M
>T
X
IA, x
b
M using
x = ⁄
j=1
m

i=1
n
1A Ix
i
j
M
b
1A
- M
>T
X
IA, x
b
M . It produces the following table showing an explosive error x. We compare the effect to a Gausian with
matching standard deviation, namely 3 .
b x
StH3L
x
GJ0, 3 N
1 0.86 0.18
3
2
3.69 0.57
2 16.35 1.54
5
2
122.49 3.9
3 ¶ 9.64
(*I thank Albert Wenger for corrections of mathematical typos)
‹ Warning. Severe mistake! One should never make a decision involving M
T
X
HA
z
, f L and basing it on calculations for M
T
X
HA
z
, 1L ,
especially when f is convex, as it violates CPSEI. Yet many papers make such a mistake. The author is disputing, in Taleb (2013), the
results of Ilmanen (2013): naively Ilmanen (2013) took the observed probabilities of large deviations, f(x)=1 then made in inference for f(x)
an option payoff based on x, which can be extremely explosive (a error that can cause losses of several orders of magnitude the initial gain).
Chapter x revisits the problem in the context of nonlinear transformations of random variables.
1.7 The Mother of All Turkey Problems: How Economics Time Series Econometrics and
Statistics Don’t Replicate
(Debunking a Nasty Type of PseudoScience)
Something Wrong With Econometrics, as Almost All Papers Don’t Replicate. The next two reliability tests, one about parametric
methods the other about robust statistics, show that there is something wrong in econometric methods, fundamentally wrong, and that the
methods are not dependable enough to be of use in anything remotely related to risky decisions.
Performance of Standard Parametric Risk Estimators, f(x)= x
n
(Norm !2 )
With economic variables one single observation in 10,000, that is, one single day in 40 years, can explain the bulk of the "kurtosis", a measure
of "fat tails", that is, both a measure how much the distribution under consideration departs from the standard Gaussian, or the role of remote
events in determining the total properties. For the U.S. stock market, a single day, the crash of 1987, determined 80% of the kurtosis. The same
problem is found with interest and exchange rates, commodities, and other variables. The problem is not just that the data had "fat tails",
something people knew but sort of wanted to forget; it was that we would never be able to determine "how fat" the tails were within standard
methods. Never.
16 | Fat Tails and (Anti)fragility - N N Taleb
With economic variables one single observation in 10,000, that is, one single day in 40 years, can explain the bulk of the "kurtosis", a measure
of "fat tails", that is, both a measure how much the distribution under consideration departs from the standard Gaussian, or the role of remote
events in determining the total properties. For the U.S. stock market, a single day, the crash of 1987, determined 80% of the kurtosis. The same
problem is found with interest and exchange rates, commodities, and other variables. The problem is not just that the data had "fat tails",
something people knew but sort of wanted to forget; it was that we would never be able to determine "how fat" the tails were within standard
methods. Never.
The implication is that those tools used in economics that are based on squaring variables (more technically, the Euclidian, or !
2
norm),
such as standard deviation, variance, correlation, regression, the kind of stuff you find in textbooks, are not valid!scientifically!(except in some
rare cases where the variable is bounded). The so-called "p values" you find in studies have no meaning with economic and financial variables.
Even the more sophisticated techniques of stochastic calculus used in mathematical finance do not work in economics except in selected
pockets.
The results of most papers in economics based on these standard statistical methods are thus not expected to replicate, and they effectively
don't. Further, these tools invite foolish risk taking. Neither do alternative techniques yield reliable measures of rare events, except that we can
tell if a remote event is underpriced, without assigning an exact value.
From Taleb (2009), using Log returns,
X
t
ª log
P HtL
P Ht - i DtL
Take the measure M
t
X
HH-¶, ¶L, X
4
L of the fourth noncentral moment
M
t
X
HH-¶, ¶L, X
4
L ª
1
N

i=0
N
HX
t-i Dt
L
4
and the N-sample maximum quartic observation Max{X
t-i Dt
4
<
i=0
N
. Q(N) is the contribution of the maximum quartic variations over N samples.
(1.16)
QHNL ª
Max 8HX
t-i Dt
L
4
<
i=0
N

i=0
N
HX
t-i Dt
L
4
For a Gaussian (i.e., the distribution of the square of a Chi-square distributed variable) show Q(10
4
) the maximum contribution should be
around .008 ± .0028
Recall that, naively, the fourth moment expresses the stability of the second moment. And the second moment expresses the stability of the
measure mean across samples.
VARIABLE Q HMax Quartic Contr.L N HyearsL
Silver 0.94 46.
SP500 0.79 56.
CrudeOil 0.79 26.
Short Sterling 0.75 17.
Heating Oil 0.74 31.
Nikkei 0.72 23.
FTSE 0.54 25.
JGB 0.48 24.
Eurodollar Depo 1M 0.31 19.
Sugar Ò11 0.3 48.
Yen 0.27 38.
Bovespa 0.27 16.
Eurodollar Depo 3M 0.25 28.
CT 0.25 48.
DAX 0.2 18.
Description of the dataset:
All tradable macro markets data available as of August 2008, with “tradable” meaning actual closing prices corresponding to transactions
(stemming from markets not bureaucratic evaluations, includes interest rates, currencies, equity indices).
Fat Tails and (Anti)fragility - N N Taleb| 17
Figure 1.6. The entire dataset
Figure 1.7. Time Stability ... We have no idea how fat the tails are in a given period. All we know is that Kurtosis is 1) high, 2) unstable, hence we don’t know how high it is.
0.2
0.4
0.6
0.8
Monthly Vol
Figure 1.8. Montly delivered volatility in the SP500 (as measured by standard deviations). The only structure it seems to have comes from the fact that it is bounded at 0.
This is standard.
Figure 1.9. Montly volatility of volatility from the same dataset, predictably unstable.
The significance of nonexistent (that is, “infinite”) Kurtosis can be seen by generating data with finite variance and a tail exponent <4. Obvi-
ously, the standard deviation is meaningless a measure is is never expected to predict itself except in subsamples that are devoid of large
deviations.
T
M
T
X
IA, x
2
O
Figure 1.10. Monthly Delivered Volatility for Fake data, Randomly Generated data with infinite Kurtosis. Not too different from Figure x (SP500).
Performance of Standard NonParametric Risk Estimators, f(x)= x or |x| (Norm !1), A =(-•, K]
Does the past resemble the future in the tails? The following tests are nonparametric, that is entirely based on empirical probability distributions.
18 | Fat Tails and (Anti)fragility - N N Taleb
0.001 0.002 0.003 0.004
M@tD
0.001
0.002
0.003
0.004
M@t+1D
0.005 0.010 0.015 0.020 0.025 0.030
M@tD
0.005
0.010
0.015
0.020
0.025
0.030
M@t+1D
Figure 1.11. Comparing M[t-1, t] and M[t,t+1], where t= 1year, 252 days, for macroeconomic data using extreme deviations, A= (-¶ ,-2 standard deviations (equivalent)], f(x)
= x (replication of data from The Fourth Quadrant, Taleb, 2009)
Figure 1.12. The “regular” is predictive of the regular, that is mean deviation. Comparing M[t] and M[t+1 year] for macroeconomic data using regular deviations, A= (-¶ ,¶) ,
f(x)= |x|
Concentration of tail events
without predecessors
Concentration of tail events
without successors
0.0001 0.0002 0.0003 0.0004 0.0005
M@tD
0.0001
0.0002
0.0003
0.0004
M@t+1D
Figure 1.13 This are a lot worse for large deviations A= (-¶ ,-4 standard deviations (equivalent)], f(x) = x
So far we stayed in dimension 1. When we look at higher dimensional properties, such as covariance matrices, things get worse. We will return
to the point with the treatment of model error in mean-variance optimization.
When x
t
are now in "
N
, the problems of sensitivity to changes in the covariance matrix makes the estimator M extremely unstable. Tail events
for a vector are vastly more difficult to calibrate, and increase in dimensions.
Figure 1.14 Correlations are also problematic, which flows from the instability of single variances and the effect of multiplication of the values of random variables.
Fat Tails and (Anti)fragility - N N Taleb| 19
The Responses so far by members of the economics/econometrics establishment: “his books are too popular to merit attention”,
“nothing new” (sic), “egomaniac” (but I was told at the National Science Foundation that “egomaniac” does not apper to have a clear
econometric significance). No answer as to why they still use STD, regressions, GARCH, value-at-risk and similar methods.
Note that economists invoke “outliers” or “peso problem” as acknolwedging fat tails, yet ignore them analytically (outside of Poisson models
that we will see are not possible to calibrate except after the fact). Our approach here is exactly the opposite: do not pull outliers under the rug,
rather build everything around them. In other words, just like the FAA and the FDA who deal with safety by focusingon catastrophe avoidance,
we will throw away the ordinary under the rug and retain extremes as the sole sound approach to risk management. And this extends beyond
safety since much of the analytics and policies that can be destroyed by tail events are unusable.
Lack of Skin in the Game. Indeed one wonders why econometric methods can be used while being wrong, so shockingly wrong, how
“University” researchers (adults) can partake of such a scam. Basically they capture the ordinary and mask higher order effects. Since
blowups are not frequent, these events do not show in data and the researcher looks smart most of the time while being fundamentally
wrong. At the source, researchers, “quant” risk manager, and academic economist do not have skin in the game so they are not hurt by
wrong risk measures: other people are hurt by them. And the scam should continue perpetually so long as people are allowed to harm
others with impunity. More in Appendix X.
1.8 Metrics Outside !
2

We can see from the data that the predictability of the Gaussian-style cumulants is low, the mean deviation of mean deviation is ~70% of the
mean deviation of the standard deviation (in sample, but the effect is much worse in practice); working with squares is not a good estimator.
Many have the illusion that we need variance: we don’t , even in finance and economics (especially in finance and economics).
We propose different cumulants, that should exist whenever the mean exists. So we are not in the dark when we refuse standard deviation. It is
just that these cumulants require more computer involvement and do not lend themselves easily to existing Platonic distributions. And, unlike in
the conventional Brownian Motion universe, they don’t scale neatly.
C
0
=
1
T

i=1
T
x
i
C
1
=
1
T - 1

i=1
T
x
i
- C
0
produces the Mean Deviation Hbut centered by the mean, the first momentL.
C
2
=
1
T - 2

i=1
T
x
i
- Co -C
1
produces the mean deviation of the mean deviation.
.
.
C
N
=
1
T - N - 1

i=1
T
... x
i
- Co -C
1
-C
2
... - C
N
The next table shows the theoretical first two cumulants for two symmetric distributions: a Gaussian, N (0,s) and a symmetric Student T
StH0, s, aL with mean 0, a scale parameter s, the PDF for x is p(x)=
1
a s BJ
a
2
,
1
2
N
J
a
a+HxêsL
2
N
Ha+1Lê2
As to the PDF of the Pareto distribution, p(x) =
s
a
x
-a-1
a for x ¥ s (and the mean will be necessarily positive).
20 | Fat Tails and (Anti)fragility - N N Taleb
Distr Mean C
1
C
2
Gaussian 0
2
p
s 2 ‰
-1êp
2
p
K1 - ‰
1
p erfcK
1
p
OO s
Pareto a
a s
a-1
2 Ha - 1L
a-2
a
1-a
s
ST a=3ê2 0
2
6
p
s GJ
5
4
N
GJ
3
4
N
8 3 GJ
5
4
N
2
p
3ê2
ST Square a=2 0 2 s s -
s
2
ST Cubic a=3 0
2 3 s
p
8 3 s tan
-1
J
2
p
N
p
2
where erfc is the complimentary error function erfcHzL = 1 -
2
p
Ÿ
0
z
e
-t
2
dt.
These cumulants will be useful in areas for which we do not have a good grasp of convergence of the mean of observations.
1.9 Typical Manifestations of The Turkey Surprise
200 400 600 800 1000
-50
-40
-30
-20
-10
10
Figure 1.15. The Turkey Problem (The Black Swan, 2007/2010), where nothing in the past properties seems to indicate the possibility of the jump.
1.10 Fattening of Tails Through the Approximation of a Skewed Distribution for the
Variance
We can improve on the fat-tail heuristic in 1.4 (which limited the kurtosis to twice the Gaussian) as follows.
Switch between Gaussians with :
s
2
H1 + aL
s
2
H1 + bL
with probability p
with probability H1 - pL
with both a and b œ (-1,1) and b= -a
p
1-p
, giving a characteristic function fHt, aL = p ‰
-
1
2
Ha+1L s
2
t
2
- Hp - 1L ‰
-
s
2
t
2
Ha p+p-1L
2 Hp-1L
, with Kurtosis =
3Ha
2
- 1L
H1+pL
1-p
s
4
, thus allowing polarized states and high kurtosis, all variance preserving, conditioned on, when a > (<) 0, a < (>)
1-p
p
.
Thus with p= 1/1000, and the maximum possible a = 999, kurt can reach as high a level as 3000 .
This heuristic approximates quite effectively the probability effect of a lognormal weighting for the characteristic function f(t, V) =
Ÿ
0
¶ ‰
-
t
2
v
2
-
-v0+
Vv
2
2
+Log@vD
2
2 Vv
2
2 p v Vv
„v , where v is the variance and Vv is the second order variance, often called volatility of volatility.
Fat Tails and (Anti)fragility - N N Taleb| 21
This heuristic approximates quite effectively the probability effect of a lognormal weighting for the characteristic function f(t, V) =
Ÿ
0
¶ ‰
-
t
2
v
2
-
-v0+
Vv
2
2
+Log@vD
2
2 Vv
2
2 p v Vv
„v , where v is the variance and Vv is the second order variance, often called volatility of volatility.
1.11 How To Arbitrage Kolmogorov-Smirnov
The Problem of Self-Reference
Counterintuitively, when one raises the kurtosis, the time series looks “quieter”. Because the storms are rare but deep. This leads to mistaken
illusion of low volatility when in fact it is just high kurtosis, something that fooled people big-time with the story of the “great moderation” as
risks were accumulating and nobody was realizing that fragility was increasing, like dynamite accumulating under the structure.
Let us use a switching process producing Kurtosis= 10
7
(from p= 1/2000, a sligtly below the upper bound a=
1-p
p
-1) and compare it to the
regular case p=0, a=1, the kurtosis of 3.
200 400 600 800 1000
time
S
Fat tails
Thin tails
200 400 600 800 1000
time
S
Fat tails
Thin tails
Figure 1.17 N=1000. Sample simulation. Both series have the exact same means and variances at the level of the generating process.
Figure 1.18 N=1000. Another simulation. there is 1/2 chance of seeing the real properties.
Kolmogorov - Smirnov, Shkmolgorov-Smirnoff
Remarkably, the fat tailed series passes the Kolomogorov-Smirnov test of normality with better marks than the thin-tailed one, since it displays
a lower variance. The problem discussed with with Avital Pilpel (Taleb and Pilpel, 2001, 2004, 2007) is that Kolmogorov-Smirnov and similar
tests of normality are inherently self-referential.
These probability distributions are not directly observable, which makes any risk calculation suspicious since it hinges on knowledge about
these distributions. Do we have enough data? If the distribution is, say, the traditional bell-shaped Gaussian, then yes, we may say that we
have sufficient data. But if the distribution is not from such well-bred family, then we do not have enough data. But how do we know which
distribution we have on our hands? Well, from the data itself.
If one needs a probability distribution to gauge knowledge about the future behavior of the distribution from its past results, and if, at the
same time, one needs the past to derive a probability distribution in the first place, then we are facing a severe regress loop–a problem of
self reference akin to that of Epimenides the Cretan saying whether the Cretans are liars or not liars. And this self-reference problem is
only the beginning.
(Taleb and Pilpel, 2001, 2004)
Also,
From the Glossary in The Black Swan. Statistical regress argument (or the problem of the circularity of statistics): We need data to
discover a probability distribution. How do we know if we have enough? From the probability distribution. If it is a Gaussian, then a few
points of data will suffice. How do we know it is a Gaussian? From the data. So we need the data to tell us what probability distribution to
assume, and we need a probability distribution to tell us how much data we need. This causes a severe regress argument, which is
somewhat shamelessly circumvented by resorting to the Gaussian and its kin.
A comment on the Kolmogorov Statistic
Note that the Kolmogorov-Smirnov test doesn’t affect payoffs and higher moments, as it only focuses on probabilities. But that’s not the only
problem. It is, as we mentioned, conditioned on sample size while claiming to be nonparametric.
Let us see how it works. Take the historical series and find the maximum point of divergence with F(.) the cumulative of the proposed distribu-
tion to test against:
(1.17) D = Max :
1
j

i=1
J
X
t0+i Dt
- F IX
t0+j Dt
M >
j=1
n
where n =
T - t
0
Dt
22 | Fat Tails and (Anti)fragility - N N Taleb
Figure 1.19 The Kolmorov-Smirnov Gap. D is the measure of the largest absolute divergence between the candidate and the target distribution.
We will get more technical in the discussion of convergence, take for now that the Kolmogorov statistic, that is, the distribution of D, is
expressive of convergence, and should collapse with n. The idea is that, by a Brownian Bridge argument (that is a process pinned on both sides,
with intermediate steps subjected to double conditioning), D
j
= K

i=1
J
XDt i+t
0
j
- FIX
Dt j+t0
MO follows a Uniform distribution
1
.
The probability of Exceeding D, P
>D
ª HJ n DN where H is the cumulative distribution function of the Kolmogorov - Smirnov distribution,
HHtL = 1 - 2 ‚
i=1

H-1L
i-1
e
-2 i
2
t
2
We can see that the main idea reposes on a decay of n D with large values of n. So we can fool the testing by proposing distributions with a
small probability of very large jump, where the probability of switch d
1
n
.
J. L. Doob, Heuristic approach to the Kolmogorov–Smirnov theorems, Ann. Math. Statistics, 20 (1949), pp. 393–403.
Table of the "fake" Gaussian when not busted
Let us run a more involved battery of statistical tests (but consider that it is a single run, one historical simulation).
Comparing the Fake and genuine Gaussians (Figure 1.17) and subjecting them to a battery of tests:
Fake
Statistic P-Value
Anderson-Darling 0.406988 0.354835
Cramér-von Mises 0.0624829 0.357839
Jarque-Bera ALM 1.46412 0.472029
Kolmogorov-Smirnov 0.0242912 0.167368
Kuiper 0.0424013 0.110324
Mardia Combined 1.46412 0.472029
Mardia Kurtosis -0.876786 0.380603
Mardia Skewness 0.7466 0.387555
Pearson c
2
43.4276 0.041549
Shapiro-Wilk 0.998193 0.372054
Watson U
2
0.0607437 0.326458
Genuine
Statistic P-Value
Anderson-Darling 0.656362 0.0854403
Cramér-von Mises 0.0931212 0.138087
Jarque-Bera ALM 3.90387 0.136656
Kolmogorov-Smirnov 0.023499 0.204809
Kuiper 0.0410144 0.144466
Mardia Combined 3.90387 0.136656
Mardia Kurtosis -1.83609 0.066344
Mardia Skewness 0.620678 0.430795
Pearson c
2
33.7093 0.250061
Shapiro-Wilk 0.997386 0.107481
Watson U
2
0.0914161 0.116241
Table of the "fake" Gaussian when busted
And of course the fake Gaussian when caught (Figure 1.18 ). But recall that we have a small chance of observing the true distribution.
Busted Fake
Statistic P-Value
Anderson-Darling 376.059 0.
Cramér-von Mises 80.7342 0.
Jarque-Bera ALM 4.21447µ10
7
0.
Kolmogorov-Smirnov 0.494547 0.
Kuiper 0.967477 0.
Mardia Combined 4.21447µ10
7
0.
Mardia Kurtosis 6430.62 1.565108283892229µ10
-8979680
Mardia Skewness 166432. 1.074929530052337µ10
-36143
Pearson c
2
30585.7 3.284407132415535µ10
-6596
Shapiro-Wilk 0.0145449 1.91137µ10
-57
Watson U
2
80.5877 0.
Fat Tails and (Anti)fragility - N N Taleb| 23
Busted Fake
Statistic P-Value
Anderson-Darling 376.059 0.
Cramér-von Mises 80.7342 0.
Jarque-Bera ALM 4.21447µ10
7
0.
Kolmogorov-Smirnov 0.494547 0.
Kuiper 0.967477 0.
Mardia Combined 4.21447µ10
7
0.
Mardia Kurtosis 6430.62 1.565108283892229µ10
-8979680
Mardia Skewness 166432. 1.074929530052337µ10
-36143
Pearson c
2
30585.7 3.284407132415535µ10
-6596
Shapiro-Wilk 0.0145449 1.91137µ10
-57
Watson U
2
80.5877 0.
Remark 1.11 (via negativa, arbitrage of probability measures). Note that these statistics are effective (and robust) at rejecting, but
not at accepting, as a single large deviation allowed the rejection with extremely satisfactory margins (a near-infinitesimal P-Value). This
shows how absence of evidence " evidence of absence.
As we saw these test statistics can be arbitraged, or "fooled" one way, not the other.
We will revisit the problem upon discussing the "N=1" fallacy (that is the fallacy of thinking that N=1 is systematically insufficient sample).
Some social “scientists” (Herb Gintis, a representative imbecile) wrote about my approach to this problem, stating among other equally ignorant
comments, something to the effect“the plural of anecdotes is not data”. This elementary violation of the logic of inference from data is very
common with social scientists as we will see in Chapter 3, as their life is based on mechanistic approaches that miss the asymmetry.
24 | Fat Tails and (Anti)fragility - N N Taleb
A
Appendix: Payoff Skewness and Lack of Skin-in-the-
Game
Anyone not responsible for dowside payoffs has an incentive to hide the risks in the tails. This shows the necessity of the Hammurabi rule.
In addition to risk takers, this affects predictors who should never be judged on how frequently they get things right, or, alternatively,
should never be judged at all. The discussion is simplified mathematically but is general enough as it is; we will revisit throughout this
book under more complex structures.
Section 3.3 presents the effect of the statistical properties not showing with the left tail. Let us limit the discussion here to the asymmetry in
incentives.
Remark A-1 (Asymmetry; Transfer of Harm): If an agent has the upside of the payoff of the random variable, with no downside, and is
judjed solely on the basis of past performance, then the incentive is to hide risks in the left tail. This can be generalized to any payoff for
which one does not bear the full risks and negative consequences of one's actions.
The upside does not have to be in the form of a direct monetary compensation. For a bureaucrat it could be promotion. For a politician it could
be in the form of re-election.
If an operator can only “stay in the game” by being profitable the previous period, and is not penalized by losses, then his objective is to
maximize the stream of payoffs by shooting for a high probability of payoff, not high expectation.
Let PHK, ML be the payoff for the operator over M incentive periods
(1.18) PHK, ML ª ‚
i=1
M
Ix
i+t
j
Dt
- KM
+
1
xDt Hi-1L+t >K
with X
j
= {x
i+t
j
Dt
}
i=1
M
, the distribution of profits i, j œ #, x
t
j
œ ", and K is a “hurdle”, 1
xDt Hi-1L+t >K
is a stopping time, the condition of having
performed above K in the previous year, otherwise the number of positive incentives stops.
Figure E.1. The most efficient payoff to maximize P(N)
The fact is that how negative x
Dt i+t
is does not matter at all for P, so the more negative skewness the better.
Let 9 f
j
= be the family of probability measures f
j
of X, each corresponding to certain mean/skewness characteristics, and split in half on both
sides of K, f
j
+
=
Ÿ
K

f
j
HxL „x and f
j
-
=
Ÿ

K
f
j
HxL „x , the "upper" and "lower" distributions, each corresponding to certain expectation
!
j
+
=
Ÿ
K

x f
j
HxL „x and !
j
-
=
Ÿ

K
x f
j
HxL „x.
Fat Tails and (Anti)fragility - N N Taleb| 25
Let 9 f
j
= be the family of probability measures f
j
of X, each corresponding to certain mean/skewness characteristics, and split in half on both
sides of K, f
j
+
=
Ÿ
K

f
j
HxL „x and f
j
-
=
Ÿ

K
f
j
HxL „x , the "upper" and "lower" distributions, each corresponding to certain expectation
!
j
+
=
Ÿ
K

x f
j
HxL „x and !
j
-
=
Ÿ

K
x f
j
HxL „x.
Now define n as a nonparametric measure of skewness, n ª
!
j
ê
!
j
+
and let us assume a “fair game”, that is
(1.19) !
j
+
f
j
+
= !
j
-
f
j
-
(1.20) n
j
=
f
j
+
f
j
-
=
f
j
+
1 - f
j
+
Intuitively, skewness has probabilities and expectations moving in opposite directions: the larger the negative payoff, the smaller the probability
to compensate.
Taking expectations in 1.18 , since !
j,i
+
= !
j
+
, E@PHK, MLD = !
j
+

i=1
M
f
j
+
1
xDt Hi-1L+t >K
(1.21) !@PHK, MLD = !
j
+
f
+
HH f
+
L
M
- 1L
f
+
- 1
lim
Mض
!HPHK, MLL = !
j
+
f
+
1 - f
+
= !
j
+
n
j
which maximizes under the distribution j with the most negative skewness.
Hence the total expectation of the positive-incentive without-skin-in-the-game depends on negative skewness. The operator has the incentive to
maximize negative left-tailedness, nothing else.
We can prove that it is insensitive to the total expectation by lifting the constraint !
j
+
f
j
+
= !
j
-
f
j
-
.
Forecasters: We can see how forecasters who do not have skin in the game have the incentive of betting on the high probability event, and
ignoring the lower probability ones. The confusion between “digital payoffs”
Ÿ
f
j
HxL „x and full distribution, called “vanilla payoffs”,
Ÿ
x f
j
HxL „x is examined in more depth in Chapter 9.
26 | Fat Tails and (Anti)fragility - N N Taleb
B
Appendix: Special Cases of Fat Tails
Multimodality and Fat Tails, or the War and Peace Model
We noted in 1.x that the distribution gains in fat tailedness (as expressed by kurtosis) by stochasticizing, ever so mildly, variances. But we
maintained the same mean.
But should we stochasticize the mean as well, and separate the potential outcomes wide enough, so that we get many modes, the kurtosis would
drop. And if we associate different vairances with different means, we get a variety of “regimes”, each with its set of probabilities.
Either the very meaning of "fat tails" loses its significance, or takes on a new one where the "middle", around the expectation ceases to exist.
Now there are plenty of distributions in real life composed of many possible regimes, or states, and, assuming finite moments to all states, s
1
a
calm regime, with expected mean m
1
and standard deviation s
1
, s
2
a violent regime, with expected mean m
2
and standard deviation s
2
, and
more. Each state has its probability p
i
.
Assume, to simplify a one-period model, as if one was standing in front of a discrete slice of history, looking forward at outcomes. (Adding
complications (transition matrices between different regimes) doesn’t change the main result.)
The Characteristic Function f(t) for the mixed distribution:
(B.1) fHtL = ‚
i=1
N
p
i

-
1
2
t
2
s
i
2
+Â t mi
For N=2, the moments simplify to
M1 p
1
m
1
+ H1 - p
1
L m
2
M2 p
1
Hm
1
2
+ s
1
2
L + H1 - p
1
L Hm
2
2
+ s
2
2
L
M3 p
1
m
1
3
+ H1 - p
1
L m
2
Hm
2
2
+ 3 s
2
2
L + 3 m
1
p
1
s
1
2
M4 p
1
H6 m
1
2
s
1
2
+ m
1
4
+ 3 s
1
4
L + H1 - p
1
L H6 m
2
2
s
2
2
+ m
2
4
+ 3 s
2
4
L
Let us consider the different varieties, all charaterized by the condition p
1
< (1-p
1
), m
1
< m
2
, preferably m
1
< 0 and m
2
> 0 , and, at the core, the
central property: s
1
> s
2
.
Variety 1: War and Peace. Calm period with positive mean and very low volatility, turmoil with negative mean and extremely low volatility.
Fat Tails and (Anti)fragility - N N Taleb| 27
S
1
S
2
Pr
Figure 1.1 The War and peace model. Kurtosis K=1.7, much lower than the Gaussian
Variety 2: Conditional determistic state. Take a bond B, paying interest r at the end of a single period. At termination, there is a high
probability of getting B (1+r), a possibility of defaut. Getting exactly B is very unlikely. Think that there are no intermediary steps between war
and peace: these are separable and discrete states. Bonds don’t just default “a little bit”. Note the divergence, the probability of the realization
being at or close to the mean is about nil. Typically, p(E(x)) the probabilitity densities of the expectation are smaller than at the different means
of regimes, so pHEHxLL < pHm
1
L and < pHm
2
L , but in the extreme case (bonds), p(E(x)) becomes increasingly small. The tail event is the
realization around the mean.
S
2
S
1
Pr
Figure 1.2 The Bond payoff model. Absence of volatility, deterministic payoff in regime 2, mayhem in regime 1. Here the kurtosis K=2.5. Note that the coffee cup is a special
case of both regimes 1 and 2 being degenerate.
In option payoffs, this bimodality has the effect of raising the value of at-the-money options and lowering that of the out-of-the-money ones,
causing the exact opposite of the so-called “volatility smile”.
Note the coffee cup has no state between broken and healthy. And the state of being broken can be considered to be an absorbing state (using
Markov chains for transition probabilities), since broken cups do not end up fixing themselves.
Nor are coffee cups likely to be “slightly broken”, as we will see in the next figure.
28 | Fat Tails and (Anti)fragility - N N Taleb
Figure 1.3 The coffee cup cannot incur “small” harm; it is exposed to everything or nothing.
A brief list of other situations where bimodality is encountered:
1. Mergers
2. Professional choices and outcomes
3. Conflicts: interpersonal, general, martial, any situation in which there is no intermediary between harmonious relations and hostility.
4. Conditional cascades
Transition probabilites, or what can break will eventually break
So far we looked at a single period model, which is the realistic way since new information may change the bimodality going into the future:
we have clarity over one-step but not more. But let us go through an exercise that will give us an idea about fragility. Assuming the structure of
the model stays the same, we can look at the longer term behavior under transition of states. Let P be the matrix of transition probabilitites,
where p
i, j
is the transition from state i to state j over Dt, (that is, where S(t) is the regime prevailing over period t, PIS Ht + Dt L = s
j
SHtL = s
j
MM
(1.22) P =
p
1,1
p
2,1
p
1,2
p
2,2
After n periods, that is, n steps,
(1.23) P
n
=
Hp1,1-1L Hp1,1+p2,2-1L
n
+p2,2-1
p1,1+p2,2-2
H1-p1,1L HHp1,1+p2,2-1L
n
-1L
p1,1+p2,2-2
H1-p2,2L HHp1,1+p2,2-1L
n
-1L
p1,1+p2,2-2
Hp2,2-1L Hp1,1+p2,2-1L
n
+p1,1-1
p1,1+p2,2-2
The extreme case to consider is the one with the absorbing state, where p
1,1
=1, hence ( replacing p
i,!i i=1,2
=1-p
i,i
).
(1.24) P
n
=
1 0
1 - p
2,2
N
p
2,2
N
and the "ergodic" probabilities:
(1.25) lim
nض
P
n
= K
1 0
1 0
O
The implication is that the absorbing state regime 1 S(1) will end up dominating with probability 1: what can break and is irreversible will
eventually break.
Fat Tails and (Anti)fragility - N N Taleb| 29
2
Preasymptotics and Central Limit in the Real World
The Central Limit Theorem, at the foundation of modern statistics: The behavior of the sum of random variables allows us to get to
the asymptote and use handy asymptotic properties, that is, Platonic distributions. But the problem is that in the real world we never get to
the asymptote, we just get “close”. Some distributions get close quickly, others very slowly (even if they have finite variance). The same
applies to the law of large numbers, with severe problems in convergence. Recall the idea of estimator is based on replicability outside the
set in which it was derived: this is the weak law of large numbers.
2.1 An Excursion Into The Law of Large Numbers Under Fat Tails
How do you reach the limit? A common flaw in the interpretation of the law of large numbers. More technically, by the weak law of large
numbers, a sum of random variables X
1
, X
2
,..., X
N
independent and identically distributed with finite mean m, that is E[X
i
] < ¶, then
1
N

1§i§N
X
i
converges to m in probability, as N Ø ¶. But the problem of convergence in probability, as we will see later, is that it does not
take place in the tails of the distribution (different parts of the distribution have different speeds). This point is quite central and will be
examined later with a deeper mathematical discussions on limits in Chapter x. We limit it here to intuitive presentations of turkey surprises.
{Hint: we will need to look at the limit without the common route of Chebychev's inequality which requires E[X
i
2
] < ¶ . Chebychev’s
inequality and similar ones eliminate the probabilities of some tail events).
Let us examine the speed of convergence of the average
1
N

1§i§N
X
i
. For a Gaussian distribution Hm, s), the characteristic function for the
convolution is jHt ê NL
N
= ‰
-
s
2
t
2
2 N
2
+
 mt
N
N
, which, derived twice at 0 yields H-ÂL
2
!
2
c
!t
2

!c
!t
ê. t Ø0 which produces the standard deviation sHnL =
sH1L
N
so one can say that sum “converges” at a speed N .
Another way to view it is by expanding j and letting N ض gets the characteristic function of the degenerate distribution at m (the latter
methods illustrates the strong law of large numbers as converge takes place almost everywhere except for a set of probability 0).
But things are far more complicated with power laws. Let us repeat the exercise for a Pareto distribution with density L
a
x
-1-a
a , x> L, a >1,
(2.1) jHt ê NL
N
= a
N
E
a+1
-
 L t
N
N
where E is the exponential integral E; E
n
HzL =
Ÿ
1

e
-z t
ë t
n
„t. Setting L=1 to scale, the standard deviation s
a
HNL for the N-average becomes
(2.2) s
a
HNL =
1
N
a
N
E
a+1
H0L
N-2
IE
a-1
H0L E
a+1
H0L + E
a
H0L
2
I-N a
N
E
a+1
H0L
N
+ N - 1MM for a > 2
Sucker Trap: After some tinkering, we get s
a
HNL =
saH1L
N
as with the Gaussian, which is a sucker’s trap. For we should be careful in interpret-
ing s
a
HNL, which will be very volatile since s
a
H1L is already very volatile and does not reveal itself easily in realizations of the process. In fact,
let p(.) be the PDF of a Pareto distribution with mean m, standard deviation s, minimum value L and exponent a, D
a
the expected mean
deviation of the variance for a given a will be D
a
=
1
s
2
Ÿ
L

H†Hx - mL
2
- s
2
§ L pHxL „x
30 | Fat Tails and (Anti)fragility - N N Taleb
Figure 2.1 The standard deviation of the sum of N=100 a=13/6. The right graph shows the entire span of realizations.
Absence of Useful Theory: As to situations, central situations, where 1<a<2, we are left hanging analytically (but we can do something about
it in the next section). We will return to the problem in our treatment of the preasymptotics of the central limit theorem.
But we saw in 1.8 that the volatility of the mean is
a
a-1
s and the mean deviation of the mean deviation, that is, the volatility of the volatility of
mean is 2 Ha - 1L
a-2
a
1-a
s , where s is the scale of the distribution. As we get close to a = 1 the mean becomes more and more volatile in
realizations for a given scale. This is not trivial since we are not interested in the speed of convergence per se given a variance, rather the ability
of a sample to deliver a meaningful estimate of some total properties.
The law of large numbers, if it ever works, operates at >20 times slower rate for an a of 1.15 than for an exponent of 3. As a rough
heuristic, just assume that one needs > 400 times the observations. Indeed, 400 times! (The point of what we mean by “rate” will be
revisited with the discussion of the Large Deviation Principle and the Cramer rate function in X.x; we need a bit more refinement of the
idea of tail exposure for the sum of random variables).
Figure 2.2 How thin tails (Gaussian) and fat tails (1<a§2 ) converge.
Comparing N = 1 to N = 2 for a symmetric power law with 1<a£2 .
Let fHtL be the characteristic function of the symmetric Student T with a degrees of freedom. After two-fold convolution of the average we get:
fHt ê 2L
2
=
4
1-a
a
aê2
†t§
a
Ka
2
a †t§
2
2
GI
a
2
M
2
. We can get an explicit density by inverse Fourier transformation of f, p
2,a
HxL =
1
2 p
Ÿ


fHt ê 2L
2
-it x
dt
which yields the following
(2.3) p
2,a
HxL =
p 2
-4 a
a
5ê2
GH2 aL
2
F
1
Ja +
1
2
,
a+1
2
;
a+2
2
; -
x
2
a
N
GI
a
2
+ 1M
4
Fat Tails and (Anti)fragility - N N Taleb| 31
where
2
F
1
is the hypergeometric function,
2
F
1
Ha, b; c; zL = ⁄
k=0

HaL
k
HbL
k
ê HcL
k
z
k
ëk!.
we can compare the twice-summed density to the initial one (with some abuse of notation: p
n
(x)= P(⁄
i=1
N
x
i
=x))
(2.4)
p
1,a
HxL =
I
a
a+x
2
M
a+1
2
a BI
a
2
,
1
2
M
We can automatically see that in the Cauchy case (a=1) the sum conserves the density, so p
1,1
HxL= p
2,1
HxL=
1
p I1+x
2
M
We can use the ratio of mean deviations; since the mean is 0, m(a)ª
Ÿ x p2,a HxL „x
Ÿ x p1,a HxL „x
m(a)=
p 2
1-a
GJa-
1
2
N
GI
a
2
M
2
lim
aض
mHaL =
1
2
1 ë
,
2
1.5 2.0 2.5 3.0
a
0.7
0.8
0.9
1.0
mHaL
Figure 2.3 Preasymptotics of the ratio of mean deviations. But one should note that mean deviations themselves are extremely high in the neighborhood of !1
This part
will be
expanded
2.2 Preasymptotics and Central Limit in the Real World
The common mistake is to think that if we satisfy the criteria of convergence, that is, independence and finite variance, that central limit is a
given. Take the conventional formulation of the Central Limit Theorem (Grimmet & Stirzaker, 1982; Feller 1971, Vol. II):
Let X
1
,X
2
,... be a sequence of independent identically distributed random variables with mean m & variance s
2
satisfying m< ¶ and 0
<s
2
<¶, then
(2.5)

i=1
N
X
i
- N m
s n
Ø
D
N H0, 1L as n ض
Where Ø
D
is converges "in distribution".
Granted convergence "in distribution" is about the weakest form of convergence. Effectively we are dealing with a double problem.
The first, as uncovered by Jaynes, corresponds to the abuses of measure theory: Some properties that hold at infinity might not hold in all
limiting processes .
There is a large difference between convergence a.s. (almost surely) and the weaker forms.
32 | Fat Tails and (Anti)fragility - N N Taleb
‡ Jaynes 2003 (p.44):"The danger is that the present measure theory notation presupposes the infinite limit already accomplished, but contains
no symbol indicating which limiting process was used (...) Any attempt to go directly to the limit can result in nonsense".
We accord with him on this point --along with his definition of probability as information incompleteness, about which later.
The second problem is that we do not have a "clean" limiting process --the process is itself idealized. I will ignore it in this chapter.
Now how should we look at the Central Limit Theorem? Let us see how we arrive to it assuming "independence".
Convergence in the Body
The CLT works does not fill-in uniformily, but in a Gaussian way --indeed, disturbingly so. Simply, whatever your distribution (assuming one
mode), your sample is going to be skewed to deliver more central observations, and fewer tail events. The consequence is that, under aggrega-
tion, the sum of these variables will converge "much" faster in thep body of the distribution than in the tails. As N, the number of observations
increases, the Gaussian zone should cover more grounds... but not in the "tails".
This quick note shows the intuition of the convergence and presents the difference between distributions.
Take the sum of of random independent variables X
i
with finite variance under distribution j(X). Assume 0 mean for simplicity (and symme-
try, absence of skewness to simplify).
A more useful formulation of the Central Limit Theorem (Kolmogorov et al, x)
(2.6)
P -u § Z =

i=0
n
X
i
n s
§ u =
Ÿ
-u
u
e
-
Z
2
2 „Z
2 p
So the distribution is going to be:
1 -

-u
u
e
-
Z
2
2 „Z , for - u § z § u
inside the "tunnel" [-u,u] --the odds of falling inside the tunnel itself
and


u
Z j
£
HNL „z +

u

Z j
£
HNL „z
outside the tunnel (-u,u)
Where j'[N] is the n-summed distribution of j.
How j'[N] behaves is a bit interesting here --it is distribution dependent.
Before continuing, let us check the speed of convergence per distribution. It is quite interesting that we get the ratio observations in a given sub-
segment of the distribution, in proportion to the expected frequency
N
-u
u
N


where N
-u
u
, is the numbers of observations falling between -u and u.
So the speed of convergence to the Gaussian will depend on
N
-u
u
N


as can be seen in the next two simulations.
Figure 2.4. Q-Q Plot of N Sums of variables distributed according to the Student T with 3 degrees of freedom, N=50, compared to the Gaussian, rescaled into standard
deviations. We see on both sides a higher incidence of tail events. 10
6
simulations
Fat Tails and (Anti)fragility - N N Taleb| 33
Figure 2.5. The Widening Center. Q-Q Plot of variables distributed according to the Student T with 3 degrees of freedom compared to the Gaussian, rescaled into standard
deviation, N=500. We see on both sides a higher incidence of tail events. 10
7
simulations
To realize the speed of the widening of the tunnel (-u,u) under summation, consider the symmetric (0-centered) Student T with tail exponent a=
3, with density
2 a
3
p Ia
2
+x
2
M
2
, and variance a
2
. For large “tail values” of x, P(x) Ø
2 a
3
p x
4
. Under summation of N variables, the tail P(Sx) will be
2 N a
3
p x
4
.
Now the center, by the Kolmogorov version of the central limit theorem, will have a variance of N a
2
in the center as well, hence P(Sx) =

-
x
2
2 a N
2 p a N
. Setting the point u where the crossover takes place,

-
x
2
2 a N
2 p a N
>
2 a
3
p x
4
, hence u
4

-
u
2
2 a N >
2 2 a
3
a N
p
produces the solution # u = # 2 a N W
a
3ê2
2 Ha NL
3ê4
H2 pL
1ê4
,
where W is the Lambert W function or product log, which climbs very slowly.
2000 4000 6000 8000 10000
N
x
Note about the crossover: See Nagaev(1969). For a regularly varying tail, the threshold beyond which the crossover of the two
distribution needs to take place should be to the right of n logHnL (normalizing for unit variance) for the right tail.
Generalizing for all exponents > 2 [THE PREVIOUS SECTION WILL BE REMOVED AS THIS ONE GENERALIZES]
More generally, using the reasoning for a broader set and getting the crossover for powelaws of all exponents:
Ha - 2L a
4

-
a-2
a
x
2
2 a N
2 p a a N
>
a
a
I
1
x
2
M
1+a
2
a
aê2
BetaA
a
2
,
1
2
E
, since the standard deviation is a
a
-2 + a
34 | Fat Tails and (Anti)fragility - N N Taleb
x Ø# #
a a Ha+1L N WHlL
Ha-2L a
Where l= -
H2 pL
1
a+1
a-2
a
a-2
4
a
-
a
2
-
1
4 a
-a-
1
2 B
a
2
,
1
2
N
-
2
a+1
a Ha+1L N
200000 400000 600000 800000 1 µ10
6
N
u
a =4
3.5
3
2.5
Figure 2.6 Scaling to a=1, the speed of the expansion of the zone of convergence for different degrees of fatness of tails.
This part
will be
expanded
Integral Limit Theorems Taking Large Deviations into Account when Cramér’s Condition Does Not Hold. I
Nagaev, A. Theory of Probability & Its Applications, 1969, Vol. 14, No. 1 : pp. 51-64
See also Petrov (1975, 1995), Mikosch and Nagaev (1998), Bouchaud and Potters (2002), Sornette (2004).
2.3 Using Log Cumulants to Observe Convergence to the Gaussian
The normalized cumulant of order n, C(n) is the derivative of the log of the characteristic function f which we convolute N times divided by
the second cumulant (i,e., second moment).
(2.7) CHn, NL =
H-ÂL
n
!
n
logHf
N
L
H-!
2
logHfL
N
L
n-1
ê.z Ø0
Fat Tails and (Anti)fragility - N N Taleb| 35
This exercise show us how fast an aggregate of N-summed variables become Gaussian, looking at how quickly the 4th cumulant
approaches 0 . For instance the Poisson get there at a speed that depends inversely on l, that is,
1
N
3
l
3
, while by contrast an exponential
distribution reaches it faster at higher values of l since the cumulant is
3! l
2
N
2
.
Table of Normalized Cumulants -Speed of Convergence (Dividing by s
n
where n is the order of the
cumulant).
Distribution Normal@m, sD PoissonHlL ExponentialHlL GHa, bL
PDF

-
Hx-mL
2
2 s
2
2 p s

-l
l
x
x!

-x l
l
b
-a

-
x
b x
a-1
GHaL
N - convoluted
Log Characteristic
N log ‰
 z m-
z
2
s
2
2 N logI‰
I-1+‰
 z
M l
M N logJ
l
l-Â z
N N logHH1 - Â b zL
-a
L
2 nd Cum 1 1 1 1
3 rd 0
1
N l
2 l
N
2
a b N
4 th 0
1
N
2
l
2
3! l
2
N
2
3!
a
2
b
2
N
2
5 th 0
1
N
3
l
3
4! l
3
N
3
4!
a
3
b
3
N
3
6 th 0
1
N
4
l
4
5! l
4
N
4
5!
a
4
b
4
N
4
7 th 0
1
N
5
l
5
6! l
5
N
5
6!
a
5
b
5
N
5
8 th 0
1
N
6
l
6
7! l
6
N
6
7!
a
6
b
6
N
6
9 th 0
1
N
7
l
7
8! l
7
N
7
8!
a
7
b
7
N
7
10 th 0
1
N
8
l
8
9! l
8
N
8
9!
a
8
b
8
N
8
Table of Normalized Cumulants -Speed of Convergence (Dividing by s
n
where n is the order of the
cumulant).
Distribution Mixed Gaussians
HStoch VolL
StudentTH3L StudentTH4L
PDF p

-
x
2
2 s
1
2
2 p s1
+ H1 - pL

-
x
2
2 s
2
2
2 p s2
6 3
p Ix
2
+3M
2
12 I
1
x
2
+4
M
5ê2
N - convoluted
Log Characteristic
N log p ‰
-
z
2
s
1
2
2 + H1 - pL ‰
-
z
2
s
2
2
2 N JlogJ 3 †z§ + 1N - 3 †z§N N logH2 †z§
2
K
2
H2 †z§LL
2 nd Cum 1 1 1
3 rd 0 Ind TK
4 th -I3 H-1 + pL p Hs
1
2
- s
2
2
L
2
M
ëIN
2
Hp s
1
2
- H-1 + pL s
2
2
L
3
M
Ind Ind
5 th 0 Ind Ind
6 th I15 H-1 + pL p H-1 + 2 pL Hs
1
2
- s
2
2
L
3
M
ëIN
4
Hp s
1
2
- H-1 + pL s
2
2
L
5
M
Ind Ind
Notes on Levy Stability and the Generalized Cental Limit Theorem
Take for now that the distribution that concerves under summation (that is, stays the same) is said to be “stable”. You add Gaussians and get
Gaussians. But if you add binomials, you end up with a Gaussian, or, more accurately, “converge to the Gaussian basin of attraction”. These
distributions are not called “unstable” but they are.
36 | Fat Tails and (Anti)fragility - N N Taleb
There is a more general class of convergence. Just consider that the Cauchy variables converges to Cauchy, so the “stability’ has to apply to an
entire class of distributions.
Although these lectures are not about mathematical techniques, but about the real world, it is worth developing some results converning
stable distribution in order to prove some results relative to the effect of skewness and tails on the stability.
Let n be a positive integer, n¥2 and X
1
, X
2
,....X
n
satisfy some measure of independence and are drawn from the same distribution,
i) there exist c
n
œ $+ and d
n
œ R+ such that

i=1
n
X
i
=
D
c
n
X + d
n
where =
D
means “equality” in distribution.
ii) or, equivalently, there exist sequence of i.i.d random variables {Y
i
}, a real positive sequence {d
i
} and a real sequence {a
i
} such that
1
dn

i=1
n
Y
i
+ a
n
Ø
D
X
where Ø
D
means convergence in distribution.
iii) or, equivalently,
The distribution of X has for characteristic function
fHtL =
expHi m t - s†t§ H1 + 2 i bê p sgnHtL logH†t§LLL a = 1
expIi m t - †t s§
a
I1 - i b tanI
p a
2
M sgnHtLMM a " 1
.
aœ(0,2] sœ !
+
, b œ[-1,1], m œ $
Then if either of i), ii), iii) holds, then X has the "alpha stable" distribution S(a, b, m, s), with b designating the symmetry, m the centrality,
and s the scale.
Warning: perturbating the skewness of the Levy stable distribution by changing b without affecting the tail exponent is mean preserving, which
we will see is unnatural: the transformation of random variables leads to effects on more than one characteristic of the distribution.
-20 -10 10 20
0.02
0.04
0.06
0.08
0.10
0.12
0.14
-30 -25 -20 -15 -10 -5
0.05
0.10
0.15
Figure 2.7 Disturbing the scale of the alpha stable and that of a more natural distribution, the gamma distribution. The alpha stable does not increase in risks! (risks for us in
Chapter x is defined in thickening of the tails of the distribution). We will see later with “convexification” how it is rare to have an effect without an increase in risks.
SHa, b, m, sL
represents the stable distribution S
type
with index of stability a, skewness parameter b, location parameter m, and scale parameter s.
Generalized central limit theorem gives sequences a
n
and b
n
such that the distribution of the shifted and rescaled sum Z
n
"H⁄
i
n
X
i
- a
n
L ê b
n
of n
i.i.d. random variates X
i
whose distribution function F
X
HxL has asymptotes 1 - c x
-m
as x -> +¶ and d H-xL
-m
as x -> -¶ weakly converges to
the stable distribution S
1
Ha, Hc - dL ê Hc + dL, 0, 1L:
Note: Chebyshev’s Inequality and upper bound on deviations under finite variance.
Even when variance is finite, the bound is rather far. Consider Chebyshev's inequality:
Fat Tails and (Anti)fragility - N N Taleb| 37
PHX > aL §
s
2
a
2
PHX > n sL §
1
n
2
Which effectively accommodate power laws but puts a bound on the probability distribution of large deviations --but still significant.
The Effect of Finiteness of Variance
This table shows the inverse of the probability of exceeding a certain s for the Gaussian and the lower on probability limit for any distribution
with finite variance.
Deviation
3
Gaussian
7.µ10
2
Chebyshev Upper Bound
9
4 3.µ10
4
16
5 3.µ10
6
25
6 1.µ10
9
36
7 8.µ10
11
49
8 2.µ10
15
64
9 9.µ10
18
81
10 1.µ10
23
100
2.4. Illustration: Convergence of the Maximum of a Finite Variance Power Law
The behavior of the maximum value as a percentage of a sum is much slower than we think, and doesn’t make much difference on whether it is
a finite variance, that is a>2 or not. (See comments in Mandelbrot & Taleb, 2011)
(2.8) tHNL ª E
Max 8X
t-i Dt
<
i=0
N

i=0
N
HX
t-i Dt
L
38 | Fat Tails and (Anti)fragility - N N Taleb
a=1.8
a=2.4
2000 4000 6000 8000 10000
N
0.01
0.02
0.03
0.04
0.05
MaxêSum
Figure 2.8 Pareto Distributions with different tail exponents a=1.8 and 2.5. Although the difference between the two ( finite and infinite variance) is quantitative, not qualitative.
impressive.
Fat Tails and (Anti)fragility - N N Taleb| 39
C
A Tutorial: How We Converge Mostly in the Center of
the Distribution
We start with the Uniform Distribution
f HxL =
1
H-L
L § x § H
0 elsewhere
where L = 0 and H =1
A random run from a Uniform Distribution
0.2 0.4 0.6 0.8 1
50
100
150
200
250
300
The sum of two random draws: N=2
40 | Fat Tails and (Anti)fragility - N N Taleb
0.5 1 1.5 2
100
200
300
400
500
Three draws N=3
0.5 1 1.5 2 2.5
200
400
600
800
As we can see, we get more observations where the peak is higher.
Now some math. By Convoluting 2, 3, 4 times iteratively:
f2 Hz2L =



H f Hz - xLL H f xL „x = #
2 - z2 1 < z2 < 2
z2 0 < z2 § 1
f3 z3 =

0
3
Hf2 Hz3 - x2LL H f x2L „x2 =
z3
2
2
0 < z3 § 1
-Hz3 - 3L z3 -
3
2
1 < z3 < 2
-
1
2
Hz3 - 3L Hz3 - 1L z3 "2
1
2
Hz3 - 3L
2
2 < z3 < 3
f4 x =

0
4
Hf3 Hz4 - xLL H f x3L „x3 =
1
4
z4 "3
1
2
z4 "2
z4
2
4
0 < z4 § 1
1
4
I-z4
2
+ 4 z4 - 2M 1 < z4 < 2 Í2 < z4 < 3
1
4
Hz4 - 4L
2
3 < z4 < 4

A simple Uniform Distribution
Fat Tails and (Anti)fragility - N N Taleb| 41
-0.5 0.5 1 1.5
0.5
1
1.5
2
2.5
3
3.5
4
We can see how quickly, after one single addition, the net probabilistic "weight" is going to be skewed to the center of the distribution, and the
vector will weight future densities..
-0.5 0.5 1 1.5 2 2.5
0.5
1
1.5
2
2.5
1 2 3
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5
0.1
0.2
0.3
0.4
0.5
42 | Fat Tails and (Anti)fragility - N N Taleb
D
Appendix:The Large Deviation Principle and Fat Tails
There are a lot of studies of large deviations in an idealized world; unfortunately the existing methods do not really extend to true fat tails
outside special situations like stable distributions. The methods we will resort to is to start with a thin-tailed distribution, then fatten the
tails progressively and probe the effect, particularly as the distribution approaches a scalable power law.
Let us start with the tail behavior of the sum of Gaussian random variables.
S
n
=
⁄x
i
n
, with x
i
~NHm, sL
(2.9)
PHm, s, n, S
n
= sL =

-
n Hs-mL
2
2 s
2
n
2 p s
Hence, with n ض, the variance of S
n
=
s
2
n
Ø 0, the dominant part PHS
n
= sL º ‰
-n RHsL
where RHsL is the rate function R Hm, s, n, sL =
Hs-mL
2
2 s
2
.
Roughly the rate function is a form of “discounter” of the tail events, the headwind so to speak.
Stochastic volatility effect
-2 -1 1 2
0.5
1.0
1.5
2.0
2.5
pH0, 1, 10, sL
pH0, 1, 50, sL
RH0, 1, sL
RI0,
3
2
, sM
1
2
HRH0, 1, sL + RH0, 2, sLL
Different rate functions. We can see the effect of fatter tails and how it weakens the headwind of “large deviations”.
This part
will be
expanded
Fat Tails and (Anti)fragility - N N Taleb| 43
E
Appendix: Fat Tails and Random Matrices
This part
will be
expanded
{Discussion of a Random Matrix}
By progressively fattening the tail according to 1.10 and subsequently by lowering the exponent of the power law, we see the effect on
Wigner matrix, comparing the distribution of the Eigenvalues to the Wigner semi-circle distribution. Fattening the tails broadens the spectrum
of the random matrix.
Fattening the tails of a Gaussian according to 1.10 with p=10
4
and aº10
3
.
A pure power law distribution
This part
will be
expanded
44 | Fat Tails and (Anti)fragility - N N Taleb
Fat Tails and (Anti)fragility - N N Taleb| 45
3
On The Misuse of Statistics in Social Science
There have been many papers (Kahneman and Tversky 1974, Hoggarth and Soyer, 2012) showing how statistical researchers overinterpret their
own findings. (Goldstein and this author, Goldstein and Taleb 2007, showed how professional researchers and practitioners substitute »» x »»
1
for
»» x »»
2
). The common result is underestimating the randomness of the estimator M, in other words read too much into it.
There is worse. Mindless application of statistical techniques, without knowledge of the conditional nature of the claims are widespread. But
mistakes are often elementary, like lectures by parrots repeating “N of 1” or “p”, or “do you have evidence of?”, etc. Social scientists need to
have a clear idea of the difference between science and journalism, or the one between rigorous empiricism and anecdotal statements. Science is
not about making claims about a sample, but using a sample to make general claims and discuss properties that apply outside the sample.
Take M’ (short for M
T
X
HA, f L) the estimator we saw above from the realizations (a sample path) for some process, and M* the "true" mean that
would emanate from knowledge of the generating process for such variable. When someone says: "Crime rate in NYC dropped between 2000
and 2010", the claim is limited M’ the observed mean, not M* the true mean, hence the claim can be deemed merely journalistic, not scientific,
and journalists are there to report "facts" not theories. No scientific and causal statement should be made from M’ on "why violence has
dropped" unless one establishes a link to M* the true mean. M cannot be deemed "evidence" by itself. Working with M’ alone cannot be called
"empiricism".
What we just saw is at the foundation of statistics (and, it looks like, science). Bayesians disagree on how M’ converges to M*, etc., never on
this point. From his statements in a dispute with this author concerning his claims about the stability of modern times based on the mean
casualy in the past (Pinker, 2011), Pinker seems to be aware that M’ may have dropped over time (which is a straight equality) and sort of
perhaps we might not be able to make claims on M* which might not have really been dropping.
In some areas not involving time series, the differnce between M’ and M* is negligible. So I rapidly jot down a few rules before showing
proofs and derivations (limiting M’ to the arithmetic mean, that is, M’= M
T
X
HH-¶, ¶L, xL). Where E is the expectation operator under "real-
world" probability measure P:
3.1 The Tails Sampling Property
E[|M’-M*|] increases in with fat-tailedness (the mean deviation of M* seen from the realizations in different samples of the same process).
In other words, fat tails tend to mask the distributional properties.
This is the immediate result of the problem of large numbers.
On the difference between the initial and the “recovered” distribution
{Explanation of the method of generating data from a known distribution and comparing realized outcomes to expected ones}
46 | Fat Tails and (Anti)fragility - N N Taleb
Figure 3.1. Q-Q plot. Fitting extreme value theory to data generated by its own process , the rest of course owing to sample insuficiency for extremely large values, a bias
that typically causes the underestimation of tails, as the points tend to fall to the right.
Case Study: Pinker’s Claims On The Stability of the Future Based on Past Data
When the generating process is power law with low exponent, plenty of confusion can take place.
For instance, Pinker(2011) claims that the generating process has a tail exponent ~1.16 but made the mistake of drawing quantitative conclu-
sions from it about the mean from M' and built theories about drop in the risk of violence that is contradicted by the data he was showing,
since fat tails (plus skewness)= hidden risks of blowup. His study is also missing the Casanova problem (next point) but let us focus on the
error of being fooled by the mean of fat-tailed data.
The next two figures show the realizations of two subsamples, one before, and the other after the turkey problem, illustrating the inability of a
set to naively deliver true probabilities through calm periods.
Figure 3.2. First 100 years (Sample Path): A Monte Carlo generated realization of a process for casualties from violent conflict of the "80/20 or 80/02 style", that is tail
exponent a= 1.15
Fat Tails and (Anti)fragility - N N Taleb| 47
Figure 3.3. The Turkey Surprise: Now 200 years, the second 100 years dwarf the first; these are realizations of the exact same process, seen with a longer window and at
a different scale.
The next simulation shows M1, the mean of casualties over the first 100 years across 10
4
sample paths, and M2 the mean of casualties over the
next 100 years.
200 400 600 800
M1
200
400
600
800
M2
5000 10000 15000 20000 25000 30000
M1
2000
4000
6000
8000
10000
12000
M2
Figure 3.4. Does the past mean predict the future mean? Not so. M1 for 100 years, M2 for the next century. Seen at different scales, the second scale show how extremes
can be severe.
1.0 1.5 2.0
M1
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
M2
Figure 3.5. The same seen with a thin-tailed distribution.
So clearly it is a lunacy to try to read much into the mean of a power law with 1.15 exponent (and this is the mild case, where we know the
exponent is 1.15).
48 | Fat Tails and (Anti)fragility - N N Taleb
NOTE: Where does this 80/20 business come from?
Assume a the power law tail exponent, and an exceedant probability P
>x
= x
min
x
-a
, x œ[x
min
, ¶). Very simply, the top p of the population
gets S= p
a-1
a of the share of the total pie.
a =
logHpL
logHpL-logHSL
which means that the exponent needs to be 1.161 for the 80/20 distribution.
Note that as a gets close to 1 the contribution explodes as it becomes close to infinite mean.
Derivation:
The density f HxL = x
min
a
a x
-a-1
, x ¥ x
min
1) The Share attributed above K, K > 1, will be
Ÿ
K

x f HxL „x
Ÿ
1

x f HxL „x
= K
1-a
2) The probability of exceeding K is
Ÿ
K

f HxL „x = K
-a
3) K
-a
of the population contributes K
1-a
=p
a-1
a of the result
Fat Tails and (Anti)fragility - N N Taleb| 49
The next graph shows exactly how not to work with power laws.
Figure 3.6. Cederman 2003, used by Pinker. I wonder if I am dreaming or if the exponent a is really = .41. Chapters x and x show why such inference is centrally flawed,
since low exponents do not allow claims on mean of the variable except to say that it is very, very high and not observable in finite samples. Also, in addition to wrong
conclusions from the data, take for now that the regression fits the small deviations, not the large ones, and that the author overestimates our ability to figure out the
asymptotic slope.
‹ Warning. Severe mistake! One should never, never fit a powerlaw using frequencies (such as 80/20), as these are interpolative. Other estimators based on Log-Log
plotting or regressions are more informational, though with a few caveats.
Why? Because data with infinite mean, a §1, will masquerade as finite variance in sample and show about 80% contribution to the top 20% quantile. In fact you are
expected to witness in finite samples a lower contribution of the top 20%/
Let us see. Generate m samples of a =1 data X
j
={x
i ,j
<
i=1
n
, ordered x
i ,j
¥ x
i -1,j
, and examine the distribution of the top n contribution Z
j
n
=

i §n n
x
j

i §n
x
j
, with n œ (0,1).
Figure 3.10. n=20/100, N= 10
7
. Even when it should be .0001/100, we tend to watch an average 75/20
50 | Fat Tails and (Anti)fragility - N N Taleb
3.2 Survivorship Bias (Casanova) Property
E[M’-M*] increases under the presence of an absorbing barrier for the process. This is the Casanova effect, or fallacy of silent evidence
see The Black Swan, Chapter 8. (Fallacy of silent evidence: Looking at history, we do not see the full story, only the rosier parts of the
process, in the Glossary)
History is a single sample path we can model as a Brownian motion, or something similar with fat tails (say Levy flights). What we observe is
one path among many “counterfactuals”, or alternative histories. Let us call each one a “sample path”, a succession of discretely observed states
of the system between the initial state S
0
and S
T
the present state.
1. Arithmetic process: We can model it as SHtL = SHt - DtL + Z
Dt
where Z
Dt
is noise drawn from any distribution.
2. Geometric process: We can model it as SHtL = SHt - DtL ‰
Wt
typically SHt - DtL ‰
m Dt +s Dt Zt
but W
t
can be noise drawn from any
distribution. Typically, logJ
S HtL
S Ht-i DtL
N is treated as Gaussian, but we can use fatter tails. The convenience of the Gaussian is stochastic
calculus and the ability to skip steps in the process, as S(t)=S(t-Dt) ‰
m Dt +s Dt Wt
, with W
t
~N(0,1), works for all Dt, even allowing for a
single period to summarize the total.
The Black Swan made the statement that history is more rosy than the “true” history, that is, the mean of the ensemble of all sample path.
Take an absorbing barrier H as a level that, when reached, leads to extinction, defined as becoming unobservable or unobserved at period T.
Barrier H
200 400 600 800 1000
Time
50
100
150
200
250
Sample Paths
Figure 3.11. Counterfactual historical paths subjected to an absorbing barrier.
When you observe history of a family of processes subjected to an absorbing barrier, i.e., you see the winners not the losers, there are biases. If
the survival of the entity depends upon not hitting the barrier, then one cannot compute the probabilities along a certain sample path, without
adjusting.
Definition: “true” distribution is the one for all sample paths, “observed” distribution is the one of the succession of points {S
iDt
<
i=1
T
.
1. Bias in the measurement of the mean. In the presence of an absorbing barrier H “below”, that is, lower than S
0
, the “observed mean” r
“true mean”
2. Bias in the measurement of the volatility. The “observed” variance b “true” variance
The first two results have been proved (see Brown, Goetzman and Ross (1995)). What I will set to prove here is that fat-tailedness increases the
bias.
First, let us pull out the “true” distribution using the reflection principle.
Fat Tails and (Anti)fragility - N N Taleb| 51
The reflection principle (graph from Taleb, 1997). The number of paths that go from point a to point b without hitting the barrier H is equivalent to the number of path from
the point -a (equidistant to the barrier) to b.
Thus if the barrier is H and we start at S
0
then we have two distributions, one f(S), the other f(S-2(S
0
-H))
By the reflection principle, the “observed” distribution p(S) becomes:
(3.1) pHSL = :
f HSL - f HS - 2 HS
0
- HLL if S > H
0 if S < H
Simply, the nonobserved paths (the casualties “swallowed into the bowels of history”) represent a mass of 1-
Ÿ
H

f HSL - f HS - 2 HS
0
- HLL „S
and, clearly, it is in this mass that all the hidden effects reside. We can prove that the missing mean is
Ÿ

H
S H f HSL - f HS - 2 HS
0
- HLLL „S and
perturbate f(S) using the previously seen method to “fatten” the tail.
Observed Distribution
H
Absorbed Paths
Figure 3.13. If you don’t take into account the sample paths that hit the barrier, the observed distribution seems more positive, and more stable, than the “true” one.
3.3 Left (Right) Tail Sample Insufficiency Under Negative (Positive) Skewness
E[M’-M*] increases (decreases) with negative (positive) skeweness of the true underying variable.
52 | Fat Tails and (Anti)fragility - N N Taleb
A naive measure of a sample mean, even without absorbing barrier, yields a higher oberved mean than “true” mean when the distribution is
skewed to the left.
Figure 3.14. The left tail has fewer samples. The probability of an event falling below K in n samples is F(K), where F is the cumulative distribution.
This can be shown analytically, but a simulation works well.
To see how a distribution masks its mean because of sample insufficiency, take a skewed distribution with fat tails, say the Pareto Distribution.
The "true" mean is known to be m=
a
a-1
. Generate {X
1, j
, X
2, j
, ...,X
N, j
} random samples indexed by j as a designator of a certain history j.
Measure m
j
=

i=1
N
Xi, j
N
. We end up with the sequence of various sample means 9m
j
=
j=1
T
, which naturally should converge to M with both N and T.
Next we calculate m
è
the median value of ⁄
j=1
T
mj
M* T
, such that P>m
è
=
1
2
where, to repeat, M* is the theoretical mean we expect from the generat-
ing distribution.
Ê
Ê
Ê
Ê
Ê
Ê
Ê
Ê
Ê
Ê
Ê
Ê
Ê
Ê
Ê
Ê
Ê
1.5 2.0 2.5
a
0.75
0.80
0.85
0.90
0.95
m
è
Figure 3.15. Median of ⁄
j =1
T
m
j
M T
in simulations (10
6
Monte Carlo runs). We can observe the underestimation of the mean of a skewed power law distribution as a exponent
gets lower. Note that lower a imply fatter tails.
Entrepreneurship is penalized by right tail insufficiency making performance look worse than it is. Figures 3.14 and 3.15 can be seen in
a symmetrical way, producing the exact opposite effect of negative skewness.
Fat Tails and (Anti)fragility - N N Taleb| 53
3.4 Why N=1 Can Be Very, Very Significant Statistically
Power of Extreme Deviations: Under fat tails, large deviations from the mean are vastly more informational than small ones. They are
not "anecdotal". (The last two properties corresponds to the black swan problem, inherently asymmetric).
The gist of the argument is the notion called masquerade problem in The Black Swan.
A thin-tailed distribution is less likely to deliver a single large deviation than a fat tailed distribution a series of long calm periods.
Now add negative skewness to the issue, which makes large deviations negative and small deviations positive, and a large negative deviation,
under skewness, becomes extremely informational.
3.5 Asymmetry in Inference
Asymmetry in Inference: Under both negative skewness and fat tails, negative deviations from the mean are more informational than
positive deviations.
3.6 A General Summary of The Problem of Reliance on Past Time Series
The four aspects of what we will call the nonreplicability issue, particularly for mesures that are in the tails. These are briefly presented here
and developed more technically throughout the book:
a- Definition of statistical rigor (or Pinker Problem). The idea that an estimator is not about fitness to past data, but related to how it can
capture future realizations of a process seems absent from the discourse. Much of econometrics/risk management methods do not meet this
simple point and the rigor required by orthodox, basic statistical theory.
b- Statistical argument on the limit of knowledge of tail events. Problems of replicability are acute for tail events. Tail events are impossible
to price owing to the limitations from the size of the sample. Naively rare events have little data hence what estimator we may have is noisier.
c- Mathematical argument about statistical decidability. No probability without metaprobability. Metadistributions matter more with tail
events, and with fat-tailed distributions.
† The soft problem: we accept the probability distribution, but the imprecision in the calibration (or parameter errors) percolates in the tails.
† The hard problem (Taleb and Pilpel, 2001, Taleb and Douady, 2009): We need to specify an a priori probability distribution from which
we depend, or alternatively, propose a metadistribution with compact support.
† Both problems are bridged in that a nested stochastization of standard deviation (or the scale of the parameters) for a Gaussian turn a
thin-tailed distribution into a power law (and stochastization that includes the mean turns it into a jump-diffusion or mixed-Poisson).
d- Economic arguments: The Friedman-Phelps and Lucas critiques, Goodhart’s law. Acting on statistical information (a metric, a response)
changes the statistical properties of some processes.
Conclusion
This chapter introduced the problem of “surprises” from the past of time series, and the invalidity of a certain class of estimators that seem to
only work in-sample. Before examining more deeply the mathematical properties of fat-tails, let us look at some practical aspects.
54 | Fat Tails and (Anti)fragility - N N Taleb
Fat Tails and (Anti)fragility - N N Taleb| 55
F
Appendix: Where Standard Diversification Fails
U
Overerestimation
of diversification
Underestimation
of risk
20 40 60 80 100
Number of Assets
Risk
Markowitz
Real World
Figure D.1. The “diversification effect”: difference between promised and delivered. Markowitz Mean Variance based portfolio construction will stand probably as the most
empirically invalid theory ever used in modern times. If irresponsible charlatanism cannot describe this, what can?
This part
will be
expanded
56 | Fat Tails and (Anti)fragility - N N Taleb
4
On the Difficulty of Risk Parametrization With Fat
Tails
This chapter presents case studies around the point that, simply, some models depend quite a bit on small variations in parameters. The effect
on the Gaussian is easy to gauge, and expected. But many believe in power laws as panacea. Even if I believed the r.v. was power law dis-
tributed, I still would not be able to make a statement on tail risks. Sorry, but that’s life.
This chapter is illustrative; it will initially focus on nonmathematical limits to producing estimates of M
T
X
HA, f L when A is limited to the
tail. We will see how things get worse when one is sampling and forecasting the maximum of a random variable.
4.1 Some Bad News Concerning power laws
We saw the shortcomings of parametric and nonparametric methods so far. What are left are power laws; they are a nice way to look at the
world, but we can never really get to know the exponent a, for a spate of reasons we will see later (the concavity of the exponent to parameter
uncertainty). Suffice for now to say that the same analysis on exponents yields a huge in-sample variance and that tail events are very sensitive
to small changes in the exponent.
For instance, for a broad set of stocks over subsamples, using a standard estimation method (the Hill estimator), we get subsamples of securi-
ties. Simply, the variations are too large for a reliable computation of probabilities, which can vary by > 2 orders of magnitude. And the effect
on the mean of these probabilities is large since they are way out in the tails.

1.5 2.0 2.5 3.0
a
50
100
150
200
250
300
350
1êPr
Figure 4.1. The effect of small changes in tail exponent on a probability of exceeding a certain point. Here Pareto(L,a), probability of exceeding 7 L ranges from 1 in 10 to 1 in
350. For further in the tails the effect is more severe.
The way to see the response to small changes in tail exponent with probability: considering P
>K
~ K
-a
, the sensitivity to the tail exponent
!P>K
!a
= -K
-a
logHKL.
Now the point that probabilities are sensitive to assumptions brings us back to the Black Swan problem. One might wonder, the change in
probability might be large in percentage, but who cares, they may remain small. Perhaps, but in fat tailed domains, the event multiplying the
probabilities is large. In life, it is not the probability that matters, but what one does with it, such as the expectation or other moments, and the
contribution of the small probability to the total moments is large in power law domains.
Fat Tails and (Anti)fragility - N N Taleb| 57
Now the point that probabilities are sensitive to assumptions brings us back to the Black Swan problem. One might wonder, the change in
probability might be large in percentage, but who cares, they may remain small. Perhaps, but in fat tailed domains, the event multiplying the
probabilities is large. In life, it is not the probability that matters, but what one does with it, such as the expectation or other moments, and the
contribution of the small probability to the total moments is large in power law domains.
For all powerlaws, when K is large, with a > 1, the unconditional shortfall S
+
=
Ÿ
K

x fHxL „x and S


-K
x fHxL „x approximate to
a
a-1
K
-a+1
and
-
a
a-1
K
-a+1
, which are extremely sensitive to a particularly at higher levels of K,
!S+
!a
= -
K
1-a
HHa-1L a logHKL+1L
Ha-1L
2
.
There is a deeper problem related to the effect of model error on the estimation of a, which compounds the problem, as a tends to be underesti-
mated by Hill estimators and other methods, but let us leave it for now.
4.2. Extreme Value Theory: Fuhgetaboudit
We saw earlier how difficult it is to compute risks using power laws, owing to excessive model sensitivity. Let us apply this to the so-
called Extreme Value Theory, EVT.
Extreme Value Theory has been considered a panacea for dealing with extreme events by some "risk modelers" . On paper it looks great. But
only on paper. The problem is the calibration and parameter uncertainty --in the real world we don't know the parameters. The ranges in the
probabilities generated we get are monstrous.
We start with a short presentation of the idea, followed by an exposition of the difficulty.
What is Extreme Value Theory? A Simplified Exposition
Let us proceed with simple examples.
Case 1, Thin Tailed Distribution
The Extremum of a Gaussian variable: Say we generate N Gaussian variables 8Z
i
<
i=1
N
with mean 0 and unitary standard deviation, and take
the highest value we find. We take the upper bound E
j
for the N-size sample run j
E
j
= Max 9Z
i, j
=
i=1
N
Assume we do so M times, to get M samples of maxima for the set E
E = 9Max 9Z
i, j
=
i=1
N
=
j=1
M
The next figure will plot a histogram of the result.
Figure 4.2. Taking M samples of Gaussian maxima; here N= 30,000, M=10,000. We get the Mean of the maxima = 4.11159 Standard Deviation= 0.286938; Median =
4.07344
Let us fit to the sample an Extreme Value Distribution (Gumbel) with location and scale parameters a and b, respectively: f(x;a,b) =

-‰
-x+a
b
+
-x+a
b
b
58 | Fat Tails and (Anti)fragility - N N Taleb
Figure 4.3. : Fitting an extreme value distribution (Gumbel) a= 3.97904, b= 0.235239
So far, beautiful. Let us next move to fat(ter) tails.
Case 2, Fat-Tailed Distribution
Now let us generate, exactly as before, but change the distribution, with N random power law distributed variables Z
i
, with tail exponent m=3,
generated from a Student T Distribution with 3 degrees of freedom. Again, we take the upper bound. This time it is not the Gumbel, but the
Fréchet distribution that would fit the result, Fréchet f(x; a,b)=

-
x
b
-a
a J
x
b
N
-1-a
b
, for x>0
Figure 4.4. Fitting a Fréchet distribution to the Student T generated with m=3 degrees of freedom. The Frechet distribution a=3, b=32 fits up to higher values of E. But next
two graphs shows the fit more closely.
Fat Tails and (Anti)fragility - N N Taleb| 59
Figure 4.5. Seen more closely
How Extreme Value Has a Severe Inverse Problem In the Real World
In the previous case we start with the distribution, with the assumed parameters, then get the corresponding values, as these “risk modelers” do.
In the real world, we don’t quite know the calibration, the a of the distribution, assuming (generously) that we know the distribution. So here
we go with the inverse problem. The next table illustrates the different calibrations of P
K
the probabilities that the maximum exceeds a certain
value K (as a multiple of b under different values of K and a.
a
1
P
>3 b
1
P
>10 b
1
P
>20 b
1. 3.52773 10.5083 20.5042
1.25 4.46931 18.2875 42.7968
1.5 5.71218 32.1254 89.9437
1.75 7.3507 56.7356 189.649
2. 9.50926 100.501 400.5
2.25 12.3517 178.328 846.397
2.5 16.0938 316.728 1789.35
2.75 21.0196 562.841 3783.47
3. 27.5031 1000.5 8000.5
3.25 36.0363 1778.78 16918.4
3.5 47.2672 3162.78 35777.6
3.75 62.048 5623.91 75659.8
4. 81.501 10000.5 160000.
4.25 107.103 17783.3 338359.
4.5 140.797 31623.3 715542.
4.75 185.141 56234.6 1.51319µ10
6
5. 243.5 100001. 3.2µ10
6
Consider that the error in estimating the a of a distribution is quite large, often > 1/2, and typically overstimated. So we can see that we get the
probabilities mixed up > an order of magnitude. In other words the imprecision in the computation of the a compounds in the evaluation of
the probabilities of extreme values.
4.3 Using Power Laws Without Being Harmed by Mistakes
We can use power laws in the “near tails” for information, not risk management. That is, not pushing outside the tails, staying within a part of
the distribution for which errors are not compounded.
I was privileged to get access to a database with cumulative sales for editions in print that had at least one unit sold that particular week (that is,
conditional of the specific edition being still in print). I fit a powerlaw with tail exponent a> 1.3 for the upper 10% of sales (graph), with
N=30K. Using the Zipf variation for ranks of powerlaws, with r
x
and r
y
the ranks of book x and y, respectively, S
x
and S
y
the corresponding
sales
60 | Fat Tails and (Anti)fragility - N N Taleb
I was privileged to get access to a database with cumulative sales for editions in print that had at least one unit sold that particular week (that is,
conditional of the specific edition being still in print). I fit a powerlaw with tail exponent a> 1.3 for the upper 10% of sales (graph), with
N=30K. Using the Zipf variation for ranks of powerlaws, with r
x
and r
y
the ranks of book x and y, respectively, S
x
and S
y
the corresponding
sales
(4.1)
S
x
S
y
=
r
x
r
y
-
1
a
So for example if the rank of x is 100 and y is 1000, x sells I
100
1000
M
-
1
1.3
= 5.87 times what y sells.
Note this is only robust in deriving the sales of the lower ranking edition (r
y
> r
x
) because of inferential problems in the presence of fat-tails.
a=1.3
Near tail
100 10
4
10
6
X
10
-4
0.001
0.01
0.1
1
P
>X
Figure 4.7 Log-Log Plot of the probability of exceeding X (book sales), and X
This works best for the top 10,000 books, but not quite the top 20 (because the tail is vastly more unstable). Further, the effective a for large
deviations is lower than 1.3. But this method is robust as applied to rank within the “near tail”.
Fat Tails and (Anti)fragility - N N Taleb| 61
5
How To Tell True Fat Tails from Poisson Jumps
5.1 Beware The Poisson
By the masquerade problem, any power law can be seen backward as a Gaussian plus a series of simple (that is, noncompound) Poisson
jumps, the so-called jump-diffusion process. So the use of Poisson is often just a backfitting problem, where the researcher fits a Poisson,
happy with the “evidence”.
The next exercise aims to supply convincing evidence of scalability and NonPoisson-ness of the data (the Poisson here is assuming a standard
Poisson). Thanks to the need for the probabililities add up to 1, scalability in the tails is the sole possible model for such data. We may not be
able to write the model for the full distribution --but we know how it looks like in the tails, where it matters.
The Behavior of Conditional Averages: With a scalable (or "scale-free") distribution, when K is "in the tails" (say you reach the point when
f HxL = C X
-a
where C is a constant and a the power law exponent), the relative conditional expectation of X (knowing that x>K) divided by K,
that is,
E@X X>KD
K
is a constant, and does not depend on K. More precisely, it is
a
a-1
.
(5.1)
Ÿ
K

x f Hx, aL „x
Ÿ
K

f Hx, aL „x
=
K a
a - 1
This provides for a handy way to ascertain scalability by raising K and looking at the averages in the data.
Note further that, for a standard Poisson, (too obvious for a Gaussian): not only the conditional expectation depends on K, but it "wanes", i.e.
(5.2) Limit
Kض
Ÿ
K
¶ m
x
GHxL
„x
Ÿ
K
¶ m
x
x!
„x
ìK = 1
Calibrating Tail Exponents. In addition, we can calibrate power laws. Using K as the cross-over point, we get the a exponent above it --the
same as if we used the Hill estimator or ran a regression above some point.
We defined fat tails in the previous chapter as the contribution of the low frequency events to the total properties. But fat tails can come from
different classes of distributions. This chapter will present the difference between two broad classes of distributions.
This brief test using 12 million pieces of exhaustive returns shows how equity prices (as well as short term interest rates) do not have a
characteristic scale. No other possible method than a Paretan tail, albeit of unprecise calibration, can charaterize them.
5.2 Leave it to the Data
We tried the exercise with about every piece of data in sight: single stocks, macro data, futures, etc.
Equity Dataset: We collected the most recent 10 years (as of 2008) of daily prices for U.S. stocks (no survivorship bias effect as we included
companies that have been delisted up to the last trading day), n= 11,674,825 , deviations expressed in logarithmic returns.
We scaled the data using various methods.
The expression in "numbers of sigma" or standard deviations is there to conform to industry language (it does depend somewhat on the
stability of sigma). In the "MAD" space test we used the mean deviation.
62 | Fat Tails and (Anti)fragility - N N Taleb
MADHiL =
Log
S
t
i
S
t-1
i
1
N

j=0
N
LogK
S
t-j
i
S
t-j-1
i
O
We focused on negative deviations. We kept moving K up until to 100 MAD (indeed) --and we still had observations.
Implied a
K
=
E@X
X<K
D
E@X
X<K
D - K
MAD E@X
X<K
D n Hfor X < KL
E@X X<KD
K
Implied a
-1. -1.75202 1.32517µ10
6
1.75202 2.32974
-2. -3.02395 300806. 1.51197 2.95322
-5. -7.96354 19285. 1.59271 2.68717
-10. -15.3283 3198. 1.53283 2.87678
-15. -22.3211 1042. 1.48807 3.04888
-20. -30.2472 418. 1.51236 2.95176
-25. -40.8788 181. 1.63515 2.57443
-50. -101.755 24. 2.0351 1.96609
-70. -156.709 11. 2.23871 1.80729
-75. -175.422 9. 2.33896 1.74685
-100. -203.991 7. 2.03991 1.96163
Short term Interest Rates
EuroDollars Front Month 1986-2006
n=4947
MAD E@X
X<K
D n Hfor X < KL
E@X X<KD
K
Implied a
-0.5 -1.8034 1520 3.6068 1.38361
-1. -2.41323 969 2.41323 1.7076
-5. -7.96752 69 1.5935 2.68491
-6. -9.2521 46 1.54202 2.84496
-7. -10.2338 34 1.46197 3.16464
-8. -11.4367 24 1.42959 3.32782
UK Rates 1990-2007
n=4143
MAD E@X
X<K
D n Hfor X < KL
E@X X<KD
K
Implied a
0.5 1.68802 1270 3.37605 1.42087
1. 2.23822 806 2.23822 1.80761
3. 4.97319 140 1.65773 2.52038
5. 8.43269 36 1.68654 2.45658
6. 9.56132 26 1.59355 2.68477
7. 11.4763 16 1.63947 2.56381
Literally, you do not even have a large number K for which scalability drops from a small sample effect.
Global Macroeconomic data
Fat Tails and (Anti)fragility - N N Taleb| 63
Taleb(2008), International Journal of Forecasting.

64 | Fat Tails and (Anti)fragility - N N Taleb
6
How Power Laws Emerge From Recursive Epistemic
Uncertainty
6.1 The Opposite of Central Limit
With the Central Limit Theorem: we start with a distribution and end with a Gaussian. The opposite is more likely to be true. Recall how we
fattened the tail of the Gaussian by stochasticizing the variance? Now let us use the same metaprobability method, put add additional layers of
uncertainty.
The Regress Argument (Error about Error)
The main problem behind The Black Swan is the limited understanding of model (or representation) error, and, for those who get it, a lack of
understanding of second order errors (about the methods used to compute the errors) and by a regress argument, an inability to continuously
reapplying the thinking all the way to its limit (particularly when they provide no reason to stop). Again, there is no problem with stopping the
recursion, provided it is accepted as a declared a priori that escapes quantitative and statistical methods.
Epistemic not statistical re-derivation of power laws: Note that previous derivations of power laws have been statistical (cumulative advantage,
preferential attachment, winner-take-all effects, criticality), and the properties derived by Yule, Mandelbrot, Zipf, Simon, Bak, and others result
from structural conditions or breaking the independence assumptions in the sums of random variables allowing for the application of the central
limit theorem. This work is entirely epistemic, based on standard philosophical doubts and regress arguments.
6.2 Methods and Derivations
Layering Uncertainties
Take a standard probability distribution, say the Gaussian. The measure of dispersion, here s, is estimated, and we need to attach some
measure of dispersion around it. The uncertainty about the rate of uncertainty, so to speak, or higher order parameter, similar to what called the
"volatility of volatility" in the lingo of option operators (see Taleb, 1997, Derman, 1994, Dupire, 1994, Hull and White, 1997) --here it would
be "uncertainty rate about the uncertainty rate". And there is no reason to stop there: we can keep nesting these uncertainties into higher orders,
with the uncertainty rate of the uncertainty rate of the uncertainty rate, and so forth. There is no reason to have certainty anywhere in the
process.
Higher order integrals in the Standard Gaussian Case
We start with the case of a Gaussian and focus the uncertainty on the assumed standard deviation. Define f(m,s,x) as the Gaussian PDF for
value x with mean m and standard deviation s.
A 2
nd
order stochastic standard deviation is the integral of f across values of s œ ]0,¶[, under the measure f Hs, s
1
, sL, with s
1
its scale
parameter (our approach to trach the error of the error), not necessarily its standard deviation; the expected value of s
1
is s
1
.
(6.1)
f HxL
1
=

0

fHm, s, xL f Hs, s
1
, sL „s
Generalizing to the N
th
order, the density function f(x) becomes
(6.2)
f HxL
N
=

0

...

0

fHm, s, xL f Hs, s
1
, sL f Hs
1
, s
2
, s
1
L ... f Hs
N-1
, s
N
, s
N-1
L „s„s
1
„s
2
... „s
N
The problem is that this approach is parameter-heavy and requires the specifications of the subordinated distributions (in finance, the lognormal
has been traditionally used for s
2
(or Gaussian for the ratio Log[
s
t
2
s
2
] since the direct use of a Gaussian allows for negative values). We would
need to specify a measure f for each layer of error rate. Instead this can be approximated by using the mean deviation for s, as we will see next.
Fat Tails and (Anti)fragility - N N Taleb| 65
The problem is that this approach is parameter-heavy and requires the specifications of the subordinated distributions (in finance, the lognormal
has been traditionally used for s
2
(or Gaussian for the ratio Log[
s
t
2
s
2
] since the direct use of a Gaussian allows for negative values). We would
need to specify a measure f for each layer of error rate. Instead this can be approximated by using the mean deviation for s, as we will see next.
Discretization using nested series of two-states for s- a simple multiplicative process
We saw in the last chapter a quite effective simplification to capture the convexity, the ratio of (or difference between) f(m,s,x) and
Ÿ
0

fHm, s, xL f Hs, s
1
, sL „s (the first order standard deviation) by using a weighted average of values of s, say, for a simple case of one-
order stochastic volatility:
s H1 # a H1LL, 0 § a H1L < 1
where a(1) is the proportional mean absolute deviation for s, in other word the measure of the absolute error rate for s. We use
1
2
as the
probability of each state. Unlike the earlier situation we are not preserving the variance, rather the STD.
Thus the distribution using the first order stochastic standard deviation can be expressed as:
(6.3) f HxL
1
=
1
2
HfHm, sH1 + aH1LL, xL + f Hm, sH1 - aH1LL, xLL
Now assume uncertainty about the error rate a(1), expressed by a(2), in the same manner as before. Thus in place of a(1) we have
1
2
a(1)( 1±
a(2)).
s
H1-aH1LL s
HaH1L +1L s
HaH1L +1L H1-aH2LL s
HaH1L +1L HaH2L +1L s
H1-aH1LL H1-aH2LL s
H1-aH1LL HaH2L +1L s
H1-aH1LL H1-aH2LL H1-aH3LL s
H1-aH1LL HaH2L +1L H1-aH3LL s
HaH1L +1L H1-aH2LL H1-aH3LL s
HaH1L +1L HaH2L +1L H1-aH3LL s
H1-aH1LL H1-aH2LL HaH3L +1L s
H1-aH1LL HaH2L +1L HaH3L +1L s
HaH1L +1L H1-aH2LL HaH3L +1L s
HaH1L +1L HaH2L +1L HaH3L +1L s
Figure 4:- Three levels of error rates for s following a multiplicative process
The second order stochastic standard deviation:
66 | Fat Tails and (Anti)fragility - N N Taleb
(6.4)
f HxL
2
=
1
4
8fHm, sH1 + aH1L H1 + aH2LLL, xL +
fHm, sH1 - aH1L H1 + aH2LLL, xL + fHm, sH1 + aH1L H1 - aH2LL, xL + f Hm, sH1 - aH1L H1 - aH2LLL, xL<
and the N
th
order:
(6.5) f HxL
N
=
1
2
N

i=1
2
N
fIm, sM
i
N
, xM
where M
i
N
is the i
th
scalar (line) of the matrix M
N
I2
N
µ 1M
(6.6)
M
N
= :‰
j=1
N
Ha HjL TPi, jT + 1L>
i=1
2
N
and T[[i,j]] the element of i
th
line and j
th
column of the matrix of the exhaustive combination of N-Tuples of (-1,1),that is the N-dimention-
alvector {1,1,1,...} representing all combinations of 1 and -1.
for N=3
T =
1 1 1
1 1 -1
1 -1 1
1 -1 -1
-1 1 1
-1 1 -1
-1 -1 1
-1 -1 -1
and M
3
=
H1 - aH1LL H1 - aH2LL H1 - aH3LL
H1 - aH1LL H1 - aH2LL HaH3L + 1L
H1 - aH1LL HaH2L + 1L H1 - aH3LL
H1 - aH1LL HaH2L + 1L HaH3L + 1L
HaH1L + 1L H1 - aH2LL H1 - aH3LL
HaH1L + 1L H1 - aH2LL HaH3L + 1L
HaH1L + 1L HaH2L + 1L H1 - aH3LL
HaH1L + 1L HaH2L + 1L HaH3L + 1L
so M
1
3
= 8H1 - aH1LL H1 - aH2LL H1 - aH3LL<, etc.
Figure 4: Thicker tails (higher peaks) for higher values of N; here N=0,5,10,25,50, all values of a=
1
10
Note that the various error rates a(i) are not similar to sampling errors, but rather projection of error rates into the future. They are, to repeat,
epistemic.
The Final Mixture Distribution
The mixture weighted average distribution (recall that f is the ordinary Gaussian with mean m, std s and the random variable x).
(6.7) f Hx m, s, M, NL = 2
-N

i=1
2
N
fIm, sM
i
N
, x M
It could be approximated by a lognormal distribution for s and the corresponding V as its own variance. But it is precisely the V that interest
us, and V depends on how higher order errors behave.
Fat Tails and (Anti)fragility - N N Taleb| 67
Next let us consider the different regimes for higher order errors.
6.3 Regime 1 (Explosive): Case of a Constant parameter a
Special case of constant a: Assume that a(1)=a(2)=...a(N)=a, i.e. the case of flat proportional error rate a. The Matrix M collapses into a
conventional binomial tree for the dispersion at the level N.
(6.8)
f Hx m, s, M, NL = 2
-N

j=0
N
K
N
j
O fIm, sHa + 1L
j
H1 - aL
N-j
, xM
Because of the linearity of the sums, when a is constant, we can use the binomial distribution as weights for the moments (note again the
artificial effect of constraining the first moment m in the analysis to a set, certain, and known a priori).
Order Moment
1 m
2 s
2
Ia
2
+1M
N
+m
2
3 3 m s
2
Ia
2
+1M
N
+m
3
4 6 m
2
s
2
Ia
2
+1M
N
+m
4
+3 Ia
4
+6 a
2
+1M
N
s
4
5 10 m
3
s
2
Ia
2
+1M
N
+m
5
+15 Ia
4
+6 a
2
+1M
N
m s
4
6 15 m
4
s
2
Ia
2
+1M
N
+m
6
+15 IIa
2
+1M Ia
4
+14 a
2
+1MM
N
s
6
+45 Ia
4
+6 a
2
+1M
N
m
2
s
4
7 21 m
5
s
2
Ia
2
+1M
N
+m
7
+105 IIa
2
+1M Ia
4
+14 a
2
+1MM
N
m s
6
+105 Ia
4
+6 a
2
+1M
N
m
3
s
4
8 28 m
6
s
2
Ia
2
+1M
N
+m
8
+105 Ia
8
+28 a
6
+70 a
4
+28 a
2
+1M
N
s
8
+420 IIa
2
+1M Ia
4
+14 a
2
+1MM
N
m
2
s
6
+210 Ia
4
+6 a
2
+1M
N
m
4
s
4
For clarity, we simplify the table of moments, with m=0
Order Moment
1 0
2 Ia
2
+1M
N
s
2
3 0
4 3 Ia
4
+6 a
2
+1M
N
s
4
5 0
6 15 Ia
6
+15 a
4
+15 a
2
+1M
N
s
6
7 0
8 105 Ia
8
+28 a
6
+70 a
4
+28 a
2
+1M
N
s
8
Note again the oddity that in spite of the explosive nature of higher moments, the expectation of the absolute value of x is both independent of a
and N, since the perturbations of s do not affect the first absolute moment =
2
p
s (that is, the initial assumed s). The situation would be
different under addition of x.
Every recursion multiplies the variance of the process by (1+a
2
). The process is similar to a stochastic volatility model, with the standard
deviation (not the variance) following a lognormal distribution, the volatility of which grows with M, hence will reach infinite variance at the
limit.
Consequences
For a constant a > 0, and in the more general case with variable a where a(n) ¥ a(n-1), the moments explode.
A- Even the smallest value of a >0, since I1 +a
2
M
N
is unbounded, leads to the second moment going to infinity (though not the first)
when Nض. So something as small as a .001% error rate will still lead to explosion of moments and invalidation of the use of the class of
!
2
distributions.
B- In these conditions, we need to use power laws for epistemic reasons, or, at least, distributions outside the !
2
norm, regardless of
observations of past data.
Note that we need an a priori reason (in the philosophical sense) to cutoff the N somewhere, hence bound the expansion of the second moment.
Convergence to Properties Similar to Power Laws
We can see on the example next Log-Log plot (Figure 1) how, at higher orders of stochastic volatility, with equally proportional stochastic
coefficient, (where a(1)=a(2)=...=a(N)=
1
10
) how the density approaches that of a power law (just like the Lognormal distribution at higher
variance), as shown in flatter density on the LogLog plot. The probabilities keep rising in the tails as we add layers of uncertainty until they
seem to reach the boundary of the power law, while ironically the first moment remains invariant.
68 | Fat Tails and (Anti)fragility - N N Taleb
We can see on the example next Log-Log plot (Figure 1) how, at higher orders of stochastic volatility, with equally proportional stochastic
coefficient, (where a(1)=a(2)=...=a(N)=
1
10
) how the density approaches that of a power law (just like the Lognormal distribution at higher
variance), as shown in flatter density on the LogLog plot. The probabilities keep rising in the tails as we add layers of uncertainty until they
seem to reach the boundary of the power law, while ironically the first moment remains invariant.
10.0 5.0 2.0 20.0 3.0 30.0 1.5 15.0 7.0
Log x
10
-13
10
-10
10
-7
10
-4
0.1
Log PrHxL
a=
1
10
, N=0,5,10,25,50
Figure x - LogLog Plot of the probability of exceeding x showing power law-style flattening as N rises. Here all values of a= 1/10
The same effect takes place as a increases towards 1, as at the limit the tail exponent P>x approaches 1 but remains >1.
Effect on Small Probabilities
Next we measure the effect on the thickness of the tails. The obvious effect is the rise of small probabilities.
Take the exceedant probability, that is, the probability of exceeding K, given N, for parameter a constant :
(6.9)
P > K N = ‚
j=0
N
2
-N-1
K
N
j
O erfc
K
2 sHa + 1L
j
H1 - aL
N-j
where erfc(.) is the complementary of the error function, 1-erf(.), erfHzL =
2
p
Ÿ
0
z
e
-t
2
dt
Convexity effect:The next Table shows the ratio of exceedant probability under different values of N divided by the probability in the case of a
standard Gaussian.
a =
1
100
N
P>3,N
P>3,N=0
P>5,N
P>5,N=0
P>10,N
P>10,N=0
5 1.01724 1.155 7
10 1.0345 1.326 45
15 1.05178 1.514 221
20 1.06908 1.720 922
25 1.0864 1.943 3347
a =
1
10
N
P>3,N
P>3,N=0
P>5,N
P>5,N=0
P>10,N
P>10,N=0
5 2.74 146 1.09µ10
12
10 4.43 805 8.99µ10
15
15 5.98 1980 2.21µ10
17
20 7.38 3529 1.20µ10
18
25 8.64 5321 3.62µ10
18
Fat Tails and (Anti)fragility - N N Taleb| 69
6.4 Regime 2: Cases of decaying parameters a(n)
As we said, we may have (actually we need to have) a priori reasons to decrease the parameter a or stop N somewhere. When the higher order
of a(i) decline, then the moments tend to be capped (the inherited tails will come from the lognormality of s).
Regime 2-a; First Method: “bleed” of higher order error
Take a "bleed" of higher order errors at the rate l, 0§ l < 1 , such as a(N) = l a(N-1), hence a(N) = l
N
a(1), with a(1) the conventional
intensity of stochastic standard deviation. Assume m=0.
With N=2 , the second moment becomes:
(6.10) M2H2L = IaH1L
2
+ 1M s
2
IaH1L
2
l
2
+ 1M
With N=3,
(6.11) M2H3L = s
2
I1 + aH1L
2
M I1 + l
2
aH1L
2
M I1 + l
4
aH1L
2
M
finally, for the general N:
(6.12) M3HNL = IaH1L
2
+ 1M s
2

i=1
N-1
IaH1L
2
l
2 i
+ 1M
We can reexpress H12L using the Q - Pochhammer symbol Ha; qL
N
= ‰
i=1
N-1
I1 - aq
i
M
(6.13) M2HNL = s
2
I-aH1L
2
; l
2
M
N
Which allows us to get to the limit
(6.14) Limit M2 HNL
Nض
= s
2
Hl
2
; l
2
L
2
HaH1L
2
; l
2
L

Hl
2
- 1L
2
Hl
2
+ 1L
As to the fourth moment:
By recursion:
(6.15) M4HNL = 3 s
4

i=0
N-1
I6 aH1L
2
l
2 i
+ aH1L
4
l
4 i
+ 1M
(6.16) M4HNL = 3 s
4
KK2 2 - 3O aH1L
2
; l
2
O
N
K-K3 + 2 2 O aH1L
2
; l
2
O
N
(6.17) Limit M4H NL
Nض
= 3 s
4
KK2 2 - 3O aH1L
2
; l
2
O

K-K3 + 2 2 O aH1L
2
; l
2
O

So the limiting second moment for l=.9 and a(1)=.2 is just 1.28 s
2
, a significant but relatively benign convexity bias. The limiting fourth
moment is just 9.88 s
4
, more than 3 times the Gaussian’s (3 s
4
), but still finite fourth moment. For small values of a and values of l close to
1, the fourth moment collapses to that of a Gaussian.
Regime 2-b; Second Method, a Non Multiplicative Error Rate
For N recursions
sH1 # HaH1L H1 # H a H2L H1 # aH3L H ... LLL
(6.18)
PHx, m, s, NL =

i=1
L
f Hx, m, sH1 + H T
N
. A
N
L
i
L
L
70 | Fat Tails and (Anti)fragility - N N Taleb
IM
N
.T + 1M
i
is the i
th
component of the HNä1L dot product of T
N
the matrix of Tuples in H6L ,
L the length of the matrix, and A is the vector of parameters
A
N
= 9a
j
=
j=1,... N
So for instance, for N=3, T= {1, a, a
2
, a
3
}
T
3
. A
3
=
a + a
2
+ a
3
a + a
2
- a
3
a - a
2
+ a
3
a - a
2
- a
3
-a + a
2
+ a
3
-a + a
2
- a
3
-a - a
2
+ a
3
-a - a
2
- a
3
The moments are as follows:
(6.19)
M1HNL = m
(6.20) M2HNL = m
2
+ 2 s
(6.21) M4HNL = m
4
+ 12 m
2
s + 12 s
2

i=0
N
a
2 i
at the limit of N ض
(6.22) Lim
N ض
M4HNL = m
4
+ 12 m
2
s + 12 s
2
1
1 - a
2
which is very mild.
6.5 Conclusion
Something Boring & something about epistemic opacity.
This part
will be
expanded
Fat Tails and (Anti)fragility - N N Taleb| 71
7
An Introduction to Metaprobability
Ludic fallacy (or uncertainty of the nerd): the manifestation of the Platonic fallacy in the study of uncertainty; basing studies of chance
on the narrow world of games and dice (where we know the probabilities ahead of time, or can easily discover them). A-Platonic
randomness has an additional layer of uncertainty concerning the rules of the game in real life. (in the Glossary of The Black Swan)
Epistemic opacity: Randomness is the result of incomplete information at some layer. It is functionally indistinguishable from “true” or
“physical” randomness.
Randomness as incomplete information: simply, what I cannot guess is random because my knowledge about the causes is incomplete, not
necessarily because the process has truly unpredictable properties.
7.1 Metaprobability
The Effect of Estimation Error, General Case
The idea of model error from missed uncertainty attending the parameters (another layer of randomness) is as follows.
Most estimations in economics (and elsewhere) take, as input, an average or expected parameter, a
-
=
Ÿ
a fHaL „a, where a is f distributed (
deemed to be so a priori or from past samples), and regardles of the dispersion of a, build a probability distribution for X that relies on the
mean estimated parameter, p HX)= pIX a
-
M, rather than the more appropriate metaprobability adjusted probability:
(7.1) pHXL =

pHX aL fHaL „a
In other words, if one is not certain about a parameter a, there is an inescapable layer of stochasticity; such stochasticity raises the expected
(metaprobability-adjusted) probability if it is <
1
2
and lowers it otherwise. The uncertainty is fundamentally epistemic, includes incertitude, in
the sense of lack of certainty about the parameter.
The model bias becomes an equivalent of the Jensen gap (the difference between the two sides of Jensen's inequality), typically positive since
probability is convex away from the center of the distribution. We get the bias w
A
from the differences in the steps in integration
(7.2) w
A
=

pHX aL fHaL „a - pKX

a fHaL „aO
With f(X) a function , f(X)=X for the mean, etc., we get the higher order bias w
A'
(7.3) w
A'
=
‡ ‡
f HXL pHX aL fHaL „a „X -

f HXL pKX

a fHaL „aO „X
Now assume the distribution of a as discrete n states, with a = 8a
i
<
i=1
n
each with associated probability f= 8f
i
<
i=1
n

i=1
n
f
i
= 1 . Then (1)
becomes
(7.4) pHXL = ‚
i=1
n
p HX a
i
L f
i
So far this holds for a any parameter of any distribution.
72 | Fat Tails and (Anti)fragility - N N Taleb
This part
will be
expanded
7.2 The Effect of Metaprobability on the Calibration of Power Laws
In the presence of a layer of metaprobabilities (from uncertainty about the parameters), the asymptotic tail exponent for a powerlaw corre-
sponds to the lowest possible tail exponent regardless of its probability. The problem explains “Black Swan” effects, i.e., why measurements
tend to chronically underestimate tail contributions, rather than merely deliver imprecise but unbiased estimates.
When the perturbation affects the standard deviation of a Gaussian or similar nonpowerlaw tailed distribution, the end product is the weighted
average of the probabilities. However, a powerlaw distribution with errors about the possible tail exponent will bear the asymptotic properties
of the lowest exponent, not the average exponent.
Now assume p(X) a standard Pareto Distribution with a the tail exponent being estimated, p HX aL = a X
-a-1
X
min
a
, where X
min
is the lower
bound for X,
(7.5) pHXL = ‚
i=1
n
a
i
X
-ai-1
X
min
ai
f
i
Taking it to the limit
Limit
Xض
X
a
*
+1

i=1
n
a
i
X
-ai-1
X
min
ai
f
i
= K
where K is a strictly positive constant and a
*
= min a
i
1§i§n
. In other words ⁄
i=1
n
a
i
X
-ai-1
X
min
ai
f
i
is asymptotically equivalent to a constant times
X
a
*
+1
. The lowest parameter in the space of all possibilities becomes the dominant parameter for the tail exponent.
pIX a
-
M
pHX a
*
L

i=1
n
p HX a
i
L f
i

5 10 50 100 500 1000
X
10
-7
10
-5
0.001
0.1
Prob
Figure 7.1 Log-log plot illustration of the asymptotic tail exponent with two states. The graphs shows the different situations, a) pIX a
-
M b) ⁄
i =1
n
p HX a
i
L f
i
and c) pHX a
*
L .
We can see how b) and c) converge
The asymptotic Jensen Gap w
A
becomes p HX a
*
L - p HX a
-
M
Implications
1. Whenever we estimate the tail exponent from samples, we are likely to underestimate the thickness of the tails, an observation made about
Monte Carlo generated a-stable variates and the estimated results (the “Weron effect”).
2. The higher the estimation variance, the lower the true exponent.
3. The asymptotic exponent is the lowest possible one. It does not even require estimation.
4. Metaprobabilistically, if one isn’t sure about the probability distribution, and there is a probability that the variable is unbounded and
“could be” powerlaw distributed, then it is powerlaw distributed, and of the lowest exponent.
Fat Tails and (Anti)fragility - N N Taleb| 73
The obvious conclusion is to in the presence of powerlaw tails, focus on changing payoffs to clip tail exposures to limit w
A'
and “robustify” tail
exposures, making the computation problem go away.
7.3 The Effect of Metaprobability on Fat Tails
74 | Fat Tails and (Anti)fragility - N N Taleb
8
Brownian Motion in the Real World (Under Path
Dependence and Fat Tails)
Most of the work concerning martingales and Brownian motion can be idealized to the point of lacking any match to reality, in spite of the
sophisticated, rather complicated discussions. This section discusses the (consequential) differences.
8.1 Path Dependence and History as Revelation of Antifragility
Let us examine the non-Markov property of antifragility. Something that incurred hard times but did not fall apart is giving us information
about its solidity, compared to something that has not been subjected to such stressors.
(The Markov Property for, say, a Brownian Motion X
N 8X1,X2,... XN-1<
= X
N 8XN-1<
, that is the last realization is the only one that matters. Now
if we take fat tailed models, such as stochastic volatility processes, the properties of the system are Markov, but the history of the past realiza-
tions of the process matter in determining the present variance. )
Take M realizations of a Brownian Bridge process pinned at S
t0
= 100 and S
T
= 120, sampled with N periods separated by Dt, with the sequence
S, a collection of Brownian-looking paths with single realizations indexed by j ,
(8.1) S
i
j
= :9S
iDt+t0
j
=
i=0
N
>
j=1
M
Nowtake m
*
= min
j
min
i
S
i
j
and 9 j : min S
i
j
= m
*
=
Take 1) the sample path with the most direct route (Path 1) defined as its lowest minimum , and 2) the one with the lowest minimum m
*
(Path
2). The state of the system at period T depends heavily on whether the process S
T
exceeds its minimum (Path 2), that is whether arrived there
thanks to a steady decline, or rose first, then declined.
If the properties of the process depend on (S
T
- m*), then there is path dependence. By properties of the process we mean the variance,
projected variance in, say, stochastic volatility models, or similar matters.
Path 1 , S
min
j
S
T
j
0.0 0.2 0.4 0.6 0.8 1.0
Time
80
100
120
140
S
Figure 8.1 Brownian Bridge Pinned at 100 and 120, with multiple realizations {S
0
j
,S
1
j
...,S
T
j
}, each indexed by j ; the idea is to find the path j that satisfies the maximum
distance D
j
= ¢S
T
-S
min
j

Fat Tails and (Anti)fragility - N N Taleb| 75
Figure 8.1 Brownian Bridge Pinned at 100 and 120, with multiple realizations {S
0
j
,S
1
j
...,S
T
j
}, each indexed by j ; the idea is to find the path j that satisfies the maximum
distance D
j
= ¢S
T
-S
min
j

This part
will be
expanded
8.2 Brownian Motion in the Real World
We mentioned in the discussion of the Casanova problem that stochastic calculus requires a certain class of distributions, such as the Gaussian.
It is not as we expect because of the convenience of the smoothness in squares (finite Dx
2
), rather because the distribution conserves across time
scales. By central limit, a Gaussian remains a Gaussian under summation, that is sampling at longer time scales. But it also remains a Gaussian
at shorter time scales. The foundation is infinite dividability.
The problems are as follows:
1. The results in the literature are subjected to the constaints that the Martingale M is member of the subset (H
2
) of square integrable
martingales, sup
t§T
E[M
2
] <¶
2. We know that the restriction does not work for lot or time series.
3. We know that, with q an adapted process, without
Ÿ
0
T
q
s
2
„s <¶ we can’t get most of the results of Ito’s lemma.
4. Even with
Ÿ
o
T
„W
2
< ¶, The situation is far from solved because of powerful, very powerful presamptotics.
Hint: Smoothness comes from
Ÿ
o
T
„W
2
becoming linear to T at the continuous limit --Simply dt is too small in front of dW
Take the normalized (i.e. sum=1) cumulative variance (see Bouchaud & Potters),

i=1
n
HW@i DtD-W@Hi-1L DtDL
2

i=1
TêDt
HW@i DtD-W@Hi-1L DtDL
2
. Let us play with finite variance
situations.
200 400 600 800 1000
0.2
0.4
0.6
0.8
1.0
,
200 400 600 800 1000
0.2
0.4
0.6
0.8
1.0
,
200 400 600 800 1000
0.2
0.4
0.6
0.8
1.0
Figure 8.2 Ito’s lemma in action. Three classes of processes with tail exonents: a = ¶ (Gaussian), a = 1.16 (the 80/20) and, a = 3. Even finite variance does not lead to the
smoothing of discontinuities except in the infinitesimal limit, another way to see failed asymptotes.
8.3 Stochastic Processes and Nonanticipating Strategies
There is a difference between the Stratonovich and Ito’s integration of a functional of a stochastic process. But there is another step missing in
Ito: the gap between information and adjustment.
This part
will be
expanded
76 | Fat Tails and (Anti)fragility - N N Taleb
This part
will be
expanded
8.4 Finite Variance not Necessary for Anything Ecological (incl. quant finance)
This part
will be
expanded
Fat Tails and (Anti)fragility - N N Taleb| 77
9
On the Difference Between Binaries and Vanillas with
Implications For “Prediction Markets”
This explains how and where prediction markets (or, more general discussions of betting matters) do not correspond to reality and have
little to do with exposures to fat tails and “Black Swan” effects. Elementary facts, but with implications. This show show, for instance, the
“long shot bias” is misapplied in real life variables, why political predictions are more robut than economic ones.
This discussion is based on Taleb (1997) which focuses largely on the difference between a binary and a vanilla option.
9.1 Definitions
1- A binary bet (or just “a binary” or “a digital”): a outcome with payoff 0 or 1 (or, yes/no, -1,1, etc.) Example: “prediction market”,
“election”, most games and “lottery tickets”. Also called digital. Any statistic based on YES/NO switch. Its estimator is M
T
X
HA, f L, which recall
1.x was

i=0
n
1A Xt
0
+i Dt

i=0
n
1
with f=1 and A either = (K,¶) or its complement HK, ¶L .
Binaries are effectively bets on probability, more specifically cumulative probabilities or their complement. They are rarely ecological,
except for political predictions.
More technically, they are mapped by the Heaviside function, q(K)= 1 if x>K and 0 if x<K . q(K) is itself the integral of the dirac delta
function which, in expectation, will deliver us the equivalent of probability densities.
2- A exposure or “vanilla”: an outcome with no open limit: say “revenues”, “market crash”, “casualties from war”, “success”, “growth”,
“inflation”, “epidemics”... in other words, about everything.
Exposures are generally “expectations”, or the arithmetic mean, never bets on probability, rather the pair probability ä payoff
3- A bounded exposure: an exposure(vanilla) with an upper and lower bound: say an insurance policy with a cap, or a lottery ticket. When the
boundary is close, it approaches a binary bet in properties. When the boundary is remote (and unknown), it can be treated like a pure exposure.
The idea of “clipping tails” of exposures transforms them into such a category.
Binary
Vanilla
Bet
Level
x
fHxL
The different classes of payoff.
A vanilla or continuous payoff can be decomposed as a series of binaries. But it can be an infinite series. It would correspond to, with L < K< H:
78 | Fat Tails and (Anti)fragility - N N Taleb
(9.1)
M
T
X
HHL, HL, xL ª lim
DKØ0

i=1
H-K
DK
HK + i DKL - ‚
i=1
K-L
DK
HK - i DKL
DK
K
and for unbounded vanilla payoffs, L=0,
(9.2) M
T
X
HH-¶, ¶L, xL ª lim
Kض
lim
DKØ0

i=1
H-K
DK
HK + i DKL - ‚
i=1
K
DK
HK - i DKL
DK
K
The Problem
The properties of binaries diverge from those of vanilla exposures. This note is to show how conflation of the two takes place: prediction
markets, ludic fallacy (using the world of games to apply to real life),
1. They have diametrically opposite responses to skewness (mean-preserving increase in skewness).
Proof TK
2. They respond differently to fat-tailedness (sometimes in opposite directions). Fat tails makes binaries more tractable.
Proof TK
3. Rise in complexity lowers the value of the binary and increases that of the vanilla.
Proof TK
Some direct applications:
1- Studies of “long shot biases” that typically apply to binaries should not port to vanillas.
2- Many are surprised that I find many econometricians total charlatans, while Nate Silver to be immune to my problem. This explains why.
3- Why prediction markets provide very limited information outside specific domains.
4- Etc.
The Elementary Betting Mistake
One can hold beliefs that a variable can go lower yet bet that it is going higher. Simply, the digital and the vanilla diverge. PHX > X
0
L>
1
2
, but
EHXL < EHX
0
L. This is normal in the presence of skewness and extremely common with economic variables. Philosophers have a related
problem called the lottery paradox which in statistical terms is not a paradox.
In Fooled by Randomness, a trader was asked:
"do you predict that the market if going up or down?" "Up", he said, with confidence. Then the questioner got angry when he discovered
that the trader was short the market, i.e., would benefit from the market going down. The questioner could not get the idea that the trader
believed that the market had a higher probability of going up, but that, should it go down, it would go down a lot. So the rational idea was
to be short.
This divorce between the binary (up is more likely) and the vanilla is very prevalent in real-world variables.
9.3 The Elementary Fat Tails Mistake
A slightly more difficult problem. To the question: “What happens to the probability of a deviation >1s when you fatten the tail (while
preserving other properties)?”, almost all answer: it increases (so far all have made the mistake). Wrong. Fat tails is the contribution of the
extreme events to the total properties, and that it is a pair probability ä payoff that matters, not just probability. This is the reason people
thought that I meant (and reported) that Black Swans were more frequent; I meant Black Swans are more consequential, more determining, but
not more frequent.
I’ve asked variants of the same question. “The Gaussian distribution spends 68.2% of the time between ± 1 standard deviation. The real world
has fat tails. In finance, how much time do stocks spend between ± 1 standard deviations?” The answer has been invariably “lower”. Why?
“Because there are more deviations.” Sorry, there are fewer deviations: stocks spend between 78% and 98% between ± 1 standard deviations
(computed from past samples).
Fat Tails and (Anti)fragility - N N Taleb| 79
I’ve asked variants of the same question. “The Gaussian distribution spends 68.2% of the time between ± 1 standard deviation. The real world
has fat tails. In finance, how much time do stocks spend between ± 1 standard deviations?” The answer has been invariably “lower”. Why?
“Because there are more deviations.” Sorry, there are fewer deviations: stocks spend between 78% and 98% between ± 1 standard deviations
(computed from past samples).
Some simple derivations
Let x follow a Gaussian distribution (m , s). Assume m=0 for the exercise. What is the probability of exceeding one standard deviation?
P
>1 s
= 1 -
1
2
erfcK-
1
2
O , where erfc is the complimentary error function, P
>1 s
= P
< 1 s
>15.86% and the probability of staying within the
"stability tunnel" between ± 1 s is > 68.2 %.
Let us fatten the tail in a varince-preserving manner, using the standard method of linear combination of two Gaussians with two standard
deviations separated by s 1 + a and s 1 - a , where a is the "vvol" (which is variance preserving, technically of no big effect here, as a
standard deviation-preserving spreading gives the same qualitative result). Such a method leads to immediate raising of the Kurtosis by a factor
of H1 + a
2
L since
E Ix
4
M
E Ix
2
M
2
= 3 Ha
2
+ 1L
P
>1 s
= P
<1 s
= 1 -
1
2
erfc -
1
2 1 - a
-
1
2
erfc -
1
2 a + 1
So then, for different values of a as we can see, the probability of staying inside 1 sigma increases.
-4 -2 2 4
0.1
0.2
0.3
0.4
0.5
0.6
Fatter and fatter tails: different values of a. We notice that higher peak ïlower probability of nothing leaving the ±1 s tunnel
9.4 The More Advanced Fat Tails Mistake and Great Moderation
Fatter tails increases time spent between deviations, giving the illusion of absence of volatility when in fact events are delayed and made worse
(my critique of the “Great Moderation”).
Stopping Time & Fattening of the tails of a Brownian Motion: Consider the distribution of the time it takes for a continuously monitored
Brownian motion S to exit from a "tunnel" with a lower bound L and an upper bound H. Counterintuitively, fatter tails makes an exit (at some
sigma) take longer. You are likely to spend more time inside the tunnel --since exits are far more dramatic.
y is the distribution of exit time t, where t ª inf {t: S – [L,H]}
From Taleb (1997) we have the following approximation
yH t sL =
1
HlogHHL - logHLLL
2

-
1
8
It s
2
M
p s
2

n=1
m
1
H L
H-1L
n

-
n
2
p
2
t s
2
2 HlogHHL-logHLLL
2
n S L sin
n p HlogHLL - logHSLL
logHHL - logHLL
- H sin
n p HlogHHL - logHSLL
logHHL - logHLL
and the fatter-tailed distribution from mixing Brownians with s separated by a coefficient a:
80 | Fat Tails and (Anti)fragility - N N Taleb
yHt s, aL =
1
2
pH t sH1 - aLL +
1
2
pHt sH1 + aLL
This graph shows the lengthening of the stopping time between events coming from fatter tails.
2 4 6 8
Exit Time
0.1
0.2
0.3
0.4
0.5
Probability
0.1 0.2 0.3 0.4 0.5 0.6 0.7
v
3
4
5
6
7
8
Expected t
More Complicated : MetaProbabilities
9.5 The Fourth Quadrant Mitigation (or “Solution”)
Let us return to M[A, f HxL] of chapter 1. A quite significant result is that M[A,x
n
] may not converge, in the case of, say power laws with
exponent a < n, but M[A,x
m
] where m<n, might converge. Well, where the integral
Ÿ


f HxL pHxL „x does not exist, by “clipping tails”, we
canmake the payoff integrable. There are two routes;
1) Limiting f (turning an open payoff to a binary): when f(x) is a constant as in a binary
Ÿ


K pHxL „x will necessarily converge if p is a
probability distribution.
2) Clipping tails: (and this is the business we will deal with in Antifragile, Part II), where the payoff is bounded, A = (L, H), or the integral
Ÿ
L
H
f HxL pHxL „x will necessarily converge.
This part will be expanded
Two types of Decisions
Fat Tails and (Anti)fragility - N N Taleb| 81
1. M0, depend on the 0
th
moment, that is, "Binary", or simple, i.e., as we saw, you just care if something is true or false. Very true or very
false does not matter. Someone is either pregnant or not pregnant. A statement is "true" or "false" with some confidence interval. (I call
these M0 as, more technically, they depend on the zero
th
moment, namely just on probability of events, and not their magnitude —you just
care about "raw" probability). A biological experiment in the laboratory or a bet with a friend about the outcome of a soccer game belong
to this category.
2. M1
+
Complex,depend on the 1
st
or higher moments. You do not just care of the frequency—but of the impact as well, or, even more
complex, some function of the impact. So there is another layer of uncertainty of impact. (I call these M1+, as they depend on higher
moments of the distribution). When you invest you do not care how many times you make or lose, you care about the expectation: how
many times you make or lose times the amount made or lost.
Two types of probability structures:
There are two classes of probability domains—very distinct qualitatively and quantitatively. The first, thin-tailed: Mediocristan", the second,
thick tailed Extremistan:
Note the typo f(x)= 1 should be f(x) = x
82 | Fat Tails and (Anti)fragility - N N Taleb
The Map
Conclusion
The 4th Quadrant is mitigated by changes in exposures. And exposures in the 4th quadrant can be to the negative or the positive, depending on
if the domain subset A exposed on the left on on the right.
Fat Tails and (Anti)fragility - N N Taleb| 83
84 | Fat Tails and (Anti)fragility - N N Taleb
10
Nonlinear Transformations of Random Variables
10.1. The Conflation Problem: Exposures to X Confused With Knowledge About X
Exposure, not knowledge.Take X a random or nonrandom variable, and F(X) the exposure, payoff, the effect of X on you, the end bottom line.
(To be technical, X is higher dimensions, in !
N
but less assume for the sake of the examples in the introduction that it is a simple one-dimen-
sional variable).
The disconnect. As a practitioner and risk taker I see the following disconnect: people (nonpractitioners) talking to me about X (with the
implication that we practitioners should care about X in running our affairs) while I had been thinking about F(X), nothing but F(X). And the
straight confusion since Aristotle between X and F(X) has been chronic. Sometimes people mention F(X) as utility but miss the full payoff.
And the confusion is at two level: one, simple confusion; second, in the decision-science literature, seeing the difference and not realizing that
action on F(X) is easier than action on X.
Examples:
X is unemployment in Senegal, F
1
(X) is the effect on the bottom line of the IMF, and F
2
(X) is the effect on your grandmother (which I
assume is minimal).
X can be a stock price, but you own an option on it, so F(X) is your exposure an option value for X, or, even more complicated the utility of
the exposure to the option value.
X can be changes in wealth, F(X) the convex-concave value function of Kahneman-Tversky, how these “affect” you. One can see that F(X)
is vastly more stable or robust than X (it has thinner tails).
Variable
X
Funtion
fHXL
A convex and linear function of a variable X. Confusing f(X) (on the vertical) and X (the horizontal) is more and more significant when f(X) is nonlinear. The more convex f(X),
the more the statistical and other properties of f(X) will be divorced from those of X. For instance, the mean of f(X) will be different from f(Mean of X), by Jensen’s ineqality. But
beyond Jensen’s inequality, the difference in risks between the two will be more and more considerable. When it comes to probability, the more nonlinear f, the less the
probabilities of X matter compared to the nonlinearity of f. Moral of the story: focus on f, which we can alter, rather than the measurement of the elusive properties of X.
Fat Tails and (Anti)fragility - N N Taleb| 85
Probability Distribution of x Probability Distribution of fHxL
There are infinite numbers of functions F depending on a unique variable X.
All utilities need to be embedded in F.
Limitations of knowledge. What is crucial, our limitations of knowledge apply to X not necessarily to F(X). We have no control over X, some
control over F(X). In some cases a very, very large control over F(X).
This seems naive, but people do, as something is lost in the translation.
† The danger with the treatment of the Black Swan problem is as follows: people focus on X (“predicting X”). My point is that, although we
do not understand X, we can deal with it by working on F which we can understand, while others work on predicting X which we can’t
because small probabilities are incomputable, particularly in “fat tailed” domains. F(X) is how the end result affects you.
† The probability distribution of F(X) is markedly different from that of X, particularly when F(X) is nonlinear. We need a nonlinear
transformation of the distribution of X to get F(X). We had to wait until 1964 to get a paper on “convex transformations of random
variables”, Van Zwet (1964).
Bad news: F is almost always nonlinear, often “S curved”, that is convex-concave (for an increasing function).
The central point about what to understand: When F(X) is convex, say as in trial and error, or with an option, we do not need to understand X
as much as our exposure to H. Simply the statistical properties of X are swamped by those of H. That's the point of Antifragility in which
exposure is more important than the naive notion of "knowledge", that is, understanding X.
Fragility and Antifragility:
† When F(X) is concave (fragile), errors about X can translate into extreme negative values for F. When F(X) is convex, one is immune from
negative variations.
† The more nonlinear F the less the probabilities of X matter in the probability distribution of the final package F.
† Most people confuse the probabilites of X with those of F. I am serious: the entire literature reposes largely on this mistake.
So, for now ignore discussions of X that do not have F. And, for Baal’s sake, focus on F, not X.
10.2. Transformations of Probability Distributions
Say x follows a distribution p(x) and z= f(x) follows a distribution g(z). Assume g(z) continuous, increasing, and differentiable for now.
The density p at point r is defined by use of the integral
DHrL ª


r
p HxL „x
hence


r
pHxL „x =


f HrL
gHzL „z
In differential form
gHzL „z = pHxL „x
since x = f
H-1L
HzL, we get
gHzL „z = pI f
H-1L
HzLM „ f
H-1L
HzL
Now, the derivative of an inverse function f
H-1L
HzL =
1
f
£
I f
-1
HzLM
, which obtains the useful transformation heuristic:
86 | Fat Tails and (Anti)fragility - N N Taleb
(10.1) gHzL =
p I f
H-1L
HzLM
f ' HuL u = I f
H-1L
HzLM
In the event that g(z) is monotonic decreasing, then
(10.2) gHzL =
p I f
H-1L
HzLM
f ' HuL u = I f
H-1L
HzLM
Where f is convex,
1
2
H f Hx - DxL + f HDx + xLL > f HxL, concave if
1
2
H f Hx - DxL + f HDx + xLL< f(x). Let us simplify with sole condition,
assuming f(.) twice differentiable,
!
2
f
! x
2
> 0 for all values of x in the convex case and <0 in the concave one.
Some Examples.
Squaring x: p(x) is a Gaussian(with mean 0, standard deviation 1) , f(x)= x
2
g(x)=

-
x
2
2 2 p x
, x r 0
which corresponds to the Chi-square distribution with 1 degrees of freedom.
Exponentiating x :p(x) is a Gaussian(with mean m, standard deviation s)
g HxL =

-
HlogHxL-mL
2
2 s
2
2 p s x
which is the lognormal distribution.
10.3 Application 1: Happiness (f(x) does not have the same statistical properties as
wealth (x)
There is a conflation of fat-tailedness of Wealth and Utility
Case 1: The Kahneman Tversky Prospect theory, which is convex-concave
vHxL =
x
a
if x ¥ 0
-l H-x
a
L x < 0
with a & l calibrated a = 0.88 and l = 2.25
For x (the changes in wealth) following a T distribution with tail exponent a,
f HxL =
I
a
a+x
2
M
a+1
2
a BI
a
2
,
1
2
M
Ç
Where B is the Euler Beta function, BHa, bL "GHaL GHbL ê GHa + bL "
Ÿ
0
1
t
a-1
H1 - tL
b-1
dt; we get (skipping the details of z= v(u) and f(u) du =
z(x) dx), the distribution z(x) of the utility of happiness v(x)
zHx a, a, lL =
x
1-a
a J
a
a+x
2êa
N
a+1
2
a a BJ
a
2
,
1
2
N
x ¥ 0
I-
x
l
M
1-a
a
a
a+K-
x
l
O
2êa
a+1
2
a l a BJ
a
2
,
1
2
N
x < 0
Fat Tails and (Anti)fragility - N N Taleb| 87
Figure 1: Simulation, first. The distribution of the utility of changes of wealth, when the changes in wealth follow a power law with tail exponent =2 (5 million Monte Carlo
simulations).
Distribution of V(x)
Distribution of x
-20 -10 10 20
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Figure 2: The graph in Figure 1 derived analytically
Fragility: as defined in the Taleb-Douady (2012) sense, on which later, i.e. tail sensitivity below K, v(x) is less “fragile” than x.
Tail of x
Tail of v(x)
-18 -16 -14 -12 -10 -8 -6
0.005
0.010
0.015
0.020
Figure 3: Left tail.
88 | Fat Tails and (Anti)fragility - N N Taleb
v(x) has thinner tails than x ñ more robust.
ASYMPTOTIC TAIL More technically the asymptotic tail for V(x) becomes
a
a
(i.e, for x and -x large, the exceedance probability for V, P
>x
~ K x
-
a
a , with K a constant, or
z HxL ~ K x
-
a
a
-1
We can see that V(x) can easily have finite variance when x has an infinite one. The dampening of the tail has an increasingly consequential
effect for lower values of a.
Case 2: Compare to the Monotone Concave of Classical Utility
Unlike the convex-concave shape in Kahneman Tversky, classical utility is monotone concave. This leads to plenty of absurdities, but the
worst is the effect on the distribution of utility.
Granted one (K-T) deals with changes in wealth, the second is a function of wealth.
Take the standard concave utility function g(x)= 1- e
-a x
. With a=1
-2 -1 1 2 3
x
-6
-5
-4
-3
-2
-1
1
gHxL
Plot of 1- e
-a x
The distribution of v(x) will be
vHxL = -

-
Hm+logH1-xLL
2
2 s
2
2 p sHx - 1L
-10 -8 -6 -4 -2 2
x
0.1
0.2
0.3
0.4
0.5
0.6
vHxL
With such a distribution of utility it would be absurd to do anything.
10.4 The effect of convexity on the distribution of f(x)
Note the following property.
Fat Tails and (Anti)fragility - N N Taleb| 89
Distributions that are skewed have their mean dependent on the variance (when it exists), or on the scale. In other words, more uncertainty
raises the expectation.
Demonstration 1:TK
Outcome
Probability
Low Uncertainty
High Uncertainty
Example: the Lognormal Distribution has a term
s
2
2
in its mean, linear to variance.
Example: the Exponential Distribution 1 - ‰
-x l
x ¥ 0 has the mean a concave function of the variance, that is,
1
l
, the square root of its
variance.
Example: the Pareto Distribution L
a
x
-1-a
a x ¥ L , a>2 has the mean a - 2 a " Standard Deviation,
a
a-2
L
a-1
(verify)
10.5 The Mistake of Using Regular Estimation Methods When the Payoff is Convex
Volume 2 - NONLINEARITIES
This Segment,
Part II corresponds to topics
covered in Antifragile
90 | Fat Tails and (Anti)fragility - N N Taleb
The previous chapters dealt mostly with probability rather than payoff. The next section deals with fragility and nonlinear effects.
Fat Tails and (Anti)fragility - N N Taleb| 91
11
Definition, Mapping, and Detection of (Anti)Fragility
Inserting Taleb
Douady 2013
92 | Fat Tails and (Anti)fragility - N N Taleb
Fat Tails and (Anti)Fragility




What is Fragility?

In short, fragility is related to how a system suffers from the variability of its environment beyond a certain preset threshold (when threshold is K, it is
called K-fragility), while antifragility refers to when it benefits from this variability —in a similar way to “vega” of an option or a nonlinear payoff,
that is, its sensitivity to volatility or some similar measure of scale of a distribution.
Simply, a coffee cup on a table suffers more from large deviations than from the cumulative effect of some shocks—conditional on being unbroken, it
has to suffer more from “tail” events than regular ones around the center of the distribution, the “at the money” category. This is the case of elements
of nature that have survived: conditional on being in existence, then the class of events around the mean should matter considerably less than tail
events, particularly when the probabilities decline faster than the inverse of the harm, which is the case of all used monomodal probability
distributions. Further, what has exposure to tail events suffers from uncertainty; typically, when systems – a building, a bridge, a nuclear plant, an
airplane, or a bank balance sheet– are made robust to a certain level of variability and stress but may fail or collapse if this level is exceeded, then they
are particularly fragile to uncertainty about the distribution of the stressor, hence to model error, as this uncertainty increases the probability of
dipping below the robustness level, bringing a higher probability of collapse. In the opposite case, the natural selection of an evolutionary process is
particularly antifragile, indeed, a more volatile environment increases the survival rate of robust species and eliminates those whose superiority over
other species is highly dependent on environmental parameters.
Figure 1 show the “tail vega” sensitivity of an object calculated discretely at two different lower absolute mean deviations. We use for the purpose of
fragility and antifragility, in place of measures in L
2
such as standard deviations, which restrict the choice of probability distributions, the broader
measure of absolute deviation, cut into two parts: lower and upper semi-deviation above the distribution center !.

Figure 1- A definition of fragility as left tail-vega sensitivity; the figure shows the effect of the perturbation of the lower semi-deviation s

on the tail integral
! of (x – ") below K, with " a entering constant. Our detection of fragility does not require the specification of f the probability distribution.
This article aims at providing a proper mathematical definition of fragility, robustness, and antifragility and examining how these apply to different
cases where this notion is applicable.
Intrinsic and Inherited Fragility: Our definition of fragility is two-fold. First, of concern is the intrinsic fragility, the shape of the probability
distribution of a variable and its sensitivity to s
-
, a parameter controlling the left side of its own distribution. But we do not often directly observe the
statistical distribution of objects, and, if we did, it would be difficult to measure their tail-vega sensitivity. Nor do we need to specify such
distribution: we can gauge the response of a given object to the volatility of an external stressor that affects it. For instance, an option is usually
analyzed with respect to the scale of the distribution of the “underlying” security, not its own; the fragility of a coffee cup is determined as a response
to a given source of randomness or stress; that of a house with respect of, among other sources, the distribution of earthquakes. This fragility coming
from the effect of the underlying is called inherited fragility. The transfer function, which we present next, allows us to assess the effect, increase or
decrease in fragility, coming from changes in the underlying source of stress.
Transfer Function: A nonlinear exposure to a certain source of randomness maps into tail-vega sensitivity (hence fragility). We prove that
Inherited Fragility ! Concavity in exposure on the left side of the distribution
and build H, a transfer function giving an exact mapping of tail vega sensitivity to the second derivative of a function. The transfer function will allow
us to probe parts of the distribution and generate a fragility-detection heuristic covering both physical fragility and model error.

K
Prob Density
Ξ￿K, s ￿ ￿

￿s ￿ ￿

￿ ￿ ￿￿

K ￿
x ￿ ￿￿ f
Λ￿s
_ ￿￿
s ￿ ￿

￿x￿ ￿x
Ξ￿K, s ￿ ￿

￿ ￿ ￿￿

K ￿
x ￿ ￿￿ f
Λ￿s
_ ￿

￿x￿ ￿x
Fat Tails and (Anti)Fragility



FRAGILITY AS SEPARATE RISK FROM PSYCHOLOGICAL PREFERENCES
Avoidance of the Psychological: We start from the definition of fragility as tail vega sensitivity, and end up with nonlinearity as a necessary attribute
of the source of such fragility in the inherited case —a cause of the disease rather than the disease itself. However, there is a long literature by
economists and decision scientists embedding risk into psychological preferences —historically, risk has been described as derived from risk aversion
as a result of the structure of choices under uncertainty with a concavity of the muddled concept of “utility” of payoff, see Pratt (1964), Arrow (1965),
Rothchild and Stiglitz(1970,1971). But this “utility” business never led anywhere except the circularity, expressed by Machina and Rothschild (2008),
“risk is what risk-averters hate.” Indeed limiting risk to aversion to concavity of choices is a quite unhappy result —the utility curve cannot be
possibly monotone concave, but rather, like everything in nature necessarily bounded on both sides, the left and the right, convex-concave and, as
Kahneman and Tversky (1979) have debunked, both path dependent and mixed in its nonlinearity.
Beyond Jensen’s Inequality: Furthermore, the economics and decision-theory literature reposes on the effect of Jensen’s inequality, an analysis
which requires monotone convex or concave transformations —in fact limited to the expectation operator. The world is unfortunately more
complicated in its nonlinearities. Thanks to the transfer function, which focuses on the tails, we can accommodate situations where the source is not
merely convex, but convex-concave and any other form of mixed nonlinearities common in exposures, which includes nonlinear dose-response in
biology. For instance, the application of the transfer function to the Kahneman-Tversky value function, convex in the negative domain and concave in
the positive one, shows that its decreases fragility in the left tail (hence more robustness) and reduces the effect of the right tail as well (also more
robustness), which allows to assert that we are psychologically “more robust” to changes in wealth than implied from the distribution of such wealth,
which happens to be extremely fat-tailed.
Accordingly, our approach relies on nonlinearity of exposure as detection of the vega-sensitivity, not as a definition of fragility. And nonlinearity in a
source of stress is necessarily associated with fragility. Clearly, a coffee cup, a house or a bridge don’t have psychological preferences, subjective
utility, etc. Yet they are concave in their reaction to harm: simply, taking z as a stress level and !(z) the harm function, it suffices to see that, with
n > 1,
!(nz) < n !( z) for all 0 < nz < Z*
where Z* is the level (not necessarily specified) at which the item is broken. Such inequality leads to !(z) having a negative second derivative at the
initial value z.
So if a coffee cup is less harmed by n times a stressor of intensity Z than once a stressor of nZ, then harm (as a negative function) needs to be concave
to stressors up to the point of breaking; such stricture is imposed by the structure of survival probabilities and the distribution of harmful events, and
has nothing to do with subjective utility or some other figments. Just as with a large stone hurting more than the equivalent weight in pebbles, if, for a
human, jumping one millimeter caused an exact linear fraction of the damage of, say, jumping to the ground from thirty feet, then the person would be
already dead from cumulative harm. Actually a simple computation shows that he would have expired within hours from touching objects or pacing
in his living room, given the multitude of such stressors and their total effect. The fragility that comes from linearity is immediately visible, so we rule
it out because the object would be already broken and the person already dead. The relative frequency of ordinary events compared to extreme events
is the determinant. In the financial markets, there are at least ten thousand times more events of 0.1% deviations than events of 10%. There are close
to 8,000 micro-earthquakes daily on planet earth, that is, those below 2 on the Richter scale —about 3 million a year. These are totally harmless, and,
with 3 million per year, you would need them to be so. But shocks of intensity 6 and higher on the scale make the newspapers. Accordingly, we are
necessarily immune to the cumulative effect of small deviations, or shocks of very small magnitude, which implies that these affect us
disproportionally less (that is, nonlinearly less) than larger ones.
Model error is not necessarily mean preserving. s
-
, the lower absolute semi-deviation does not just express changes in overall dispersion in the
distribution, such as for instance the “scaling” case, but also changes in the mean, i.e. when the upper semi-deviation from ! to infinity is invariant, or
even decline in a compensatory manner to make the overall mean absolute deviation unchanged. This would be the case when we shift the distribution
instead of rescaling it. Thus the same vega-sensitivity can also express sensitivity to a stressor (dose increase) in medicine or other fields in its effect on
either tail. Thus s

(!) will allow us to express the sensitivity to the “disorder cluster” (Taleb, 2012): i) uncertainty, ii) variability, iii) imperfect,
incomplete knowledge, iv) chance, v) chaos, vi) volatility, vii) disorder, viii) entropy, ix) time, x) the unknown, xi) randomness, xii) turmoil, xiii)
stressor, xiv) error, xv) dispersion of outcomes.

DETECTION HEURISTIC
Finally, thanks to the transfer function, this paper proposes a risk heuristic that "works" in detecting fragility even if we use the wrong model/pricing
method/probability distribution. The main idea is that a wrong ruler will not measure the height of a child; but it can certainly tell us if he is
growing. Since risks in the tails map to nonlinearities (concavity of exposure), second order effects reveal fragility, particularly in the tails where
they map to large tail exposures, as revealed through perturbation analysis. More generally every nonlinear function will produce some kind of
positive or negative exposures to volatility for some parts of the distribution.
Fat Tails and (Anti)Fragility




Figure 2- Disproportionate effect of tail events on nonlinear exposures, illustrating the necessity of nonlinearity of the harm function and showing how we
can extrapolate outside the model to probe unseen fragility. It also shows the disproportionate effect of the linear in reverse: For small variations, the linear
sensitivity exceeds the nonlinear, hence making operators aware of the risks.
Fragility and Model Error: As we saw this definition of fragility extends to model error, as some models produce negative sensitivity to uncertainty,
in addition to effects and biases under variability. So, beyond physical fragility, the same approach measures model fragility, based on the difference
between a point estimate and stochastic value (i.e., full distribution). Increasing the variability (say, variance) of the estimated value (but not the
mean), may lead to one-sided effect on the model —just as an increase of volatility causes porcelain cups to break. Hence sensitivity to the
volatility of such value, the “vega” of the model with respect to such value is no different from the vega of other payoffs. For instance, the misuse of
thin-tailed distributions (say Gaussian) appears immediately through perturbation of the standard deviation, no longer used as point estimate, but
as a distribution with its own variance. For instance, it can be shown how fat-tailed (e.g. power-law tailed) probability distributions can be
expressed by simple nested perturbation and mixing of Gaussian ones. Such a representation pinpoints the fragility of a wrong probability model and
its consequences in terms of underestimation of risks, stress tests and similar matters.
Antifragility: It is not quite the mirror image of fragility, as it implies positive vega above some threshold in the positive tail of the distribution and
absence of fragility in the left tail, which leads to a distribution that is skewed right.

Fragility and Transfer Theorems
Table 1- Introduces the Exhaustive Taxonomy of all Possible Payoffs y=f(x)

Type Condition Left Tail
(loss domain)
Right Tail
(gains domain)
Nonlinear
Payoff Function
y = f(x)
"derivative",
where x is a
random
variable
Derivatives
Equivalent (Taleb,
1997)
Effect of fatailedness of f(x)
compared to primitive x
Type 1 Fragile (type 1) Fat
(regular or
absorbing
barrier)
Fat Mixed concave
left, convex right
(fence)
Long up -vega,
short down -
vega
More fragility if
absorbing barrier,
neutral otherwise
Type 2 Fragile (type 2) Fat Thin Concave Short vega More fragility
Type 3 Robust Thin Thin Mixed convex left,
concave right
(digital, sigmoid)
Short up -
vega,
long down - vega
No effect
Type 4 Antifragile Thin Fat
(thicker than left)
Convex Long vega More antifragility
Fat Tails and (Anti)Fragility





The central Table 1 introduces the exhaustive map of possible outcomes, with 4 mutually exclusive categories of payoffs.
Our steps in the rest of the paper are as follows:
a. We provide a mathematical definition of fragility, robustness and antifragility.
b. We present the problem of measuring tail risks and show the presence of severe biases attending the estimation of small probability
and its nonlinearity (convexity) to parametric (and other) perturbations.
c. We express the concept of model fragility in terms of left tail exposure, and show correspondence to the concavity of the payoff from a
random variable.
d. Finally, we present our simple heuristic to detect the possibility of both fragility and model error across a broad range of probabilistic
estimations.

Conceptually, fragility resides in the fact that a small – or at least reasonable – uncertainty on the macro-parameter of a distribution may have
dramatic consequences on the result of a given stress test, or on some measure that depends on the left tail of the distribution, such as an out-of-the-
money option. This hypersensitivity of what we like to call an “out of the money put price” to the macro-parameter, which is some measure of the
volatility of the distribution of the underlying source of randomness.
Formally, fragility is defined as the sensitivity of the left-tail shortfall (non-conditioned by probability) below a certain threshold K to the overall left
semi-deviation of the distribution.

Examples
a. Example: a porcelain coffee cup subjected to random daily stressors from use.
b. Example: tail distribution in the function of the arrival time of an aircraft.
c. Example: hidden risks of famine to a population subjected to monoculture —or, more generally, fragilizing errors in the application of
Ricardo’s comparative advantage without taking into account second order effects.
d. Example: hidden tail exposures to budget deficits’ nonlinearities to unemployment.
e. Example: hidden tail exposure from dependence on a source of energy, etc. (“squeezability argument”).\
TAIL VEGA SENSITIVITY
We construct a measure of “vega” in the tails of the distribution that depends on the variations of s, the semi-deviation below a certain level !,
chosen in the L
1
norm in order to insure its existence under “fat tailed” distributions with finite first semi-moment. In fact s would exist as a measure
even in the case of infinite moments to the right side of !.
Let X be a random variable, the distribution of which is one among a one-parameter family of pdf f
!
, ! ! I " !. We consider a fixed reference value
# and, from this reference, the left-semi-absolute deviation:
( ) ( ) ( ) s x f x dx
!
!
"
#
#$
= "#
%

We assume that ! $ s

(!) is continuous, strictly increasing and spans the whole range !! = [0, +%), so that we may use the left-semi-absolute
deviation s

as a parameter by considering the inverse function !(s) : !! ! I, defined by s

(!(s)) = s for s ! !!.
This condition is for instance satisfied if, for any given x < #, the probability is a continuous and increasing function of !. Indeed, denoting
( ) ( ) Pr ( )
x
f
F x X x f t dt
!
! !
"#
= < =
$
, an integration by part yields:
( ) ( ) s F x dx
!
!
"
#
#$
=
%

This is the case when ! is a scaling parameter, i.e. X ~ # + !(X
1
– #), indeed one has in this case
1
( )
x
F x F
!
!
" # $ %
= #+
& '
( )
,
( ) ( )
F x
x f x
!
!
! !
" #$
=
"
and s

(!) = !s

(1).
It is also the case when ! is a shifting parameter, i.e. X ~ X
0
– !, indeed, in this case
0
( ) ( ) F x F x
!
! = + and ( )
s
F
!
!
"
#
= $
#
.
Fat Tails and (Anti)Fragility



For K < ! and s "!!, let:
( )
( , ) ( ) ( )
K
s
K s x f x dx
!
"
#
#
#$
= %#
&

In particular, !(!, s

) = s

. We assume, in a first step, that the function !(K,s

) is differentiable on (–#, !] $ !!. The K-left-tail-vega sensitivity of X
at stress level K < ! and deviation level s

> 0 for the pdf f
"
is:

V( X, f
!
, K, s
"
) =
#$
#s
"
(K, s
"
) = (% " x)
# f
!
#!
(x) dx
"&
K
'
(
)
*
+
,
-
ds
"
d!
(
)
*
+
,
-
"1

As the in many practical instances where threshold effects are involved, it may occur that ! does not depend smoothly on s

. We therefore also define
a finite difference version of the vega-sensitivity as follows:

V( X, f
!
, K, s
"
, #s) =
1
2#s
$(K, s
"
+ #s) "$(K, s
"
" #s)
( )
= (%" x)
f
!(s
"
+#s)
(x) " f
!(s
"
"#s)
(x)
2#s
dx
"&
K
'

Hence omitting the input %s implicitly assumes that %s & 0.
Note that
( ) ( , ) E Pr
f f
K s X X K X K
! !
"
#
= # $ < % <
& '
. It can be decomposed into two parts:
( , ( )) ( ) ( ) ( )
( ) ( ) ( )
K
K s K F K P K
P K K x f x dx
! !
! !
" !
#
#$
= %# +
= #
&

Where the first part (! – K)F
'
(K) is proportional to the probability of the variable being below the stress level K and the second part P
"
(K) is the
expectation of the amount by which X is below K (counting 0 when it is not). Making a parallel with financial options, while s

(") is a “put at-the-
money”, !(K,s

) is the sum of a put struck at K and a digital put also struck at K with amount ! – K; it can equivalently be seen as a put struck at !
with a down-and-in European barrier at K.
Letting " = "(s

) and integrating by part yields
( , ( )) ( ) ( ) ( ) ( )
K
K
K s K F K F x dx F x dx
! ! !
" !
#
$
$% $%
= #$ + =
& &

Where
( ) ( ) ( ) min( , ) min ( ), ( )
K
F x F x K F x F K
! ! ! !
= = , so that
( )
( , , , ) ( , )
( )
K
F
x dx
V X f K s K s
F
s
x dx
!
!
!
"
!
!
#
$%
$ $
#
$%
&
&
&
= =
&
&
&
'
'

For finite differences
( )
,
1
( , , , , ) ( )
2
K
s
V X f K s s F x dx
s
! !
"
#
$
#%
$ = $
$
&

Where "
s
+
and "
s

are such that ( )
s
s s s !
"
+ "
= + # , ( )
s
s s s !
"
" "
= "# and
,
( ) ( ) ( )
s s
K K K
s
F x F x F x
!
! !
+ "
#
# = " .

Fat Tails and (Anti)Fragility




Figure 3- The different curves of F
!
(") and F’
!
(") showing the difference in sensitivity to changes in at different levels of K.

MATHEMATICAL EXPRESSION OF FRAGILITY
In essence, fragility is the sensitivity of a given risk measure to an error in the estimation of the (possibly one-sided) deviation parameter of a
distribution, especially due to the fact that the risk measure involves parts of the distribution – tails – that are away from the portion used for
estimation. The risk measure then assumes certain extrapolation rules that have first order consequences. These consequences are even more
amplified when the risk measure applies to a variable that is derived from that used for estimation, when the relation between the two variables is
strongly nonlinear, as is often the case.
Definition of Fragility: The Intrinsic Case
The local fragility of a random variable X
!
depending on parameter !, at stress level K and semi-deviation level s

(!) with pdf f
!
is its K-left-tailed
semi-vega sensitivity V(X, f
!
, K, s

).
The finite-difference fragility of X
!
at stress level K and semi-deviation level s

(!) ± !s with pdf f
!
is its K-left-tailed finite-difference semi-vega
sensitivity V(X, f
!
, K, s

, !s).
In this definition, the fragility relies in the unsaid assumptions made when extrapolating the distribution of X
!
from areas used to estimate the semi-
absolute deviation s

(!), around ", to areas around K on which the risk measure # depends.
Definition of Fragility: The Inherited Case
We here consider the particular case where a random variable Y = $(X) depends on another source of risk X, itself subject to a parameter !. Let us
keep the above notations for X, while we denote by g
!
the pdf of Y, "
Y
= $(") and u

(!) the left-semi-deviation of Y. Given a “strike” level L = $(K),
let us define, as in the case of X :
( , ( )) ( ) ( )
K
Y
L u y g y dy
!
" !
#
#$
= % #
&

The inherited fragility of Y with respect to X at stress level L = $(K) and left-semi-deviation level s

(!) of X is the partial derivative:

1
( , , , ( )) ( , ( )) ( ) ( )
K
X Y
g ds
V Y g L s L u y y dy
s d
!
!
"
! !
! !
#
#
# #
#$
% & ' ' % &
= = ( #
) * ) *
' '
+ ,+ ,
-

Note that the stress level and the pdf are defined for the variable Y, but the parameter which is used for differentiation is the left-semi-absolute
deviation of X, s

(!). Indeed, in this process, one first measures the distribution of X and its left-semi-absolute deviation, then the function $ is
applied, using some mathematical model of Y with respect to X and the risk measure % is estimated. If an error is made when measuring s

(!), its
impact on the risk measure of Y is amplified by the ratio given by the “inherited fragility”.
K #
F

K

F
!
K

F

!

F

F

K
#
Fat Tails and (Anti)Fragility



Once again, one may use finite differences and define the finite-difference inherited fragility of Y with respect to X, by replacing, in the above
equation, differentiation by finite differences between values !
+
and !

, where s

(!
+
) = s

+ !s and s

(!

) = s

– !s.
Implications of a Nonlinear Change of Variable on the Intrinsic Fragility
We here study the case of a random variable Y = "(X), the pdf g
!
of which also depends on parameter !, related to a variable X by the nonlinear
function ". We are now interested in comparing their intrinsic fragilities. We shall say, for instance, that Y is more fragile at the stress level L and
left-semi-deviation level u

(!) than the random variable X, at stress level K and left-semi-deviation level s

(!) if the L-left-tailed semi-vega sensitivity
of Y
!
is higher than the K-left-tailed semi-vega sensitivity of X
!
:
V(Y, g
!
, L, u

) > V(X, f
!
, K, s

)
One may use finite differences to compare the fragility of two random variables: V(Y, g
!
, L, u

, !u) > V(X, f
!
, K, s

, !s). In this case, finite variations
must be comparable in size, namely !u/u

= !s/s

.
Let us assume, to start with, that " is differentiable, strictly increasing and scaled so that "
Y
= "(") = ". We also assume that, for any given x < ",
( ) 0
F
x
!
!
"
>
"
. In this case, as observed above, ! ! s

(!) is also increasing.
Let us denote ( ) Pr ( )
g
G y Y y
!
!
= < . We have:

( ) ( ) Pr ( ( )) Pr ( ) ( )
g f
G x Y x X x F x
! !
! !
" " = < = < =
Hence, if #(L, u

) denotes the equivalent of $(K, s

) with variable (Y, g
!
) instead of (X, f
!
), then we have:

!(L,u
"
(#)) = G
#
L
( y) dy
"$
%
&
= F
#
K
(x)
d'
dx
(x) dx
"$
%
&

Because " is increasing and min("(x),"(K)) = "(min(x,K)). In particular
( ) ( , ( )) ( ) ( )
d
u u F x x dx
dx
!
"
! # !
$
% %
%&
= $ =
'

The L-left-tail-vega sensitivity of Y is therefore:
( ) ( )
( , , , ( ))
( ) ( )
K
F d
x x dx
dx
V Y g L u
F d
x x dx
dx
!
!
!
"
!
!
"
!
#
$%
$
#
$%
&
&
=
&
&
'
'

For finite variations:
,
1
( , , , ( ), ) ( ) ( )
2
K
u
d
V Y g L u u F x x dx
u dx
! !
"
!
#
$
%
$&
% = %
%
'

Where
u
!
"
+
and
u
!
"
"
are such that ( )
u
u u u !
"
+ "
= + # , ( )
u
u u u !
"
" "
= "# and
,
( ) ( ) ( )
u u
K K K
u
F x F x F x
!
! !
+ "
" "
#
= " .
Next, Theorem 1 proves how a concave transformation "(x) of a random variable x produces fragility.
THEOREM 1 (FRAGILITY TRANSFER THEOREM)
Let, with the above notations, " : ! # ! be a twice differentiable function such that "(") = " and for any x < ", ( ) 0
d
x
dx
!
> . The random variable
Y = "(X) is more fragile at level L = "(K) and pdf g
!
than X at level K and pdf f
!
if, and only if, one has:

2
2
( ) ( ) 0
K
d
H x x dx
dx
!
"
#
$%
<
&

Fat Tails and (Anti)Fragility



Where
( ) ( ) ( ) ( ) ( )
K K
K
P P P P
H x x x
! ! ! !
!
! ! ! !
" " " "
= # $ #
" " " "

and where ( ) ( )
x
P x F t dt
! !
"#
=
$
is the price of the “put option” on X
!
with “strike” x and ( ) ( )
x
K K
P x F t dt
! !
"#
=
$
is that of a “put option” with
“strike” x and “European down-and-in barrier” at K.
H can be seen as a transfer function, expressed as the difference between two ratios. For a given level x of the random variable on the left hand side of
!, the second one is the ratio of the vega of a put struck at x normalized by that of a put “at the money” (i.e. struck at !), while the first one is the
same ratio, but where puts struck at x and ! are “European down-and-in options” with triggering barrier at the level K.
Proof
Let ( )
X
F
I x dx
!
!
!
"
#$
%
=
%
&
, ( )
K
K
X
F
I x dx
!
!
!
"
#$
%
=
%
&
, ( ) ( )
Y
F d
I x x dx
dx
!
!
"
!
#
$%
&
=
&
'
and ( ) ( )
K
L
Y
F d
I x x dx
dx
!
!
"
!
#
$%
&
=
&
'
. One
has ( , , , ( ))
K
X X
V X f K s I I
! !
!
!
"
= and ( , , , ( ))
L
Y Y
V Y g L u I I
! !
!
!
"
= hence:
( , , , ( )) ( , , , ( ))
L K K L
Y X X Y Y
K
Y X Y X X
I I I I I
V Y g L u V X f K s
I I I I I
! ! ! ! !
! ! ! ! !
! !
! !
" "
# $
" = " = " % &
% &
' (

Therefore, because the four integrals are positive, ( , , , ( )) ( , , , ( )) V Y g L u V X f K s
! !
! !
" "
" has the same sign as
L K
Y X Y X
I I I I
! ! ! !
" . On the other hand, we have ( )
X
P
I
!
!
!
"
= #
"
, ( )
K
K
X
P
I
!
!
!
"
= #
"
and

2
2
2
2
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
Y
K K K
L
Y
F P P d d d
I x x dx x x dx
dx dx dx
F P P d d d
I x x dx x x dx
dx dx dx
!
!
! ! !
! ! !
" " "
! ! !
" " "
! ! !
# #
$% $%
# #
$% $%
& & &
= = # # $
& & &
& & &
= = # # $
& & &
' '
' '

An elementary calculation yields:

1 1
2 2
2 2
2
2
( ) ( ) ( ) ( )
( )
L
K K
Y Y
K
X X
K
I I
P P P P d d
x dx x dx
I I dx dx
d
H x dx
dx
! !
! !
! ! ! !
!
" "
! ! ! !
"
# #
$ $
#% #%
$
#%
& ' & ' ( ( ( (
# = # $ + $
) * ) *
( ( ( (
+ , + ,
= #
- -
-

!
Let us now examine the properties of the function ( )
K
H x
!
. For x ! K, we have ( ) ( ) 0
K
P P
x x
! !
! !
" "
= >
" "
(the positivity is a consequence of that of
"F
!
/"!), therefore ( )
K
H x
!
has the same sign as ( ) ( )
K
P P
! !
! !
" "
# $ #
" "
. As this is a strict inequality, it extends to an interval on the right hand side of
K, say (–#, K" ] with K < K" < !.
But on the other hand:
( ) ( ) ( ) ( ) ( )
K
K
P P F F
x dx K K
! ! ! !
! ! ! !
" # # # #
" $ " = $ "$
# # # #
%

For K negative enough, ( )
F
K
!
!
"
"
is smaller than its average value over the interval [K, !], hence ( ) ( ) 0
K
P P
! !
! !
" "
# $ # >
" "
.
We have proven the following theorem.
Fat Tails and (Anti)Fragility



THEOREM 2 ( FRAGILITY EXACERBATION THEOREM)
With the above notations, there exists a threshold !
!
< " such that, if K ! !
!
then ( ) 0
K
H x
!
> for x # (–$, "
!
] with K < "
!
< ". As a consequence,
if the change of variable # is concave on (–$, "
!
] and linear on ["
!
, "], then Y is more fragile at L = #(K) than X at K.
One can prove that, for a monomodal distribution, !
!
< "
!
< " (see discussion below), so whatever the stress level K below the threshold !
!
, it
suffices that the change of variable # be concave on the interval (–$, !
!
] and linear on [!
!
, "] for Y to become more fragile at L than X at K. In
practice, as long as the change of variable is concave around the stress level K and has limited convexity/concavity away from K, the fragility of Y is
greater than that of X.
Figure 2 shows the shape of ( )
K
H x
!
in the case of a Gaussian distribution where ! is a simple scaling parameter (! is the standard deviation $) and
" = 0. We represented K = –2! while in this Gaussian case, !
!
= –1.585!.

Figure 4- The Transfer function H for different portions of the distribution: its sign flips in the region slightly below %
DISCUSSION
Monomodal case
We say that the family of distributions (f
!
) is left-monomodal if there exists "
!
< " such that 0
f
!
!
"
#
"
on (–$, "
!
] and 0
f
!
!
"
#
"
on [µ
!
, "]. In this
case
P
!
!
"
"
is a convex function on the left half-line (–$, µ
!
], then concave after the inflexion point µ
!
. For K ! µ
!
, the function
K
P
!
!
"
"
coincides with
P
!
!
"
"
on (–$, K], then is a linear extension, following the tangent to the graph of
P
!
!
"
"
in K (see graph below). The value of ( )
K
P
!
!
"
#
"
corresponds
to the intersection point of this tangent with the vertical axis. It increases with K, from 0 when K % –$ to a value above ( )
P
!
!
"
#
"
when K = µ
!
. The
threshold !
!
corresponds to the unique value of K such that ( ) ( )
K
P P
! !
! !
" "
# = #
" "
. When K < !
!
then ( ) ( ) ( )
P P
G x x
! !
!
! !
" "
= #
" "
and
( ) ( ) ( )
K K
K
P P
G x x
! !
!
! !
" "
= #
" "
are functions such that ( ) ( ) 1
K
G G
! !
" = " = and which are proportional for x ! K, the latter being linear on
[K, "]. On the other hand, if K < !
!
then ( ) ( )
K
P P
! !
! !
" "
# < #
" "
and ( ) ( )
K
G K G K
! !
< , which implies that ( ) ( )
K
G x G x
! !
< for x ! K. An
elementary convexity analysis shows that, in this case, the equation ( ) ( )
K
G x G x
! !
= has a unique solution "
!
with µ
!
< "
!
< ". The “transfer”
function ( )
K
H x
!
is positive for x < "
!
, in particular when x ! µ
!
and negative for "
!
< x < ".

K
&

&

&
H

K

Fat Tails and (Anti)Fragility




Figure 5- The distribution G
!
and the various derivatives of the unconditional shortfalls
Scaling Parameter
We assume here that ! is a scaling parameter, i.e. X
!
= " + !(X
1
– "). In this case, as we saw above, we have
1
1
( )
x
f x f
!
! !
" # $ %
= #+
& '
( )
,
1
( )
x
F x F
!
!
" # $ %
= #+
& '
( )
,
1
( )
x
P x P
!
!
!
" # $ %
= #+
& '
( )

and s

(!) = !s

(1). Hence

( )
1 1
2
( , ( )) ( )
1 1
( , ) ( , ) ( ) ( ) ( ) ( ) ( )
(1) ( )
K K
K s K F P
K s K P K K F K K f K
s s s
! ! !
" ! !
! !
" "
!
! !
#
#
# # #
#$ #$ % & % &
= $ # $ + + $ +
' ( ' (
) * ) *
+ +
= = + $ # + $ #
+ +

When we apply a nonlinear transformation ", the action of the parameter ! is no longer a scaling: when small negative values of X are multiplied by a
scalar !, so are large negative values of X. The scaling ! applies to small negative values of the transformed variable Y with a coefficient (0)
d
dx
!
,
but large negative values are subject to a different coefficient ( )
d
K
dx
!
, which can potentially be very different.

Fragility Drift
Fragility is defined at as the sensitivity – i.e. the first partial derivative – of the tail estimate # with respect to the left semi-deviation s

. Let us now
define the fragility drift:
2
( , , , ) ( , )
K
V X f K s K s
K s
!
"
# #
#
$
% =
$ $

!"
#"
G

K

G

!P

/
!$
!P

K
/!$
"
#

K
H

K
> 0
H

K
< 0
Fat Tails and (Anti)Fragility



In practice, fragility always occurs as the result of fragility, indeed, by definition, we know that !(!, s

) = s

, hence V(X, f
"
, !, s

) = 1. The fragility
drift measures the speed at which fragility departs from its original value 1 when K departs from the center !.
Second-order Fragility
The second-order fragility is the second order derivative of the tail estimate ! with respect to the semi-absolute deviation s

:
( )
2
2
( , , , ) ( , )
s
V X f K s K s
s
!
"
#
# #
#
$
% =
$

As we shall see later, the second-order fragility drives the bias in the estimation of stress tests when the value of s

is subject to uncertainty, through
Jensen inequality.
Definitions of Robustness and Antifragility

Antifragility is not the si mpl e opposite of fragility, as we saw in Table 1. Measuring antifragility, on the one hand, consists of the flipside of
fragility on the right-hand side, but on the other hand requires a control on the robustness of the probability distribution on the left-hand side. From
that aspect, unlike fragility, antifragility cannot be summarized in one single figure but necessitates at least two of them.
When a random variable depends on another source of randomness: Y
"
= #(X
"
), we shall study the antifragility of Y
"
with respect to that of X
"
and to
the properties of the function #.
DEFINITION OF ROBUSTNESS
Let (X
"
) be a one-parameter family of random variables with pdf f
"
. Robustness is an upper control on the fragility of X, which resides on the left hand
side of the distribution.
We say that f
"
is b-robust beyond stress level K < ! if V(X
"
, f
"
, K’, s(")) " b for any K’ " K. In other words, the robustness of f
"
on the half-line (–
!, K] is
( ] ,
( , , , ( )) max ( , , , ( ))
K
K K
R X f K s V X f K s
! ! ! !
! !
" "
"#
$%
$ = , so that b-robustness simply means
( ] ,
( , , , ( ))
K
R X f K s b
! !
!
"
"#
$ .
We also define b-robustness over a given interval [K
1
, K
2
] by the same inequality being valid for any K’ # [K
1
, K
2
]. In this case we use
[ ]
1 2
1 2
,
( , , , ( )) max ( , , , ( ))
K K
K K K
R X f K s V X f K s
! ! ! !
! !
" "
# $ $
# = .
Note that the lower R, the tighter the control and the more robust the distribution f
"
.
Once again, the definition of b-robustness can be transposed, using finite differences V(X
"
, f
"
, K’, s

("), $s).
In practical situations, setting a material upper bound b to the fragility is particularly important: one need to be able to come with actual estimates of
the impact of the error on the estimate of the left-semi-deviation. However, when dealing with certain class of models, such as Gaussian, exponential
of stable distributions, we may be lead to consider asymptotic definitions of robustness, related to certain classes.
For instance, for a given decay exponent a > 0, assuming that f
"
(x) = O(e
ax
) when x –!, the a-exponential asymptotic robustness of X
"
below the
level K is:


R
exp
( X
!
, f
!
, K, s
"
(!), a) = max
# K $K
e
a(%" # K )
V( X
!
, f
!
, # K , s
"
(!))
( )

If one of the two quantities
( )
( )
a K
e f K
!
" #$
" or
( )
( , , , ( ))
a K
e V X f K s
! !
!
" #$ $
" is not bounded from above when K! –!, then R
exp
= +! and X
"
is
considered as not a-exponentially robust.
Similarly, for a given power $ > 0, and assuming that f
"
(x) = O(x
–$
) when x –!, the $-power asymptotic robustness of X
"
below the level K is:
( )
2
pow
( , , , ( ), ) max ( ) ( , , , ( ))
K K
R X f K s a K V X f K s
!
" " " "
" "
# # #
$%
$ $ = &#
Fat Tails and (Anti)Fragility



If one of the two quantities ( ) ( ) K f K
!
"
# # $% or
2
( ) ( , , , ( )) K V X f K s
!
" "
"
# #
$ $ %# is not bounded from above when K! –!, then R
pow
= +!
and X
!
is considered as not "-power robust. Note the exponent " – 2 used with the fragility, for homogeneity reasons, e.g. in the case of stable
distributions.
When a random variable Y
!
= #(X
!
) depends on another source of risk X
!
.
Definition 2a, Left-Robustness (monomodal distribution). A payoff y = #(x) is said (a,b)-robust below L = #(K) for a source of randomness
X with pdf f
!
assumed monomodal if, letting g
!
be the pdf of Y = #(X), one has, for any K! ! K and L! = #(K!) :


( ) ( )
, , , ( ) , , , ( )
X
V Y g L s aV X f K s b
! !
! !
" "
# # $ +
(4)
The quantity b i s of order deemed of “negligible utility” (subjectively), that is, does not exceed some tolerance level in relation with the context,
while a is a scaling parameter between variables X and Y.
Note that robustness is in effect impervious to changes of probability distributions. Also note that this measure robustness ignores first order
variations since owing to their higher frequency, these are detected (and remedied) very early on.
Example of Robustness (Barbells):
a. trial and error with bounded error and open payoff
b. for a "barbell portfolio" with allocation to numeraire securities up to 80% of portfolio, no perturbation below K set at 0.8 of valuation
will represent any difference in result, i.e. q = 0. The same for an insured house (assuming the risk of the insurance company is not a
source of variation), no perturbation for the value below K, equal to minus the insurance deductible, will result in significant changes.
c. a bet of amount B (limited liability) is robust, as it does not have any sensitivity to perturbations below 0.

DEFINITION OF ANTIFRAGILITY
The second condition of antifragility regards the right hand side of the distribution. Let us define the right-semi-deviation of X :
( ) ( ) ( ) s x f x dx
!
!
+"
+
#
= $#
%

And, for H > L > " :

!
+
(L, H, s
+
(")) = (x # $) f
"
(x) dx
L
H
%
W( X, f
"
, L, H, s
+
) =
&!
+
(L, H, s
+
)
&s
+
= (x # $)
& f
"
&"
(x) dx
L
H
%
'
(
)
*
+
,
(x # $)
& f
"
&"
(x) dx
$
+-
%
'
(
)
*
+
,
#1

When Y = #(X) is a variable depending on a source of noise X, we define:


W
X
(Y, g
!
,"(L),"(H), s
+
) = ( y #"($))
%g
!
%!
( y) dy
"( L)
"( H)
&
'
(
)
*
+
,
(x # $)
% f
!
%!
(x) dx
$
+-
&
'
(
)
*
+
,
#1


Definition 2b, Antifragility (monomodal distribution). A payoff y = #(x) is locally antifragile over the range [L, H] if

1. It is b-robust below " for some b > 0
2.
( ) ( )
, , ( ), ( ), ( ) , , , , ( )
X
W Y g L H s aW X f L H s
! !
" " ! !
+ +
#

where

( )
( )
u
a
s
!
!
+
+
=

The scaling constant a provides homogeneity in the case where the relation between X and y is linear. In particular, nonlinearity in the relation
between X and Y impacts robustness.
The second condition can be replaced with finite differences #u and #s, as long as #u/u = #s/s.
Fat Tails and (Anti)Fragility




REMARKS
Fragility is K-specific. We are only concerned with adverse events below a certain pre-specified level, the breaking point. Exposures A can be
more fragile than exposure B for K = 0, and much less fragile if K is, say, 4 mean deviations below 0. We may need to use finite !s to avoid
situations as we will see of vega-neutrality coupled with short left tail.
Effect of using the wrong distribution f: Comparing V(X, f, K, s

, !s) and the alternative distribution V(X, f*, K, s*, !s), where f* is the
“true” distribution, the measure of fragility provides an acceptable indication of the sensitivity of a given outcome – such as a risk measure – to model
error, provided no “paradoxical effects” perturb the situation. Such “paradoxical effects” are, for instance, a change in the direction in which certain
distribution percentiles react to model parameters, like s

. It is indeed possible that nonlinearity appears between the core part of the distribution and
the tails such that when s

increases, the left tail starts fattening – giving a large measured fragility – then steps back – implying that the real fragility
is lower than the measured one. The opposite may also happen, implying a dangerous under-estimate of the fragility. These nonlinear effects can stay
under control provided one makes some regularity assumptions on the actual distribution, as well as on the measured one. For instance, paradoxical
effects are typically avoided under at least one of the following three hypotheses:
a. The class of distributions in which both f and f* are picked are all monomodal, with monotonous dependence of percentiles with
respect to one another.
b. The difference between percentiles of f and f* has constant sign (i.e. f* is either always wider or always narrower than f at any given
percentile)
c. For any strike level K (in the range that matters), the fragility measure V monotonously depends on s

on the whole range where the
true value s* can be expected. This is in particular the case when partial derivatives !
k
V/!s
k
all have the same sign at measured s

up to
some order n, at which the partial derivative has that same constant sign over the whole range on which the true value s* can be expected.
This condition can be replaced by an assumption on finite differences approximating the higher order partial derivatives, where n is large
enough so that the interval [s

± n!s] covers the range of possible values of s*. Indeed, in this case, the finite difference estimate of
fragility uses evaluations of ! at points spanning this interval.
Unconditionality of the shortfall measure ! : Many, when presenting shortfall, deal with the conditional shortfall
( ) ( )
K K
x f x dx f x dx
!" !"
# #
;
while such measure might be useful in some circumstances, its sensitivity is not indicative of fragility in the sense used in this discussion. The
unconditional tail expectation
( )
K
xf x dx !
"#
=
$
is more indicative of exposure to fragility. It is also preferred to the raw probability of falling below
K, which is
( )
K
f x dx
!"
#
, as the latter does not include the consequences. For instance, two such measures
( )
K
f x dx
!"
#
and
( )
K
g x dx
!"
#
may be equal
over broad values of K; but the expectation
( )
K
x f x dx
!"
#
can be much more consequential than
( )
K
xg x dx
!"
#
as the cost of the break can be more
severe and we are interested in its “vega” equivalent.


Applications to Model Error
In the cases where Y depends on X, among other variables, often x is treated as non-stochastic, and the underestimation of the volatility of x maps
immediately into the underestimation of the left tail of Y under two conditions:
a- X is stochastic and its stochastic character is ignored (as if it had zero variance or mean deviation)
b- Y is concave with respect to X in the negative part of the distribution, below "
"Convexity Bias" or Jensen's Inequality Effect: Further, missing the stochasticity under the two conditions a) and b) , in the event of the concavity
applying above " leads to the negative convexity bias from the lowering effect on the expectation of the dependent variable Y.
CASE 1- APPLICATION TO DEFICITS
Example: A government estimates unemployment for the next three years as averaging 9%; it uses its econometric models to issue a
forecast balance B of 200 billion deficit in the local currency. But it misses (like almost everything in economics) that unemployment is a
stochastic variable. Employment over 3 years periods has fluctuated by 1% on average. We can calculate the effect of the error with the
following:
• Unemployment at 8% , Balance B(8%) = -75 bn (improvement of 125bn)
Fat Tails and (Anti)Fragility



• Unemployment at 9%, Balance B(9%)= -200 bn
• Unemployment at 10%, Balance B(10%)= --550 bn (worsening of 350bn)

The convexity bias from underestimation of the deficit is by -112.5bn, since 5 . 312
2
%) 10 ( %) 8 (
! =
+ B B


Further look at the probability distribution caused by the missed variable (assuming to simplify deficit is Gaussian with a Mean Deviation of
1% )



Figure 6 CONVEXITY EFFECTS ALLOW THE DETECTION OF BOTH MODEL BIAS AND FRAGILITY. Illustration of the example;
histogram from Monte Carlo simulation of government deficit as a left-tailed random variable simply as a result of randomizing
unemployment of which it is a convex function. The method of point estimate would assume a Dirac stick at -200, thus underestimating
both the expected deficit (-312) and the skewness (i.e., fragility) of it.


Adding Model Error and Metadistributions: Model error should be integrated in the distribution as a stochasticization of parameters. f and g
should subsume the distribution of all possible factors affecting the final outcome (including the metadistribution of each). The so-called
"perturbation" is not necessarily a change in the parameter so much as it is a means to verify whether f and g capture the full shape of the final
probability distribution.
Any situation with a bounded payoff function that organically truncates the left tail at K will be impervious to all perturbations affecting the
probability distribution below K.

For K = 0, the measure equates to mean negative semi-deviation (more potent than negative semi-variance or negative semi-standard deviation often
used in financial analyses).


MODEL ERROR AND SEMI-BIAS AS NONLINEARITY FROM MISSED STOCHASTICITY OF VARIABLES
Model error often comes from missing the existence of a random variable that is significant in determining the outcome (say option pricing without
credit risk). We cannot detect it using the heuristic presented in this paper but as mentioned earlier the error goes in the opposite direction as model
tend to be richer, not poorer, from overfitting. But we can detect the model error from missing the stochasticity of a variable or
underestimating its stochastic character (say option pricing with non-stochastic interest rates or ignoring that the “volatility” ! can vary).

Missing Effects: The study of model error is not to question whether a model is precise or not, whether or not it tracks reality; it is to ascertain the
first and second order effect from missing the variable, insuring that the errors from the model don’t have missing higher order terms that cause
severe unexpected (and unseen) biases in one direction because of convexity or concavity, in other words, whether or not the model error causes a
change in z.
Fat Tails and (Anti)Fragility





Model Bias, Second Order Effects, and Fragility

Having the right model (which is a very generous assumption), but being uncertain about the parameters will invariably lead to an increase in
model error in the presence of convexity and nonlinearities.
As a generalization of the deficit/employment example used in the previous section, say we are using a simple function:

f x !
( )
(5)

Where

!

is supposed to be the average expected rate, where we take ! as the distribution of " over its domain !
"


! = ! "(!) d!
#
!
$
(6)
The mere fact that ! is uncertain (since it is estimated) might lead to a bias if we perturb from the outside (of the integral), i.e. stochasticize the
parameter deemed fixed. Accordingly, the convexity bias is easily measured as the difference between a) f integrated across values of potential
! and b) f estimated for a single value of ! deemed to be its average. The convexity bias #
A
becomes:

!
A
! f x "
( )
! "
( )
d! dx
"
!
#
"
x
#
$ f (x ! " !
( )
d!
"
!
#
%
&
'
( "
x
#
)dx
(7)

And #
B
the missed fragility is assessed by comparing the two integrals below K, in order to capture the effect on the left tail:

!
B
(K) ! f x !
( )
! "
( )
d! dx
"
!
#
$%
K
#
$ f (x ! " !
( )
d!
"
!
#
&
'
(
) $%
K
#
)dx
(8)

Which can be approximated by an interpolated estimate obtained with two values of " separated from a mid point by "" a mean deviation of "
and estimating

( ) ( ) ( )
1
( ) | | ( | )
2
K K
B
K f x f x dx f x dx ! " " " " "
#$ #$
% + & + #& #
' '
(8)

We can probe #
B
by point estimates of f at a level of X # K

( ) ( ) ( )
1
( ) | | ( | )
2
B
X f X f X f X ! " " " " " # = + $ + %$ %
(9)
So that
( ) ( )
K
B B
K x dx ! !
"#
$ =
%

which leads us to the fragility heuristic. In particular, if we assume that #’
B
(X) has a constant sign for X # K, then #
B
(K) has the same sign.


The Fragility/Model Error Detection Heuristic (detecting $A and $B
when cogent)

Example 1 (Detecting Tail Risk Not Shown By Stress Test, $
B
). The famous firm Dexia went into financial distress a few days after
passing a stress test “with flying colors”.
If a bank issues a so-called “stress test” (something that has not proven very satisfactory), off a parameter (say stock market) at -15%. We
ask them to recompute at -10% and -20%. Should the exposure show negative asymmetry (worse at -20% than it improves at -10%), we deem
that their risk increases in the tails. There are certainly hidden tail exposures and a definite higher probability of blowup in addition to
exposure to model error.
Note that it is somewhat more effective to use our measure of shortfall in Definition, but the method here is effective enough to show
hidden risks, particularly at wider increases (try 25% and 30% and see if exposure shows increase). Most effective would be to use power-law
Fat Tails and (Anti)Fragility



distributions and perturb the tail exponent to see symmetry.

Example 2 (Detecting Tail Risk in Overoptimized System, !
B
). Raise airport traffic 10%, lower 10%, take average expected traveling time
from each, and check the asymmetry for nonlinearity. If asymmetry is significant, then declare the system as overoptimized. (Both !A and !B as
thus shown.

The same procedure uncovers both fragility and consequence of model error (potential harm from having wrong probability distribution, a thin-
tailed rather than a fat-tailed one). For traders (and see Gigerenzer’s discussions, i n Gigerenzer and Brighton (2009), Gigerenzer and
Goldstein(1996)) simple heuristics tools detecting the magnitude of second order effects can be more effective than more complicated and
harder to calibrate methods, particularly under multi-dimensionality. See also the intuition of fast and frugal in Derman and Wilmott (2009),
Haug and Taleb (2011).

The Heuristic applied to a Model:

1- First Step (first order). Take a valuation. Measure the sensitivity to all parameters p determining V over finite ranges !p. If materially
significant, check if stochasticity of parameter is taken into account by risk assessment. If not, then stop and declare the risk as grossly
mismeasured (no need for further risk assessment).
2-Second Step (second order). For all parameters p compute the ratio of first to second order effects at the initial range "p = estimated mean
deviation.
H !p ( ) "
µ'
µ
,
where
µ' !p ( ) "
1
2
f p +
1
2
!p
#
$
%
&
'
(
+ f p )
1
2
!p
#
$
%
&
'
(
#
$
%
&
'
(

2-Third Step. Note parameters for which H is significantly > or < 1.
3- Fourth Step: Keep widening "p to verify the stability of the second order effects.



The Heuristic applied to a stress test:

In place of the standard, one-point estimate stress test S1, we issue a "triple", S1, S2, S3, where S2 and S3 are S1 ± "p. Acceleration of losses
is indicative of fragility.



Remarks:

a. Simple heuristics have a robustness (in spite of a possible bias) compared to optimized and calibrated measures. Ironically, it is from
the multiplication of convexity biases and the potential errors from missing them that calibrated models that work in-sample
underperform heuristics out of sample (Gigerenzer and Brighton, 2009).
b. Heuristics allow to detection of the effect of the use of the wrong probability distribution without changing probability distribution (just
from the dependence on parameters).
c. The heuristic improves and detects flaws in all other commonly used measures of risk, such as CVaR, “expected shortfall”, stress-
testing, and similar methods have been proven to be completely ineffective (Taleb, 2009).
d. The heuristic does not require parameterization beyond varying !p.




Further Applications

In parallel works, applying the "simple heuristic" allows us to detect the following “hidden short options” problems by merely perturbating a
certain parameter p:
a. Size and negative stochastic economies of scale.
i. size and squeezability (nonlinearities of squeezes in costs per unit)
b. Specialization (Ricardo) and variants of globalization.
i. missing stochasticity of variables (price of wine).
ii. specialization and nature.
c. Portfolio optimization (Markowitz)
Fat Tails and (Anti)Fragility



d. Debt
e. Budget Deficits: convexity effects explain why uncertainty lengthens, doesn’t shorten expected deficits.
f. Iatrogenics (medical) or how some treatments are concave to benefits, convex to errors.
g. Disturbing natural systems




!
Fat Tails and (Anti)fragility - N N Taleb| 93
11.1 Fragility, As Linked to Nonlinearity
1 2 3 4 5 6 7
-700000
-600000
-500000
-400000
-300000
-200000
-100000
,
2 4 6 8 10 12 14
-2.5 µ10
6
-2.0 µ10
6
-1.5 µ10
6
-1.0 µ10
6
-500000
,
5 10 15 20
-1 µ10
7
-8 µ10
6
-6 µ10
6
-4 µ10
6
-2 µ10
6
Mean Dev 10 L 5 L 2.5 L L Nonlinear
1 -100000 -50000 -25000 -10000 -1000
2 -200000 -100000 -50000 -20000 -8000
5 -500000 -250000 -125000 -50000 -125000
10 -1000000 -500000 -250000 -100000 -1000000
15 -1500000 -750000 -375000 -150000 -3375000
20 -2000000 -1000000 -500000 -200000 -8000000
25 -2500000 -1250000 -625000 -250000 -15625000
94 | Fat Tails and (Anti)fragility - N N Taleb
Mean Dev 10 L 5 L 2.5 L L Nonlinear
1 -100000 -50000 -25000 -10000 -1000
2 -200000 -100000 -50000 -20000 -8000
5 -500000 -250000 -125000 -50000 -125000
10 -1000000 -500000 -250000 -100000 -1000000
15 -1500000 -750000 -375000 -150000 -3375000
20 -2000000 -1000000 -500000 -200000 -8000000
25 -2500000 -1250000 -625000 -250000 -15625000
11.2 Nonlinearity of Harm Function and Probability Distribution
Generalized LogNormal
x~ Lognormal (m,s)
Ax
k
~F@A, x, k, m, sD =

-
log K
x
A
O
1
k -m
2
2 s
2
2 p k s x
As the response harm becomes more convex.
12.5 13.0 13.5 14.0 14.5 15.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
FH1, x, 1, 0, 1L
FI
1
2
, x,
3
2
, 0, 1M
FI
1
4
, x, 2, 0, 1M
FI
1
4
, x, 3, 0, 1M
Survival Functions
12.5 13.0 13.5 14.0 14.5 15.0
0.2
0.4
0.6
0.8
1.0
SH1, 15 - k, 1, 0, 1L
SI
1
2
, 15 - K,
3
2
, 0, 1M
SI
1
4
, 15 - K, 2, 0, 1M
SI
1
4
, 15 - K, 3, 0, 1M
Fat Tails and (Anti)fragility - N N Taleb| 95
11.3 Coffee Cup and Barrier-Style Payoff
One period model
Broken glass !
0 5 10 15 20
Dose
0.5
1.0
1.5
2.0
2.5
3.0
Response
Clearly there is path dependence.
5 10 15 20
Dose 0
1
2
3
4
Response
This part will be expanded
96 | Fat Tails and (Anti)fragility - N N Taleb
Appendix II (Very Technical):
WHERE MOST ECONOMIC MODELS
FRAGILIZE AND BLOW PEOPLE UP
When I said “technical” in the main text, I may have been fibbing. Here I am not.
The Markowitz incoherence: Assume that someone tells you that the probability of
an event is exactly zero. You ask him where he got this from. “Baal told me” is the
answer. In such case, the person is coherent, but would be deemed unrealistic by
non- Baalists. But if on the other hand, the person tells you “I estimated it to be
zero,” we have a problem. The person is both unrealistic and inconsistent. Some-
thing estimated needs to have an estimation error. So probability cannot be zero if it
is estimated, its lower bound is linked to the estimation error; the higher the estima-
tion error, the higher the probability, up to a point. As with Laplace’s argument of
total ignorance, an infinite estimation error pushes the probability toward ½.
We will return to the implication of the mistake; take for now that anything es-
timating a parameter and then putting it into an equation is different from estimat-
ing the equation across parameters (same story as the health of the grandmother, the
average temperature, here “estimated” is irrelevant, what we need is average health
across temperatures). And Markowitz showed his incoherence by starting his “semi-
nal” paper with “Assume you know E and V” (that is, the expectation and the vari-
ance). At the end of the paper he accepts that they need to be estimated, and what is
worse, with a combination of statistical techniques and the “judgment of practical
men.” Well, if these parameters need to be estimated, with an error, then the deriva-
tions need to be written differently and, of course, we would have no paper— and no
Markowitz paper, no blowups, no modern finance, no fragilistas teaching junk to
students. . . . Economic models are extremely fragile to assumptions, in the sense
that a slight alteration in these assumptions can, as we will see, lead to extremely
consequential differences in the results. And, to make matters worse, many of these
models are “back-fit” to assumptions, in the sense that the hypotheses are selected
to make the math work, which makes them ultrafragile and ultrafragilizing.
Simple example: Government deficits.
We use the following deficit example owing to the way calculations by govern-
ments and government agencies currently miss convexity terms (and have a hard
time accepting it). Really, they don’t take them into account. The example illustrates:
Tale_9781400067824_3p_all_r1.indd 447 10/10/12 8:44 AM
This is from the Appendix of ANTIFRAGILE. Technical, but there
is much, much more technical in FAT TAILS AND AF (textbook).
448 APPENDI X I I
(a) missing the stochastic character of a variable known to affect the model
but deemed deterministic (and fixed), and
(b) F, the function of such variable, is convex or concave with respect to the
variable.
Say a government estimates unemployment for the next three years as averaging
9 percent; it uses its econometric models to issue a forecast balance B of a two-
hundred-billion deficit in the local currency. But it misses (like almost everything in
economics) that unemployment is a stochastic variable. Employment over a three-
year period has fluctuated by 1 percent on average. We can calculate the effect of the
error with the following:
Unemployment at 8%, Balance B(8%) = −75 bn (improvement of 125 bn)
Unemployment at 9%, Balance B(9%)= −200 bn
Unemployment at 10%, Balance B(10%)= −550 bn (worsening of 350 bn)
The concavity bias, or negative convexity bias, from underestimation of the deficit is
−112.5 bn, since ½ {B(8%) + B(10%)} = −312 bn, not −200 bn. This is the exact case
of the inverse philosopher’s stone.
8 9 10
Unemployment
1200
1000
800
600
400
200
Deficit bil
–1000 –800 –600 –400 –200 0
Deficit
Missed fragility (unseen left tail)
Missed cpnvexity effect
FI GURE 37. Nonlinear transformations allow the detection of both model convexity bias
and fragility. Illustration of the example: histogram from Monte Carlo simulation of
government deficit as a left- tailed random variable simply as a result of randomizing
unemployment, of which it is a concave function. The method of point estimate
would assume a Dirac stick at −200, thus underestimating both the expected deficit
(−312) and the tail fragility of it. (From Taleb and Douady, 2012).
Application: Ricardian Model and Left Tail— The Price of Wine Happens to Vary
For almost two hundred years, we’ve been talking about an idea by the economist
David Ricardo called “comparative advantage.” In short, it says that a country
should have a certain policy based on its comparative advantage in wine or clothes.
Say a country is good at both wine and clothes, better than its neighbors with whom
it can trade freely. Then the visible optimal strategy would be to specialize in either
wine or clothes, whichever fits the best and minimizes opportunity costs. Everyone
would then be happy. The analogy by the economist Paul Samuelson is that if some-
one happens to be the best doctor in town and, at the same time, the best secretary,
Deficit (bil)
Unemployment
Deficit
Missed fragility
(unseen left tail)
Missed convexity effect
Tale_9781400067824_3p_all_r1.indd 448 10/10/12 8:44 AM
APPENDI X I I 449
then it would be preferable to be the higher- earning doctor— as it would minimize
opportunity losses— and let someone else be the secretary and buy secretarial ser-
vices from him.
I agree that there are benefits in some form of specialization, but not from the
models used to prove it. The flaw with such reasoning is as follows. True, it would
be inconceivable for a doctor to become a part- time secretary just because he is good
at it. But, at the same time, we can safely assume that being a doctor insures some
professional stability: People will not cease to get sick and there is a higher social
status associated with the profession than that of secretary, making the profession
more desirable. But assume now that in a two- country world, a country special-
ized in wine, hoping to sell its specialty in the market to the other country, and
that suddenly the price of wine drops precipitously. Some change in taste caused
the price to change. Ricardo’s analysis assumes that both the market price of wine
and the costs of production remain constant, and there is no “second order” part of
the story.
TABLE 11 • RICARDO’S ORIGINAL EXAMPLE
(COSTS OF PRODUCTION PER UNIT)
CLOTH WINE
Britain 100 110
Portugal 90 80
The logic: The table above shows the cost of production, normalized to a selling
price of one unit each, that is, assuming that these trade at equal price (1 unit of
cloth for 1 unit of wine). What looks like the paradox is as follows: that Portugal
produces cloth cheaper than Britain, but should buy cloth from there instead, using
the gains from the sales of wine. In the absence of transaction and transportation
costs, it is efficient for Britain to produce just cloth, and Portugal to only produce
wine.
The idea has always attracted economists because of its paradoxical and coun-
terintuitive aspect. For instance, in an article “Why Intellectuals Don’t Understand
Comparative Advantage” (Krugman, 1998), Paul Krugman, who fails to under-
stand the concept himself, as this essay and his technical work show him to be
completely innocent of tail events and risk management, makes fun of other intel-
lectuals such as S. J. Gould who understand tail events albeit intuitively rather than
analytically. (Clearly one cannot talk about returns and gains without discounting
these benefits by the offsetting risks.) The article shows Krugman falling into the
critical and dangerous mistake of confusing function of average and average of
function.
Now consider the price of wine and clothes variable— which Ricardo did not
assume— with the numbers above the unbiased average long- term value. Further as-
sume that they follow a fat- tailed distribution. Or consider that their costs of pro-
duction vary according to a fat- tailed distribution.
If the price of wine in the international markets rises by, say, 40 percent, then
there are clear benefits. But should the price drop by an equal percentage, −40 per-
cent, then massive harm would ensue, in magnitude larger than the benefits should
there be an equal rise. There are concavities to the exposure— severe concavities.
Tale_9781400067824_3p_all_r1.indd 449 10/10/12 8:44 AM
450 APPENDI X I I
And clearly, should the price drop by 90 percent, the effect would be disastrous.
Just imagine what would happen to your household should you get an instant and
unpredicted 40 percent pay cut. Indeed, we have had problems in history with coun-
tries specializing in some goods, commodities, and crops that happen to be not just
volatile, but extremely volatile. And disaster does not necessarily come from varia-
tion in price, but problems in production: suddenly, you can’t produce the crop be-
cause of a germ, bad weather, or some other hindrance.
A bad crop, such as the one that caused the Irish potato famine in the decade
around 1850, caused the death of a million and the emigration of a million more
(Ireland’s entire population at the time of this writing is only about six million, if one
includes the northern part). It is very hard to reconvert resources— unlike the case in
the doctor- typist story, countries don’t have the ability to change. Indeed, monocul-
ture (focus on a single crop) has turned out to be lethal in history— one bad crop
leads to devastating famines.
The other part missed in the doctor- secretary analogy is that countries don’t
have family and friends. A doctor has a support community, a circle of friends, a
collective that takes care of him, a father- in- law to borrow from in the event that he
needs to reconvert into some other profession, a state above him to help. Countries
don’t. Further, a doctor has savings; countries tend to be borrowers.
So here again we have fragility to second- order effects.
Probability Matching: The idea of comparative advantage has an analog in probabil-
ity: if you sample from an urn (with replacement) and get a black ball 60 percent of
the time, and a white one the remaining 40 percent, the optimal strategy, according
to textbooks, is to bet 100 percent of the time on black. The strategy of betting 60
percent of the time on black and 40 percent on white is called “probability match-
ing” and considered to be an error in the decision-science literature (which I remind
the reader is what was used by Triffat in Chapter 10). People’s instinct to engage in
probability matching appears to be sound, not a mistake. In nature, probabilities are
unstable (or unknown), and probability matching is similar to redundancy, as a buf-
fer. So if the probabilities change, in other words if there is another layer of random-
ness, then the optimal strategy is probability matching.
How specialization works: The reader should not interpret what I am saying to
mean that specialization is not a good thing— only that one should establish such
specialization after addressing fragility and second- order effects. Now I do believe
that Ricardo is ultimately right, but not from the models shown. Organically, sys-
tems without top- down controls would specialize progressively, slowly, and over a
long time, through trial and error, get the right amount of specialization— not
through some bureaucrat using a model. To repeat, systems make small errors, de-
sign makes large ones.
So the imposition of Ricardo’s insight- turned- model by some social planner
would lead to a blowup; letting tinkering work slowly would lead to efficiency— true
efficiency. The role of policy makers should be to, via negativa style, allow the emer-
gence of specialization by preventing what hinders the process.
A More General Methodology to Spot Model Error
Model second- order effects and fragility: Assume we have the right model (which is
a very generous assumption) but are uncertain about the parameters. As a general-
ization of the deficit/employment example used in the previous section, say we are
using f, a simple function: , where ᾱ is supposed to be the average expected
Tale_9781400067824_3p_all_r1.indd 450 10/10/12 8:44 AM
Instantaneous ad-
justment misses an
Ito term.
APPENDI X I I 451
input variable, where we take φ as the distribution of α over its domain ,
= ( ) d
.
The philosopher’s stone: The mere fact that α is uncertain (since it is estimated)
might lead to a bias if we perturbate from the inside (of the integral), i.e., stochasti-
cize the parameter deemed fixed. Accordingly, the convexity bias is easily measured
as the difference between (a) the function f integrated across values of potential α,
and (b) f estimated for a single value of α deemed to be its average. The convexity
bias (philosopher’s stone) ω
A
becomes:
*
The central equation: Fragility is a partial philosopher’s stone below K, hence ω
B
the
missed fragility is assessed by comparing the two integrals below K in order to cap-
ture the effect on the left tail:
( )
which can be approximated by an interpolated estimate obtained with two values of
α separated from a midpoint by ∆α its mean deviation of α and estimating
Note that antifragility ω
C
is integrating from K to infinity. We can probe ω
B
by
point estimates of f at a level of X ≤ K
so that
which leads us to the fragility detection heuristic (Taleb, Canetti, et al., 2012). In
particular, if we assume that ω´
B
(X) has a constant sign for X ≤ K, then ω
B
(K) has the
same sign. The detection heuristic is a perturbation in the tails to probe fragility, by
checking the function ω´
B
(X) at any level X.
* The difference between the two sides of Jensen’s inequality corresponds to a notion
in information theory, the Bregman divergence. Briys, Magdalou, and Nock, 2012.
Tale_9781400067824_3p_all_r1.indd 451 10/10/12 8:44 AM
452 APPENDI X I I
TABLE 12
MODEL SOURCE OF FRAGILITY REMEDY
Portfolio theory, Assuming knowledge of 1/n (spread as large a
mean- variance, etc. the parameters, not number of exposures as
integrating models across manageable), barbells,
parameters, relying on progressive and organic
(very unstable) correlations. construction, etc.
Assumes ω
A
(bias) and
ω
B
(fragility) = 0
Ricardian Missing layer of randomness Natural systems find
comparative in the price of wine may their own allocation
advantage imply total reversal of through tinkering
allocation. Assumes ω
A
(bias)
and ω
B
(fragility) = 0
Samuelson Concentration of sources of Distributed randomness
optimization randomness under concavity
of loss function. Assumes
ω
A
(bias) and ω
B
(fragility) = 0
Arrow- Debreu lattice Ludic fallacy: assumes Use of metaprobabilities
state- space exhaustive knowledge of changes entire model
outcomes and knowledge of implications
probabilities. Assumes
ω
A
(bias), ω
B
(fragility), and
ω
C
(antifragility) = 0
Dividend cash flow Missing stochasticity causing Heuristics
models convexity effects. Mostly
considers ω
C
(antifragility) =0
Portfolio fallacies: Note one fallacy promoted by Markowitz users: portfolio theory
entices people to diversify, hence it is better than nothing. Wrong, you finance fools:
it pushes them to optimize, hence overallocate. It does not drive people to take less
risk based on diversification, but causes them to take more open positions owing to
perception of offsetting statistical properties— making them vulnerable to model
error, and especially vulnerable to the underestimation of tail events. To see how,
consider two investors facing a choice of allocation across three items: cash, and se-
curities A and B. The investor who does not know the statistical properties of A and
B and knows he doesn’t know will allocate, say, the portion he does not want to lose
to cash, the rest into A and B— according to whatever heuristic has been in traditional
use. The investor who thinks he knows the statistical properties, with parameters σ
A
,
σ
B
, ρ
A
,
B
, will allocate ω
A
, ω
B
in a way to put the total risk at some target level (let us
ignore the expected return for this). The lower his perception of the correlation ρ
A
,
B
,
the worse his exposure to model error. Assuming he thinks that the correlation ρ
A
,
B
,
is 0, he will be overallocated by
1

3
for extreme events. But if the poor investor has the
illusion that the correlation is −1, he will be maximally overallocated to his A and B
Tale_9781400067824_3p_all_r1.indd 452 10/10/12 8:44 AM
APPENDI X I I 453
investments. If the investor uses leverage, we end up with the story of Long-Term
Capital Management, which turned out to be fooled by the parameters. (In real life,
unlike in economic papers, things tend to change; for Baal’s sake, they change!) We
can repeat the idea for each parameter σ and see how lower perception of this σ leads
to overallocation.
I noticed as a trader— and obsessed over the idea— that correlations were never
the same in different measurements. Unstable would be a mild word for them:
0.8 over a long period becomes −0.2 over another long period. A pure sucker game.
At times of stress, correlations experience even more abrupt changes— without
any reliable regularity, in spite of attempts to model “stress correlations.” Taleb
(1997) deals with the effects of stochastic correlations: One is only safe shorting a
correlation at 1, and buying it at −1— which seems to correspond to what the 1/n
heuristic does.
Kelly Criterion vs. Markowitz: In order to implement a full Markowitz- style optimi-
zation, one needs to know the entire joint probability distribution of all assets for the
entire future, plus the exact utility function for wealth at all future times. And with-
out errors! (We saw that estimation errors make the system explode.) Kelly’s method,
developed around the same period, requires no joint distribution or utility function.
In practice one needs the ratio of expected profit to worst- case return— dynamically
adjusted to avoid ruin. In the case of barbell transformations, the worst case is guar-
anteed. And model error is much, much milder under Kelly criterion. Thorp (1971,
1998), Haigh (2000).
The formidable Aaron Brown holds that Kelly’s ideas were rejected by economists—
in spite of the practical appeal— because of their love of general theories for all asset
prices.
Note that bounded trial and error is compatible with the Kelly criterion when
one has an idea of the potential return— even when one is ignorant of the returns, if
losses are bounded, the payoff will be robust and the method should outperform
that of Fragilista Markowitz.
Corporate Finance: In short, corporate finance seems to be based on point projec-
tions, not distributional projections; thus if one perturbates cash flow projections,
say, in the Gordon valuation model, replacing the fixed— and known— growth (and
other parameters) by continuously varying jumps (particularly under fat-tailed dis-
tributions), companies deemed “expensive,” or those with high growth, but low
earnings, could markedly increase in expected value, something the market prices
heuristically but without explicit reason.
Conclusion and summary: Something the economics establishment has been missing
is that having the right model (which is a very generous assumption), but being un-
certain about the parameters will invariably lead to an increase in fragility in the
presence of convexity and nonlinearities.
Fuhgetaboud Small Probabilities
Now the meat, beyond economics, the more general problem with probability and
its mismeasurement.
Tale_9781400067824_3p_all_r1.indd 453 10/10/12 8:44 AM
454 APPENDI X I I
HOW FAT TAILS (EXTREMISTAN) COME FROM
NONLINEAR RESPONSES TO MODEL PARAMETERS
Rare events have a certain property— missed so far at the time of this writing. We
deal with them using a model, a mathematical contraption that takes input param-
eters and outputs the probability. The more parameter uncertainty there is in a
model designed to compute probabilities, the more small probabilities tend to be
underestimated. Simply, small probabilities are convex to errors of computation, as
an airplane ride is concave to errors and disturbances (remember, it gets longer, not
shorter). The more sources of disturbance one forgets to take into account, the lon-
ger the airplane ride compared to the naive estimation.
We all know that to compute probability using a standard Normal statistical distri-
bution, one needs a parameter called standard deviation— or something similar that
characterizes the scale or dispersion of outcomes. But uncertainty about such standard
deviation has the effect of making the small probabilities rise. For instance, for a devia-
tion that is called “three sigma,” events that should take place no more than one in 740
observations, the probability rises by 60% if one moves the standard deviation up by
5%, and drops by 40% if we move the standard deviation down by 5%. So if your
error is on average a tiny 5%, the underestimation from a naive model is about 20%.
Great asymmetry, but nothing yet. It gets worse as one looks for more deviations, the
“six sigma” ones (alas, chronically frequent in economics): a rise of five times more.
The rarer the event (i.e., the higher the “sigma”), the worse the effect from small uncer-
tainty about what to put in the equation. With events such as ten sigma, the difference
is more than a billion times. We can use the argument to show how smaller and smaller
probabilities require more precision in computation. The smaller the probability, the
more a small, very small rounding in the computation makes the asymmetry massively
insignificant. For tiny, very small probabilities, you need near- infinite precision in the
parameters; the slightest uncertainty there causes mayhem. They are very convex to
perturbations. This in a way is the argument I’ve used to show that small probabilities
are incomputable, even if one has the right model— which we of course don’t.
The same argument relates to deriving probabilities nonparametrically, from
past frequencies. If the probability gets close to 1/ sample size, the error explodes.
This of course explains the error of Fukushima. Similar to Fannie Mae. To sum-
marize, small probabilities increase in an accelerated manner as one changes the
parameter that enters their computation.
FI GURE 38. The probability is
convex to standard deviation in
a Gaussian model. The plot
shows the STD effect on P>x,
and compares P>6 with an STD
of 1.5 compared to P>6 assum-
ing a linear combination of 1.2
and 1.8 (here a(1)=1/5).
The worrisome fact is that a perturbation in σ extends well into the tail of the
distribution in a convex way; the risks of a portfolio that is sensitive to the tails
P > x
STD
Tale_9781400067824_3p_all_r1.indd 454 10/10/12 8:44 AM
APPENDI X I I 455
would explode. That is, we are still here in the Gaussian world! Such explosive un-
certainty isn’t the result of natural fat tails in the distribution, merely small impreci-
sion about a future parameter. It is just epistemic! So those who use these models
while admitting parameters uncertainty are necessarily committing a severe inconsis-
tency.
*
Of course, uncertainty explodes even more when we replicate conditions of the
non- Gaussian real world upon perturbating tail exponents. Even with a powerlaw
distribution, the results are severe, particularly under variations of the tail exponent
as these have massive consequences. Really, fat tails mean incomputability of tail
events, little else.
COMPOUNDING UNCERTAINTY (FUKUSHIMA)
Using the earlier statement that estimation implies error, let us extend the logic: er-
rors have errors; these in turn have errors. Taking into account the effect makes all
small probabilities rise regardless of model— even in the Gaussian— to the point of
reaching fat tails and powerlaw effects (even the so- called infinite variance) when
higher orders of uncertainty are large. Even taking a Gaussian with σ the standard
deviation having a proportional error a(1); a(1) has an error rate a(2), etc. Now it
depends on the higher order error rate a(n) related to a(n−1); if these are in constant
proportion, then we converge to a very thick- tailed distribution. If proportional er-
rors decline, we still have fat tails. In all cases mere error is not a good thing for small
probability.
The sad part is that getting people to accept that every measure has an error has
been nearly impossible— the event in Fukushima held to happen once per million
years would turn into one per 30 if one percolates the different layers of uncertainty
in the adequate manner.
* This further shows the defects of the notion of “Knightian uncertainty,” since all tails
are uncertain under the slightest perturbation and their effect is severe in fat-tailed
domains, that is, economic life.
Tale_9781400067824_3p_all_r1.indd 455 10/10/12 8:44 AM
Tale_9781400067824_3p_all_r1.indd 456 10/10/12 8:44 AM

Sign up to vote on this title
UsefulNot useful