Professional Documents
Culture Documents
Abstract—We present consequential mistakes in uses of VII Fat Tailed Residuals in Linear Regression Mod-
correlation in social science research: els 9
1) use of subsampling since (absolute) correlation is
severely subadditive
Appendix 10
2) misinterpretation of the informational value of corre-
lation owing to nonlinearities, 1 Mean Deviation vs Standard
3) misapplication of correlation and PCA/Factor analysis Deviation . . . . . . . . . . . 10
when the relationship between variables is nonlinear, 2 Relative Standard Deviation
4) How to embody sampling error of the input variable
Error . . . . . . . . . . . . . 10
FT
5) Intransitivity of correlation
6) Other similar problems mostly focused on psychomet- 3 Relative Mean Deviation Error 10
rics (IQ testing is infected by the "dead man bias") 4 Finalmente, the Asymptotic
7) How fat tails cause R2 to be fake. Relative Efficiency For a
We compare to the more robust entropy approaches. Gaussian . . . . . . . . . . . 10
A Effect of Fatter Tails on the "efficiency"
of STD vs MD . . . . . . . . . . . . . 10
C ONTENTS
References 11
I Correlation is subaditive (in absolute value) 1
RA
I-A Intuition via one-dimensional represen-
tations . . . . . . . . . . . . . . . . . . 3 I. C ORRELATION IS SUBADITIVE ( IN ABSOLUTE VALUE )
I-B Mutual Information is Additive . . . . . 3
I-C Example of Quadrants . . . . . . . . . . 3
IV Transitivity of Correlations 6
-1
V Nonlinearities and other defects in "IQ" studies 0.525618 0.181141
and psychometrics in general 7
V-A Using a detector of disease as a detector
-2
of health . . . . . . . . . . . . . . . . . 7
V-A1 Sigmoidal functions . . . . . 8
V-B ReLu type functions (ramp payoffs) . . 8
V-C Dead man bias . . . . . . . . . . . . . . 8 -2 -1 0 1 2
FT
v. is the conditional variance, and cov(.,.) the conditional
covariance.
∫ ∞∫ ∞
1
=((0,1),(0,1)) =((1,2),(0,1))
vx (Q) = 1x,y∈Q f (x, y)(x−µx (ρ, Q))2 dydx
corr → 0.13096 corr → 0.112353 π(Q) −∞ −∞
∫ ∞ ∫ ∞
1
Covx,y (Q) = 1x,y∈Q f (x, y)(x − µx (Q))(y
π(Q) −∞ −∞
Fig. 2. Dividing the space into smaller and smaller squares yields lower − µy (Q))dydx
correlations
Finally, the local correlation:
RA
1.0 Covx,y (Q)
Corr(Q) = √
vx (Q)vy (Q)
ρ
0.5
Corr(1, 3)
Corr(2, 4) Theorem 1
ρ For all Q in R2 , we have
-1.0 -0.5 0.5 1.0
|Corr(Q)|≤ |ρ|
-0.5
Proof. Appendix.
D
-1.0
{1 to 2 σ} {3 to 4 σ}
p(x) p(x)
Fig. 3. Total correlation and the corresponding ones in the 4 quadrants. 0.25
0.004
0.20
0.003
0.15 0.002
0.10 0.001
Rule 1: Subadditivity x x
1.2 1.4 1.6 1.8 2.0 3.2 3.4 3.6 3.8 4.0
Correlation cannot be used for nonrandom subsamples.
{11 to 12 σ} {20 to 21 σ}
p(x) p(x)
Let X, Y be normalized random variables, Gaussian dis-
2.× 10-27 5.× 10-88
tributed with correlation ρ and pdf f (x, y). If we sample 1.5 × 10-27 4.× 10-88
randomly from the distribution and break it up further into 1.× 10-27 3.× 10-88
2.× 10-88
random sub-samples, then, under adequate conditions, the 5.× 10-28 1.× 10-88
x x
expected correlation of each sub-sample should, obviously, 11.211.411.611.812.0 20.220.420.620.821.0
converge to ρ.
However, should we break up the data into non random Fig. 4. As we sample in blocks in the tails separated by 1 standard deviation
on the x axis, we observe a drop in standard deviation as the Gaussian
subsamples, say quadrants, octants, etc. along the x and y axes, distribution concentrates in the left side of the partition as we go further
as in Fig. 1 and measure the correlation in each square, we in the tails. Power laws have an opposite behavior.
end up with considerably lower (probability) weighted sums
3
FT
sider the additivity of measures on subintervals.
Let p(x) be the density of the Normalized Gaussian, a, b ∈
R, a < b C. Example of Quadrants
∫ b Assume we are, as before, in a situation where X and
1
v(a, b) = p(x)(x − µ(a, b))2 dx, (1) Y follow a standardized bivariate Gaussian distribution with
P (a, b) a correlation ρ –and let’s compare to the results shown in Fig.
1.
where
Breaking IX,Y in 4 quadrants:
∫ ∞ Ix <0,y≥0
p(x)1a<x<b dx ( √ ( ) )
RA
P (a, b) =
−∞ 1 2 1 − ρ2 ρ + log 1 − ρ2 cos−1 (ρ)
( ( ) ( )) (2) = −
1 b a Px<0,y≥0 4π
= erf √ − erf √ ,
2 2 2 (5)
Ix ≥0,y≥0
√ ( ) √ ( )( )
2
2
− a2
2
− b2 b2
−e
a2 1 2 1 − ρ2 ρ + log 1 − ρ2 cos−1 (ρ) − π
πe e 2 2
=
µ(a, b) = ( ) ( ) . (3) Px≥0,y≥0 4π
erf √b
2
− erf √a
2 (6)
Ix ≥0,y<0
( √ ( ) )
1 −2ρ 1 − ρ2 + i log 1 − ρ2 cosh−1 (ρ)
D
IX,Y
=
Px≥0,y<0 4π
8
(7)
Ix <0,y<0
6 √ ( )( )
1 4ρ 1 − ρ2 − log 1 − ρ2 2 sin−1 (ρ) + π
=
Px<0,y<0 8π
4
(8)
We can see that
2
Rule 3
Correlation should never be interpreted linearly without Which means that the expected value of X given Y is normally
translation via some rescaling. distributed.
(
In [1] it has been shown that great many econometricians, E(X|Y ) ∼ N E[X]
while knowing their statistical equations down pat, don’t get
the real practical implication –all in one direction, the fooled √
V(X) (12)
by randomness one. The authors has a version of the effect + ρ(y − E[[Y ]),
V(Y )
FT
in [2] as professionals and graduate students failed to realize )
that they interpreted mean deviation as standard deviation, ( )
therefore underestimating volatility, especially under fat tails. 1 − ρ V(X)
2
3 3
2 2 2
1 1
-2 -1 1 2 3 -3 -2 -1-1 1 2 3 -3 -2 -1-1 1 2 3
-2
-2 -2
-4 -3 -3
3 3 3
2 2 2
1 1 1
-4 -2 -1 2 -3 -2 -1 -1 1 2 3 -3 -2 -1-1 1 2 3
-2 -2 -2
-3 -3 -3
FT
Fig. 6. One needs to translate ρ into information. See how ρ = .5 is much closer to 0 than to a ρ = 1. There are considerable differences between .9 and
.99
Real Information
1.0 0.627271 0.79506
4 4
2 2
0.8
-3 -2 -1 1 2 3 -2 -1 1 2 3 4
-2 -2
0.6
ϕ metric
RA
-4 -4
1- (Y X)
Mutual Information (unscaled)
Mutual Information (truncated .99)
0.4 Fig. 9. Correlation for twice the mutual information.
0.2
The numerator
∫ ∫
2 −2ρxy+y 2
∆+K ∆+K −x
2(1−ρ2 )
e
-1.0 -0.5 0.0 0.5 1.0
ρ √ dxdy
K−∆ K−∆ 2π 1 − ρ2
Fig. 7. Various rescaling methods,linerarizing information and putting corre- does not integrate, which necessitates numerical methods.
lation in perspective.
Real Information
D
0.02
D. PCA with Mutual Information
ρ
Now one can perform information based PCA maps, if the
-0.4 -0.2 0.2 0.4
Data is Gaussian, by rescaling substituting the performance.
Fig. 8. Various rescaling methods, seen around [0, | 21 |]
0.5
0.8
0.4
0.6
0.3
0.4
0.2
0.2
0.1
FT
Let g(µ, σ; x) be the PDF of the NormalDistribution with
0.4 mean µ and variance σ 2 , and f (., .) the joint distribution for
a multivariate Gaussian.
0.2 ∫∞ ∫∞ ∫∞
uyf (u, y)g(x, κ, u)dudxdy
ρ = √−∞
′ −∞
∫∞ ∫∞
−∞
have much going for them: the error below is pervasive. 1.0
They make the following inference:
(i) There is a positive correlation between genetics and
0.8
IQ scores.
(ii) There is a positive correlation between IQ scores
and performance. 0.6
(iii) Hence there is a positive correlation between genet-
ics and performance. 0.4
Problem is that (iii) doesn’t flow from (i) and (ii). You can have
the first two correlations positive and the third one negative. 0.2
Let us organize the correlations pairwise, where
ρ12 , ρ13 , ρ23 are indexed by 1 for genes, 2 for scores,
and 3 for performance. Let σ(.) 2
be the respective variances. -4 -2 2 4
FT
V. N ONLINEARITIES AND OTHER DEFECTS IN "IQ"
ρ13 σ1 σ3 ρ23 σ2 σ3 σ32
STUDIES AND PSYCHOMETRICS IN GENERAL
Let us apply Sylvester’ s criterion (a necessary and sufficient
We discuss the effect of nonlinearity in general but IQ
criterion to determine whether a Hermitian matrix is positive-
studies and psychometric is a treasure trove of defective
semidefinite) and, using determinants, produce conditions for
use of statistical metrics, especially correlation.
the positive-definite-ness of Σ. The criterion states that a
Hermitian matrix M is positive-semidefinite if and only if the
leading principal minors are nonnegative. See IQ is largely a pseudoscientific swindle:
https://medium.com/incerto/
The constraint on the first principal minor is obvious
iq-is-largely-a-pseudoscientific-swindle-f131c101ba39
(Cauchy-Schwarz):
RA
σ12 ρ12 σ1 σ2 Rule 4: Nonlinearity
= σ1 σ2 − ρ12 σ1 σ2 ≥ 0,
2 2 2 2 2
ρ12 σ1 σ2 σ22 One cannot use total correlation entailing X1 and X2
so when the association between X1 and X2 depends in
−1 ≤ ρ12 ≤ 1. expectation on X1 or X2 .
ρ12 ρ23 − (ρ212 − 1) (ρ223 − 1) ≤ ρ13 ≤ component reduction for r.v.s X1 , . . . , Xn if for all
√
i ̸= j the association between Xi and Xj depends in
(ρ212 − 1) (ρ223 − 1) + ρ12 ρ23 (17) expectation on the level of either Xi or Xj . (The flaw
infects the "g" in psychometry.)
Example: If we have ρ12 = 13 and ρ23 = 13 , we get the
following bound
7 A. Using a detector of disease as a detector of health
− ≤ ρ13 ≤ 1,
9 A metric to detect disease will masquerade as a detector
So obviously (iii) is false as the correlation can be negative. of health if one uses (Pearson) correlation! Because of the
Conditions for transitivity: We assume transitivity when we nonlinearity of disease. Let us consider disease anything +K
have the identity ρ13 = ρ12 ρ23 . STDs away.
Consider the situation where ρ212 + ρ223 = 1. From 17: Looking for the induced correlation of performance as a
√ √ binary variable {0, 1} for IQ > KST Ds, assuming everything
ρ12 ρ23 − ρ223 − ρ423 ≤ ρ13 ≤ ρ12 ρ23 + ρ223 − ρ423 (18) is Gaussian.
The inequality tightens on both sides as ρ212 + ρ223 becomes Let Ix>K be the Heaviside Theta Function. We are looking
greater than 1. at ρ the correlation between X and Ix>K .
Background: It is remarkable that people take transitivity E ((x − E(x)) (Ix>K − E (Ix>K )))
ρ= √ (19)
for granted; even maestro Terry Tao wasn’t aware of it. E ((x − E(x))2 ) E ((Ix>K − E (Ix>K )) 2 )
8
FT
R −3+ R+R )}
1
2 2
∫R ( ) + σ11 + σ22 .
(x − µx ) f (x) − µf (x) dx
−R
ρ−R,R =√ (∫ ( )2 )
(22)
∫R 2 R
−R
(x − µ x ) dx −R
f (x) − µ f (x) dx We have the second derivative no longer flat–as we see in
∫R ∫R the graph the function is not just non constant but nonlin-
1 1
where µx = 2R −R
x dx and µf (x) = 2R −R
f (x) dx ear;nonlinearities show in second derivative,
In the special symmetric case where R1 = 0, we get ρ = { (
√2 ≈ .894. ′ 1 4σ11 σ22 f (x)f ′′ (x)
2 2
5 λ (x) = −√ 2 2 4 − 2σ 2 σ 2 + σ 4
2 4σ22 σ11 f (x)2 + σ11
RA
22 11 22
16σ11 σ22 f (x)2 f ′ (x)2
4 4
C. Dead man bias + 2 σ 2 f (x)2 + σ 4 − 2σ 2 σ 2 + σ 4 ) 3/2
(4σ22 11 11 22 11 22 )
QUIZ: You administer IQ tests to 10K people, then 2 2 ′ 2
give them a "performance test" for anything, any task. 4σ11 σ22 f (x)
−√ 2 2 4 − 2σ 2 σ 2 + σ 4
,
2000 of them are dead. Dead people score 0 on IQ and 4σ22 σ11 f (x)2 + σ11 22 11 22
0 on performance. The rest have the IQ uncorrelated (
1 2 2
4σ11 σ22 f (x)f ′′ (x)
to the performance. What is the spurious correlation √
2 2 σ 2 f (x)2 + σ 4 − 2σ 2 σ 2 + σ 4
4σ22
IQ/performance? 11 11 22 11 22
σ22 f (x)2 f ′ (x)2
4 4
16σ11
− 2 2 2 4 − 2σ 2 σ 2 + σ 4 ) 3/2
Answer: roughly 37% (4σ22 σ11 f (x) + σ11 22 11 22 )}
The systematic bias comes from the fact that if you hit 2 2 ′
4σ11 σ22 f (x)2
√
D
√ P>
FT
lucky (again, a large difference). properties
. for the variance of a power law.
Ten years later (2011) it was found that 50% of neuroscience
papers (peer-reviewed in "prestigious journals") that compared R2
variables got it wrong.
In theory, a comparison of two experimental ef-
2.0
fects requires a statistical test on their difference.
In practice, this comparison is often based on an
incorrect procedure involving two separate tests in 1.5
which researchers conclude that effects differ when
one effect is significant (P < 0.05) but the other
RA
1.0
is not (P > 0.05). We reviewed 513 behavioral,
systems and cognitive neuroscience articles in five
top-ranking journals (Science, Nature, Nature Neu- 0.5
a problem of significance. Nature neuroscience, 14(9), 1105- where X is standard Gaussian (N (0, 1)) and ϵ is power law
1107. distributed, with E(ϵ) = 0 and E(ϵ2 ) < +∞. There are no
Fooled by Randomness was read by many professionals (to restrictions on the parameters.
put it mildly); the mistake is still being made. Ten years from Clearly we can compute the coefficient of determination R2
now, they will still be making the mistake. as 1 minus the ratio of the expectation of the sum of residuals
over the total squared variations, so we get the more general
VII. FAT TAILED R ESIDUALS IN L INEAR R EGRESSION answer to our idiosyncratic model. Since X ∼ N (0, 1), Y ∼
M ODELS N (b, |a|), we have
We mentioned in Chapter ?? that linear regression fails to ( )
E ϵ2
inform under fat tails. Yet it is practiced. For instance, it is E(R2 ) = 1 − 2 . (24)
a + E (ϵ2 )
patent that income and wealth variables are power law dis-
tributed (with a spate of problems, see our Gini discussions in Proof.
( Since the expectation
) of the total variation of Y is
[3]). However IQ scores are Gaussian (seemingly by design). E (aX + b − b + ϵ) 2 .
Yet people regress one on the other failing to see that it is And of course, for infinite variance:
improper.
Consider the following linear regression in which the inde- lim E(R2 ) = 0.
E(ϵ2 )→+∞
pendent and independent are of different classes:
We can also compute it by taking, simply, the square of
Y = aX + b + ϵ, the correlation between X and Y . For instance, assume the
10
distribution for ϵ is the Student T distribution with zero 2) Relative Standard Deviation Error: The characteris-
mean, scale σ and tail exponent α < 2 (as we saw earlier, tic function Ψ1 (t) of the distribution of x2 : Ψ1 (t) =
we get identical results with other ones so long as we ∫ ∞ e− x22 +itx2 1
√ dx = √1−2it . With the squared deviation
constrain the mean to be 0). Let’s start by computing the −∞ 2π
2
z = x , f , the pdf for n summands becomes:
correlation: the numerator is the covariance Cov(X, Y ) =
E ((aX + b + ϵ)X) = a. The denominator (standard deviation ∫ ∞ ( )n
√ √ 1 1
for Y ) becomes E (((aX + ϵ) − a)2 ) = 2αa2 −4a2 +ασ 2
. fZ (z) = exp(−itz) √ dt
α−2 2π −∞ 1 − 2it
So (26)
2− 2 e− 2 z 2 −1
n z n
a2 (α − 2) ( )
E(R2 ) = (25) =
Γ 2 n ,z
2(α − 2)a2 + ασ 2
> 0.
And the Dirac the limit from above:
√ 1− n − z
2
n−1
FT
n 2
nΓ( 2 )
The point is illustrated in Fig. 15. The point invalidates much 2 − 1.
2Γ( n+1
2 )
studies of the relations IQ-wealth and IQ-income of the kind 3) Relative Mean Deviation Error: Characteristic function
[4]; we can see the striking effect in Fig. 14. Given that R is again for |x| is that of a folded Normal distribution, but let us
bounded in [0, 1], it will reach its true value very slowly, see redo it: ∫ ∞ √ 2 − x2 +itx 2
− t2
( ( ))
the P-Value problem in Chapter ??. Ψ2 (t) = 0 e 2 = e 1 + i erfi √t ,
π 2
Property 1. When a fat tailed random variable is regressed where erfi is the imaginary error function erf (iz)/i.
over a thin tailed one, the coefficient of determination R2 will ( t2 (
The first moment: ( )))n √
be biased higher, and requires a much larger sample size to M1 = −i ∂t∂1 e− 2n2 1 + i erfi √t2n = π2 .
t=0
converge (if it ever does). The second moment,( ( ( ))) n
RA
2
e− 2n2 1 + i erfi √t2n
t
∂2
We will examine in [5] the slow convergence of power laws M2 = (−i)2 ∂t 2 =
t=0
distributed variables under the law of large numbers (LLN): it 2n+π−2 V(M ad) M −M 2
πn .Hence, E(M ad)2 =
2
M12
1
= π−2
2n .
can be as much as 1013 times slower than the Gaussian. 4) Finalmente, the Asymptotic Relative Efficiency For a
Gaussian:
A PPENDIX ( 2
)
nΓ( n2)
n 2 − 2
1) Mean Deviation vs Standard Deviation: The metric Γ( n+1
2 ) 1
ARE = lim = ≈ .875
standard deviation itself is a computational metric and does n→∞ π−2 π−2
not map to distances as interpreted. Mean absolute deviation
which means that the standard deviation is 12.5% more
does.
"efficient" than the mean deviation conditional on the data
Why the [REDACTED] did statistical science pick STD
being Gaussian and these blokes bought the argument. Except
D
RE
a
5 10 15 20
FT
R EFERENCES
[1] E. Soyer and R. M. Hogarth, “The illusion of predictability: How re-
gression statistics mislead experts,” International Journal of Forecasting,
vol. 28, no. 3, pp. 695–711, 2012.
[2] D. Goldstein and N. Taleb, “We don’t quite know what we are talking
about when we talk about volatility,” Journal of Portfolio Management,
vol. 33, no. 4, 2007.
[3] A. Fontanari, N. N. Taleb, and P. Cirillo, “Gini estimation under infinite
variance,” Physica A: Statistical Mechanics and its Applications, vol. 502,
no. 256-269, 2018.
RA
[4] J. L. Zagorsky, “Do you have to be smart to be rich? the impact of iq
on wealth, income and financial distress,” Intelligence, vol. 35, no. 5, pp.
489–501, 2007.
[5] N. N. Taleb, Technical Incerto, Vol 1: The Statistical Consequences of
Fat Tails, Papers and Commentaries. Monograph, 2019.
[6] P. J. Huber, Robust Statistics. Wiley, New York, 1981.
D