/  26
 
Two Simple Statistical CalculationsandClimateGate
Derek O’Connor
Started : February 14, 2008,Latest : February 6, 2010
Yet the errors do not come from the art but from those who practice the art .
1
— Isaac Newton (1642 – 1727)
1 Introduction
This note is prompted by reports of errors in the calculation of two simple statistics: themean of a vector
x
, or ¯
 x
=
1
n
i
x
i
, and its variance Var(
 x
)
=
1
n
i
(
 x
i
¯
 x
)
2
.Errors in such simple calculations usually indicate that a problem is ill-conditioned, or anunstable algorithm is being used, or a combination of both.We illustratetheseideas by analyzing two problems where these errors arise: the Sea SurfaceHeights mean problem which is ill-conditioned, and the Microsoft Excel variance problemwhere an unstable algorithm is used.A new section has been added called Climategate. This identifies and explains some of thethe programming errors found in the journal of Ian (Harry) Harris who worked as a scientist-programmer in the Climate Research Unit at the University of East Anglia.We will see that Newton’s admonition needs to be taken to heart, especially in scientificcomputing.
1
 Attamen errores non sunt Artis sed Artificum
, from the ‘Author’s Preface to the Reader’,
Philosophiae Naturalis Principia Mathematica
, First Edition, July 5. 1686.
1
 
Two Statistical Calculations Derek OConnor
1.1 The
Sea Surface Heights
Problem
The following information was taken from a paper [4], by Yun He & Chris Ding of theNERSC-Lawrence Berkeley Labs who were doing a large-scale simulation of ocean circula-tion. At each step of the simulation the following was done:1.
Sea Surface Heights
are calculated at each point on a 64
×
120 latitude-longitudegrid.2. The average of these 64
×
120
=
7680 numbers is then calculated.3. This average is compared with satellite data.The Fortran code below does the summation part of these calculations.
sum = 0.0do i = 1, 64
latitude index
do j = 1, 120
longitude index
sum = sum + ssh(i,j)end doend do
The order of summation can be changed by interchanging the
i
and
j
indices and by revers-ing their order.Table 1: He and Ding’s Results using 16-digit Precision
Summation Order Value o
sum
Rel. Error Corr. Digs.Longitude First 34.4147682189941410 95.1 0Latitude First 0.67326545715332031 0.88 0Reverse Longitude First 32.302734375 89.2 0Reverse Latitude First 0.734375 1.05 0Exact Value 0.35798583924770355224609375 0 26
Table1shows the results that He & Ding got with the Fortran code shown above, on a singleprocessor, using IEEE double precision (
16 decimal digits). He & Ding point out that theseresults are completely wrong — not one digit is correct. We will analyse this problem inSection3and explain why He & Ding got such inaccurate results.
1.2 The Microsoft Excel Problem
Microsoft’s Excel spreadsheet has been in use for many years and has gone though manyversions. Many millions in business, government, and universities, use some version of Excel. Most users do not have the time or ability to test the quality of Excel’s calculations,and when errors do occur most users do not see the result as erroneous. Here is an examplewhere Excel gets a wrong answer to a simple problem.
c
Derek O’Connor, February 6, 2010
2
 
Two Statistical Calculations Derek OConnorWe wish to calculate the mean and standard deviation of the set of numbers
 x
i
=
a
i
+
,
a
i
=
1
i
=
1
,
3
,
5
,
7
,
9 and
a
i
=
2
,
i
=
2
,
4
,
6
,
8
,
10
.
where
is a large constant. This contrived example is designed to reveal flaws in thestandard deviation calculation.The exact values of the mean and variance are¯
 x
=
1
n
n
i
=
1
 x
i
=
110
n
i
=
1
(
a
i
+
)
=
110(15
+
10
 M 
)
=
+
1
.
5Var(
 x
)
=
1
n
1
n
i
=
1
(
 x
i
¯
 x
)
2
=
110
1
10
i
=
1
(
a
i
+
1
.
5)
2
=
19
10
i
=
1
(
a
i
1
.
5)
2
=
19
10
i
=
1
(
±
0
.
5)
2
.
=
192
.
5
=
0
.
27˙7
...
SDev(
 x
)
=
√ 
0
.
27˙7
...
=
0
.
5270462766947299rounded to 16 digits and does not involve
.
Table2shows the results of Excel 2000’s calculations with
=
10
8
,
10
10
,
10
14
,
10
15
. Thelast line of the Table2contains Excel 2000’s values for the standard deviation. None of thesevalues is correct. This result is not new: for many years and versions the variance functionin Excel has been calculated by a bad algorithm which gives the bad results shown here.
2
We will analyse this problem in Section4and explain why Excel 2000 and later versions getsuch inaccurate results.Table 2: Excel 2000 Results
i x
i
x
i
+
10
8
 x
i
+
10
10
 x
i
+
10
14
 x
i
+
10
15
1 1 10000001 1000000001 100000000000001 10000000000000012 2 10000002 1000000002 100000000000002 10000000000000023 1 10000001 1000000001 100000000000001 10000000000000014 2 10000002 1000000002 100000000000002 10000000000000025 1 10000001 1000000001 100000000000001 10000000000000016 2 10000002 1000000002 100000000000002 10000000000000027 1 10000001 1000000001 100000000000001 10000000000000018 2 10000002 1000000002 100000000000002 10000000000000029 1 10000001 1000000001 100000000000001 100000000000000110 2 10000002 1000000002 100000000000002 1000000000000002Sum 15 100000015.0 10000000015.0 1000000000000015.0 10000000000000016.0Mean 3
/
2 10000001.5 1000000001.5 100000000000001.5 1000000000000001.6Sdev
√ 
0
.
2˙7 0.54
...
20E
+
00 0.00
...
00E
+
00 1.39
...
30E
+
06 0.00
...
00E
+
00
2
Note that the
sum
in the last column is wrong also.
c
Derek O’Connor, February 6, 2010
3

Share & Embed

More from this user

Add a Comment

Characters: ...

uploaded a new revision for this document (#2)

02 / 06 / 2010