You are on page 1of 11

Temperatures,Trends, and Missing Data

Harald E. Krogstad, NTNU, January 2007

1 Trends
We all know that the temperature varies in a cyclic manner over the year. In addition,
there are considerable variations from year to year. It may come as a surprise, but there
exists no complete theory for the year-to-year variations, or whether this winter will be
warmer than last year’s winter. In science, there is a belief that this variability will never
be fully explained (The chaos of the weather system).
This note considers yearly mean temperatures, as these are found at www.rimfrost.no,
and the focus will be on statistics rather than climate science.
Starting with the temperature record from Trondheim, the yearly means (or averages) are
shown in Figure 1. First of all, we observe considerable gaps (missing data) in the series.
The graph also shows that the year-to-year variations appear to occur without any obvious
regularity. Finally, there seems to be some slow variation in the mean temperature, where,
in particular, the newest bunch of data appear to be somewhat higher than the previous
recordings. This slow variation is called a trend.

Trondheim (all measurements)


10

8
Temperature, oC

0
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
Year

Figure 1: Time series of yearly mean temperatures in Trondheim. Data are recorded at different
locations within the city and may therefore have some small systematic offsets.

The temperature measurements are called a time series. For the yearly means, the time-
step is 1 year, and we write the measurements as {Xi }, where Xi is the measured value at
time ti .
In a trend analysis Xi is expressed as a sum of two parts,

Xi = Ti + ri , (1)

where Ti is the trend and the remainder, ri = Xi − Ti , is called the residual. We prefer
that the trend is slowly varying, whereas the residual should vary from point to point with

1
Yearly mean temperatures and trend curve, Blindern, Oslo
10
Temperature, oC 8

2
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
Year
Trend
10
Temperature, oC

8
6

4
2
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
Year
Residuals
4
Temperature, oC

-2

-4
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
Year

Figure 2: Data, trend and residuals for yearly mean temperatures from Blindern, Oslo. The trend
curve is computed by means of the Hodrick-Prescott filter (described below).

no obvious regularity, or what is called a random or a stochastic variation. Unfortunately,


there is no unique or best way of making this division. Without going too far into the
theory, let us say that the trend curve is acceptable if it looks reasonable. Clearly, such
a curve can never be scientifically ”true”. An example of a trend curve and the residuals
for the temperatures from Blindern, Oslo, is shown in Fig. 2. The trend curve is slowly
varying, and the residuals spread out evenly around the trend. This is exactly what we
appreciate for a good trend curve.
One can say that a good trend curve is what you would draw by hand. The statistical
tradition has been to fit trend curves that are straight lines or polynomials of degree 2 — 3.
Attempting to fit polynomials of higher order is seldom a good idea. The trend curve in
Fig. 2 is not a polynomial, but it is possible to make up nice trend curves using 3rd order
polynomials glued together (spline curves).
The trend curve in Fig. 2 is produced using a very simple principle. We are looking for the
trend T in the same points as we have the data. Since the trend curve should be centred

2
in the middle of the data, it is reasonable to require that

X
N
MSE = (Xi − Ti )2 (2)
i=1

is small, but not so small that the curve becomes too irregular. Obviously, MSE = 0
implies that Ti = Xi , which is useless.
It is simple to see that what is called the second difference,

Di = Ti+1 − 2Ti + Ti−1 , (3)

measures how straight the trend curve is around ti . If Ti−1 , Ti , and Ti+1 lie on a straight
line, Di = 0. Thus, the quantity
X
N−1
DEV = Di2 (4)
i=2

measures how straight the full curve is: If DEV = 0, all points lie on a straight line. It is
reasonable trying to make both MSE and DEV small, but since these requirements are in
conflict, one instead minimizes the sum

HP = MSE + λ × DEV (5)

where λ is a parameter we are free to choose. If λ = 0, we get the trivial minimum Ti = Xi ,


while an increasing λ straightens the trend curve. In the limit when λ → ∞, the curve
approaches the mean square linear regression line. This is illustrated in Fig. 3. By varying
λ, we choose how straight the curve is, but it is not always easy to say what is the best.
These trend curves, which seem to cover what we need, are called Hodrick-Prescott curves,
and the algorithm the Hodrick-Prescott filter, named after the people who introduced the
method to the economists in the 1990s (E. C. Prescott got the Nobel Prize in Economics
for 2004 together with the Norwegian Finn Kydland). Nevertheless, the method is much
older, dating back at least to the 1920s. We shall not discuss how to choose λ, but rather
rely on the subjective impression of the result.
So far, we have disregarded that it is not quite straightforward to compute the trend curve
by minimizing HP in Eqn. 5. This amounts to solving a linear system of equations for
{Ti }. When the number of data points is large, up to 200 here, this requires a computer
with appropriate software. According to the Internet, there exist free add-ins available for
Microsoft ExcelT M (MatlabT M has been used here).
If one wants to keep things simple, the old-fashioned Moving Average (MA) over M points
is the first that comes to mind,
i+[M/2]
1 X
Ti = Xj . (6)
M
j=i−[M/2]

When M is an odd number, [M/2] = (M − 1) /2 (It is convenient to let M be an odd


number so that Ti is an average centred at ti ). The obvious problem with this formula is
that the computations run off the ends, also called the end-effect. One possibility to avoid

3
10 Blindern , Oslo, λ=100
Temperature, oC

8
6

4
2
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000

10 λ=1500
Temperature, oC

8
6
4
2
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000

10 λ=100000
Temperature, oC

8
6

4
2
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
Year

Figure 3: Examples of trend curves for varying values of λ in the Hodrick-Prescott filter.

4
Yearly means and a Moving Average, 21 years, Blindern, Oslo
10

8
Temperature, oC
6

1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
Year

Figure 4: Data and moving average over 21 years for the data from Blindern. See the text how
the moving average is extended all the way to the ends.

this would be to reduce M near the ends, but that would make the trend curve unstable in
the end-zones compared to the mid-range. Another choice is to fit a linear regression line
to M points and use that for the [M/2] points closest to the ends. The result for M = 21
(years) is shown on Fig. 4. The trend curve is a straight line near the ends, and the curve
has more small wiggles than the Hodrick-Prescott curve. There is no universal best way
to deal with the end-effect, and the use of a straight regression line at the ends will not be
suitable in all cases (e.g. when dealing with physically positive data, where the line may
give negative values).
The wiggles on the MA-curve are reduced by repeating the MA-operation on the current
MA-curve several times. After k repetitions (or iterations), where k ≥ 4, we obtain a
moving average with an approximately Gaussian weight function. Apart from a smoother
curve, k repetitions of an M-point moving average roughly amounts to one moving average
over √
Mk = k × M (7)
points. However, contrary to the single MA operation, repeated averaging puts more weight
on neighbouring points compared to points further away. Fig. 5 illustrates the iterative
MA for the temperature data from Paris.
Even if iterative MA (IMA) and MA are similar in the interior, they differ slightly at the
ends. A similar effect appears also to be present for the HP curve shown in Fig. 6.
When comparing Hodrick-Prescott and iterative MA, there should be correspondence be-
tween the parameters (M, k) for the IMA and the λ-parameter in Eqn. 5. A rough test,
based on artificial data, gave the following relation between Mk and λ:

λ = Mk3.56 /60. (8)



(Mk = k × M, and k was equal to 4 in the study). For λ = 1400, used above, this formula
suggests Mk ≈ 24.
As a subjective conclusion, it therefore seems that a 12 year MA (11 or 13 years are equally
good) iterated 4 times is a reasonable choice for the temperature data considered here.

5
15

14

13
Temperature, C
o

12

11

10

9 Data
8 MA 29 pts.
IMA 15 pts, 4 it.
7
1750 1800 1850 1900 1950 2000
Year

Figure 5: Data from Paris. One moving average over 29 points (blue), and a moving average over
15 points, repeated 4 times (red).

15

14

13
Temperature, oC

12

11

10

9 Data
8 Hodrick-Prescott
IGM 11 pts, 4 it.
7
1750 1800 1850 1900 1950 2000
År

Figure 6: Similar to previous figure. Iterative moving average (MA over 11 points and 4 iterations)
(red), and the Hodrick-Prescott curve with λ = 1400 (blue).

6
2 Missing Data
When looking through temperature data at rimfrost.no, we often find that some data are
missing. Restoring a data series by filling in missing data in a reasonable way is therefore
a common problem. An occasional missing month is not serious,— an average of the mean
temperature in the neighboring months would suffice in most cases. If yearly means are
missing, it would still be possible to find a reasonable trend curve if we only miss a few
years in a long data series. Missing data could then be restored by filling in the trend value.
Obviously, one could also fill in a random residual, but this would only make the graph
look nicer without really adding information to the data.
If we return to the Blindern temperatures in Fig. 2 and compute the so-called auto-
correlation function of the residual, we obtain the result in Fig. 7.

Blindern, Oslo
1

0.8
Autocorrelation function

0.6

0.4

0.2

-0.2

-0.4
0 2 4 6 8 10 12 14 16 18 20
Time difference (years)

Figure 7: The auto-correlation function for the residuals in the yearly means from Blindern, Oslo.

The correlation drops nearly to 0 after only one year. This does not mean that the residuals
are completely independent, but it does imply that it is very hard to predict the residual
for one year from the neighboring years (at least for this and other locations with a similar
climate). The trend curves discussed above require complete data sets for their construc-
tion, and for a quick but primitive restoration, with only a few missing data, the simplest
method is to fill in extra data by linear interpolation. However, in this case, it is important
to flag restored data on a graph and visually inspect that the trend curve is reasonable.
For larger gaps, like the Trondheim series shown in Fig. 1 this would be too primitive. An
example of the simple approach for the Warszawa time series is shown in Fig. 8. Whereas
simple interpolation in this case apparently works well for single missing points and occa-
sional larger gaps, it introduces an artificial hump in the trend for the missing years 1940
— 1950. Nevertheless, as long as the trend curve is visually inspected and the restored data
points are clearly flagged, also this may be acceptable. Instead of a linear interpolation, it
is more reasonable to interpolate between averaged data on both sides of the gap.
Instead of trying to restore data by looking only at the series itself, it is often better to
include data from neighboring locations. In the following example we return to the data
from Trondheim shown in Fig. 1, which, for some reasons unknown to this author, have
large gaps. Since Norway has a relatively high density of measurement sites, it is reasonable

7
Yearly mean, Warszawa
14
Data
Inserted data
12
Hodrick-Prescott, λ = 1400

Temperature, degC
IMA, 11 pts., 4 it.
10

1750 1800 1850 1900 1950 2000


Year

Figure 8: Example showing avaiable data (blue), data inserted using linear interpolation (red),
and the resulting trend curves.

to use stations near Trondheim for filling in the missing data. Trondheim is located 30 km
from Værnes and about 60 km from Selbu, and both locations have overlapping data sets
with Trondheim.
The general methodology is as follows. First select a group of stations and data that cover
the gaps with good margin. Then extract a subset of the data where all stations have data.
This reduced set is called the calibration set and is used for establishing a relation between
the neighboring stations and the target station. Denoting the temperature at the target T0
and the neighbours T1 , T2 , · · · , Tn , the typical expression has the form

T0 = a0 + a1 T1 + · · · + an Tn , (9)

and is called a multivariate linear regression. The constants a0 are found by minimizing
the expression

1 X
C
J (a0 , a1 , · · · , an ) = (T0c − a0 − a1 T1c − · · · − an Tnc )2 (10)
C c=1

where the c runs over the calibration data set (The number of calibration data, C, should
exceed n). This involves solving the so-called Normal√ Equations and will not be discussed
here. When the a’s are determined, the size of J will be a measure of the precision by
which T0 may be obtained from T1 , T2 , · · · , Tn .
When carrying out this in practice, one should be restrictive when selecting stations to be
included. As a general rule, it is important to avoid nearly redundant stations and stations
with little/no influence on the result.
Fig. 9 shows the Selbu/Trondheim and the Værnes/Trondheim calibration data set com-
prising 47 years of data.
Apart from the systematic difference between the sites (the bias), the variability between
Selbu and Trondheim is slightly greater than between Værnes and Trondheim.
The idea is now to try to predict the Trondheim temperature from the Selbu and Værnes
temperatures, and the result came out as shown in Table 1. Order 0 simply means com-
pensating for the bias between the stations, Order 1 is linear regression, and Order 2 is

8
Calibration data Calibration data
8 8

7 7
Trondheim, oC

Trondheim, oC
6 6

5 5

4 4

3 3

2 2
2 3 4 5 6 7 8 2 3 4 5 6 7 8
o
Selbu, C Værnes, oC

Figure 9: Simultaneous yearly means for Selbu and Trondheim, and Værnes and Trondheim (47
common years).

Order Formula
TT = 0.48◦ C + TS (0.26◦ C)
0
TT = −0.38◦ C + TV (0.18◦ C)
TT = 0.79◦ C + 0.93 × TS (0.25◦ C)
1
TT = 0.00◦ C + 0.93 × TV (0.17◦ C)
2 TT = 0.07◦ C + 0.17 × TS + 0.78 × TV (0.17◦ C)

Table 1: Results of the regression analysis. TT , TS , and TV are the temperatures at Trondheim,
Selbu and Værnes, respectively. The RMS prediction error, quantifying the reliability of the
result, is given in the brackets behind each formula.

9
Trondheim, predicted temperature (oC)
8

2
2 3 4 5 6 7 8
Trondheim, actual temperature (oC)

Figure 10: Actual vs. predicted temperature in Trondheim based on the calibration data set.

multilinear regression. The prediction error for the Værnes-Trondheim linear regression
formula is slightly higher (0.173◦ C), than the prediction error for the multilinear regression
(0.168◦ C). However, the improvement by going to Order 2 is negligible, and if one should
select one of the relations in Table 1, the bias-compensating formula TT = −0.38◦ C+TV is
clearly the simplest.
Figure 10 shows the actual Trondheim temperature vs. the predicted temperature from
the multivariate regression. It could be mentioned that it is good practice to split the
calibration data set into two parts; the first one is used for deriving the calibration formula
and the second one for checking the result.
With 3 stations available, it is possible to predict one that is missing from the two others.
This is shown in Figure 11. In the present case, the prediction error is smaller than 0.2
degrees, and therefore insignificant compared to the year-to-year variations at each of the
locations.

3 Acknowledgement
Thanks to Dr. Stephen F. Barstow, Fugro Oceanor (FOAS), for useful comments and
corrections.

10
8 Selbu
Temperature oC

2
1920 1930 1940 1950 1960 1970 1980 1990 2000 2010

8 Værnes
Temperature oC

2
1920 1930 1940 1950 1960 1970 1980 1990 2000 2010

8 Trondheim
Temperature oC

2
1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
Year

Figure 11: Restored yearly means in the time series from Trondheim, Værnes and Selbu. Only
points where both the other stations were available have been restored.

11