You are on page 1of 7

CLIVE ORTON

TESTING SIGNIFICANCE OR TESTING CREDULITY?

Summary.The paper examines two recent attempts to analyse a set of data


relating to Icenian coin hoards. The methodology of the second attempt, which
is based on significance tests made on ‘permillia’ data (i.e. coins per
thousand), is shown to be fatally flawed, and its conclusions unfounded. The
danger of using analytical statistical techniques without consideration of the
assumptions on which they are based, is stressed. Alternative approaches, one
analytical and one graphical, are presented: they suggest an interpretation
which differs from those of both the original analyses.

The use of statistical techniques in that the hoards are very similar to each other,
archaeology is full of pitfalls, as many and all date to around the Boudican rebellion.
archaeologists know to their cost. Occasion- Creighton (1994) suggested a chronological
ally, papers are published which attempt to ordering of the hoards, based on a simple
‘set the record straight’ and bring back seriation. In turn, van Arsdell (1996) criti-
straying archaeologists to the straight and cised Creighton on the grounds that he had
narrow path (e.g. Thomas 1978). It is not demonstrated that there were statistically
especially important that such papers should significant differences between the hoards,
themselves be free of serious statistical error, and went on to apparently show that there
as they are likely to be taken as a model by were no such differences. The first point is
archaeologists seeking for some certainty in a perfectly valid: it is all to easy to spot
sea of statistical confusion. patterns in ‘random’ data, and an objective
It is particularly unfortunate, therefore, that check that any possible patterns are ‘real’ is
a recent paper in this journal (van Arsdell an essential prerequisite to their interpret-
1996), while attempting to improve the meth- ation. The problem comes with the method
odology of an earlier contribution (Creighton used to demonstrate the apparent lack of
1994), has only succeeded in digging an even statistical significance.
deeper pit to entrap unwary archaeologists. It The first step of this method was to convert
is necessary to point out the fatal flaws in the all the counts to permillia (i.e. coins per
methodology before it has a chance to thousand, in contrast to the more common
become an accepted technique. percentages, or coins per hundred), following
The issue concerns the statistical analysis the example of Reece (1981). This is a
of a simple dataset: counts of coins of 9 types perfectly respectable technique of data
found in 12 hoards. The traditional view is presentation, and lends itself well to the

OXFORD JOURNAL OF ARCHAEOLOGY 16(2) 1997


ß Blackwell Publishers Ltd. 1997, 108 Cowley Road, Oxford OX4 1JF, UK
and 350 Main Street, Malden, MA 02148, USA. 219
TESTING SIGNIFICANCE OR TESTING CREDULITY?

production of comparative graphs. The next hoards were ten times the size, the conclusion
step was to compare each hoard against the would be exactly the same. But with ten times
aggregate of all the others, using the more evidence, it is much more likely that the
permillia of each coin type as the bases for observed pattern would be judged to be
comparison. For each coin type in turn, a statistically significant. The method fails to
range of  two standard deviations was take account of the size of the dataset and thus
plotted about the mean permillia figure for of the weight of the evidence. The problem of
‘other hoards’, and the corresponding figure the relationship between sample size and
for the chosen hoard was plotted to see if it statistical significance has been widely
fell within the range; if it did, this was taken discussed, for example by Shennan (1988,
as evidence that this difference was not 77–8).
statistically significant. The error is compounded by van Arsdell’s
This is where the method breaks down, misleading explanation of the meaning of a
because two key assumptions, on which this significance test (e.g. 1996, 236). If a hoard is
‘2 sigma’ rule is based, do not hold in this proven to be different from the others at the
particular dataset. First, the data are not two standard deviation level, it is not true that
normally distributed, but in fact have a ‘there will still be a five percent chance that
skewed distribution (this can be seen from they are actually the same’. What can be said
the way in which the  2 standard deviations is that if they were the same, there would be a
bars frequently encompass impossible five percent chance of the difference between
negative values). Second, and more serious, them appearing to be as big as (or bigger
the individual permillia values from which than) the observed difference. This misinter-
each standard deviation is calculated do not pretation is repeated at several stages in the
have the same statistical distribution. Even argument.
under the implicit null hypothesis that all the So what can be done? In statistical terms,
hoards have the same pattern (and hence all the dataset is a small two-way contingency
hoards have the same mean permillia for any table, amenable to analysis by any of the
chosen coin type), they all have different techniques designed for such tables, e.g. the
standard deviations because they are based 2 (chi-squared) test, the related G2 test, or
on hoards of different sizes. The calculation log-linear analysis (Bishop et al. 1975). Here
of the standard deviation in this way (i.e. on I shall first use the simplest and most
percentages or permillia), and its use in familiar, the 2 test, which compares the
constructing a hypothesis test (in effect, a t- observed data with the figures that would
test, although it is not called such), is invalid. have been obtained if (in this case) each coin
Discussion of this point led Reece to stop type had occurred in the same proportions in
using this technique (Reece 1988, 22–3). all the hoards, and gives a measure of the
To the archaeologist, this may seem like an statistical significance of the difference
obscure technical point, but it is a technicality between the two (known as the ‘observed’
that ‘pulls the rug from beneath the feet’ of and the ‘expected’ respectively).
the method. At a common-sense level, the If we apply this test to the data as they
fatal weakness of the method can be seen in stand, we obtain a value of 2 = 696 on 88
the observation that the apparent significance degrees of freedom (d.f.), which is statisti-
of the differences in no way depends on the cally significant at the 0.0001 level, i.e. if the
sizes of the hoards. If, for example, all the hoards were really ‘the same’, there would be

OXFORD JOURNAL OF ARCHAEOLOGY

220 ß Blackwell Publishers Ltd. 1997


CLIVE ORTON

a chance of less than 0.0001 (1 in 10,000) of and coin type contribute most to the statistical
them appearing as different as they do. significance of the overall value of 2 that we
Unfortunately, it is not quite that simple. have already observed. There is unfortunately
The calculation of 2 is only an approxi- no statistical test which would enable us to
mation, which can become inaccurate if some say which of these contributions are statisti-
of the ‘expected’ values are ‘too small’ cally significant, but one can often form a
(Cochran 1954). This implies that two small clear impression of which cells contribute
hoards (Brettenham, 5 coins, and March, 8 strongly. It is these cells which give a dataset
coins) and three rare coin types (D, 8 its distinctive pattern. The values of the con-
examples, E, 7 examples, and M, 9 examples) tributions to 2 for the reduced dataset (i.e. as
must either be deleted before the analysis, or immediately above) are given in Table 1.
merged with other categories. If they are Table 1 shows that some coin types con-
deleted, the resulting 6-by-10 table gives 2 tribute strongly to 2 and to the pattern, and
= 149 on 45 d.f., and p is again < 0:0001. My others scarcely at all. Type C+D+E makes
preference would be to delete the two small several strong contributions, indicating that
hoards and to merge coin types D and E with there are more coins than ‘expected’ of this
C, and M with LN (as done by van Arsdell in group of types in hoards Honingham,
his Tables 1 and 2), again leading to a 6-by- Lakenheath, Weston and Wimblington, and
10 table, but this time with 2 = 226, and yet fewer in hoards Field Baulk, Scole, and
again p < 0:0001. So there is clearly a possibly Eriswell. There are individual large
statistically significant pattern in the data; the contributions arising from there being more
question is — where is it? coins than ‘expected’ (i) of type L+M+N
We can approach this question by a closer (more specifically, type M) at Joist Fen, and
examination of the dataset itself, or by trying (ii) of type F at Eriswell. Type G shows a
to make a picture of it. I shall do each I turn. remarkably uniform pattern, as does the
The 2 statistic is made up from individual Fring hoard.
‘contributions’ from each cell (i.e. each Even with this approach, we have the task
combination of hoard and coin type) in the of scanning tables to look for interesting
dataset. We can examine these contributions features. A visual approach might give us a
to see which particular combinations of hoard more accessible route into this dataset. The
Table 1
Contributions to the overall chi-squared statistic from each cell of the ’reduced’ dataset. Very large contributions are shown in bold. The . . . indicates a
number smaller than 0.01 but greater than zero.

Coin type
C+D+E F G HIJK L+M+N O
Honingham 6.21 1.00 ... 1.03 0.47 1.07
Lakenheath 6.19 3.68 0.02 0.05 0.08 0.01
Joist Fen 0.01 0.01 0.05 0.91 53.33 1.05
Weston 11.47 1.75 0.02 4.62 0.50 1.52
Santon Downham 2.69 1.19 0.61 0.17 0.02 0.06
Wimblington 39.53 0.84 1.01 2.09 0.41 1.96
Fring 0.28 1.71 0.09 0.81 2.48 1.06
Eriswell 4.24 29.04 0.16 2.42 0.50 4.71
Field Baulk 14.39 1.18 0.69 2.61 0.04 1.82
Scole 7.32 0.70 0.79 1.50 1.07 0.62

OXFORD JOURNAL OF ARCHAEOLOGY

ß Blackwell Publishers Ltd. 1997 221


TESTING SIGNIFICANCE OR TESTING CREDULITY?

Figure 1
Plot of the first two axes of a correspondence analysis of the ‘reduced’ dataset.

appropriate statistical technique for the sort prominent, and vice versa (1984, 6). In our
of dataset is correspondence analysis example, points representing coin types will
(Greenacre 1984; 1992), which represents lie ‘in the direction of’ the hoards in which
the rows and columns of a contingency table they occur more frequently than ‘expected’.
as points on a scatterplot, in which (roughly The first run of this technique was with the
speaking) the point representing a particular ‘reduced’ dataset as described above, and
row will lie ‘in the direction of’ the points produced the plot shown as Figure 1. Here
representing the columns in which it is we can see a central core of types and hoards,

OXFORD JOURNAL OF ARCHAEOLOGY

222 ß Blackwell Publishers Ltd. 1997


CLIVE ORTON

Figure 2
Plot of the first two axes of a correspondence analysis of the ‘reduced’ dataset, after the omission of coin type M.

with two sets of outliers defining a pattern: coins of type C+D+E at Wimblington and
(i) in the horizontal axis, type C+D+E and the Weston than ‘expected’, and more of type
hoards Wimblington and to a lesser extent L+M+N at Joist Fen. This agrees with the
Weston, indicating an association between high contributions shown in Table 1.
this type and these hoards, (ii) on the vertical It is clear that pattern is to some extent
axis, type L+M+N and the Joist Fen hoard, dominated by the exclusive relationship
indicating an association between this type between type M and Joist Fen (i.e. type M
and this hoard. In other words, there are more is only found there in this dataset). There is

OXFORD JOURNAL OF ARCHAEOLOGY

ß Blackwell Publishers Ltd. 1997 223


TESTING SIGNIFICANCE OR TESTING CREDULITY?

no more to be said statistically about this hoard do look ‘remarkably similar’,


type, so we delete it and re-run the analysis, especially as it is quite unusual to find coin
creating Figure 2. hoards of an appreciable size that are not
Figure 2 shows a central core of types and statistically significantly different from each
hoards, with C+D+E and Wimblington and other (Lockyear, pers. comm.).
Weston to its right, and type F and Eriswell Because van Arsdell’s criticism of
to the top of the plot, reflecting the Creighton’s seriation is ill-founded, that does
association already noted from Table 1. The not necessarily mean that the data support
odd position of Weston reflects partly the Creighton’s view. A seriation of types in
high proportion of C+D+E and partly the low assemblages (e.g. hoards) usually shows
proportion of HIJK in this hoard. Some itself as a distinctive parabolic curve (the
detailed patterning can be observed within ‘horse-shoe’) in the correspondence analysis
the central core, for example, Santon plot (Madsen 1988, 24). No such curve is
Downham and Honingham are to the right, apparent in either Figure 1 or Figure 2. From
towards type C+D+E, of which they have an a statistical point of view, we can just
above-average proportion. demonstrate as clearly as possible the nature
Overall, the pattern is clear from both of the patterning in this dataset. The reasons
Table 1, and from Figures 1 and 2 together: for this pattern remain an open archae-
there are large and statistically significant ological question.
associations between (i) type C+D+E and the A final methodological point: the use of
Wimblington and Weston hoards, (ii) type M analytical statistical techniques, such as
and the Joist Fen hoard, and (iii) type F and significance tests, on datasets that are
the Eriswell hoard. If a 2 test is carried out composed of percentages or permillia should
on the ‘core’ types and hoards (as shown in be viewed with the deepest suspicion. They
Figure 2), it gives a value of 2 = 13.8 on 18 are almost always invalid.
d.f., which is not at all significant
statistically. Acknowledgements
Van Arsdells’ claim that there are no
statistically significant differences between I am grateful to Kris Lockyear for his comments on a
draft of this note, and for preparing Figures 1 and 2.
the hoards must therefore be rejected as the
product of a faulty methodology. Four hoards Institute of Archaeology
are statistically significantly different from University College London
the others, in each case because they possess 31–34 Gordon Square
a higher-than-expected proportion of a London W0H 0PY
particular coin type. However, the remaining

creighton, j.d. 1994: A time of change: the Iron Age


REFERENCES to Roman monetary transition in East Anglia. OJA 13,
bishop, y.m.m. fienberg, s.e. and Holland, P.W. 325–333.
1975: Discrete Multivariate Analysis: Theory and
Practice (Cambridge, Massachusetts, The MIT Press). greenacre, m.j. 1984: Theory and Applications of
Correspondence Analysis (London, Academic Press).

cochran, w.g. 1954: Some methods for strengthening greenacre, m.j. 1992: Correspondence Analysis in
the common chi-squared tests. Biometrics 10, 417–451. Practice (London, Academic Press).

OXFORD JOURNAL OF ARCHAEOLOGY

224 ß Blackwell Publishers Ltd. 1997


CLIVE ORTON

thomas, d.h. 1978: The awful truth about statistics in reece, r. 1988: My Roman Britain. Cotswold Studies
archaeology. American Antiquity 43, 231–244. 3.
madsen, t. 1988: Multivariate statistics and shennan, s. 1988: Quantifying Archaeology.
archaeology. In Madsen, T. (ed.), Multivariate (Edinburgh, Edinburgh University Press).
Archaeology (Jutland Archaeological Society
van arsdell, r. 1996: A statistical analysis of Icenian
Publications XXI), 7–27.
coin hoards. OJA 15, 235–242.
reece, r. 1981: The ‘Normal’ Hoard. PACT 5.
Statistics and Numismatics.

OXFORD JOURNAL OF ARCHAEOLOGY

ß Blackwell Publishers Ltd. 1997 225

You might also like