European Journal of Soil Science, June 2001, 52, 331±340

Statistics to support soil research and their presentation
Rothamsted Experimental Station, Harpenden, Hertfordshire AL5 2JQ, UK

Much soil research needs statistics to support and con®rm impressions and interpretations of investigations in ®eld and laboratory. Many soil scientists have not been trained in statistical method and as a result apply quite elementary techniques out of context and without understanding. This article concentrates on the most common abuses and misunderstandings and points authors to proper use. It distinguishes variance and standard deviation for measuring dispersion from standard error to indicate con®dence in estimates of means. It describes the strictly limited context in which to use the coef®cient of variation. It stresses the importance of quoting means and differences between them in contrast to statistical signi®cance, which is at best of secondary interest. It guides readers to inspect and explore their data before deciding to transform them for analysis and illustrates what can be achieved by taking logarithms of single variates and by principal component analysis of multivariate data.

 Statistiques pour soutenir la recherche du sol et leur presentation   Resume
Beaucoup de recherche en science du sol utilise les statistiques pour con®rmer les impressions et  interpretations dans les champs et en laboratoire. De nombreux scienti®ques manquent de formation en   Â methodes statistiques, et en consequence appliquent les techniques les plus elementaires dans un contexte  errone et sans comprehension. Cet article se concentre sur les abus et les malentendus les plus courants et indique l'utilisation   correcte. Il differencie la variance et l'ecart type pour mesurer la dispersion de l'erreur standard qui  indique la con®ance que l'on peut avoir dans les moyennes. Il souligne l'importance de presenter les Á moyennes et les differences entre elles par opposition a la signi®cance statistique, qui est au mieux d'un à Á  Á Á  interet secondaire. Il incite les lecteurs a examiner leurs donnees de pres et a les explorer avant de decider  d'effectuer toute transformation en vue d'analyses plus formelles, et il met en evidence ce que l'on peut achever en utilisant des logarithmes des variables individuelles et des composantes principales des  donnees multivariables.

Readers of both this Journal and other journals of soil science will have noticed an increasing polarization in the use of mathematics and statistics in soil science. On the one hand we have masters of modern advanced theory applying both it and the not-so-new with powerful computers and software. They have made substantial progress in geostatistics, fractal geometry, and mathematical morphology, to name a few ®elds of application, in recent years. Long may they continue to advance the technology and our understanding and ability to predict arising from it.
E-mail: Received 16 October 2000; accepted 1 December 2000

On the other hand, there are many ®eld pedologists and laboratory scientists still struggling with elementary statistics and experimental design. Their plight is not helped by the emphasis on inferential tests in many students' texts, such as that by Dytham (1999) which proudly proclaims that it contains no equations (though I did espy one skulking in the corner of a graph). Its author evidently thinks it enough to tell students how to test for signi®cance and which buttons to press on their personal computers for the purpose. Matters are made worse by misleading introductory chapters in what appear as authoritative works on soil analysis and laboratory practice. For example, the new Practical Environmental Analysis by  Radojevic & Bashkin (1999), published by the Royal Society of Chemistry, omits mention of standard error, perpetuates the muddle over regression, and invokes the Central Limit 331

2001 Blackwell Science Ltd

If you have enough observations then draw a histogram of them. But variation there is aplenty. of which the N observations are a sample. How would ®eld pedologists entertain us if all soil were uniformly drab grey loam to 3 m? What would they have to argue about? They would be out of work. S2 is a biased estimator. The point I stress here is the distinction between the standard deviation and the standard error. The reason is that z in the equation. for example. i. the Å estimate of " from the same data. Its square root. and to sampling ¯uctuation. zN. to the imprecision of instruments. z2. is itself more or less in error. S or s measures the dispersion in our observations. Authors often imply some kind of con®dence interval when they put error bars on graphs. provided the sampling was unbiased in the ®rst place. You may also ®nd it helpful to visualize the variation in your data. From now on I shall discourage its use and ask authors to spell out `standard error' in the ®rst instance and abbreviate it to `SE' thereafter if they wish. and fertilizing the land. The natural variation that has taxed systematists and soil surveyors is one kind.332 R. Whichever we choose. Some authors use T to mean the standard error. standard error. What to present Whatever the source of the variation we want to be able to express it quantitatively. Equation (1) gives us a biased result.e. z1. and its square root is the standard error. and soil science would be pretty uninteresting if there were no variation. it does not immediately tell us how reliable is our estimate of ". To assess the con®dence we may have in an estimate of the mean we want the expected squared deviation of it from the true mean. ¼. ingenuity and resources to the full. Data on soil contain additional sources of variation arising from our observations. others use it for the standard deviation. Its estimate is derived simply from s2 by Å s2(z) = s2/N. European Journal of Soil Science. but that is not the only factor involved. If the data are drawn from a normal (Gaussian) distribution and there are many of them then a bar of length 2s/ÖN centred on z spans a symmetric con®dence Å interval of approximately 68%. 52. reclaiming. s2. Many authors use the T sign to express error. Å (4) This is the estimation variance. and con®dence `Variety is the spice of life'. This itself is a source of ambiguity. as in Table 1. Intuitively we should know that the smaller is either in relation to z the more con®dent we can be in our Å estimate. though within ®elds they have removed some by cultivation and drainage. 1. it should be included in any summary of a large body of data. however. Farmers have created variation by enclosing. The former describes the variation in the sample. however. standard errors should accompany means in tables compiled from replicated measurements. To remove the bias we must replace N by N ±1 in the denominator: s2 ˆ N 1 ˆ …zi À z†2 X N À 1 iˆ1 Dispersion.25 mg l±1'. and so would most of the rest of us. Taking the results from the ®rst column of Table 1. these can be attributed to the people who make the observations. # (2) 2001 Blackwell Science Ltd. The larger is our sample the smaller is this error. and my purpose here is to provide that. and so the above factors need to be replaced by Student's t for the number of …1† where z is the mean of the data. we should report. as in Figure 1. Evidently authors need guidance on elementary statistics and summarizing data. 331±340 . and see how the standard deviation relates to the spread of values.96 for 95%. Similarly. to laboratory technique. Wider con®dence intervals are readily calculated by multiplying by factors that can be found in standard texts: 1.64 for 90%. which is the standard deviation of means of samples of size N. standard deviation. and they may be shown by error bars on graphs. clearing. where " is the mean of z in the population and E denotes expectation. It is the average squared Å difference between the observations and their mean. Usually. E[(z ± ")2]. With books of this kind it is small wonder that I spend much of my editorial time explaining to authors how to present and summarize their data properly and correcting their statistical mistakes. and the more con®dent we can be in our estimate. authors have few data. Webster Theorem out of context. In many instances we should like the variance obtained from a set of data to estimate the variance of some larger population. S. however. and here I encounter the ®rst common misunderstanding. other things being equal.86 mg l±1. Having measured the variation we want to use our measurement properly in context. with standard error 0. is now unbiased. for example. is usually expressed as variance: S2 ˆ N 1ˆ …zi À z†2 Y N iˆ1 …3† The result. is the standard deviation. The variation in a set of N measurements. and coping with it demands our intellect. which is often preferred because it is in the same units as the measurements and is more intelligible therefore. `mean phosphorus concentration 4. This variance is by de®nition '2 = E[(z ± ")2]. The latter is the error associated with a mean.

however. Geostatisticians formalize this in the distinction between the actuality on the ground and the random process that is assumed to generate that actuality. and they are of little interest therefore. which is a set of observed values of a variable. according to my (Oxford) dictionary. What we observe and represent as variates is therefore `variation'. In computing a parameter has a similar meaning of a quantity that is held constant for a particular run of a program or model. 331±340 . Keep the word `variability' for those situations where it is the potential that is of concern. `Parameter' is not synonymous with `variable'.52 5. we do not measure parameters. One is `variation' itself.15 3. Thus. in most instances `variation' is the word you want to describe either what you have observed and recorded as one or more variates or the population that you sampled or could sample.25 3.59 27. however. Statisticians distinguish a variable as de®ned above from a `variate'. Which words? Two words are used for variation in the scripts I receive.33 9. The data on seven trace metals in the Swiss Jura used to illustrate principal component analysis. means liable to change or changeable. The 433 values of available phosphorus (P) in the soil summarized in Table 1 constitute a variate of the variable. which. embodies variability. rather we estimate them from measurements of the variables that interest us.2 log10 P 0. Values of 12.95 368. Another word that is more often used incorrectly than correctly is `parameter'.510 0. From there it is a small step to the mathematical meaning of a quantity that may take more than one value. f. available P. If you want to show error bounds on the original scale then compute the con®dence limits on the transformed scale and then back-transform them. with 18 degrees of freedom for the full set of data and 9 degrees for the subset.7 Subset P /mg l±1 4. The computed standard errors then apply to the transformed scale and are not easily transformed back into the original units. Standard deviations themselves estimated from small sets of data have wide con®dence intervals. are seven variates. and many authors # evidently regard them as synonyms and use them interchangeably.346 0. So. A parameter is a quantity that is constant in the case being considered. which is a model of the reality. A variable (noun) is a quantity that is subject to change or has the ability to change. and that in (a) shows the lognormal distribution. When we draw a realization from that process what we obtain is variation. It could be the organic carbon content of the soil or the soil's hydraulic conductivity. are added for the hypothesis that the data or their logarithms are from a normal distribution Full set Statistic Mean Variance Standard deviation Skewness 12df 18 12df 9 P /mg l±1 4. data are transformed to approximate a normal distribution before analysis (see below).546 0. In many instances. The curve in (b) is that of the ®tted normal distribution. Yet synonyms they are not.60 5.86 26.43 64. the lower and upper symmetric con®dence limits about a mean z are Å p p z À tf sa N and z ‡ tf sa N Y …5† and bars should be drawn to these limits to express con®dence. The models contain parameters. the other is `variability'. European Journal of Soil Science. In statistics it is generally reserved for quantities such as means and variances of populations. From these follow the noun `variability'. 52.338 0. Let us start with the adjective `variable'. and (b) transformed to common logarithms. as in many models.1 degrees of freedom. Table 1 Summary statistics of available phosphorus at Broom's Barn of both the full set of data (433 points) and subset (44 points).5 log10 P 0. The random process.Statistics to support soil research 333 Figure 1 Histograms of the full set of data on available phosphorus (a) in mg l±1.1197 0. Further. which means the potential to vary. below.1142 0. as examples. and their creators and users have in many instances to estimate those 2001 Blackwell Science Ltd. Modelling physical and biological processes introduces a duality. There are formulae for calculating the con®dence limits for other theoretical distributions.23 26.

`The topsoil is signi®cantly more acid than the subsoil (P < 0. say y and z. At the other end of the scale the bulk density cannot exceed the average density of the mineral particles. and pH. Coef®cient of variation The coef®cient of variation (CV) is the standard deviation divided by the mean. The standard deviation would remain unchanged. the soil's bulk density. So. but it still happens) then the standard deviation would be larger by 9/5. so that results expressed as CV would not be comparable. enabling investigators to distinguish variation from two or more sources and estimate their components according to the design as by the analysis of variance. Lewontin (1966) explains the matter at greater length. regardless of the mean. The sensible use of the coef®cient of variation for comparing two variables. For some soil properties physics sets limits on the utility of the CV. Finally in this section I comment on the practice of averaging coef®cients of variation from two or more sets of data. The zero of the pH scale is set at ±log10 of the hydrogen ion concentration in mol l±1. relies on the assumption that they are the same apart from some multiplying factor. approximately 2. For mineral soils on dry land a working minimum bulk density is around 1 g cm±3. 2001 Blackwell Science Ltd. b.001)'. Equation (6) becomes log y = log b + log z. If we were to record the temperatures in Kelvin then the standard deviation would remain the same. it is arbitrary. s2 y and s2 z. In the minds of many soil scientists no statistical analysis is complete without an inferential test at the end. We could choose some other concentration. Usually we measure temperature in degrees Celsius. and then take the square root of it to give an `average' standard deviation. European Journal of Soil Science. let us suppose that we wish to compare the variation in acidity of a group of Luvisols with one of Podzols. Its merit is that it expresses variation as pure numbers independent of the scales of measurement. Variances are additive. (7) Since log b is a constant the variances. Further. and so these quantities can become variables in their own rights with their own distributions. s/z. Although it is convention. We can use it to compare variation in two groups of observations. Thus we have a measure of variation that is independent of the original scale of measurement. and it makes no sense to compare it with that of. 331±340 . are log log equal. which is approximately circular and for which we could choose any particular hue we like as our zero. Its minimum is determined by the physical structures that keep particles apart. but the zero would be different again. So the CV of bulk density is fairly tightly constrained. as would be its CV. # Signi®cance tests One of the commonest misunderstandings of statistics arises with signi®cance testing.01)'. Consider. (6) Then the mean y is y = bz. So if you want an average measure of variation from several sets of data then you should compute the arithmetic mean of their variances. even ones having different dimensions. otherwise the soil collapses. Unfortunately. It enables investigators and those reading their reports to appreciate quickly the degree of variation present and to compare one region with another and one experiment with another. More generally. Otherwise the arbitrary choice of the zero affects it. and its standard Å Å y z deviation is sy = bsz. Other variables with arbitrary zeros are colour hue. Webster parameters from data at speci®c places and times. If we measured them in degrees Fahrenheit (Heaven forbid. for example. The coef®cient should be restricted to variables measured on scales with an absolute zero. its variance s2 = b2s2. You can then divide it by the mean to obtain an `average' CV. Particles must touch one another. A few examples will illustrate the matter. We treat the hydrogen ion concentration as the original variable. This is usually an abuse. but their square roots. and the CV directly obtained from the measurements will be determined by the mean above 0°C. see below. we were to de®ne pH as ±log10[H+] in mol ml±1 we should increase the current conventional values by 3. 52. weighted as appropriate by the numbers of degrees of freedom. The result is that the estimates may vary in either space or time or both.334 R. transform it to pH. Suppose we want to express variation in temperature. and `The regression of CEC on organic matter was highly signi®cant (P < 0. this facility is too readily used to compare variation in different variables. This principle offers a better way of comparing variation by taking logarithms of the observations. but the CV would be smaller. the additive nature of variances confers great ¯exibility in analysis. From there simple arithmetic shows that their CVs are the same. as are their standard deviations. say. To return to our example of pH. and compute the variances of pH. are not. It is often multiplied by 100 and Å quoted as a percentage. and a lot smaller if we were concerned with ambient soil temperatures. you should convert any outcome to errors only at the end of the analysis. the corresponding standard deviations. we can make a formal signi®cance test by computing F ˆ s2 y as2 z and compare the result with F for Ny ± 1 and log log Nz ± 1 degrees of freedom.7 g cm±3. the available phosphorus content of the soil. and all too often the only results reported are statements such as `Treatment A yielded signi®cantly more than did treatment B (P = 0. for example. and were we to do so its logarithm would be different.e.05)'. If. i. Whichever has the larger variance is the more variable. so that y = bz. but the CV would be diminished.

I get the impression that the authors simply and thoughtlessly copy the numbers from their computer output. Their textbooks tell them of the dire consequences of analysing nonnormal data. let us be clear that few measurements of soil properties are accurate to more than three ®gures. For example.7 and that in subsoil to be 6. they are different. unless you are sure that more are signi®cant. if you found the mean pH of the topsoil to be 6.e. bear in mind that the null hypothesis is highly implausible when comparing different horizons and different types of soil. more stringently.9 you would be unlikely to regard the difference of much consequence. two and three stars. Mean values from replicated sampling are more precise.54 MPa. Table 1 shows what is required. As an example take matric suction. There is more to the matter than this. so that only the ®rst two ®gures are likely to be meaningful and in that sense signi®cant. i. In the above example the suction would be written `1. and you are generally justi®ed in reporting them with three ®gures. Sampling ¯uctuation is likely to swell the error substantially.54 E06' or `1540 * 103'. Determining the concentration of an element in the soil typically incurs a laboratory error of 2±5% of the true value.5 compared with 6. however. larger samples result in smaller standard errors and hence more sensitive comparisons. when a difference is deemed statistically signi®cant because you reject the null hypothesis that does not mean that it is important or physically or biologically meaningful. They are of virtually no interest to readers unless the author tells them how much more the yield was or how much more acid is the soil. a reader may be willing to recognize one for which 0. to summarize this section. only if P ` 0. In some instances they respond to my chiding with remarks such as `Yes. the quoted values of standard deviations and errors should be limited to three ®gures. That is the `null hypothesis'. If the values are so large or so small that they lie well outside what we can express using familiar pre®xes then use powers of 10 to scale them. or even 1. and some with even more. and choose familiar units that enable you to do so conveniently. full stop! Signi®cant ®gures and choice of units `Signi®cance' takes on another meaning in reporting quantitative results.1 ` P < 0. say. change the units to kPa and write 1540 kPa. 1 540 000 Pa. Before making an inferential test we put forward a hypothesis. let us recognize what signi®cance means in a statistical context. If you need further guidance then see Monteith's (1984) eminently sensible article on the matter. This point needs to be made in your reporting. you might summarize a result as. 52. Would the consequences really be so dire? Should they transform? 2001 Blackwell Science Ltd. are replaced by sets of one. but the values are in the tens of thousands'.05.01. Rather than report a result as. Thus. These are potentially interesting. signifying probability. Your values may run into tens of thousands or more in the SI units. First. however. So. If you take large enough samples you can establish that any soil is different from almost any other for whatever property of them that you care to choose. Present your numerical results to three ®gures. which at wilting point is around 1 500 000 Pa.9 in the subsoil. When we come to regression we want to know at least two things about it ± how steep it is (given by the regression coef®cient) and how close the observations lie to the regression line (given by the product± moment correlation coef®cient or the variance of the residuals). `The mean measured pH of the topsoil was 5. 331±340 . However. Transformations Many authors are vaguely aware of transformations. Likewise. and in particular the variances' being constant. You may feel justi®ed in quoting variances with four or even ®ve ®gures because you present them as intermediate results which readers may wish to process further by adding them. So the sensitivity of the test depends on the precision with which we have estimated the means. The signi®cance test is valuable in preventing false claims on inadequate evidence. First. and statistical signi®cance is of secondary importance. Other things' being equal. whatever the probability of rejecting the null hypothesis. You should also realize that. European Journal of Soil Science. and that unless they do something about such data damnation will surely follow. and if you give the standard errors then readers can reach their own judgements. To judge whether two means differ we compute from the sampling error the probability of obtaining the observed difference if the true means were identical. But please do not write `1. which our tests are designed to reject (not con®rm). I receive many papers in which measured values and means and errors are quoted with ®ve and six ®gures. and by taking their square roots.Statistics to support soil research These statements may be supported by tables in which the values of P. This is usually that there is no real difference between populations or treatments and that any differences among the means of our observations are due to sampling ¯uctuation.54 Q 106 Pa'. subtracting one from another. # 335 Finally in this section. but no more. It is to some extent a matter of personal choice. assuming that we know the form of the distribution. while you may regard a difference as signi®cant only if P ` 0. they are computer code.05 or. but because the samples were small the difference was not statistically signi®cant'. on their standard errors or on the standard error of their difference.

The normal curve has the formula 1 y ˆ p exp ' 2% @ A …z À "†2 Y À 2'2 dimensionless quantity can be obtained via the third moment of the data about their mean: 1 ˆ N 1 ˆ …z À z†3 X N S3 iˆ1 …9† …8† where y is the probability density. Taking logarithms produces a result in which there is virtually no relation (Figure 3b). Third. Start by drawing a histogram. This may be true in some instances. in the topsoil (0±20 cm) of Broom's Barn Farm in 1960 (Webster & McBratney.336 R. Values of  1 greater than zero signify positive skewness. A second feature of skewed data is that the variances of subsets depend on their means. The data are 433 measurements of available phosphorus. does the curve look as though it ®ts well? If it does then go no further along this road. but the usual t tests and F ratios computed by the analysis of variance are robust for comparing means. Suppose. 52. we had only 44 data. and of these the second is perhaps the most important. Does it look symmetrical? If so superimpose on it a normal curve computed from the mean and variance of the data. put another way. a matter that I illustrate below. The most serious departure from normality usually encountered with soil data is skewness. In particular.05. But we might be tempted to reject it for the logarithms too. long upper tails to the distribution and a mean that exceeds the median. If the histogram is skewed then compute the skewness coef®cient in addition to the mean and variance. its mean is at its centre (its mode). the skewness coef®cient is almost the same. If now we compute 12 and apply our signi®cance test we still conclude that the distribution on the original scale is nonnormal. nor does the median (the central value in the data).34 we can be satis®ed. and so if the data come from some other distribution then the tests might mislead. by which I mean that the errors are greater than they need be or. To emphasize the point I have drawn the normal curve on Figure 1(b). and it ®ts reasonably well. but we should accept without any hesitation that the logarithms are near enough normally distributed.5 <  1 ` 1 then you might try taking square roots. So. If we can transform data to approximate normality we overcome these disadvantages. I have drawn subsets of 44 from the full set of data.e. This is undesirable when making comparisons. all deviate more or less from normality. and this is corroborated by the skewness coef®cient of 3. The mean of data from a skewed distribution does not estimate the mode. It will be helpful. It illustrates how sensitive signi®cance tests can be when you have many data and why you should not use such tests for judging whether to transform.7 with 18 degrees of freedom (Table 1) is only 0. as do the summary statistics. P. They correspond well with the full set of data. as in Figure 1. computed their means and variances.42. Further. If the data are positively skewed (again the usual situation) then the variances increase with increasing mean. with 12 = 9. If 0. more data are required to achieve a given precision than would be if the distribution were normal. Figure 3(a) shows that the variance increases strongly with increasing mean. as many authors do. when should we transform? How large a departure from normality should we tolerate? There is no ®rm answer. asymmetry. We can see how the transformation stabilizes the variances. if  1 is positive and less than 0. The curve on Figure 1(a) is that of the lognormal distribution. These are the principal reasons for it. to illustrate this with an example.088. you need not transform. Webster The reason put forward in most textbooks is for statistical inference. We should reject the hypothesis of normality on the original scale. and salutary. What is needed is a little exploration of the data plus judgement.1 with 9 degrees of freedom the probability is 0. statistical signi®cance is of secondary importance anyway. We achieve symmetry. as above. It is again the matter of statistical signi®cance. The usual parametric tests are based on the normal distribution. i. i. We are therefore left uncertain as to the meaning of the statistics. Now. Negative values of  1 signify negative skewness. and so the mean of the data estimates this central value without ambiguity. European Journal of Soil Science. Certainly the solution is not to apply a signi®cance test. estimation is `inef®cient' where data are skewed. which is common. Their histogram (Figure 1a) shows them to be strongly positively skewed. Of course. they are rather insensitive to departures from normality. the probability associated with 12 = 26. Now. 1987).95. and if  1 > 1 then a transformation to logarithms is likely to give approximate normality. and plotted their variances against their means. and are unusual. so try it. We stabilize the variances. and hence remove ambiguity concerning the centre. and (b) for the logarithms. no real data are exactly normal. so we must probe deeper. This # A symmetric histogram has  1 = 0. and as the skewness in the logarithms is now only 0. how should we judge then? Figure 2 shows the histograms (a) on the original scale. 2001 Blackwell Science Ltd.e.5 you should not need to transform the data. Transforming to logarithms makes the histogram (Figure 1b) more nearly symmetric. Scale it to ®t the histogram by multiplying by the number of observations and by the width of the bins. And we make estimation ef®cient. in Table 1. 331±340 . The normal distribution is symmetric. Were we to judge the goodness of ®t by a signi®cance test we might compute a 12 from the differences between the observed frequencies and the theoretical ones and compare the result with 12 for probability 0. however.

they do not change the detail. Figure 3 Graphs of variance against mean for 10 subsets of 44 phosphorus data (left) on the original scale in mg l±1 and (right) after transformation to common logarithms. the second. 331±340 . The curves in (b) and (a) are of the normal and lognormal distributions. you should realize that these simple functions change only the general form of the distribution. One may imagine this matrix as a distribution of units as N points in a Euclidean space of p dimensions. orthogonal to the ®rst. and (b) transformed to common logarithms. Principal component analysis helps us in that search. as if it were a scatter graph with p orthogonal axes. even the most elastic imagination has dif®culty in stretching to a space of 20 or 30 dimensions. These are assembled into a data matrix with N rows and p columns. If you want to normalize the detail then you will need the more elaborate normal score transform. However. The structure of interest might be the relations of the units (individuals. But its popularity is increasing again as soil biologists seek to understand the microbial ecology of soil. Finally. The result is a set of p values (variates) for each of N samples of soil (units). Goovaerts (1997) describes how to do it. Once the pedologists had mastered PCA they put it into their repertory. respectively. For the ®rst time they had a tool to enable them to analyse large sets of correlated multivariate data and identify `structure' in them. lies in the direction of maximum variance in the residuals from the ®rst. the third takes up the maximum variance in the residuals 2001 Blackwell Science Ltd. or it might be the general relations among the variates and how they cluster. 52. especially psychologists. even though it has been instrumental in several penetrating studies. and it is one that has recently returned to fashion and brought with it its own brand of misunderstanding. It did not make the news. When biologists and pedologists ®rst gained access to general-purpose computers in the 1960s principal component analysis (PCA) was one of their delights. the analysis ®nds new axes in the multidimensional space such that the ®rst lies in the direction of maximum variance. (2000) in this Journal. Samples of soil are subject to a battery of tests. and they got it out and dusted it down # when desired. sampling points) to one another and perhaps clustering. or their phospholipids are fractionated and their concentrations measured ± see the recent paper by Fritze et al.Statistics to support soil research 337 Figure 2 Histograms of a subset of 44 data on available phosphorus (a) in mg l±1. sought to interpret the principal components. such as those of Biolog. European Journal of Soil Science. Pursuing the geometric representation. Principal components Another kind of transformation is to principal components. and when there is correlation among the variates we may rightly seek to reduce the dimensionality to one that we can envisage and hope that the structure we seek is revealed in just a few dimensions. Other scientists. and they elaborated the technique for the purpose.

Analysing matrices S and C will give sensibly the same result apart from a scaling factor.25 then the P will swamp pH.90 78. They are best plotted as scatter diagrams. 2001 Blackwell Science Ltd.162 0.31 1. gives the third set of values. In general the results will differ from those obtained by analysing matrix R. and the documentation does not always state which. !j is the jth eigenvalue. The eigenvalues.e. it is easy to do the analyses now in many statistical packages. the eigenvectors (or latent vectors). or even to summarize it. (2000).17 9. namely the eigenvalues (or latent roots). The magnitude of the elements can help to give meaning to the components. ¼. Which matrix? The analysis may be done on any one of three matrices. and herein lie the ®rst pitfalls for the unwary. denoted !1. here is a second potential pitfall.07 87.123 1. we shall not notice the effect of the latter. C.00 statistical package is doing then you should standardize the variates to unit variance before doing the PCA. one against another. the component scores. Multiplying the centred data by the eigenvectors. for the following reason. i. ap. The ®rst two are S = XTX. The mathematics of PCA can be found in any good book on multivariate statistics. You should always report the ®rst few in a table. Figure 4. If you are in any doubt of what your # where aij denotes the ith element of the jth eigenvector. (10) where matrix X contains the data from which the means have been subtracted. or the correlation matrix. the elements are effectively weights. the matrix of sums of squares and products. by bij ˆ aij q !j a'2 Y i …13† Order 1 2 3 4 5 6 7 Eigenvalue 4. What to report The output from a PCA should contain three sets of values. are the variances along the new axes.344 0. as in Table 2 on the left and Table 6 of Fritze et al. contain the cosines of the angles between the original axes and the new ones.120 Percentage 58. The outcomes in the ®rst two instances will be dominated by those original variates with the largest variances. Thus. to which it is orthogonal. Mardia et al. columns in matrix A.g. for the ®rst few components. (14) The columns of Y are the new variates. p. Working with matrix R gives them equal weight. 331±340 . e. The eigenvectors.72 4. i. a2.91 3.70 95. and component scores. Webster Table 2 Eigenvalues from a principal component analysis of the standardized values of seven trace metals in the Swiss Jura (from Webster et al.79 92. if our data comprise available P with a variance of 26. Many packages produce tables of output called `loadings'.5 (mg l±1)2 and pH with a variance of 0.342 0. the variance± covariance matrix. and clearly this is much more sensible. However. of the relations of trace metals in the soil of the Swiss Jura.e. These are also of interest.28 100. 1994) Accumulated percentage 58. Y: Y = XA.2. European Journal of Soil Science. These scatter graphs show the relations between the units in the reduced dimensions. and they are nonplussed when we editors insist on their telling us precisely what they are reporting. and their proximity to one another shows how related the original variables are to one another. !2. to the product±moment correlation coef®cients between the principal components and the original data.90 19. ¼. and 'i2 is the variance of the ith original variate. is an example. and the larger they are in absolute value the stronger is their in¯uence. and it is not my purpose here to repeat that. These loadings may be either eigenvectors or correlation coef®cients.72 from the ®rst and second.338 R. !p.. R. N ± 1.27 2. a1. You will often ®nd that interpretation is aided by converting the eigenvectors to correlations.229 0. is formed from the variance± covariance matrix by p rij ˆ cij a cii cjj …12† for all i and j = 1. ¼. Authors copy the results into their papers without knowing.97 98. and the superscript T signi®es the transpose. S. (1979) and Webster & Oliver (1990). and Cˆ 1 XT XX NÀ1 …11† The correlation matrix. and they are usually tabulated in order from largest to smallest with the proportions of the total variance for which they account. and it continues until there is no variance left.681 0. A scatter diagram of these in a circle of unit radius shows how strong they are (the nearer they plot to the circumference the stronger they are in those dimensions). effectively it standardizes the variances to 1. and you may tabulate them or graph their elements as points in as many dimensions as you think ®t. centred data. Further. 52.

330 0. or just plain wrong. on the contrary. we can transform the new components further to extract more meaning. so that the leading few principal axes account for large proportions of the variance. we can examine the scatter in those few dimensions for structure. To conclude this section.722 ±0. Bear in mind ®nally that statistical processing and analyses are means to ends in soil research and that their outcomes must make pedological sense. but their outcomes are at best guides. inappropriate. guided authors on its correct application.Statistics to support soil research 339 Table 3 First three eigenvectors from the PCA of the standardized trace metal data Order 1 Cd Co Cr Cu Ni Pb Zn 0.388 0. not as an end in itself. The analyses have been programmed and are readily available in computer packages. This is not to say that PCA is not valuable. it does not lead È directly to any tests of signi®cance. The only thing that seems to have changed since Mark & Church (1977) reported the dismal record of regression in earth science is that it is easier to do ± every computing system now has a regression button.313 0. In the example from which Tables 2 and 3 derive. Cr and Ni.396 0.303 0.327 0. The second component. and the group of Cu. 1994). It is no more and no less. and we usually have to make our own judgements on what is meaningful and good sense. and I do not know what more I can do to educate authors on the subject. 52. principal component analysis may be regarded as a rigid rotation of the data to new axes.338 0.457 2 ±0.034 ±0. all seven metals lie well to the right of centre in Figure 4.. 1997).086 Figure 4 Scatter of the seven trace metals plotted as their correlation coef®cients between them and the ®rst two principal components in the unit circle (from Webster et al.580 ±0. it has proved remarkably.248 ±0. and if you do not understand it please consult a professional statistician.327 ±0. and we should be able to trust the outcome. and a projection from the full space on to them will give a picture containing most of the information in the data.416 0. discriminates between the metals Cd.178 0. The ®rst component is often one of size. and the authors thought it not worth investigating the remaining 20%. How many dimensions? As above. however.187 ±0. Tests have been proposed for the purpose. If you are contemplating a regression analysis for this Journal then read my article ®rst. There are often moderate to strong correlations in data. # helpful. even surprisingly.558 0. It is mathematical rather than statistical. Yet misunderstanding and abuse continue. It embodies no distributional assumptions. and it is no exaggeration to say that in most the regression is inadequately explained. Soil scientists' main failings are now in choosing the statistics appropriate for their purposes and in presenting their results with understanding. 2001 Blackwell Science Ltd. See text for explanation. and we can analyse them as representatives of the full set of data. Epilogue Most soil research requires only well-established standard statistics. an exploration of your data. unnecessary. It is also naõve in the sense that it takes no account of anything you know apart from the data themselves. Papers in which regression has been applied thoughtlessly continue to pour into my of®ce.551 ±0. We shall have reduced the dimensionality to one that is manageable. and warned of its improper use in a previous article (Webster. You should see PCA as the beginning of an investigation. trace metals in the soil of the Swiss Jura. These together account for almost 80% of the variance among the seven metals. If the results turn out to have pedological or biological meaning then that is good fortune rather than a matter of design. Regression I described regression.133 3 ±0. showing that units are either rich in all the metals or poor in them all. Pb and Zn. Co. the main aim of PCA is to reduce the dimensionality and obtain a few variates that capture most of the information in the data. 331±340 . The question almost inevitably arises: how few? ± how many components should one retain? There is rarely an unequivocal answer. European Journal of Soil Science.274 0.

Mapping soil fertility at Broom's Barn by simple kriging. European Journal of Soil Science. 2000.. Atteia. Journal of the International Association of Mathematical Geology. & Bashkin.B. & Pennanen. M. European Journal of Soil Science. 331±340 . Webster. J. Kent. 38. H. Â Radojevic. R. 1987. 1997.M.M. 1994. 52. Practical Environmental Analysis. Journal of the Science of Food and Agriculture.. T.340 R. Coregionalization of trace metals in the soil in the Swiss Jura. 51.C. Goovaerts. 20. R. 1997. Systematic Zoology. A. & Church. Choosing and Using Statistics: A Biologist's Guide. 48. C. M. 1999. 1966.V. 105±117. Multivariate Analysis. 9. Oxford University Press. Oxford. On the misuse of regression in earth science. & Dubois. 1977.T. & McBratney. 15. J. 1984. K. J. Regression and functional relations. Statistical Methods in Soil and Land Resource Survey. 565±573. Cambridge. 557±566. P. On the measurement of relative variability. Pietikainen.L.. Lewontin. Mark. 205±218. D. R. È Fritze. Geostatistics for Natural Resources Evaluation. Webster. Webster. Experimental Agriculture. Distribution of microbial biomass and phospholipid fatty acids in Podzol pro®les under coniferous forest. London. 63±77. Oxford University Press. European Journal of Soil Science. J.-P. J. 97±115. New York. Mardia.A. Webster References Dytham. 1999. # 2001 Blackwell Science Ltd. Blackwell Science. R. Academic Press. & Bibby. Monteith. 1979. Webster. 171±172. O. 45. Oxford. 1990. Royal Society of Chemistry.N. V. M. & Oliver. R. European Journal of Soil Science. Consistency and convenience in the choice of units for agricultural science.