You are on page 1of 6

LC•GC Europe Online Supplement statistics and data analysis 19

Missing Values, Outliers,
Robust Statistics &
Non-parametric Methods
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

This article, the fourth and final part of our statistics refresher series, looks
at how to deal with ‘messy’ data that contain transcription errors or extreme
and skewed results.

This is the last article in a series of short readings are taken at set times or the cost Pairwise deletion can be used as an
papers introducing basic statistical methods of retesting is prohibitive, so alternative alternative to casewise deletion in
of use in analytical science. In the three ways of addressing this problem are needed. situations where parameters (correlation
previous papers (1–3) we have assumed Current statistical software packages coefficients, for example) are calculated on
the data has been ‘tidy’; that is, normally typically deal with missing data by one of successive pairs of variables (e.g., in a
distributed with no anomalous and/or three methods: recovery experiment we may be interested
missing results. In the real world, however, Casewise deletion excludes all examples in the correlations between material
we often need to deal with ‘messy’ data, (cases) that have missing data in at least recovered and extraction time, temperature,
for example data sets that contain one of the selected variables. For example, particle size, polarity, etc. With pairwise
transcription errors, unexpected extreme in ICP–AAS (inductively coupled deletion, if one solvent polarity measurement
results or are skewed. How we deal with plasma–atomic absorption spectroscopy) was missing only this single pair would be
this type of data is the subject of this article. calibrated with a number of standard deleted from the correlation and the
solutions containing several metal ions at correlations for recovery versus extraction
Transcription errors different concentrations, if the aluminium time and particle size would be unaffected)
Transcription errors can normally be value were missing for a particular test (see Table 2).
corrected by implementing good quality portion, all the results for that test portion Pairwise deletion can, however, lead to
control procedures before statistical would be disregarded (See Table 1). serious problems. For example, if there is a
analysis is carried out. For example, the This is the usual way of dealing with ‘hidden’ systematic distribution of missing
data can be independently checked or, missing data, but it does not guarantee points then a bias may result when
more rarely, the data can be entered, again correct answers. This is particularly so, in calculating a correlation matrix (i.e., different
independently, into two separate files and complex (multivariate) data sets where it is correlation coefficients in the matrix can be
the files compared electronically to possible to end up deleting the majority based on different subsets of cases).
highlight any discrepancies. There are also of your data if the missing data are Mean substitution replaces all missing
a number of outlier tests that can be used randomly distributed across cases data in a variable by the mean value for
to highlight anomalous values before other and variables. that variable. Though this looks as if the
statistics are calculated. These tests do not
remove the need for good quality
assurance; rather they should be seen as
an additional quality check. Al B Fe Ni
Solution 1 94.5 578 23.1
Solution 2 567 72.1 673 7.6
Missing data
Solution 3 34.0 674 44.7
No matter how well our experiments are
Solution 4 234 97.4 429 82.9
planned there will always be times when
something goes wrong, resulting in gaps in Casewise deletion. Statistical analysis
only carried out on the reduced data set.
the data. Some statistical procedures will
not work as well, or at all, with some data Al B Fe Ni
missing. The best recourse is always to Solution 2 567 72.1 673 7.6
Solution 4 234 97.4 429 82.9
repeat the experiment to generate the
complete data set. Sometimes, however,
this is not feasible, particularly where table 1 Casewise deletion.

Statistical analysis unaffected except A to E. extreme values at > 99% confidence level. given that we understand that the statistical model derived from the patterns in the data) and then m possible tests only tell us where to look.7). Note that the tests used to check dispersion (the spread of the data).5 578 23.7 Solution 4 234 97. diminish or even reverse sign depending on Recovery Recovery Recovery which method is chosen to handle the vs vs vs Extraction Particle Solvent missing data (i. skewness test.8 a correlation matrix.g. A total of m data be declared ‘wrong’ and removed.728886 -0. In its simplest ad hoc form an imputed value is substituted for the nonetheless be a correct piece of missing value (e.87495 0. extreme values detected Solution 2 567 72.g. the A. the imputed missing the identification of a particular cause can values are predicted from patterns in the real (non-missing) data. where you are most likely to have a technical error. r 0. stragglers and outliers Extreme values are defined as observations table 2 Pairwise deletion.5 94. however.20 statistics and data analysis LC•GC Europe Online Supplement data set is now complete.6 between the 95% and 99% confidence Solution 3 34. has its own disadvantages.1 stragglers.6 bias in the calculated mean. how do we complete data sets are analysed in turn by the selected statistical method. leading to underestimates of Kolmogrov–Smirnov–Lillefors test. possible imputed values are calculated for each missing value (using a suitable So.g.1 673 7. and outliers.9 levels.1 673 7.9) etc. e.5 paired combination of the five variables.4 429 82. This method works well providing that the missing distributed then a number of ‘outlier tests’ data is randomly distributed and the model used to predict the inputed values (sometimes called Q-tests) are available is sensible.8). Outlier tests tell you. on the basis of some simple assumptions.5 34. they do not tell you that the point is ‘wrong’. the suspect value could packages.5) is yet another method that is increasingly being used to No matter how extreme a value is in a set handle missing data..7 golden rule however: no value should be Solution 4 234 97. in a sample. Al B Fe Ni Extreme values can also be subdivided into Solution 1 94. not yet widely available in statistical software of data.0 coefficient (r) (3) is determined for each Sample 4 73 10 500 1. or the result of an error in measurement (6). for example.0 674 44.5 578 23. however. mean substitution way (7. that identify extreme values in an objective . Mean substitution. how the r value can increase.9 removed from a data set on statistical grounds alone. e. for when one of a pair of data points are missing. such as Recovery Extraction Particle Solvent linear regression statistics (3).033942 (number of data points (4) (4) (3) in the correlation) Extreme values... ‘Statistical grounds’ include outlier testing.4 429 82.0 674 44. Box 1: Imputation (4. automatically from a data set. histogram normal probability plots (1. Note. where the correlation Sample 3 99 180 50 1. in the data set is artificially decreased in • past experience of similar data • plots of the data. frequency direct proportion to the number of missing • passing normality tests. so far separated in value from the remainder as to suggest that they may be from a different population. or possibly introduce a Solution 2 567 72. table 3 Mean substitution. In its more general/systematic form. Mean substitution may also considerably change the values of some other statistics. The m test for outliers? If we have good grounds intermediate results are then pooled to yield the final result (statistic) and an for believing our data is normally estimate of its uncertainty. mean substitution already discussed above is a form of information (1). particularly % time Size Polarity where correlations are strong (See Table 3). Only with experience or imputation). The variability data is normal are kurtosis test (7. for the calculation of Sample 2 105 120 150 1. because they can alter the calculated statistics.. Pairwise deletion. There is one Solution 3 400. Statistical analysis carried It is tempting to remove extreme values out on pseudo completed data with no allowance made for errors in estimated values. It is.e. Al B Fe Ni increase the estimate of variance (a Solution 1 400. (mins) (µm) (pKa) Examples of these three approaches are Sample 1 93 20 90 illustrated in Figure 1.1 measure of spread). Good grounds for believing the Shapiro–Wilk’s test. data points. B correlation time Size Polarity coefficients).

Most outlier tests look at some measure of the relative distance of a suspect point Outlier Outlier from the mean value.01 -0.9 77.4 91. therefore. figure 2 Outliers and masking. Most of the tests look for single extreme values (Figure 2(a)).632 C 0. s is the standard deviation for the whole data set.50 amount of data (a minimum of 10–15 Cases A B C D E C 0. 2(b) Grubbs 2 and 2(c) Grubbs 3. The test values are outlier.6 3 86.2 78. Note.54(12) 0. and 12 0.514 B C D E have large numbers of replicate data.66 distributed then robust statistics and/or mean 99. –x is the mean.2 92.62 0.2 81. as a rule of thumb. x – xi x –x n – 3 × sn2 – 2 The appropriate outlier tests for the G1 = s G2 = n s 1 G3 = 1 – three situations described in Figure 2 are: n – 1 × s2 2(a) Grubbs 1. We will concentrate on the three where. technical reason can be found for their aberrant behaviour.05 0.0 91.71(10) Outlier tests n r≥ Mean substitution (15 cases) In analytical chemistry it is rare that we 15 0.950 D 0. Note. No missing data (15 cases) B C D E A 0.77(10) estimates and non-parametric methods) C 0.23(11) three approaches (outlier tests. Outlier tests should. i. after calculation ignoring the sign of the result.8 C 0.47(11) 0.7 115. mean = Mean values replacing missing data.3 100.47 0. n is the number of data points.41 0. at the 95% confidence level.1 101.576 A 0.0 82.02 to carry out such tests. significant correlations are indicated at 3 D 0.02 0..9 103.7 61. but Outlier Outlier sometimes it is possible for several ‘outliers’ to be present in the same data (b) set.3 89. Dixon or Nalimov.602 B 0.e.7 Casewise deletion (only 5 cases remain) there will be many examples in analytical B C D E science where either it will be impractical 13 90.2 D 0. 15 96.27(12) 0.5 72.11 0.47 0. xn the data are arranged in ascending order.53 0.5 96.9 78.50(11) 0.39 normality usually require a significant Variables / Factors B 0. and x1 are the most extreme values. xi is the suspected single Grubbs’ tests (7).e.5 tell us anything meaningful.1 101. B 0.6 89. the value furthest away from the mean.1 A -0.0 95.70(10) are examined in more detail below. sn-2 is the standard deviation for the data set . These can be identified in one of two ways: Outliers Outliers • by iteratively applying the outlier test • by using tests that look for pairs of (c) or extreme values. or the tests will not B -0.25 grouping and consequent apparent 10 0. | | is the modulus — the value of a calculated using the formulae below.57 0.0 91.17 14 90.40 0.46 outliers. This measure is then assessed to see if the extreme value could reasonably be expected to have arisen by (a) or chance.55(12) 0. robust of missing data.36 small data sets often show fortuitous 11 0.0 111.61 the normality test applied)..79(11) 0. i.0 72. identified data points should only be removed if a figure 1 Effect of missing data on a correlation matrix.50 0.7 non-parametric (distribution independent) Pairwise deletion (Variable number of cases) tests can be applied to the data.4 100.59 results are recommended depending on 1 105.47 0.36 0.LC•GC Europe Online Supplement statistics and data analysis 21 Correlation matrices with different approaches selected for missing data.0 77. if more than 20% of the data are identified as outlying you should start to question your assumption about the data distribution and/or the quality of the data collected.43 5 0.0 97.8 97. of course.91 0.21 -0.71 If we are not sure the data set is normally D 0.62 0. For this reason 2 77.4 94. outliers that are masking each other (see Figure 2(b) and 2(c)).68 0.5 98. be used with care and. These value = Data removed to show the effects B C D E A 0.

are sufficient numbers of replicates to get a reasonable estimate of the variance. In this mean.498 G3 = 1 – 10 × 0. the suspect variance is compared with the sum of all Pitfalls of outlier tests group variances. unlikely to have occurred by chance at the • The Cochran’s test can be used to test for the third case. 13 replicates are ordered in ascending order..g. admit.e. how many times do your procedures state ‘average the best two out of three ↔ xm when n is odd 1. is an infinite number of sample results produced by all groups. the pair of values furthest away group means to have outliers with respect to each other.498 G2 = 49. Since the test values are less than their respective critical values.484 – 48. The most commonly used of these statistics are as follows: If there is very little data (Figure 3(c)) an Median: The median is a measure of central tendency1 and can be used instead of the ↔ outlier can be identified by chance.. standard deviations away from the mean The Cochran’s test assumes the number of replicates within the groups are the same or of the remaining results. outlying variance. methods via interlaboratory comparison) it Box 2: Grubbs’ tests (worked example).e. that of a suspected stated confidence level (see Box 2).607.… where m = round up 2 2 Outliers by variance When the data are from different groups Median Absolute Deviation (MAD): The MAD value is an estimate of the spread in the (for example when comparing test data similar to the standard deviation.997 48.6705 and 0.… x = n determinations’? xm xm  1 when n is even 2. there are equal numbers of the other values that are the outliers. 5.118 48.498. x1 xn 47.151 48. Because of limited Σ S2 measurement precision (rounding errors) it i=1 i is possible to end up comparing a result If this calculated ratio.484 – 47. G2 = 4. For a symmetrical distribution the mean occurs more often than we would like to and median have the same value. This value will at least similar (± 1).479. pesticide residues. (The variance is a measure of spread and is simply the square of the Figure 3 shows three situations where standard deviation (1). This observations smaller and greater than the median). It also assumes that none of the data have been rounded and there therefore always be flagged as an outlier.. values.711 49. The choice of n – is the average number of all other values.634 48.479 = 2. 6.23 0. exceeds the critical value obtained from statistical tables (7) which. than the critical value obtained from tables • The same Grubbs’ tests that are used to determine the presence of within group (see Table 4) then the extreme value(s) are outlying replicates may also be used to test for suspected outlying means.065 48.22 statistics and data analysis LC•GC Europe Online Supplement excluding the suspected pair of outlier is not only possible for individual points within a group to be outlying but also for the values.876 = 3.005 49. C–n .123 n = 13. To calculate the median ( χ ) the data are arranged in order of magnitude and the situation it is possible that the identified median is then the central member of the series (or the mean of the two central point is closer to the ‘true value’ and it is members when there is an even number of data. mean = 48. i.24. it can be concluded there are no outlying values. G3 = 0. Another type of ‘outlier’ that can from the mean.) outlier tests can misleadingly identify an g extreme value.559 48. no matter how close it is to the then the suspect group spread is extreme. in all cases. The In Figure 3(b) there is a genuine long tail Cochran’s test should not be used iteratively as this could lead to a large percentage of on the distribution that may cause data being removed (See Box 3). This type of distribution is surprisingly Robust statistics common in some types of chemical Robust statistics include methods that are largely unaffected by the presence of extreme analysis.211 48. 4. .1232 = 0. s = 0.498 Grubbs’ critical values for 13 values are G1 = 2. successive outlying points to be identified. sn–2 G1 = 49. occur is when the spread of data within one particular group is unusually small or large If the test values (G1.02 0.00 and 4.484 2 = 0. i.587 12 × 0. suspected s2 Σ ni Figure 3(a) shows a situation common in Cn = g where g is the number of groups and n = i = g1 chemical analysis. e.331 and 2.166 49.7667 for the 95% and 99% confidence levels.251 48. To carry out the Cochran’s test. 3. G3) are greater when compared with the spread of the other groups (see Figure 4).876 47. G2.

44 1.43 0.252 n = 85 = 6.386 0. There is at least 13 2.23 at the 95% confidence levels7.1288 3. level n G(1) G(2) G(3) G(1) G(2) G(3) Non-parametric tests 3 1.150 0.866 5.471 5.411 5. 1. 13).72 0. As the test value is greater than the critical values it can be concluded that the laboratory with the highest standard deviation (0..00 0. It is normal practice in inter-laboratory comparisons not to test for low variance outliers.4270 40 2.483 it becomes comparable with a standard trimmed mean and deviations.884 4.1671 3.968 0.0000 assumptions about the underlying 5 1.730 0. If at all possible. 35 2.437 0.749 2.672 2.9965 distribution of data (such as normality).705 4.1838 of the resulting statistics so care should 120 3.24 0. n MADE = 1.3896 Conclusions 50 2.00 --.483 × MAD Other robust statistical estimates include If the MAD value is scaled by a factor of 1.3276 3.082 5.43 0.7695 2. Levene’s test (heterogeneity in ANOVA).500 0.89 0.2187 3.8091 2. 85 determinations where carried out in total.712 6..1810 3.7667 one non-parametric equivalent for each 15 2.2022 + 0. 2. Typical statistical tests incorporate 4 1.820 0.1519 experiments should be carried out to fill in the missing points.7141 parametric type of test (see Table 5).2147 behaviour can be found.00 --.086 0.54 0.688 6.402 0.3595 3..103 5.5196 2. 100 3.600 6.900 0.662 6.2462 + 0.19 0.202 0.550 4.9817 1.17 0.1716 be taken with the method chosen to 130 3.34 0.6182 2.030 0.13 0.236 0.9814 ‘Non-parametric’ tests are so called 7 1. Winsorized deviation.6091 describe the methodology for all these 25 2.6092 = 0.330 0.171 5..452 0.331 4.7004 2.2914 errors.463 2.492 2.88 0.68 0.650 0.8586 advantage is improved reliability when the 12 2..336 5.938 0.350 0.638 0.3328 • Always check your data for transcription 60 3. Dev.198 Cn = 0.1982 1. 6 1.025 5. . this is the MADE value.557 4.097 3.110 3.074 0.609 0.026 0.030 0.246 0. and do not rely on 9 2.944 3. mean and deviation.120 0.2797 3. table 4 Grubbs’ critical value table (5).176 3.4732 other publications (12.55 0.285 3.474 13 – Cochran’s critical value for n = 7 and g = 13 is 0.067 0.267 6.80 0.811 5.822 3. ….607 4.153 2. and hence rely on distribution parameters.LC•GC Europe Online Supplement statistics and data analysis 23 For n values MAD = median xi – ↔ x i = 1.323 3. Their chief 10 2.200 0.54 ≈ 7 0.1553 3.155 2. The standard deviations of the data obtained by each of the 13 laboratories was as follows: Std. laboratories reporting unusually precise results.009 5.563 6.8980 2.745 4.3992 3.210 0. In a short article.410 3.2450 3.178 5.800 0.7957 distribution is unknown.1980 • Missing data can result in misinterpretation 110 3.137 0.9560 because they make few or no assumptions 8 2.22 0.5320 tests but more information can be found in 30 2.01 0.663 4. 0.1452 3.1364 3. 0.268 0.1611 handle the gaps.956 5.1979 3..6705 2.207 5.9436 1.609) has an outlying spread of replicates and this laboratory’s results therefore need to be investigated further.49 0.032 3.2350 technical reason for their aberrant 90 3.409 4.9992 1. least median of squares (robust regression).91 0.326 0. Outlier tests can help to identify 70 3. Box 3: Cochran’s test (worked example).221 3.e.450 0.2599 them as part of a quality control check.938 3.240 5.632 6.75 0.8918 distribution parameters.332 0.9250 about the distributions.130 5.10 0. further 140 3.294 6. etc. it is impossible to 20 2..318 6. 11).318 0. such as this.4505 3.8522 2.525 0. i.73 0.4022 .40 0.239 5.371 = 0. An interlaboratory study was carried out by 13 laboratories to determine the amount of cotton in a cotton/polyester fabric. A discussion of robust statistics in analytical 95% confidence level 99% confidence chemistry can be found elsewhere (10.521 6.03 0.79 0. • Delete extreme values only when a 80 3.

15 Acknowledgement The preparation of this paper was supported 14 under a contract with the UK’s Department of Trade and Industry as part of the National Measurement System Valid 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Analytical Measurement Programme (VAM) Laboratory ID (14).B. Brown & Forsythe (11) D. Analyst 1989 114. This assumption should be Box & Whisker Plot checked for validity before these tests Analyte concentration are applied. F. (9) William H. Chemist. John Wiley (1994). Friedman’s two-way ANOVA (8) V.W. . ISBN 0- counted variables 2 (Chi-square) test 471-09777-2. figure 4 Different types of outlier in grouped data. International Encyclopaedia of Statistics. (7) T. 1693–7. dependent groups Wilcoxon’s matched pairs test section 2. Autumn. Laboratory of the Government table 5 Non-parametric alternatives to parametric statistical tests.24 statistics and data analysis LC•GC Europe Online Supplement • Outlier tests assume the data distribution is known. 1998. Non-parametric Phi coefficient statistical methods. Practical statistics for the analytical of data McNemar’s test scientist: A bench guide.A. Tau (10) Analytical Methods Committee. the Food and 18 Drug Administration (FDA) in a guide — Guide to inspection of pharmaceutical 17 quality control laboratories — has specifically prohibited the use of outlier 16 tests. Kruskal & Judith M. Royal Society of 2 (Chi-square) test Chemistry 1997. concordance (14) M. Burke. Wiley & Sons. ISBN 0-02- continuous variables Correlation coefficient3 Kendall 917960-2. (12) M.J. Schafer. Geneva 1993. outlying mean 19 NB: It should be noted that following a judgement in a US court. Kendall coefficient of (13) W. (ANOVA/MANOVA)2 Kruskal–Wallis analysis of ranks.L. Burke. Hoaglin. 22 • Robust statistics avoid the need to use outlier tests by down-weighting the 21 effect of extreme values. Scientific Data Management 1(1). Rubin. 4–5. Lewis. 32–38. Daniel. (2) S. Statistics — Vocabulary and Symbols. Boston 1978. (ISBN 0 85404 442 6).A. Sargent. Farrant. Scientific Data Management 2(1). VAM Bulletin. Tukey. non- 20 parametric methods should be used. 1995. Robust Statistics — How Not to Reject Outliers Part 2.64. Collier Relationships between Linear regression3 Spearman R Macmillian Publishers. (4) J. Types of comparison Parametric methods Non-parametric methods (12. Monographs on Statistics and independent groups Mann–Whitney U test Applied Probability 72 — Analysis of of data Kolmogorov–Smirnov two-sample Incomplete Multivariate Data. Hollander & D. Understanding Robust and Exploratory Data Relationships between coefficient Gamma Analysis. References (1) S. • When knowledge about the underlying outlying variance data distribution is limited. Tanur. New York Fisher exact test 1973. Houghton Mifflin. Issue 13. (5) R. Wolf. ANOVA with replication2 Cochran Q test 3rd Edition. (6) ISO 3534. 1997. 1998. Differences between t-test for independent groups2 Wald–Wolfowitz runs test 32–40. Median test ISBN 0-471-80243-9. Applied non-parametric statistics. Barret & T.W. Mosteller & J. John Wiley & Sons (1987).J. Homogeneity of Variance Bartlett’s test7 Levene’s test. (3) S. Scientific Data Management 2(2). Statistical Analysis With Missing Data. Differences between t-test for dependent groups2 Sign test Part 1: Probability and general statistical terms. 1978.C. 13) 36–41. Chapman & Hall test (1997) ISBN 0-412-04061-1. Burke. Little & D. John Wiley & Sons (1983). Outliers in Statistical Data.