Spatial Outlier Detection in The pm10 Monitoring Network of Normandy France

Atmospheric Pollution Research 6 (2015) 476‐483
Atm spheric Pollution Research

www.atmospolres.com
Spatial outlier detection in the PM10 monitoring network of Normandy

(France)
Michel Bobbia 1, Michel Misiti 2, Yves Misiti 2, Jean‐Michel Poggi 2,3, Bruno Portier 4
1
Air Normand, 3, place de la pomme d'or, 76 000 Rouen, France
2
Université d'Orsay, Lab. de Mathématiques, 91405 Orsay, France
3
Université Paris Descartes, Paris, France
4
Normandie Université, INSA de Rouen, BP 08 – 76800 Saint–Etienne du Rouvray, France
ABSTRACT
We consider hourly PM10 measurements from 22 monitoring stations located in Basse–Normandie and Haute–
Normandie regions (France) and also in the neighboring regions. All considered monitoring stations are either urban
background stations or rural ones. The paper focuses on the statistical detection of outliers of the hourly PM10
concentrations from a spatial point of view. The general strategy uses a jackknife type approach and is based on the
comparison of the actual measurement with some robust spatial prediction. Two spatial predictions are considered: the
first one is based on the median of the concentrations of the closest neighboring stations which directly consider
weighted concentrations while the second one is based on kriging increments, instead of more traditional pseudo–
innovations. The two methods are applied to the PM10 monitoring network in Normandy and are fully implemented by
Air Normand (the official association for air quality monitoring in Haute–Normandie) in the Measurements Quality Corresponding Author:
Control process. Some numerical results are provided on recent data from January 1, 2013 to May 31, 2013 to illustrate
Jean-Michel Poggi
and compare the two methods.
 : +33‐0 1 69 15 57 44

 : +33‐0 1 69 15 72 34
Keywords: Air quality, kriging, nearest neighbors, particulate matter, spatial outlier detection
 : jean‐michel.poggi@math.
u‐psud.fr

Article History:
Received: 16 July 2014
Revised: 01 December 2014
Accepted: 02 December 2014
doi: 10.5094/APR.2015.053

1. Introduction The aim of this work is to provide tools for outlier detection in
the spatial sense, which could help in the validation of measure‐
In France, air quality is monitored in each region by an official ment of each specific location of the monitoring network. More
association. In Normandy (consisting of two regions), Air Normand, precisely, we consider in this paper the problem of spatial outlier
based in Rouen and Air C.O.M. (Air COM for short), based in Caen, detection in the context of particulate matter and especially PM10,
monitor air quality. In addition to these primary functions, their which is the more crucial pollutant in Normandy, but this is a
role is also to inform the population regarding air quality. Thus, to general pattern and it can be applied to many other contexts.
fulfill their missions, Air Normand and Air COM measure air quality
with automatic analyzers scattered throughout the region, and A short survey of the literature about outliers among a large
make these measurements publicly available, mainly through the number of references can be quickly performed. For example, we
website to inform the public on exposure to air pollution. Indeed, can first highlight the classical book of Barnett (2004), which
Air Normand and Air COM work closely together to publish their contains a chapter especially dedicated to this topic as well as
measurements on a common website (www.airnormand.fr). In some survey papers (Ben–Gal, 2005; Planchon, 2005 or more
particular, measurements are spatially interpolated to produce recently Chandola et al., 2009). However, these references are
maps of air quality, also available from the website. More mainly concerned by univariate or multivariate outliers but not
precisely, Air Normand provides a map of air quality on the specifically dedicated to the spatial nature of the data. Haslett et
Normandy region updated every hour. The maps of air quality for al. (1991), as well as Laurent et al. (2012), use analytic tools to
two pollutants (O3 and PM10) are obtained combining hourly explore spatial data and to deduce some outlier detection proce‐
measurements of concentrations and the maps provided by the dures as in Filzmoser et al. (2014). In the case of spatial data, a
numerical model outputs. Each pollutant is mapped by correcting classical distinction is to be made between a “global” atypical
the numerical model outputs by the measurements provided by value, which consists in reasoning starting from the behavior of the
the monitoring stations, using assimilation methods (Grancher et majority of data, and a “local” atypical value, which consists in
al., 2005; de Fouquet et al., 2011). Thus undiagnosed mea‐ reasoning from the behavior of the observations that are
surement errors could seriously affect the quality of the spatial geographical neighbors. Then four classes of observations can be
reconstruction of concentrations leading to erroneous maps. defined: typical, global atypical only (detected using standard
tools), local atypical only and the last one, local and global atypical.
In this paper, we are interested in detecting local atypical
© Author(s) 2015. This work is distributed under the Creative Commons Attribution 3.0 License.
Bobbia et al. – Atmospheric Pollution Research (APR) 477

observations. We can notice that a local atypical observation is the others are urban background ones. Monitoring devices are
often defined as an observation that differs from the closest mainly TEOM (Tapered Element Oscillating Microbalance) except
observations, so it is implicitly assumed that the data exhibit a for stations CHD, IFS, LIS and MRA where Beta gauges are used.
positive spatial autocorrelation. Of course it is important to check
that this autocorrelation is realistic in each specific application.
Some references particularly favor the detection of spatial
anomalous observations. Cerioli and Riani (1999), define a
procedure based on kriging schemes and dedicated to multiple
outliers while Shekhar et al. (2003), Lu et al. (2003) and Kou et al.
(2006), develop some simple, intuitive and robust ways to detect
spatial outliers. Let us finally mention the very recent paper of Li et
al. (2013) about outlier hypothesis testing studied in a universal
setting.

The basic idea of the detection algorithm we propose consists
in comparing the measured concentration to some spatial
prediction, following a classical jackknife type approach (i.e.
leaving out the observation at the considered location from the
dataset used to calculate the estimate, see Efron, 1982). The
decision rule is then based on thresholds coming from the
distributions of prediction residuals along time. We consider two
methods to perform the detection of spatial outliers depending on
the prediction method. The first one, inspired by Lu et al. (2003)
and by Kou et al. (2006), follows a non–stationary spatial way by
directly comparing the concentration of a given site to the median Figure 1. Location of PM10 monitoring stations and main cities.
of the weighted concentrations of its neighbors with respect to a
pre–specified neighborhood system. The second one is based on 2.2. Measurements quality control process
kriging increments, namely the difference between the current
observations and a reference set of past observations or numerical Air Normand and Air COM make measurements freely
model outputs, and not directly on concentrations. available and update it at a rate depending on the communication
media that is going to be used (i.e., paper report, web). Of course,
The paper is organized as follows. Section 2 presents the PM10 the most common one is the Internet: measurements are collected
monitoring network, the Measurements Quality Control process at various monitoring stations, stored in the database and auto‐
and the PM10 data. Finally some basics about kriging methodology matically published every hour. To ensure a high quality of service,
are recalled. Section 3 describes the general principle of the outlier Air Normand and Air COM have developed several procedures for
detection procedure and the two methods for spatial outlier data validation.
detection. Section 4 first presents the results of detection proce‐
dures of spatial outliers on some recent database and then First, the maintenance of the analyzers is the primary task of
proposes a discussion. Finally, some concluding remarks are any air quality monitoring association. Technicians regularly
collected in Section 5. calibrate the analyzers, according to the manufacturers’
recommendations and the references prescribed by CEN standards
2. Materials and Kriging Methodology (see www.cen.eu).

Normandy (over 3.3 million people in 2008) is located in Then technicians check measurements twice a day (morning
Northwestern France, along the south coast of the English Channel and evening). It is a first validation level performed on a strictly
and at the Northwest of Paris. Normandy is composed of two technical basis. At this stage, the physical state of measuring
regions: Basse–Normandie and Haute–Normandie. Haute– equipment is controlled, leading to three decisions: validation,
Normandie is heavily industrialized with two large urban areas, invalidation or sometimes correction of the data in the database.
Rouen and Le Havre including more than 490 000 and 250 000 The website is then updated accordingly. A second level of valida‐
inhabitants respectively, while Basse–Normandie is more agricul‐ tion is performed daily by a single expert on a different temporal
tural with only one urban area of significant size, Caen with more scale examining the sequence of measurements on each site.
than 400 000 inhabitants. Finally, an environmental validation of the measurements is
performed every month during a meeting of experts. At this stage,
2.1. PM10 monitoring network the measurement network is considered as a whole: rather than
reviewing site by site, from a metrological point of view, it is
We have a set of PM10 hourly mean concentrations coming examined spatially.
from 22 monitoring stations located in Basse–Normandie and
Haute–Normandie regions and also in the neighboring regions (see Of course, the notion of outlier is not necessary the same as
in Figure 1 the location of PM10 monitoring stations and main cities invalidated data. For example, in case of a smooth drift, an expert
of Normandy). All considered monitoring stations are either urban will invalidate the data even if it is not an outlier. Conversely an
background stations or rural ones. outlier could be a validated data in case of local pollution episode.
In summary, it is not possible to perform the complete validation
The spatial outlier detection only concerns PM10 measure‐ of measurements in real–time. The scope of this work is to provide
ments coming from stations located in Normandy, namely AIL some automatic ways to identify possible spatial outliers using
(near Dieppe), HRI, NEI (Le Havre), JUS, PQV, POS (Rouen), EVT statistical methods. These possible outliers will then be invalidated
(Evreux) for the Air Normand network (Haute–Normandie) and or not.
CHD (Cherbourg), SLO (Saint–Lo), CAE, IFS (Caen), LIS (Lisieux), ALE
(Alençon) and MRA for the Air COM network (Basse–Normandie). 2.3. Data
The last eight stations ARR, BOV, DRE, FRE, LUC, PRU, MAZ and
REN are stations of other monitoring agencies of the neighboring For each station, we have hourly or bi–hourly average PM10
regions. Stations AIL, POS, MRA and ARR are rural stations while concentrations. Each value can be a NaN (Not A Number) when the

concentration is not available or a value with a tag: validated or Table 1. Summary statistics of PM10 invalidated data. The last column
invalidated. For stations IFS, CHD, MRA and LIS, where average stands for the number of invalidated data
PM10 concentrations are measured every two hours, we have
Station Min. 1st Qu. Median Mean 3rd Qu. Max. N_inval
duplicated measurements to obtain sets of data compatible with
those of other stations. Note that for stations outside Normandy, AIL –2 7 11 15.21 16.5 90 19
every available data is validated. ALE –3 5 9 11.56 20 28 9
CAE 11 17.25 21 27.33 31.5 60 6
PM10 concentrations were collected over the period from
January 1, 2013 to May 31, 2013 (set of 3 624 hours). We split this CHD 0 0 35 84.53 59 859 19
set in two parts: learning and test sets. An hour h belongs to the EVT –7 4 12 17.4 33 57 25
learning set if h and h–1 are validated data and without missing HRI –2 9 12 19.2 27 56 25
value (across stations). The test set is defined as the remaining
IFS 0 7.5 29 24.61 40 53 51
hours. Outlier detection algorithms have been developed using
data of the learning set (of size 1 560) and evaluated on the test JUS –13 24.75 39.5 39.78 55.75 119 242
set (2 064 observations). LIS 0 4.5 233 1 224 1 254 8 513 28
MRA 0 0 4 13.43 25 65 21
Boxplots of validated PM10 concentrations of the learning set,
for the 22 stations are presented in Figure 2. One can observe a NEI –1 4 9.5 20.66 14.75 300 38
relatively homogeneous PM10 pollution in the region, which is POS –20 28 41 108.4 78.5 1 742 47
consistent with the spatial outlier detection algorithm based on PQV –1 8.75 23 24.24 35 78 80
neighbors and presented later.
SLO –200 8.25 16 100 24.5 735 6

2.4. Kriging methodology

Let us recall some basic facts about the kriging method; the
reader can refer to the book of Cressie (1993), for details. Kriging is
a statistical method which allows to estimate a quantity Z(s at
spatial location s∈S (S denotes the mapping area), using an
interpolation scheme which takes into account the spatial
correlation structure and allows in addition to quantify the
estimation uncertainty. Namely starting from K measurements
(Z(sk))1≤k≤K coming from K monitoring sites (sk)1≤k≤K the prediction
Z(s0 at location s0 is a linear combination of data:

(1)

The underlying spatial interpolation model is of the following
general form:

, ∈ (2)

where, µ(s) is a deterministic function of position, Ɛ(s) is a
Figure 2. Boxplots of validated PM10 concentrations (in µg m–3) for the 22 stationary random function of position, with zero mean and where
stations. the structure of spatial dependence is supposed to be known. The
function µ(s) is known or unknown. It defines the mean function of
Z and its form specifies the type of kriging and the spatial
In Table 1, we can find some basic statistics about the
dependence of Ɛ(s) which is actually given by the variogram
invalidated hourly PM10 concentrations, tagged by the Normand
analysis. The two terms of the model can be specified a little more.
networks. The last column of the table gives the number of
For the mean function, we consider the so–called ordinary kriging
invalidated data per station. First of all, we can notice there are
which assumes that µ(s)=m where m is an unknown constant to be
few invalidated data with respect to the total number of data estimated.
(3 624 hours). The mean value is not very informative for
invalidated data, but some median values are rather low and show To model the spatial dependence, the key tool is the vario‐
that a large part of invalidated data has a normal order of
gram defined by the function of the positive distance h between
magnitude. It may be noted the presence of negative values which
points:
are due to either a monitor failure, either a device maintenance

or metrological characteristics of the monitor. We can also note
1
the presence of some huge values in five stations. Technicians var 0 (3)
without further analysis automatically invalidate these negative 2
and huge values. To give an idea of the usual range of observations
for Normandy region, hourly PM10 concentrations, validated since where, C(h)=cov(Z(s),Z(s+h)) is the stationary covariance function.
January 2013, are non negative and less than 300 µg m–3. Finally, it Of course ƴ(h) is in general unknown and must be estimated
is clear that some invalidated data will be easily tagged as outliers previously to the kriging step, by fitting the empirical variogram
as –200 µg m–3 at SLO and 8 513 µg m–3 at LIS. However, many using a parametric model.
others invalidated data, who have comparable order of magnitude
with validated data, will not be detected as outliers. Of course, the A usual way to use kriging in the spatial context of pollution
spatial nature of the data is not taken into account in these basic data is for merging predictions coming from a numerical model and
statistics. measurements coming from the network of monitoring stations.
The idea is to correct the predictions by measurements. Since the
kriging requires some spatial stationarity, it is usual to try to apply
kriging to innovations instead of concentrations.

Namely, let us consider Zt (s) the pollutant of interest at time residual excess beyond the threshold in absolute value is
and at the location s of the S domain. At each given , we suppose considered as an outlier. Then the whole detection process is
we have, on one hand, the K measurements Zt (sk) for 1≤k≤K iterated but without the corresponding concentration replaced
coming from the monitoring sites, and on the other hand, Pt (s), for by a missing data.
s∈S for the most recent predicted map given by the deterministic
model ESMERALDA (ESMERALDA, 2015) which is based on the A natural question is how to choose the rejection limits L(sk)
CHIMERE model (CHIMERE, 2015). One can find in Hodzic et al. and U(sk). First, we can notice that, as for any testing procedure,
(2005) and Honore et al. (2008), detailed studies of the the rejection limits are computed under the null hypothesis,
performance of the CHIMERE model. ESMERALDA provides corresponding here to validated data. Second, since we consider
predictions on a grid with a mesh of size 3 kmx3 km for 4 days from the concentrations, the residuals are not spatially homogeneous.
day (J–1) to day (J+2), where day J is the current day. In our study, Then, we choose, at each fixed monitoring station and without
we associate each of the PM10 monitoring sites to the nearest grid spatial normalization, to set thresholds to extreme quantiles of the
point of the ESMERALDA model. Then considering the pseudo– empirical distribution of residuals on the history of validated data.
innovations defined by the prediction errors, These limits depend on the station.

,1 (4) Note that the quality and the size of the history of validated
data affect the actual performance of the procedure. Further down
It is possible to apply kriging to this innovation process and in the paper, we have chosen for L(sk) and U(sk) the extreme
then obtain estimates for s∈S, and then obtain a corrected quantiles q0.25%(sk) and q99.75%(sk) of the empirical distribution of
map: (Rt (sk))t∈T where T is a subset of validated data.

, ∈ (5) The two graphs of Figure 3 describe, for PQV station, the
density of validated data together with the detection bounds of
In this work, to model the spatial variance, we use an the learning set residuals obtained from NN–and KK–algorithm
exponential variogram of the following form: superimposed with the residuals associated with invalidated data.

1 ⁄ (6) (a)
0.15
Estimated density of residuals
where, the sill and the range a are estimated by fitting the
empirical variogram. Indeed it allows efficiently fitting most of the
situations encountered with our data set. From a computational
0.10
point of view, it is implemented in R (R Core Team, 2013).

3. Two Methods for Spatial Outlier Detection
0.05

The general strategy uses a jackknife type approach and is
based on the comparison of the actual measurement with some
0.00
robust prediction. Two methods for spatial prediction are consid‐ -40 -20 0 20 40
ered: the first one follows a non–stationary spatial approach by
comparing directly the concentration of a given site with the (b)
weighted median of the concentrations of its neighbors. The
0.15
Estimated density of residuals
second one follows a stationary spatial approach based on kriging
the differences between the current observations and a reference
set of past observations or numerical model outputs.
0.10

3.1. General principle

0.05
The general principle of the detection algorithm is based on
comparing the measured concentration to some robust spatial
prediction, supposed to be free of outliers. This strategy is a
0.00
classical one; see for example Rousseeuw and Leroy (2005). -40 -20 0 20 40

More precisely, the residuals are defined by: Residuals associated with invalidated data

,1 (7) Figure 3. For PQV station, density of learning set residuals obtained from
NN–algorithm (a) and KK‐algorithm (b) superimposed with vertical lines
where, Zt (sk) is the measured concentration at time and at the for detection bounds (0.25% and 99.75% quantiles). Ticks correspond to
measurement site sk and where is some spatial the residuals associated with invalidated data for PQV station.
prediction based on the measurements of the neighboring
stations, except the kth according to a jackknife type approach. As it can be seen, only a few invalidated data are associated
with residuals exceeding the detection limits. We can see that
Hence, starting from the residuals (Rt (sk))1≤k≤K, the basic idea is there are no relations between the magnitude of the residual and
to compare each residual to two values, a lower bound L(sk) usually the invalidated data. Residuals associated with invalidated data can
negative and an upper bound U(sk) usually positive. Then three be within or out of the bound. This gives an idea of the difficulty to
cases can arise: assess the proposed methods since we have only information
about data validity and not the typicality.
 All the residuals are inside bounds, then no outlier is detected;
 A unique residual is outside its associated bounds, then the The procedure involves local spatial predictions based on the
corresponding measurement is an outlier; measurements of the neighboring stations. Let us examine the two
 Several residuals are outside their associated bounds. In that different experimented approaches.
case, the measurement corresponding to the maximum

3.2. Nearest neighbors weighted median site remains since K different kriging procedures are necessary to
get the K residuals and perform the outlier diagnosis.
The first prediction method is based on the nearest neighbors
weighted median, which considers concentrations, and is called 4. Results and Discussion
NN–Algorithm further in the paper. The prediction is defined by:
We study in this section the performance of the algorithm
weighted median for ∈ (8) based on the neighbors (NN–algorithm) and the two variants based
on the kriging method (Kx‐algorithms).
where, N(sk) denotes the set of neighbors for the site sk and where
the weighted median is simply the ordinary median for a finite 4.1. Algorithm specifications
number of weighted observations (the weights summing to 1). Of
course, since the spatial correlation of the concentrations is not Let us specify the main elements necessary for the
homogeneous, the sets of neighbors depend on the specific implementation of the algorithms. For the NN–algorithm, for each
location s as well as the associated weights. station, we must choose neighbors and weights as well as quantiles
for decision rules. Neighbors and weights are obtained using the
Then, we need a method to determine the sets of neighbors variant based on random forests and are not reported in the
and the corresponding weights. Two different methods have been paper. Essentially, two situations arise. One corresponding to
used, the first one is not fully automatic but a reference needs to weights of the neighbors of the same order of magnitude (see the
be constructed and validated by experts. The second one is a fully rural station MRA for a typical example). The second one exhibits
automatic variant based on variable importance indices provided one or two large values corresponding to sites of the same town
by random forests (Genuer et al. 2010) whose results appear to be (see the Rouen station JUS for a typical example). A graphical
consistent with the former one. It should be noted that such a illustration of these two situations can be found in Figure 4.
procedure to define the neighborhoods implies that the number of
neighbors depends on the site and leads to non–necessarily
symmetric neighborhoods, i.e. sm∈N(sk) does not imply that
sk∈N(sm) It is consistent with the lack of spatial stationarity of the
pollution phenomenon.

3.3. Kriging increments

The second prediction method is based on kriging. The idea is
to use for the prediction of Zt (sk) the output of a kriging model,
similar to the one described in Section 2.4, but fitted by excluding
the site sk i.e. only based on the other sites (Z(sm);sm≠sk).

To apply kriging, spatial stationarity is required and this
assumption is not satisfied in the air pollution framework. Then we
cannot directly consider the PM10 concentrations. However, this
property can be approximately satisfied using some pseudo–
innovations instead of concentrations, i.e. the differences between
actual observations and some quantities reflecting the mean value
of local concentrations.

Two variants are considered leading to two algorithms:

 The first one (denoted KK–algorithm) considers differences
with the most recent validated data network measurements, Figure 4. Two typical situations for weighted neighbors, JUS and MRA
typically at time t–1; stations.
 The second one (denoted KM–algorithm) considers differences
with the predicted values at time t coming from the more Quantiles are calculated from the learning set (see Section 2.3
recent ESMERALDA deterministic model outputs. for definition) and are reported in Table 2. For each algorithm and
for each station, we calculate the residuals and then derive the
We denote by Kx–Algorithms the prediction methods (KK– and empirical quantiles of level 0.25% and 99.75%.
KM–) based on kriging pseudo innovations of differences with a
validated network. The parameters of the Kx–algorithms are as follows: The
theoretical variogram is the exponential one with a range and a sill
Let us be a little more explicit, and assume that the data estimated from the data at each time period. The R package geoR
(Zt–1 (sk))1≤k≤K for time t–1 are available and free of outliers. Then (Ribeiro and Diggle, 2001) is used to estimate the parameters of
for the prediction of Zt (sk) we consider the jackknife subset of the exponential variogram and the R package gstat (Pebesma,
measurements (D(sm)=Zt(sm)–Zt–1(sm);sm≠sk), we estimate the 2004) is used to predict the value at a given station.
exponential variogram and construct kriging optimal weights to
predict D(sk) by . Finally, getting back to concentrations, we As previously mentioned, two variants of the kriging algorithm
deduce an estimate . This estimate can are used. They differ from the choice of the network considered
for the computation of the pseudo–innovations. More precisely:
then be compared with the actual measurement Zt (sk) by

computing the residual.  KK–algorithm uses a kriged version of the network at time (t–1)
if necessary, i.e. unavailable or invalidated data are replaced by
It should be noted that even if kriging increments or pseudo the predictions coming from the kriging algorithm;
innovations removes the major part of the mean field and a  KM–algorithm uses the predicted values at time t coming from
significant part of non–stationary aspects, some adaptation to the the most recent ESMERALDA model output.

The quantiles associated with the two variants are given in The KK–algorithm leads to the best fit, a RMSE around 4 µg m–3
Table 2. We can notice that the values of the lower and upper to be compared with RMSE larger than 6 µg m–3. The NN–algorithm
quantiles of the distribution of the residuals are much more leads for 10 stations to a better RMSE than the KM–algorithm.
symmetric with the KK–algorithm (see Figure 3 for PQV station).
4.2. Results
Let us close this Section by a remark about an indication of the
quality of the fitting. We use the root–mean–square error (RMSE) It is important to note that the quality of the historical data is
to quantify it. This measurement is defined as the square root of not really high since only a small part of data can be considered as
the mean over time of squared differences between values true outliers from the statistical viewpoint. Indeed, as mentioned
predicted by the model and the values actually observed at a given in Section 2.2, there are various other sources of invalidation and
spatial location: our methods are designed to detect spatial outliers only.
Nevertheless, the application to the test period, i.e. the set of all
time instants that are not used to compute the different detection
1 bounds (2 064 hours) allows a retrospective analysis of historical
(9)
data which opens the assessment of the methods by examining the
diagnostics coming from the statistical detection rules by the
One can find in Table 2, the RMSE obtained on the learning set analysis of the experts. We can find the results obtained for each
at each station for the three algorithms. algorithm in Table 3.

Table 2. Detection bounds for the three different algorithms and root–mean–square error RMSE
of predictions on the learning set, for the 14 stations
NN–algorithm KK–algorithm KM–algorithm

Station
Qinf Qsup RMSE Qinf Qsup RMSE Qinf Qsup RMSE
AIL –24.77 35.77 7.87 –18.65 18.31 4.51 –36.85 32.58 8.87
ALE –24.00 23.77 6.22 –14.02 11.21 3.58 –25.27 22.83 6.64
CAE –18.77 19.00 5.58 –16.97 14.79 4.14 –26.57 20.80 7.25
CHD –42.36 47.00 11.44 –23.87 26.28 5.92 –41.06 45.07 11.69
EVT –17.00 31.30 5.78 –19.67 18.02 4.42 –14.44 35.06 6.85
HRI –17.00 44.71 7.47 –19.28 18.85 4.55 –25.84 32.34 6.84
IFS –18.00 40.36 6.83 –15.17 18.35 4.30 –21.48 33.04 7.91
JUS –14.00 25.53 5.64 –21.97 21.57 4.37 –23.17 18.94 5.79
LIS –15.00 41.53 9.72 –18.38 20.38 5.03 –21.88 35.39 8.71
MRA –24.00 20.77 7.16 –16.84 18.84 4.22 –23.86 22.49 7.60
NEI –17.00 32.00 6.61 –19.39 17.12 4.65 –16.10 34.96 6.51
POS –18.77 23.77 5.29 –14.47 22.71 4.03 –24.88 35.97 6.24
PQV –8.77 38.30 8.14 –17.67 22.05 4.46 –14.45 36.59 6.55
SLO –19.00 30.30 6.73 –15.47 14.71 3.83 –24.27 24.70 7.25

Table 3. Results of algorithms of outlier detection on the test period
Invalidated Data Diagnosed as Validated Data Diagnosed as
Spatial Outlier Number of Spatial Outlier
Stations
Invalidated Data
NN KK KM NN KK KM
AIL 1 2 1 19 16 9 0
ALE 1 3 0 9 3 16 4
CAE 1 1 1 6 21 19 14
CHD 5 8 5 19 6 11 14
EVT 3 2 2 25 11 9 14
HRI 2 2 0 25 4 7 5
IFS 14 25 14 51 7 8 3
JUS 13 7 10 242 11 1 6
LIS 19 18 18 28 24 21 12
MRA 3 9 3 21 15 7 19
NEI 5 5 6 38 13 12 7
POS 14 21 13 47 12 14 7
PQV 21 17 18 80 36 16 29
SLO 2 2 2 6 4 19 13


For each station, the number of invalidated data diagnosed as JUS, MRA, SLO, AIL, POS) and 5 (EVT, SLO, MRA, AIL, POS) outliers
outliers and the number of validated data diagnosed as outliers for while KK–algorithm only detects SLO as outlier.
the most “active” method are in italic boldface. It is clear that
globally it is very difficult to consider that a method is uniformly The reason is that KK–algorithm is based on the 1–hour
more efficient than another. However, with respect to the increments that are uniformly small due to the 1–hour persistency
detection objective, the methods are similar in quality while a of the pollution episode while the other failed to capture it. The
great variability occurs for the validated data diagnosed as outliers. NN–algorithm generally ignores time and the pollution is, in that
In addition, the detection performance varies among stations. This case, far for spatial uniformity. The KM–algorithm uses increments
comes from two facts. First, as the prediction methods are based with a flat reference map.
on spatial algorithms, involving for each station the use of a
specific neighborhood, the quality of the prediction methods 4.3. Discussion
depends on the station (see the values of the RMSE in Table 2) and
leads to detection performance, which differs among stations. The quality and the size of the history of available data
Second, the number of validated data depends on the station since naturally introduce some limitation to a fair evaluation of the
different sensors are used; some are more fragile than others, and performance of the procedures. New data with a suitable tag
therefore subject to more frequent failures or metrological corresponding to the status of outlier coming from a large–scale
problems. operational implementation could improve the methods according
to the following remarks.
For six months, these procedures have been implemented in
the monitoring network, helping the experts but of course without Several parameters or steps can be modified in the detection
any automatic decision rule. To illustrate the difference between algorithm. Let us cite a few of them. First, the value of quantiles
the methods, let us consider three typical examples. used (0.25%, 99.75%) can of course be modified according to the
detection objectives and the localization of the stations for
Further in this section, we examine in detail the behavior of example. The method used to estimate the selected quantiles can
the three algorithms on three recent cases: a typical invalidated also be considered. For instance, a bootstrap method can be used
data diagnosed as outlier, then a typical validated data diagnosed to estimate the quantiles of the distribution of the residuals.
as outlier and finally a special case of local pollution episode.
Second, the subset of observations considered for the
First, let us examine on the left of Figure 5, the map of January estimation of quantiles: the validated data only or all values except
13, 2014 at 12h UTC. It contains a red spot which is abnormal with missing values or some set selected according to some stratified
respect to the real situation: a concentration of 107 µg m–3 is sampling?
measured at JUS station while all the other measurements are less
than 35 µg m–3. This high value has been a posteriori invalidated by Third, about the kriging strategy: the kriging differences
Air Normand. Unsurprisingly, the three detection algorithms have between the actual map (in fact only a network) and the last
detected this high value as an outlier. This scenario typically measured valid map (typically 1–hour before) is of course a
corresponds to the ideal situation in which algorithms have helped sensible choice, and when the last valid map is outdated, the idea
to avoid the publishing of a map that could lead to a to use the map at time t–1, eventually obtained from kriging the
misunderstanding by the public. available validated observations is the current choice. These
aspects, which are typically operational ones, open up some
Second, let us consider the map of January 22, 2014 at 14h prospects: more generally we could also think to generate maps by
UTC (see the middle part of Figure 5). This map shows some bootstrap (instead of jackknife) or by conditional simulation
pollution over the City of Caen. We observe a level of 51 µg m–3 at techniques in order to obtain maps free of outliers. In our case,
CAE station whereas the other stations measure concentrations however, and at this scale, with a small number of sites, we prefer
less than 22 µg m–3. The three algorithms have detected this restricting our strategy to jackknife for the time being.
relatively high value as an outlier. However Air COM did not
invalidate this measurement deeming it corresponds to real local In addition, mention that in the future we hope to have at our
pollution. disposal historical data covering a longer period and conveniently
tagged concerning the outlying status. We could then divide the
Finally, let us consider another case corresponding to a pollu‐ dataset in three subsets, as it is classical, in machine learning. The
tion episode leading to health advices to the public. We examine first one, the learning set, is used to compute the weights of the
the map of December 11, 2013 at 11h UTC (see right side of the neighbors in the NN–method (see Section 3.2) and the upper and
Figure 5). We can see high concentrations (up to 118 µg m–3) in the lower limits of the detection procedure (see Section 3.1). The
east of Normandy and an isolated hot spot (SLO, 90 µg m–3) located second one, the validation set, absent in the present paper, will be
in the west of Normandy. Air Normand and Air COM have classified used to determine the best parameters of the model used to
all the measurements as valid while the three algorithms have compute the weights and also the best choice of the quantiles.
detected many outliers. However the behavior of the three Finally the last subset, the test set, will be used to evaluate the
algorithms is very different and instructive. For this special detection performance of the proposed method.
situation, the NN and KM algorithms respectively detect 6 (EVT,

(a) (b) (c)
Figure 5. (a) Map of January 13, 2014 at 12h UTC, (b) map of January 22, 2014 at 14h UTC, and (c) map of December 11, 2013 at 11h UTC.


5. Conclusions CHIMERE (The Chimere Chemistry‐Transport Model), 2015.
http://www.lmd.polytechnique.fr/chimere/chimere.php, accessed in
We have developed a general statistical outlier detection January 2015.
strategy from a spatial point of view. It uses a jackknife type Cressie, N., 1993. Statistics for Spatial Data, Revised Edition, John Wiley
approach and is based on the comparison of the actual and Sons, New York.
measurement with some robust prediction. Two ways to handle
de Fouquet, C., Malherbe, L., Ung, A., 2011. Geostatistical analysis of the
spatial prediction are considered: the first one is based on the
temporal variability of ozone concentrations. Comparison between
median of nearest neighbors which directly considers concen‐
trations while the second one is based on kriging increments, CHIMERE model and surface observations. Atmospheric Environment
45, 3434–3446.
instead of more traditional pseudo–innovations, with two different
variants. The proposed methods have been applied to the hourly Efron, B., 1982. The Jackknife, the Bootstrap, and Other Resampling Plans.
PM10 concentrations of the monitoring network in Normandy SIAM, Philadelphia.
(France). ESMERALDA (EtudeS MultiRegionALes De L’Atmosphere), 2015.
http://www.esmeralda–web.fr, accessed in January 2015.
As a result, the actual situation is to compute all the three
Filzmoser, P., Ruiz–Gazen, A., Thomas–Agnan, C., 2014. Identification of
indicators coming from the different variants and to alert the
local multivariate outliers. Statistical Papers 55, 29–47.
technicians, who take the final decision to keep or not the data,
when it is possible. In off–line mode, the provided methods are Genuer, R., Poggi, J.M., Tuleau–Malot, C., 2010. Variable selection using
currently used as new tools for the validation of data. random forests. Pattern Recognition Letters 31, 2225–2236.
Grancher, D., Bel, L., Vautard, R., 2005. Estimation de champs de pollution
In a fully automatic mode for the construction of maps, par adaptation statistique locale et approche non stationnaire. Journal
especially during the night, the problem of the decision rule is Européen des Systèmes Automatisés 39, 475–492.
under consideration. An adaptive solution could be to follow the Haslett, J., Bradley, R., Craig, P., Unwin, A., Wills, G., 1991. Dynamic
best method up to the current hour for each station. graphics for exploring spatial data with application to locating global
and local anomalies. The American Statistician 45, 234–242.
In addition to an effective implementation of real–world
solution and an example of successful collaboration between Honore, C., Rouil, L., Vautard, R., Beekmann, M., Bessagnet, B., Dufour, A.,
academics and experts in air quality, the originality of our Elichegaray, C., Flaud, J.M., Malherbe, L., Meleux, F., Menut, L., Martin,
statistical contribution is, to our knowledge, twofold. First, about D., Peuch, A., Peuch, V.H., Poisson, N., 2008. Predictability of European
the nearest neighbors method, our idea is to introduce the air quality: Assessment of 3 years of operational forecasts and analyses
historical data, that is to say the time, both in the choice of limits by the PREV'AIR system. Journal of Geophysical Research: Atmospheres
of detection and in the definition of the neighborhoods by station 113, art. no. D04301.
with the weights associated to neighbors. Second, about the Hodzic, A., Vautard, R., Bessagnet, B., Lattuati, M., Moreto, F., 2005. On the
kriging method of pseudo–innovations, our idea is to introduce quality of long–term urban particulate matter simulation with the
time for the calculation of detection bounds and to work on kriging CHIMERE model. Atmospheric Environment 39, 5851–5864.
increments, which can be considered as a crude way to handle an Kou, Y., Lu, C.T., Chen, D., 2006. Spatial weighted outlier detection.
autoregressive modeling of the map of concentrations. Proceedings of the 2006 SIAM Conference on Data Mining, April 20–22,
2006, Bethesda, Maryland, USA, 614–618.
Acknowledgments
Laurent, T., Ruiz–Gazen, A., Thomas–Agnan, C., 2012. GeoXp: An R Package

for Exploratory Spatial Data Analysis. Journal of Statistical Software 47,
This work comes from a scientific collaboration between Air
Normand (see the website http://www.airnormand.fr) from the 1–23.
applied side and Orsay University and INSA Rouen from the Li, Y., Nitinawarat, S., Veeravalli, V.V., 2013. Universal outlier detection.
academic side. We would like to thank Veronique Delmas, from Air 2013 Information Theory and Applications Workshop.
Normand, for providing the problem, the data as well as for Lu, C.T., Chen, D., Kou, Y., 2003. Algorithms for spatial outlier detection.
supporting the statistical study. In addition, we would also like to Proceedings of the Third IEEE International Conference on Data Mining
thank the three anonymous referees for their thorough comments (ICDM’03), November 19–22, 2003, Melbourne, Florida, USA, pp. 597–
and suggestions, which really helped to improve the clarity of the 600.
paper.
Pebesma, E.J., 2004. Multivariable geostatistics in S: The Gstat package.

Computers & Geosciences 30, 683–691.
References
Planchon, V., 2005. Traitement des valeurs aberrantes: Concepts actuels et
Barnett, V., 2004. Environmental Statistics: Methods and Applications, tendances générales. Biotechnology, Agronomy, Society and
Wiley Series in Probability and Statistics. Environment 9, 19–34.
Ben–Gal, I., 2005. Outlier detection, in Data Mining and Knowledge R Core Team, 2013. R: A Language and Environment for Statistical
Discovery Handbook: A Complete Guide for Practitioners and Computing, R Foundation for Statistical Computing, Vienna, Austria.
Researchers, edited by Maimon, O., Rockach, L., Kluwer Academic http://www.R–project.org/.
Publishers, pp. 131–146. Ribeiro Jr, P.J., Diggle, P.J., 2001. geoR: A package for geostatistical analysis.
CEN (European Committee for Standardization), 2015. www.cen.eu, R News 1, 15–18.
accessed in January 2015. Rousseeuw, P.J., Leroy, A.M., 2005. Robust regression and outlier detection,
Cerioli, A., Riani, M., 1999. The ordering of spatial data and the detection of Volume 589, John Wiley & Sons.
multiple outliers. Journal of Computational and Graphical Statistics 8, Shekhar, S., Lu, C.T., Zhang, P.S., 2003. A unified approach to detecting
239–258. spatial outliers. GeoInformatica 7, 139–166.
Chandola, V., Banerjee, A., Kumar, V., 2009. Anomaly Detection: A Survey.
ACM Computing Surveys 41, Article 15.

Spatial Outlier Detection in The pm10 Monitoring Network of Normandy France

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spatial Outlier Detection in The pm10 Monitoring Network of Normandy France

Uploaded by

Copyright:

Available Formats

Atm spheric Pollution Research

Spatial outlier detection in the PM10 monitoring network of Normandy

classical one; see for example Rousseeuw and Leroy (2005). -40 -20 0 20 40

NN–algorithm KK–algorithm KM–algorithm

Figure 5. (a) Map of January 13, 2014 at 12h UTC, (b) map of January 22, 2014 at 14h UTC, and (c) map of December 11, 2013 at 11h UTC.

You might also like