Professional Documents
Culture Documents
Spatial Outlier Detection in The pm10 Monitoring Network of Normandy France
Spatial Outlier Detection in The pm10 Monitoring Network of Normandy France
Atmospheric Pollution Research 6 (2015) 476‐483
ABSTRACT
We consider hourly PM10 measurements from 22 monitoring stations located in Basse–Normandie and Haute–
Normandie regions (France) and also in the neighboring regions. All considered monitoring stations are either urban
background stations or rural ones. The paper focuses on the statistical detection of outliers of the hourly PM10
concentrations from a spatial point of view. The general strategy uses a jackknife type approach and is based on the
comparison of the actual measurement with some robust spatial prediction. Two spatial predictions are considered: the
first one is based on the median of the concentrations of the closest neighboring stations which directly consider
weighted concentrations while the second one is based on kriging increments, instead of more traditional pseudo–
innovations. The two methods are applied to the PM10 monitoring network in Normandy and are fully implemented by
Air Normand (the official association for air quality monitoring in Haute–Normandie) in the Measurements Quality Corresponding Author:
Control process. Some numerical results are provided on recent data from January 1, 2013 to May 31, 2013 to illustrate
Jean-Michel Poggi
and compare the two methods.
: +33‐0 1 69 15 57 44
: +33‐0 1 69 15 72 34
Keywords: Air quality, kriging, nearest neighbors, particulate matter, spatial outlier detection
: jean‐michel.poggi@math.
u‐psud.fr
Article History:
Received: 16 July 2014
Revised: 01 December 2014
Accepted: 02 December 2014
doi: 10.5094/APR.2015.053
1. Introduction The aim of this work is to provide tools for outlier detection in
the spatial sense, which could help in the validation of measure‐
In France, air quality is monitored in each region by an official ment of each specific location of the monitoring network. More
association. In Normandy (consisting of two regions), Air Normand, precisely, we consider in this paper the problem of spatial outlier
based in Rouen and Air C.O.M. (Air COM for short), based in Caen, detection in the context of particulate matter and especially PM10,
monitor air quality. In addition to these primary functions, their which is the more crucial pollutant in Normandy, but this is a
role is also to inform the population regarding air quality. Thus, to general pattern and it can be applied to many other contexts.
fulfill their missions, Air Normand and Air COM measure air quality
with automatic analyzers scattered throughout the region, and A short survey of the literature about outliers among a large
make these measurements publicly available, mainly through the number of references can be quickly performed. For example, we
website to inform the public on exposure to air pollution. Indeed, can first highlight the classical book of Barnett (2004), which
Air Normand and Air COM work closely together to publish their contains a chapter especially dedicated to this topic as well as
measurements on a common website (www.airnormand.fr). In some survey papers (Ben–Gal, 2005; Planchon, 2005 or more
particular, measurements are spatially interpolated to produce recently Chandola et al., 2009). However, these references are
maps of air quality, also available from the website. More mainly concerned by univariate or multivariate outliers but not
precisely, Air Normand provides a map of air quality on the specifically dedicated to the spatial nature of the data. Haslett et
Normandy region updated every hour. The maps of air quality for al. (1991), as well as Laurent et al. (2012), use analytic tools to
two pollutants (O3 and PM10) are obtained combining hourly explore spatial data and to deduce some outlier detection proce‐
measurements of concentrations and the maps provided by the dures as in Filzmoser et al. (2014). In the case of spatial data, a
numerical model outputs. Each pollutant is mapped by correcting classical distinction is to be made between a “global” atypical
the numerical model outputs by the measurements provided by value, which consists in reasoning starting from the behavior of the
the monitoring stations, using assimilation methods (Grancher et majority of data, and a “local” atypical value, which consists in
al., 2005; de Fouquet et al., 2011). Thus undiagnosed mea‐ reasoning from the behavior of the observations that are
surement errors could seriously affect the quality of the spatial geographical neighbors. Then four classes of observations can be
reconstruction of concentrations leading to erroneous maps. defined: typical, global atypical only (detected using standard
tools), local atypical only and the last one, local and global atypical.
In this paper, we are interested in detecting local atypical
© Author(s) 2015. This work is distributed under the Creative Commons Attribution 3.0 License.
Bobbia et al. – Atmospheric Pollution Research (APR) 477
observations. We can notice that a local atypical observation is the others are urban background ones. Monitoring devices are
often defined as an observation that differs from the closest mainly TEOM (Tapered Element Oscillating Microbalance) except
observations, so it is implicitly assumed that the data exhibit a for stations CHD, IFS, LIS and MRA where Beta gauges are used.
positive spatial autocorrelation. Of course it is important to check
that this autocorrelation is realistic in each specific application.
Some references particularly favor the detection of spatial
anomalous observations. Cerioli and Riani (1999), define a
procedure based on kriging schemes and dedicated to multiple
outliers while Shekhar et al. (2003), Lu et al. (2003) and Kou et al.
(2006), develop some simple, intuitive and robust ways to detect
spatial outliers. Let us finally mention the very recent paper of Li et
al. (2013) about outlier hypothesis testing studied in a universal
setting.
The basic idea of the detection algorithm we propose consists
in comparing the measured concentration to some spatial
prediction, following a classical jackknife type approach (i.e.
leaving out the observation at the considered location from the
dataset used to calculate the estimate, see Efron, 1982). The
decision rule is then based on thresholds coming from the
distributions of prediction residuals along time. We consider two
methods to perform the detection of spatial outliers depending on
the prediction method. The first one, inspired by Lu et al. (2003)
and by Kou et al. (2006), follows a non–stationary spatial way by
directly comparing the concentration of a given site to the median Figure 1. Location of PM10 monitoring stations and main cities.
of the weighted concentrations of its neighbors with respect to a
pre–specified neighborhood system. The second one is based on 2.2. Measurements quality control process
kriging increments, namely the difference between the current
observations and a reference set of past observations or numerical Air Normand and Air COM make measurements freely
model outputs, and not directly on concentrations. available and update it at a rate depending on the communication
media that is going to be used (i.e., paper report, web). Of course,
The paper is organized as follows. Section 2 presents the PM10 the most common one is the Internet: measurements are collected
monitoring network, the Measurements Quality Control process at various monitoring stations, stored in the database and auto‐
and the PM10 data. Finally some basics about kriging methodology matically published every hour. To ensure a high quality of service,
are recalled. Section 3 describes the general principle of the outlier Air Normand and Air COM have developed several procedures for
detection procedure and the two methods for spatial outlier data validation.
detection. Section 4 first presents the results of detection proce‐
dures of spatial outliers on some recent database and then First, the maintenance of the analyzers is the primary task of
proposes a discussion. Finally, some concluding remarks are any air quality monitoring association. Technicians regularly
collected in Section 5. calibrate the analyzers, according to the manufacturers’
recommendations and the references prescribed by CEN standards
2. Materials and Kriging Methodology (see www.cen.eu).
Normandy (over 3.3 million people in 2008) is located in Then technicians check measurements twice a day (morning
Northwestern France, along the south coast of the English Channel and evening). It is a first validation level performed on a strictly
and at the Northwest of Paris. Normandy is composed of two technical basis. At this stage, the physical state of measuring
regions: Basse–Normandie and Haute–Normandie. Haute– equipment is controlled, leading to three decisions: validation,
Normandie is heavily industrialized with two large urban areas, invalidation or sometimes correction of the data in the database.
Rouen and Le Havre including more than 490 000 and 250 000 The website is then updated accordingly. A second level of valida‐
inhabitants respectively, while Basse–Normandie is more agricul‐ tion is performed daily by a single expert on a different temporal
tural with only one urban area of significant size, Caen with more scale examining the sequence of measurements on each site.
than 400 000 inhabitants. Finally, an environmental validation of the measurements is
performed every month during a meeting of experts. At this stage,
2.1. PM10 monitoring network the measurement network is considered as a whole: rather than
reviewing site by site, from a metrological point of view, it is
We have a set of PM10 hourly mean concentrations coming examined spatially.
from 22 monitoring stations located in Basse–Normandie and
Haute–Normandie regions and also in the neighboring regions (see Of course, the notion of outlier is not necessary the same as
in Figure 1 the location of PM10 monitoring stations and main cities invalidated data. For example, in case of a smooth drift, an expert
of Normandy). All considered monitoring stations are either urban will invalidate the data even if it is not an outlier. Conversely an
background stations or rural ones. outlier could be a validated data in case of local pollution episode.
In summary, it is not possible to perform the complete validation
The spatial outlier detection only concerns PM10 measure‐ of measurements in real–time. The scope of this work is to provide
ments coming from stations located in Normandy, namely AIL some automatic ways to identify possible spatial outliers using
(near Dieppe), HRI, NEI (Le Havre), JUS, PQV, POS (Rouen), EVT statistical methods. These possible outliers will then be invalidated
(Evreux) for the Air Normand network (Haute–Normandie) and or not.
CHD (Cherbourg), SLO (Saint–Lo), CAE, IFS (Caen), LIS (Lisieux), ALE
(Alençon) and MRA for the Air COM network (Basse–Normandie). 2.3. Data
The last eight stations ARR, BOV, DRE, FRE, LUC, PRU, MAZ and
REN are stations of other monitoring agencies of the neighboring For each station, we have hourly or bi–hourly average PM10
regions. Stations AIL, POS, MRA and ARR are rural stations while concentrations. Each value can be a NaN (Not A Number) when the
Bobbia et al. – Atmospheric Pollution Research (APR) 478
concentration is not available or a value with a tag: validated or Table 1. Summary statistics of PM10 invalidated data. The last column
invalidated. For stations IFS, CHD, MRA and LIS, where average stands for the number of invalidated data
PM10 concentrations are measured every two hours, we have
Station Min. 1st Qu. Median Mean 3rd Qu. Max. N_inval
duplicated measurements to obtain sets of data compatible with
those of other stations. Note that for stations outside Normandy, AIL –2 7 11 15.21 16.5 90 19
every available data is validated. ALE –3 5 9 11.56 20 28 9
CAE 11 17.25 21 27.33 31.5 60 6
PM10 concentrations were collected over the period from
January 1, 2013 to May 31, 2013 (set of 3 624 hours). We split this CHD 0 0 35 84.53 59 859 19
set in two parts: learning and test sets. An hour h belongs to the EVT –7 4 12 17.4 33 57 25
learning set if h and h–1 are validated data and without missing HRI –2 9 12 19.2 27 56 25
value (across stations). The test set is defined as the remaining
IFS 0 7.5 29 24.61 40 53 51
hours. Outlier detection algorithms have been developed using
data of the learning set (of size 1 560) and evaluated on the test JUS –13 24.75 39.5 39.78 55.75 119 242
set (2 064 observations). LIS 0 4.5 233 1 224 1 254 8 513 28
MRA 0 0 4 13.43 25 65 21
Boxplots of validated PM10 concentrations of the learning set,
for the 22 stations are presented in Figure 2. One can observe a NEI –1 4 9.5 20.66 14.75 300 38
relatively homogeneous PM10 pollution in the region, which is POS –20 28 41 108.4 78.5 1 742 47
consistent with the spatial outlier detection algorithm based on PQV –1 8.75 23 24.24 35 78 80
neighbors and presented later.
SLO –200 8.25 16 100 24.5 735 6
2.4. Kriging methodology
Let us recall some basic facts about the kriging method; the
reader can refer to the book of Cressie (1993), for details. Kriging is
a statistical method which allows to estimate a quantity Z(s at
spatial location s∈S (S denotes the mapping area), using an
interpolation scheme which takes into account the spatial
correlation structure and allows in addition to quantify the
estimation uncertainty. Namely starting from K measurements
(Z(sk))1≤k≤K coming from K monitoring sites (sk)1≤k≤K the prediction
Z(s0 at location s0 is a linear combination of data:
(1)
The underlying spatial interpolation model is of the following
general form:
, ∈ (2)
where, µ(s) is a deterministic function of position, Ɛ(s) is a
Figure 2. Boxplots of validated PM10 concentrations (in µg m–3) for the 22 stationary random function of position, with zero mean and where
stations. the structure of spatial dependence is supposed to be known. The
function µ(s) is known or unknown. It defines the mean function of
Z and its form specifies the type of kriging and the spatial
In Table 1, we can find some basic statistics about the
dependence of Ɛ(s) which is actually given by the variogram
invalidated hourly PM10 concentrations, tagged by the Normand
analysis. The two terms of the model can be specified a little more.
networks. The last column of the table gives the number of
For the mean function, we consider the so–called ordinary kriging
invalidated data per station. First of all, we can notice there are
which assumes that µ(s)=m where m is an unknown constant to be
few invalidated data with respect to the total number of data estimated.
(3 624 hours). The mean value is not very informative for
invalidated data, but some median values are rather low and show To model the spatial dependence, the key tool is the vario‐
that a large part of invalidated data has a normal order of
gram defined by the function of the positive distance h between
magnitude. It may be noted the presence of negative values which
points:
are due to either a monitor failure, either a device maintenance
or metrological characteristics of the monitor. We can also note
1
the presence of some huge values in five stations. Technicians var 0 (3)
without further analysis automatically invalidate these negative 2
and huge values. To give an idea of the usual range of observations
for Normandy region, hourly PM10 concentrations, validated since where, C(h)=cov(Z(s),Z(s+h)) is the stationary covariance function.
January 2013, are non negative and less than 300 µg m–3. Finally, it Of course ƴ(h) is in general unknown and must be estimated
is clear that some invalidated data will be easily tagged as outliers previously to the kriging step, by fitting the empirical variogram
as –200 µg m–3 at SLO and 8 513 µg m–3 at LIS. However, many using a parametric model.
others invalidated data, who have comparable order of magnitude
with validated data, will not be detected as outliers. Of course, the A usual way to use kriging in the spatial context of pollution
spatial nature of the data is not taken into account in these basic data is for merging predictions coming from a numerical model and
statistics. measurements coming from the network of monitoring stations.
The idea is to correct the predictions by measurements. Since the
kriging requires some spatial stationarity, it is usual to try to apply
kriging to innovations instead of concentrations.
Bobbia et al. – Atmospheric Pollution Research (APR) 479
Namely, let us consider Zt (s) the pollutant of interest at time residual excess beyond the threshold in absolute value is
and at the location s of the S domain. At each given , we suppose considered as an outlier. Then the whole detection process is
we have, on one hand, the K measurements Zt (sk) for 1≤k≤K iterated but without the corresponding concentration replaced
coming from the monitoring sites, and on the other hand, Pt (s), for by a missing data.
s∈S for the most recent predicted map given by the deterministic
model ESMERALDA (ESMERALDA, 2015) which is based on the A natural question is how to choose the rejection limits L(sk)
CHIMERE model (CHIMERE, 2015). One can find in Hodzic et al. and U(sk). First, we can notice that, as for any testing procedure,
(2005) and Honore et al. (2008), detailed studies of the the rejection limits are computed under the null hypothesis,
performance of the CHIMERE model. ESMERALDA provides corresponding here to validated data. Second, since we consider
predictions on a grid with a mesh of size 3 kmx3 km for 4 days from the concentrations, the residuals are not spatially homogeneous.
day (J–1) to day (J+2), where day J is the current day. In our study, Then, we choose, at each fixed monitoring station and without
we associate each of the PM10 monitoring sites to the nearest grid spatial normalization, to set thresholds to extreme quantiles of the
point of the ESMERALDA model. Then considering the pseudo– empirical distribution of residuals on the history of validated data.
innovations defined by the prediction errors, These limits depend on the station.
,1 (4) Note that the quality and the size of the history of validated
data affect the actual performance of the procedure. Further down
It is possible to apply kriging to this innovation process and in the paper, we have chosen for L(sk) and U(sk) the extreme
then obtain estimates for s∈S, and then obtain a corrected quantiles q0.25%(sk) and q99.75%(sk) of the empirical distribution of
map: (Rt (sk))t∈T where T is a subset of validated data.
, ∈ (5) The two graphs of Figure 3 describe, for PQV station, the
density of validated data together with the detection bounds of
In this work, to model the spatial variance, we use an the learning set residuals obtained from NN–and KK–algorithm
exponential variogram of the following form: superimposed with the residuals associated with invalidated data.
1 ⁄ (6) (a)
0.15
Estimated density of residuals
where, the sill and the range a are estimated by fitting the
empirical variogram. Indeed it allows efficiently fitting most of the
situations encountered with our data set. From a computational
0.10
point of view, it is implemented in R (R Core Team, 2013).
3. Two Methods for Spatial Outlier Detection
0.05
The general strategy uses a jackknife type approach and is
based on the comparison of the actual measurement with some
0.00
robust prediction. Two methods for spatial prediction are consid‐ -40 -20 0 20 40
ered: the first one follows a non–stationary spatial approach by
comparing directly the concentration of a given site with the (b)
weighted median of the concentrations of its neighbors. The
0.15
Estimated density of residuals
second one follows a stationary spatial approach based on kriging
the differences between the current observations and a reference
set of past observations or numerical model outputs.
0.10
3.1. General principle
0.05
The general principle of the detection algorithm is based on
comparing the measured concentration to some robust spatial
prediction, supposed to be free of outliers. This strategy is a
0.00