You are on page 1of 10

Atmospheric Environment 45 (2011) 7005e7014

Contents lists available at SciVerse ScienceDirect

Atmospheric Environment
journal homepage: www.elsevier.com/locate/atmosenv

PM10 forecasting using clusterwise regression


Jean-Michel Poggia, c, Bruno Portierb, *
a
Laboratoire de Mathématiques d’Orsay, Université Paris-Sud, 91405 Orsay, France
b
Laboratoire de Mathématiques, INSA de Rouen, BP 08, Avenue de l’Université, 76800 Saint-Etienne du Rouvray, France
c
Université Paris-Descartes, France

a r t i c l e i n f o a b s t r a c t

Article history: In this paper, we are interested in the statistical forecasting of the daily mean PM10 concentration.
Received 20 May 2011 Hourly concentrations of PM10 have been measured in the city of Rouen, in Haute-Normandie, France.
Received in revised form Located at northwest of Paris, near the south side of Manche sea and heavily industrialised. We consider
7 September 2011
three monitoring stations reflecting the diversity of situations: an urban background station, a traffic
Accepted 9 September 2011
station and an industrial station near the cereal harbour of Rouen. We have focused our attention on data
for the months that register higher values, from December to March, on years 2004e2009. The models
Keywords:
are obtained from the winter days of the four seasons 2004/2005 to 2007/2008 (training data) and then
Particulate matter
Forecasting
the forecasting performance is evaluated on the winter days of the season 2008/2009 (test data).
Clusterwise linear models We show that it is possible to accurately forecast the daily mean concentration by fitting a function of
Generalized additive models meteorological predictors and the average concentration measured on the previous day. The values of
Random forests observed meteorological variables are used for fitting the models and are also considered for the test
Rouen data. We have compared the forecasts produced by three different methods: persistence, generalized
additive nonlinear models and clusterwise linear regression models. This last method gives very
impressive results and the end of the paper tries to analyze the reasons of such a good behavior.
Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction cannot exceed 50 mg m3 more than 35 days per year, and cannot
exceeds an annual mean of 40 mg m3. Recently the French
The atmosphere consists of gases but also particulate matter government adopted a national plan on particles. Airborne partic-
condensed in liquid or solid. The particles have multiple origins and ulate matter as PM10, measured by Tapered Element Oscillating
come from natural sources (sea salt, volcanic eruptions, forest fires, Microbalance (TEOM) continuous monitor is factual in France and
wind erosion of soils, .) as well as from human activities (trans- in Europe for more than 10 years.
port, heating, industry, agriculture, .). They can also be secondary Pollution forecasting from suspended particles in the air is
sources, i.e., formed by combination according to a complex phys- obviously an important issue for the Haute-Normandie region, but
icochemical process. Their characteristics are extremely diverse also for France. Indeed, for several years, the limit values for PM10
and may vary over time. They are often classified according to concentrations are exceeded in several French regions. The devel-
a standard size or chemical composition which often determines opment of a statistical forecasting technique of PM10 concentra-
the intensity of their health impact. tion, aiming in particular to improve early warning procedures,
The problem of air pollution by particulate matter, although useful for sensitive people, would be an important tool for Air
complex, is a real public health issue requiring a response from the Normand, the local air quality agency and therefore for the region
public authorities. New standards designed to comply with new of Haute-Normandie.
requirements for monitoring, updating the regulation of airborne A lot of references in the literature can be considered about
particles was performed by European Council and at national level PM10 forecasting, see for example Grivas and Chaloulakou (2006)
by the “tightening up” of the thresholds as triggers of information, or the long and detailed introduction of Dong et al. (2009) high-
caution and warning respectively. The European regulation sets lighting methods and models. The references related to the use of
that PM10 (particles whose diameter is less than 10 mm) daily statistical approaches can be distinguished according to the adop-
ted model and the predictors involved.
* Corresponding author. A large panel of methods is available among which we find
E-mail address: bruno.portier@insa-rouen.fr (B. Portier). neural networks, linear regression models, nonlinear parametric or

1352-2310/$ e see front matter Ó 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.atmosenv.2011.09.016
7006 J.-M. Poggi, B. Portier / Atmospheric Environment 45 (2011) 7005e7014

nonparametric models but also new strategies such as mixtures of diversity of situations: an urban background station, a traffic station
predictions from different models or clusterwise regression and an industrial station near the cereal harbour of Rouen. We have
models. Let us briefly review some of them. focused our attention on data for the months that register higher
First of all, neural networks are the most frequently used. The values, from December to March, on years 2004e2009. The models
recent paper of Paschalidou et al. (2011) offers a synthesis about the are obtained from the winter days of the four seasons 2004/2005 to
neural networks based forecasting and compares different neural 2007/2008 (training data) and then the forecasting performance is
network architectures including multilayer perceptron as well as evaluated on the winter days of the season 2008/2009 (test data).
radial basis neural networks. But of course, due to the lack of To give some general ideas, let us note that the total numbers of
theoretical results from the statistical viewpoint as well as the low exceedances of the threshold value 50 mg m3 over the six years at
interpretability of this class of black-box models, some alternative the three sites are respectively about 16, 35 and 13 for winter days
strategies have been considered. and 21, 17 and 44 for the whole period. Considering a smaller
Indeed, multiple linear modeling is also frequently used, as in threshold value, the average numbers of exceedances of 30 mg m3
Stadlober et al. (2008) for example focusing on winter days, in the per year are about 20, 42 and 22 for winter days and twice for
basin areas of the Alps. The results are very satisfactory, even if low a year. The annual averages of PM10 are about 21, 26 and 20 mg m3
values are often overestimated while high values are frequently while the levels in winter season increase about 2 mg m3.
underestimated. This method is then often considered as We show that it is possible to accurately forecast the daily mean
a competitor to compare a simpler actual scheme. concentration by fitting a function of meteorological predictors and
For example, Slini et al. (2006) use and compare classification the average concentration measured on the previous day. The
and regression trees (CART), neural network, linear regression and values of observed meteorological variables are used for fitting the
regression on principal components, in Thessaloniki, Greece. The models and are also considered for the test data. We have
conclusion is that CART and neural network capture PM10 trends compared the forecasts produced by three different statistical
while CART is better for alarm forecasting. models: persistence, generalized additive model and clusterwise
In Milan, Italy, where PM10 pollution is extremely important linear models. This last method gives impressive results. Of course,
and easy to predict, Corani (2005) uses local polynomials based a lot of competitors could be chosen in order to assess the present
nonparametric method to estimate a nonlinear regression model proposition, including neural networks for example. But since we
which delivers good results. are primarily interested in the understanding of the good behavior
The introduction of clusterwise linear models for PM10 fore- of the forecasting model, we favor two explicit global models: the
casting, promoted in this paper, is already present for example in simplest one (persistence) as the basic reference and a second
Sfetsos and Vlachogiannis (2010) considering Athens, Helsinki and competitor: the generalized additive model which contains the
London. A simple global model is first considered and then a two- linear model and which is of intermediate complexity. Such models
steps approach is proposed. First, the days are classified; second have been recently used in a similar pollution context by Aldrin and
a linear model is used inside each class for forecasting. The benefit Haff (2005) and Barmpadimos et al. (2011) to study the influence of
with respect to a single global model is significant in performance meteorology on air pollution traffic volume or on PM10 trends.
but also in interpretability and simplicity of the models. The clus- Our paper is organized as follows. Section 2 describes the data,
ters come from a supervised classification integrating observed PM10 and meteorological, and recall some basics about the
PM10, leading to more coherent models inside each class. With considered statistical methods. Section 3 presents the results ob-
respect to this previous work, we propose in this paper to use tained using by the different models on the three considered
a more global way to handle such clusterwise linear regression and stations, including the choice of predictors and the forecasting
to simultaneously optimize the clusters and the local models. The performance on test data as well as on training data. Section 4
underlying statistical modeling framework is the so-called analyzes more deeply the clusterwise linear model capabilities,
mixtures of linear regressions (see McLachlan and Peel, 2000). focusing on the urban background station. Finally, some concluding
We still use in the sequel the expression “clusterwise regression” remarks are collected in Section 5.
because it is more intuitive even if it is a little bit an abuse of
language since the induced clusters are designed using the 2. Materials and methods
response variable.
The variables used in forecasting schemes may simply combine Rouen (Latitude: 49 250 N, Longitude: 140 E) city is located in the
conventional pollutants and meteorological variables but often also Haute-Normandie region of northern France, near the south side of
include model output statistics or constructed variables. Manche sea and at northwest of Paris (nearby 100 km). The Haute-
We find of course classical variables involving the wind speed Normandie is heavily industrialised and more than 490k people
and direction, solar radiation (considered in some cases as live in Rouen and its agglomeration.
a precursor of other pollutants), relative humidity, temperatures We have focused our attention on data for the months, from
and temperature gradients, atmospheric pressure and rain. Some- December to March, on years 2004e2009. Indeed during these
times, cloud cover and dew point are included in the first set of months, higher values are registered, both in average and in
candidates, before a variable selection step. A persistence term is number of exceedances, and the data can be considered as homo-
very often introduced through the average PM10 of the day before. geneous in terms of pollution sources which are not included in the
For example, Chaloulakou et al. (2003a) compare a neural network models. The data set consists of PM10 daily mean concentrations
and a multiple linear regression model for PM10 forecasting in and many meteorological data over the period 2004e2009. The
Athens. They show that the lagged PM10 is an important predictor: data are presented in the next two subsections.
it appears to be as informative alone as the considered set of The statistical tools used in this work are not classical (more
meteorological variables. precisely they are not widely known) but have been recently used
In this paper, we consider daily mean concentrations of PM10 for analysis PM10 pollution in Haute-Normandie (see Jollois et al.,
measured in the city of Rouen, in Haute-Normandie, France. 2009 and Bobbia et al., 2011). We recall some elements about
Located at northwest of Paris, near the south side of Manche sea, these methods in the third subsection.
the Haute-Normandie region is heavily industrialised. We consider Finally, it should be noted that published studies for PM10
three monitoring stations of network of Air Normand, reflecting the forecasting in France are very difficult to find. Let us mention the
J.-M. Poggi, B. Portier / Atmospheric Environment 45 (2011) 7005e7014 7007

paper of Zolghadri and Cazaurang (2006) which consider only one Table 1
station in Bordeaux and use extended Kalman filter. Summary statistics of daily mean PM10 concentrations (in mg m3) for the three
stations.

JUS GCM GUI


2.1. PM10 data
Minimum 7 5 9
1st Quartile 16 13 19
PM10 come from three monitoring stations of Air Normand Median 20 19 26
network. They are located in Rouen city, namely Palais de Justice Mean 22.34 21.43 27.97
(JUS), Guillaume-Le-Conquérant (GUI) and Grand-Couronne (GCM). 3rd Quartile 26 26 33
This choice reflects the diversity of situations by considering the Maximum 81 88 100
SD 10.17 10.76 12.31
urban background station JUS, the roadside station GUI, which is
Missing values 10 0 2
the second most polluted in the region, and the industrial station No. of values 697 697 697
GCM which is near the cereal harbour of Rouen, one of the most
important in Europe, in order to have a widest panel. In Fig. 1, we
can find the boxplots of PM10 from each monitoring station. The
boxplots are based on the 697 winter days of the six years. difference between temperatures (GTrouen, in  C), giving us an idea
In addition to these boxplots, Table 1 containing a summary of of the mixing height. In Table 2, a summary of basic statistics gives
basic statistics complements this synthetic view. some general information about their distributions. The last
As it can be seen, the three monitoring stations are different column of the table gives the number of missing values, and we
from the daily mean PM10 concentration distributions, especially note that it is always very small.
GUI from the two others. In addition, we emphasize that, from our
previous studies (see Jollois et al., 2009), they are also very different 2.3. Methodologies
in terms of intrinsic difficulty to model PM10 concentrations with
the same predictors, namely JUS is easy, GCM is hard and GUI is We give here some elements of the statistical tools used in this
medium. So we capture in this subset of PM10 monitoring stations paper.
of the Rouen network the meaningful part of the crucial difficulties
for the forecasting problem. 2.3.1. Variable selection using random forests
In the forecasting problem, the choice of well-suited predictors
2.2. Meteorological data is of course crucial. To deal with this difficult problem, we use
a recent highly nonlinear and nonparametric statistical method
Meteorological data are provided by Météo-France (the French introduced by Breiman (2001) and called random forests. A random
national meteorological company) and come from one monitoring forest is an aggregation of many binary decision trees obtained
site located nearby Rouen. We have retained classical meteoro- using the CART method (see Breiman et al., 1984) or some
logical indicators including the daily minimum (Tmin), mean unpruned version of CART.
(Tmoy) and maximum (Tmax) temperature (in  C), the daily total In this context, the CART model is a binary decision tree very
rain (PLsom, in mm), the daily mean atmospheric pressure (PAmoy, easy to understand and very general: input data do not need to be
in hPa), the daily maximum (VVmax, in m/s) and mean (VVmoy, in Gaussian and nonlinear relationships between the response and
m/s) wind speed, the daily most frequently observed wind direc- the explanatory variables can be handled. A model is built by first
tion (DVdom, in  ), the wind direction associated with the daily performing a growing step by recursive partitioning the set of
maximum wind speed (DVmaxvv, in  ) and the daily minimum observations, choosing the best split (a split is defined by a variable
(HRmin, in percentage), maximum (HRmax) and mean (HRmoy) and a threshold) dividing the current set of observations in two
relative humidity. In addition, we use a measure of the vertical subsets, minimizing the local internal variance of the response. To
prevent overfitting, a pruning step is then performed to select
a convenient subtree. To cope with the instability of this method
with respect to perturbations on the training data set, Breiman
(2001) introduced a new method called Random Forests (RF for
short). The principle of RF is to combine many binary decision trees
built using several bootstrap samples coming from the set of

Table 2
Summary statistics for meteorological variables. The last column gives the number
of missing values.

Min. 1stQu. Median Mean 3rdQu. Max. SD Missing


Tmin 10.8 0.9 1.7 1.7 4.6 12.1 3.87 2
Tmoy 6.7 1.9 4.9 4.8 7.6 14.3 4.11 2
Tmax 2.5 4.7 7.7 7.6 10.6 22 3.8 2
VVmoy 1.2 3.0 4.2 4.6 5.9 11.9 2 1
VVmax 2 5 7 7.2 9 17 2.8 1
DVdom 0 90 180 162 225 315 98.7 0
DVmaxvv 10 100 200 185 260 360 98 1
PAmoy 980 1011 1019 1018 1028 1042 11.8 2
HRmin 24 65 73 71.61 80 98 8.4 8
HRmoy 46 81.18 87 85 91 99 4.9 3
HRmax 56 92 95 93.8 97 100 2.2 3
GTrouen 1.5 0.3 0.8 1.3 2 14 3.8 0
PLsom 0 0 0 1.88 2 28 3.8 0
Fig. 1. Boxplots of daily mean PM10 concentrations (in mg m3) for the three stations.
7008 J.-M. Poggi, B. Portier / Atmospheric Environment 45 (2011) 7005e7014

observations and choosing randomly at each node a subset of estimated. A classical approach is to fit models of various number of
explanatory variables. At each node, a given number of input components and to compare them using the BIC criterion (Schwarz,
variables are randomly chosen and the best split is calculated only 1978) for example.
within this subset and second, no pruning step is performed so all To compute a mixture of linear regressions, we use the flexmix R
the trees are maximal trees. To evaluate the quality of the fitted package, described in Leisch (2004) and Grün and Leisch (2007).
model, it is common to estimate the error of generalization thanks This package has been recently extended to handle more general
to a test set. Breiman introduced the OOB scheme: the error is models in each cluster (see Grün and Leisch, 2008) but it is out of
estimated through the Out-Of-Bag (OOB) error, calculated accord- the scope of this paper.
ing to the iterations of the algorithm. The OOB error corresponds to Instead of talking about mixtures of linear regressions, we use in
the prediction error for the data not belonging to the bootstrap the sequel the expression “clusterwise regression” because it is
sample used to build the tree. more intuitive even if it is a little bit an abuse of language since the
For sure, the RF method allows building a black-box prediction induced clusters are designed using the response variable.
model, difficult to interpret, but which efficiently computes indi-
vidual scores of importance of regressors.
3. Results and discussion
The quantification of the variable importance is an important
issue. In the RF framework, the most widely used score of impor-
Let us now present the different results obtained for the two
tance of a given variable is the mean increase of the error of a tree
classes of models on the three stations considered. For each station,
(quantified by the mean square error) in the forest when the
we first give the forecasting performance obtained for the training
observed values of this variable are randomly permuted in the OOB
data: the winter days of the seasons 2004/2005 to 2007/2008, and
samples. The idea is simple: if a variable is not important, then
then the forecasting performance on the test data: the winter days
a random change would not degrade too much the prediction while
of December 2008 to March 2009.
it hugely deteriorates it if the variable is important: the higher the
Usual statistical indices are used to provide indications of the
importance, the stronger the variable influence.
quality of the prediction or the forecast with respect to the
We perform the statistical analyses using R (http://www.r-
observed PM10 concentration. We must distinguish (see for
project.org/) with the associated R package randomForest, see
example Chaloulakou et al., 2003b for a detailed definition of these
Liaw and Wiener (2002).
indicators) those based on forecast errors from those based on level
exceedances.
2.3.2. Nonlinear additive models
In the first category: the percentage of explained variance (EV)
Nonlinear additive models have been introduced by Breiman
given by 1 minus the ratio of the residual variance and the observed
and Friedman (1985) and have been widely used since the work
PM10 variance, the correlation coefficient between the prediction
of Hastie and Tibshirani (1990).
(or forecast) and the observed PM10 concentration (R), the mean
These models are more flexible than traditional linear models
absolute percentage error (MAPE), the root mean square error
since they allow to model nonlinear effects instead of simple linear
(RMSE), the index of agreement (IA) and finally the skill score (SS).
ones while retaining the additivity property preserving the ease of
More precisely, IA gives the degree to which model predictions are
interpretation by the separability of effects of each regressor. For
error free (range between [0,1], with a best value of 1); SS measures
these reasons, such models are attractive for modeling and fore-
the relative improvement with respect to persistence forecast
casting problems (see for example, Aldrin and Haff, 2005 and
(range between [1, 1] and a value of 0.5 or more indicates
Barmpadimos et al., 2011).
a significant improvement in skill). Note that the percentage of the
To estimate nonlinear additive models, we use the R package
explained variance EV does not exactly correspond to the square of
mgcv developed by Wood (2006). Nonlinear functions are esti-
the correlation coefficient R, even if it is true for linear models.
mated using the backfitting algorithm based on penalized regres-
In the second category, classical indicators are the probability of
sion splines.
detection (POD, range between [0,1] with a best value of 1), the
false alarm rate (FAR, range between [0,1] with a best value of 0)
2.3.3. Clusterwise linear models
and the threat score (TS, range between [0,1] with a best value of 1).
Mixture models, gaussian in particular, are widely used in
Let us mention that the threshold value used for computing POD, TS
statistics, including classification (see McLachlan and Peel, 2000).
and FAR is set to 30 mg m3, instead of 50 mg m3 in order to take
They have recently been extended by mixing standard linear
into account the fact that TEOM measurements do not integrate the
regression models. The main hypothesis is that observations come
volatile fraction.
from a mixture of s components in some unknown proportion and
in each component observations are modeled using a linear
regression model. The purpose is then to estimate the parameters 3.1. Choice of predictors
of each linear model and the parameters defining the components.
In the clustering context, each object is supposed to be generated The choice of the predictors used in the models is based on the
by one of the components of the mixture model being fitted. The random forest variable importance. More precisely, the variables
partition is derived from these parameters using the maximum are ranked for each station using the average importance of twenty
a posteriori (MAP) principle from the posterior probabilities for an random forests to minimize sampling effects.
object to belong to a component. We first examine the problem of the choice of lagged PM10
Finite mixture models with a fixed number of components are (denoted by PM10hier) which plays a crucial role especially when
usually estimated, under the maximum-likelihood estimation pollution level is high. To illustrate it, we examine for the JUS
framework, using the expectation-maximization (EM) algorithm station only, the average importance of variables based on 20
(Dempster et al., 1977). This algorithm iteratively repeats two steps random forests in three different situations collected in Table 3. The
until convergence. The first step E computes the conditional first column involves pollutants (NO, NO2 and SO2) and meteoro-
expectation of the complete log-likelihood, and the second one M logical variables, the second one adds the lagged PM10 (PM10hier,
computes the parameters maximizing the complete log-likelihood. in italics) to this set of variables and the last column is obtained
The number of components is generally unknown and needs to be removing pollutants.
J.-M. Poggi, B. Portier / Atmospheric Environment 45 (2011) 7005e7014 7009

Table 3
JUS station. Importance of variables based on 20 random forests and quantified by
the mean increase of the mean square error of a tree in the forest when the observed
values of this variable are randomly permuted in the OOB samples.

Pollutants and Adding PM10hier Removing NO, NO2,


Meteo SO2
NO2 26 NO 24 PM10hier 25
NO 25 NO2 24 VVmoy 22
SO2 16 PM10hier 21 GTrouen 22
VVmoy 15 VVmoy 14 VVmax 12
PAmoy 14 SO2 13 PAmoy 12
GTrouen 14 GTrouen 12 PLsom 10
HRmoy 12 PAmoy 12 Tmoy 8
PLsom 12 PLsom 12 Tmin 8
VVmax 10 HRmoy 11 DVdom 6
DVmaxvv 9 VVmax 9 Tmax 6
Tmoy 9 Tmoy 7 DVmaxvv 5
Tmax 8 DVmaxvv 7 HRmoy 5
HRmax 8 HRmax 7 HRmax 5
Tmin 7 Tmax 7 HRmin 2
HRmin 7 Tmin 6
DVdom 6 DVdom 5
HRmin 4

Starting from the first model as a reference (first column of Fig. 3. BIC values versus number of clusters of CLM for each of the three stations.
Table 3), adding the lagged variable slightly modifies the scores of
importance but promotes in third position PM10hier while
removing the pollutants leads to push it at the first rank leaving predict and can be a crucial variable in the model, and some recent
almost unchanged the importance scores of meteorological vari- studies show that PM10 concentrations decrease even when there
ables. This illustrates why the lagged variable is considered as very is little rain only.
useful in the forecasting context (from the d-day for tomorrow) In the sequel we focus on these five predictors. This selection
because PM10hier is available since it is measured while NO, NO2, captures the most frequently retained predictors, according to the
SO2 are not observable and need to be replaced by forecasts which literature.
are of medium to poor accuracy.
Let us now select the meteorological parameters. In Fig. 2, one
can find a typical result coming from one forest for each station. The 3.2. Forecasting performances on training data
results are extremely homogeneous: the same top 3 variables are
highlighted and the plots look similar. In addition to the persistence method, we consider a reference
For each station, the variables are selected according to the model given by a generalized additive model (GAM for short),
average importance of twenty random forests to minimize which naturally contains the regression linear model and generally
sampling effects. A first gap (huge) around 20, and a second one outperforms it in such a context. Then we consider a clusterwise
around 10 help us in the choice of a small subset of 6 variables. So, linear model (CLM for short) with an unprescribed number of
we select the same predictors for the three stations: PM10hier, clusters. So the first step is to choose it according to a model
GTrouen and VVmoy which are clearly the most important, and in selection criterion: we compare the BIC values given for each
addition PAmoy and Tmoy. Let us remark that the daily total rain clusterwise linear model and we choose the number of clusters
(variable Plsom) is not retained in the final set of variables. In fact, leading to the smallest BIC value. We can see in Fig. 3, that 2 or 3
this variable is of small average importance, especially in winter, clusters lead to the smaller values. In particular for GUI station, the
and in addition it is hard to predict so, even if it is useful to explain BIC value is the same for 2 or 3 clusters. Of course, in such situa-
exceedances, we chose to let this variable out of the considered tions, the parcimony argument tends to favor the smaller one but it
models. But we must remark that if total rain is not easy to predict, could be of interest, from a descriptive or explicative perspective to
some applications show that rain as binary variable is easy to also inspect the model with 3 clusters.

Fig. 2. Variable importance using random forests.


7010 J.-M. Poggi, B. Portier / Atmospheric Environment 45 (2011) 7005e7014

Table 4 Table 6
Forecasting errors on training data (winter days 2004/2005 e 2007/2008). Forecasting errors on test data (winter days 2008e2009).

Station Statistical indices GAM CLM (2 clusters) CLM (3 clusters) Station Statistical indices GAM CLM (2 clusters) CLM (3 clusters)
JUS R 0.78 0.87 0.93 JUS R 0.81 0.87 0.92
EV 0.61 0.76 0.85 EV 0.65 0.74 0.81
IA 0.86 0.92 0.96 IA 0.89 0.90 0.93
SS 0.53 0.71 0.83 SS 0.66 0.73 0.80
MAPE 0.23 0.18 0.14 MAPE 0.22 0.17 0.14
RMSE 6.34 4.98 3.85 RMSE 6.16 5.47 4.72
GUI R 0.74 0.86 GUI R 0.76 0.85
EV 0.55 0.72 EV 0.57 0.68
IA 0.83 0.91 IA 0.85 0.87
SS 0.49 0.68 SS 0.64 0.73
MAPE 0.25 0.20 MAPE 0.24 0.18
RMSE 8.20 6.43 RMSE 8.34 7.21
GCM R 0.74 0.86 0.93 GCM R 0.69 0.79 0.90
EV 0.54 0.73 0.85 EV 0.45 0.62 0.80
IA 0.83 0.91 0.95 IA 0.82 0.87 0.94
SS 0.44 0.67 0.81 SS 0.50 0.66 0.83
MAPE 0.29 0.21 0.17 MAPE 0.30 0.20 0.15
RMSE 7.32 5.63 4.23 RMSE 7.45 6.15 4.43

So, we compare the nonlinear additive one and the two clus- Some statistical studies in other regions indicate lower levels on
terwise linear models for 2 or 3 clusters. Saturdays compared to working days, and lower levels on Sundays
The forecasting performances evaluated on winter days of years compared to Saturdays, so one could study models with two
2004e2008, for the three sites are given in Table 4 for indices based dummy variables for the weekend.
on forecasting errors. First, the results for the three models are very
good. For example, for the urban background station JUS, the worst 3.3. Forecasting performance on test data
model explains 61% of the variance and the best one 85%.
To complement this view, Table 5 contains the forecasting Let us now study the forecasting performances of the three
performances for indices based on level exceedances. The threat- models. We use models estimated using the training data and
score (TS) is about 0.7 for the best models, this is remarkable. analyzed in the previous section and we evaluate their perfor-
Let us start the global comparison with the persistence strategy mances on winter days of year 2008e2009, in order to obtain
which is the worse: bad TS (around 35% instead of 70% considered a fair evaluation. The different performance statistics for the three
as good), poor POD (around 55% instead of 75%) and poor FAR sites are summarized in Table 6 for indices based on forecasting
(around 45% instead of 10%). errors.
The comparison between the three other models is clear: the The results are remarkably stable from the training period to the
nonlinear additive model is less accurate than the clusterwise test period. The results remain of good quality: explained variance
linear model with 2 clusters, and the one with 3 clusters is, as ex- from 45% to 74%, depending on the station and the model.
pected on the training data, slightly better. It turns out that a global As in the previous studies (see Jollois et al., 2009) for modeling
multiple linear model fitted on this data set leads to less accurate PM10 concentrations using together pollutants and meteorological
forecasts, since it is outperformed by the GAM model. However, this variables, the stations exhibit different difficulties for forecasting,
last one is only slightly better than persistence. Note that, even if namely JUS is easy, GCM is harder and GUI is medium.
this is of little interest on the training data, the forecasting model of The comparison with persistence is of interest for the test data.
persistence performs even slightly better than GAM on training Let us notice that the skill score SS is always greater than 0.5, which
data set of the GCM station (R ¼ 0.81, EV ¼ 0.65 and TS ¼ 0.37). This means that the considered model outperforms the persistence
comes certainly from the fact that in winter, exceedances appear as model. More details can be found in Table 7 giving the performance
sequences of consecutive observations exhibiting a locally indices based on level exceedances: the performances of persis-
increasing trend. tence are considerably degraded with respect to those attained on
Let us remark that, at this stage focusing on simple models, we the training data set.
have neglected possible weekend effects. So the comparison The results obtained for the GAM and CLM models are stable,
between clusterwise regression and nonlinear additive regression even if the results in exceedances are based on a small number of
is not completely fair. Indeed, instead of clustering, one could study days, and can be very satisfactory, for example for JUS station where
weekend effects and include corresponding dummy variables. the TS is about 0.80 and the FAR is zero.

Table 5 Table 7
Forecasting performances on training data (winter days 2004e2008). Forecasting performances on test data (winter days 2008e2009).

Station Statistical indices Persistence GAM CLM (2 clusters) CLM (3 clusters) Station Statistical indices Persistence GAM CLM (2 clusters) CLM (3 clusters)
JUS POD 0.52 0.56 0.77 0.80 JUS POD 0.36 0.71 0.79 0.93
TS 0.34 0.45 0.76 0.77 TS 0.23 0.53 0.79 0.93
FAR 0.49 0.30 0.03 0.05 FAR 0.62 0.33 0 0
GUI POD 0.57 0.66 0.78 GUI POD 0.59 0.74 0.67
TS 0.40 0.53 0.73 TS 0.43 0.60 0.66
FAR 0.44 0.28 0.09 FAR 0.39 0.24 0.03
GCM POD 0.54 0.38 0.62 0.73 GCM POD 0.25 0.45 0.45 0.45
TS 0.37 0.31 0.55 0.67 TS 0.14 0.36 0.38 0.39
FAR 0.46 0.38 0.18 0.10 FAR 0.75 0.38 0.31 0.25
J.-M. Poggi, B. Portier / Atmospheric Environment 45 (2011) 7005e7014 7011

4. Analysis of clusterwise linear model on JUS station Table 8


JUS Station, winter days 2004e2008. CLM with 2 clusters. Estimated model coeffi-
cients with the p-value of the corresponding significance test.
In this section, we develop more carefully the clusterwise linear
model focusing on JUS station. We first analyze together with the Predictors Cluster 1 (139 days) Cluster 2 (415 days)
cluster assignment distribution, the models built in the different Coefficient p-value Coefficient p-value
clusters. We compare them and study the discriminative proper- (Intercept) 0.32 6.97e  05 0.32 <2.2e  16
ties. Then, we discuss the different ways to compute the forecasts PM10hier 0.53 <2.2e  16 0.02 0.60
by combining the predictions given by each model (or equivalently Tmoy 0.08 0.172 0.004 0.92
VVmoy 0.25 0.00128 0.10 0.007
each cluster).
PAmoy 0.09 0.267 0.14 8.2e  07
As previously mentioned, the clusterwise linear model is ob- GTrouen 0.26 0.00024 0.27 1.5e  15
tained from all days of winter for years 2004e2008, using the R-
package flexmix by varying the number of clusters from 1 to 7. Let us
mention that we use the method on standardized variables, which
then requires rescaling back the estimates of PM10 concentrations. Cluster 2 does not contain any day of concentration greater
The BIC criterion is minimum for 2 or 3 clusters (see Fig. 3), so we than 35 mg m3 while all the polluted days (PM10
examine first the model with 2 clusters and close the section by concentration  50 mg m3) are assigned to cluster 1 with a prob-
a short discussion about the model with 3 clusters. ability equal to 1. So the interpretation is easy: cluster 1 is essen-
tially the class of polluted days and cluster 2 the one of unpolluted
days.
4.1. The model Let us now analyze the linear model across the 2 clusters
involving the previously selected predictors (PM10hier, VVmoy,
Recall that a clusterwise regression model is a mixture of linear GTrouen, Tmoy and PAmoy). In Table 8, we find the estimated
models, that is a way to compute the posterior probabilities of coefficients with additional information: the p-value of the corre-
belonging to clusters together with the local linear models. sponding significance test.
Let us first examine these probabilities to describe the two A graphical counterpart of this table (where the coefficients
clusters. We will show that cluster 1 corresponds somehow to values are explicit) is given in Fig. 5, facilitating the comparison
the class of polluted days. Fig. 4 shows the posterior probabili- between the two models.
ties of belonging to cluster 1 versus the observed PM10 Let us first notice that the two models are similar except two
concentrations (of course, the probabilities of belonging to main differences: the sign of the intercept and the prominent role
cluster 2 are the complements to 1). Two horizontal lines allow of PM10hier for cluster 1. More precisely, the intercept is positive in
to easily visualize unpolluted days from medium to highly cluster 1 and negative in cluster 2. Influent predictors are not the
polluted days. Since the assignment is based on the maximum- same in each cluster. Lagged PM10 variable (PM10hier) is the most
likelihood, we adopt the following convention: cluster 1 (resp. important predictor in cluster 1 while it is not significant in cluster
2) is made of the days such that the posterior probability is 2. Indeed in cluster 1, Tmoy and PAmoy are not significant (p-
greater (resp. less) than 0.5. The cluster size on the training data value > 5%), while in Cluster 2 PM10hier and Tmoy are not
is then: 139 days in cluster 1 and 415 days in cluster 2. We mark significant.
by circles the days corresponding to cluster 1 and by crosses
those of cluster 2.

Fig. 4. JUS Station, winter days 2004e2008. CLM with 2 clusters. Posterior probabil- Fig. 5. JUS Station, winter days 2004e2008. CLM with 2 clusters. Confidence intervals
ities of belonging to class 1. 95% and significance of the coefficients of linear models. (Comp. stands for Cluster).
7012 J.-M. Poggi, B. Portier / Atmospheric Environment 45 (2011) 7005e7014

Fig. 6. JUS Station, winter days 2004e2008. CLM with 2 clusters, (observed PM10, predicted PM10), «hard» forecast method on the left panel and «fuzzy» forecast method on the
right panel.

Table 9 4.2. How to build a forecast of PM10 concentration from a given


JUS station. Performances for forecasting winter days 2004e2008. Hard and fuzzy clusterwise model?
forecasting methods.

Statistical indices Hard method Fuzzy method In addition to the local linear models, the method provides
R 0.87 0.87 a way to compute the posterior probabilities of belonging to clus-
EV 0.75 0.76 ters. Assignment to a class is then based on the maximum likeli-
IA 0.92 0.92 hood. So, two ways can be explored to build a forecast:
SS 0.70 0.71
MAPE 0.18 0.18
RMSE 4.98 4.98
- to assign the new observation to a class and use the model of
this class to predict (we refer to this method by hard since the
assignment is hard). The hard forecast is then obtained using
the linear model of the chosen cluster;
Table 10
JUS station. Performances for forecasting winter days 2009. Hard and fuzzy fore- - to combine the forecasts delivered by the different models
casting methods. weighted by the posterior probabilities (we refer to this
Statistical indices Hard method Fuzzy method
method by fuzzy since the assignment is a probability
distribution).
R 0.87 0.87
EV 0.75 0.74
IA 0.90 0.90 On the left panel of Fig. 6, we find the forecasts obtained by the
SS 0.72 0.73 hard strategy versus the observed PM10 concentrations. The fore-
MAPE 0.18 0.17 casts obtained using the model of cluster 1 are represented by
RMSE 5.47 5.47
a circle and by a cross for those coming from model of cluster 2. The
right panel displays the forecasts obtained using the “fuzzy
strategy”, all the points are represented using the same symbol
The previous description of the clusters helps us to interpret the since each day is forecasted using both models with different
discrimination. For example, the intercept can be interpreted as weights.
a difference with respect to the global PM10 mean (in fact 0 since It should be noted that all PM10 concentrations above 30 mg m3
PM10 is standardized): more polluted for cluster 1 and less polluted are estimated using the linear model of cluster 1, see the circles on
for cluster 2. the left panel of Fig. 6.

Fig. 7. JUS Station, winter days 2004e2008. CLM with 3 clusters. Posterior probabilities of belonging to each cluster.
J.-M. Poggi, B. Portier / Atmospheric Environment 45 (2011) 7005e7014 7013

To end this paragraph, let us briefly comment the results, see


Table 10, obtained by the two different forecasting methods on the
test data.
Unsurprisingly the results are very similar to those obtained on
the training data and also very close to each other for the two
forecasting methods.
This comes from the fact that polluted days are assigned to
cluster 1 with a probability very close to 1.

4.3. A model with 3 clusters

To close this descriptive part, let us comment the clusterwise


regression model with 3 clusters, instead of only 2 as in the
previous section. The cluster sizes are as follows: 223 for cluster 1,
257 for cluster 2 and 74 for cluster 3. Again, by examining the
posterior probabilities of belonging to each cluster, we easily get
the interpretation of clusters. From the right to the left of Fig. 7: the
polluted days belong to cluster 3 with a probability close to 1, the
days belonging to cluster 2 with a probability greater than 0.5 are
unpolluted and the cluster 1 corresponds to an intermediate class.
The linear models for the 3 clusters are given in Fig. 8. This
synthetic view has to be compared with Fig. 5. It is clear that the
models of new clusters 2 and 3 are roughly as the same as those of
previous clusters 2 and 1 respectively. The intercept of the model of
the new cluster 1 is about zero confirming the intermediate situation.
The forecasts obtaining by combining the predictions given by
the three models lead to the graph of predicted versus observed
Fig. 8. JUS Station, winter days 2004e2008. CLM with 3 clusters. Confidence intervals PM10 concentrations of Fig. 9. This scatter plot exhibits an excep-
95% and significance of the coefficients of linear models (Comp. stands for Cluster). tional concentration around the ideal diagonal even if the model
slightly underestimates the medium to high values. The explained
variance reaches 86% which is very satisfactory.

Even if the second approach seems more appropriate for the 5. Conclusion and discussion
problem of forecasting since the assignment to a cluster is not
necessary, comparing the two panels does not reveal significant For three monitoring stations of Rouen (Haute-Normandie,
differences between the two scatter plots. Table 9 contains the France) reflecting the diversity of urban situations (background,
numerical indices of performance and confirms that the accuracy is traffic and industrial stations), we have built statistical models for
similar. daily mean PM10 concentrations from the winter days 2004/2005
to 2007/2008. We have shown that it is possible to accurately
forecast the daily mean PM10 concentration by fitting a function of
meteorological predictors and the average PM10 concentration
measured on the previous day. We have compared the forecasting
performance evaluated on winter days 2008/2009 of three different
methods: persistence, generalized additive nonlinear models and
clusterwise linear regression models and analyze the reasons of the
especially good behavior of this last one.
To discuss the future directions of this promising work, let us
sketch two different questions.
The first one is about the forecasting results in the actual fore-
casting context. Indeed, the results obtained in the literature are
often of excellent quality, especially when local particularities
(geographic or meteorological) lead to easy predict situations (see
for example Diaz-Robles et al., 2008, about Temuco, Chili). In
addition, as in the present work, a lot of papers evaluate the fore-
casting performance on observed meteorological variables which
should be completed by effective evaluation of the additional
uncertainty brought by the replacement of observed values by
predicted ones. The perfect prognosis strategy for forecasting (see
Wilks, 1995) leads to fit the models on observed variables using
training data and to use in actual forecasting situation by replacing
unavailable variables by forecasts generally coming from the
meteorological institute. This allows to introduce for example the
temperature of the day to explain PM10 by considering the fore-
Fig. 9. JUS Station, winter days 2004e2008. CLM with 3 clusters, (observed PM10, casted temperature in an actual situation. Even if the perfect
predicted PM10), «hard» forecasting method. prognosis strategy is used almost everywhere, Caselli et al. (2009)
7014 J.-M. Poggi, B. Portier / Atmospheric Environment 45 (2011) 7005e7014

for Bari (Italy) use a classical neural network fitted directly the network and a multivariate linear regression model. Journal Water, Air, & Soil
Pollution 201, 365e377.
forecasts of meteorological variables. Nevertheless, focusing on an
Chaloulakou, A., Grivas, G., Spyrellis, N., 2003a. Neural network and multiple
application in operational mode requires additional experiments to regression models for PM10 prediction in Athens: a comparative assessment.
assess the quality of the forecasting procedure. Indeed, for the Journal of the Air & Waste Management Association 53, 1183e1190.
model considered in our paper, it may be mentioned that in fore- Chaloulakou, A., Saisana, M., Spyrellis, N., 2003b. Comparative assessment of neural
networks and regression models for forecasting summertime ozone in Athens.
casting PM10 for day d, e.g. at the evening of day d-1, one cannot The Science of the Total Environment 313, 1e13.
use PM10hier. Instead one may use a model with moving average of Cobourn, W.G., 2010. An enhanced PM2.5 air quality forecast model based on
PM10 from 16.00 day d-2 to 16.00 day d-1. nonlinear regression and back-trajectory concentrations. Atmospheric Envi-
ronment 44, 3015e3023.
The second direction is to make profit of deterministic fore- Corani, G., 2005. Air quality prediction in Milan: feed-forward neural networks,
casting coming from large-scale numerical modelization. A lot of pruned neural networks and lazy learning. Ecological Modelling 185, 513e529.
ideas can be used to make the cooperation between the two Dempster, A., Laird, N., Rubin, D., 1977. Maximum likelihood for incomplete data via
the EM algorithm. Journal of the Royal Statistical Society 39, 1e38.
modeling approaches. For example, in Hooyberghs et al. (2005), the Diaz-Robles, L.A., Ortega, J.C., Fu, J.S., Reed, G.D., Chow, J.C., Watson, J.G., Moncada-
considered models are very simple and the main originality is to Herrera, J.A., 2008. A hybrid arima and artificial neural networks model to
include the mixing height. A model including only this variable and forecast particulate matter in urban areas: the case of Temuco, Chile. Atmo-
spheric Environment 42, 8331e8340.
the PM10 of the last day gives satisfactory results in terms of Dong, M., Yang, D., Kuang, Y., He, D., Erdal, S., Kenski, D., 2009. PM2.5 concentration
alarms. The introduction of four classical additional variables cited prediction using hidden semi-markov model-based times series data mining.
below leads to slightly improved performances. In Konovalov et al. Expert Systems with Applications 36, 9046e9055.
Grivas, G., Chaloulakou, A., 2006. Artificial neural network models for prediction of
(2009), a deterministic forecasting model is compared to statistical
PM10 hourly concentrations, in the greater area of Athens, Greece. Atmospheric
ones and appears to be less accurate. What is interesting is that, the Environment 40, 1216e1229.
statistical models can be easily improved by introducing the Grün, B., Leisch, F., 2007. Fitting finite mixtures of generalized linear regressions in
deterministic forecast as an additional predictor. In the same line, R. Computational Statistics & Data Analysis 51, 5247e5252.
Grün, B., Leisch, F., 2008. Flexmix version 2: finite mixtures with concomitant
Perez and Reyes (2006) introduce the Meteorological Potential of variables and varying and constant parameters. Journal of Statistical Software
Atmospheric Pollution given by a deterministic model which leads 28, 1e35.
to linearized and simplified forecasting models. Finally, let us Hastie, T., Tibshirani, R., 1990. Generalized Additive Models. Chapman & Hall.
Hooyberghs, J., Mensink, C., Dumont, G., Fierens, F., Brasseur, O., 2005. A neural
mention that Cobourn (2010), using nonlinear parametric model, network forecast for daily average PM10 concentrations in Belgium. Atmo-
shows that adding the back-trajectories is useful from the average spheric Environment 39, 3279e3289.
error as well as in terms of alarms. Jollois, F.X., Poggi, J.M., Portier, B., 2009. Three nonlinear statistical methods to
analyze PM10 pollution in Rouen area, case Stud. Bus. Ind. Gov. Stat. 3, 1e17.
Konovalov, I.B., Beekmann, M., Meleux, F., Dutot, A., Foret, G., 2009. Combining
deterministic and statistical approaches for PM10 forecasting in Europe.
Acknowledgements Atmospheric Environment 43, 6425e6434.
Leisch, F., 2004. FlexMix: a general framework for finite mixture models and latent
We want to thank Véronique Delmas and Michel Bobbia from class regression in R. Journal of Statistical Software 11, 1e18.
Liaw, A., Wiener, M., 2002. Classification and regression by random forest. R News 2,
Air Normand for fruitful discussions. In addition, we want to thank 18e22.
Air Normand and Météo-France for providing PM10 and meteoro- McLachlan, G., Peel, D., 2000. Finite mixture models, Wiley series in probability and
logical data respectively. Finally, the authors thank the reviewers statistics.
Paschalidou, A.K., Karakitsios, S., Kleanthous, S., Kassomenos, P.A., 2011. Forecasting
for constructive comments and recommendations.
hourly PM10 concentration in Cyprus through artificial neural networks and
multiple regression models: implications to local environmental management.
Environmental Science and Pollution Research 18, 316e327.
References Perez, P., Reyes, J., 2006. An integrated neural network model for PM10 forecasting.
Atmospheric Environment 40, 2845e2851.
Aldrin, M., Haff, I.H., 2005. Generalised additive modelling of air pollution, traffic Schwarz, G., 1978. Estimating the dimension of a model. Annals of Statistics 6,
volume and meteorology. Atmospheric Environment 39, 2145e2155. 461e464.
Barmpadimos, I., Hueglin, C., Keller, J., Henne, S., Prévôt, A.S.H., 2011. Influence of Sfetsos, A., Vlachogiannis, D., 2010. Time series forecasting of hourly PM10 using
meteorology on PM10 trends and variability in Switzerland from 1991 to 2008. localized linear models. Journal of Software Engineering and Applications 3,
Atmospheric Chemistry and Physics 11, 1813e1835. 374e383.
Bobbia, M., Jollois, F.X., Poggi, J.M., Portier, B., 2011. Quantifying local and back- Slini, T., Kaprara, A., Karatzas, K., Moussiopoulos, N., 2006. PM10 forecasting for
ground contributions to PM10 concentrations in Haute-Normandie using Thessaloniki, Greece. Environmental Modelling & Software 21, 559e565.
random forests. Environmetrics 22, 758e768. Stadlober, E., Hörmann, S., Pfeiler, B., 2008. Quality and performance of a PM10 daily
Breiman, L., Friedman, J.H., Ohlsen, R.A., Stone, C.J., 1984. Classification and forecasting model. Atmospheric Environment 42, 1098e1109.
regression trees, Belmont. Wilks, D.R., 1995. Statistical Methods in the Atmospheric Sciences: An Introduction.
Breiman, L., Friedman, J.H., 1985. Estimating optimal transformations for multiple Academic Press.
regression and correlation. Journal of the American Statistical Association 80, Wood, S.N., 2006. Generalized Additive Models: An Introduction with R. Chapman &
580e619. Hall/CRC.
Breiman, L., 2001. Random forests. Machine Learning 45, 5e32. Zolghadri, A., Cazaurang, F., 2006. Adaptive nonlinear state-space modelling for the
Caselli, V., Trizio, L., de Gennaro, G., Ielpo, P., 2009. A simple feed forward neural prediction of daily mean PM10 concentrations. Environmental Modelling &
network for the PM10 forecasting: comparison with a radial basis function Software 21, 885e894.

You might also like