Professional Documents
Culture Documents
Abstract— The use of false or erroneous data can lead to drift, breakdowns, etc.), producing false flow data readings.
wrong decisions when operating a system. In case of a water These false data must also be detected and replaced by esti-
distribution network, the use of incorrect data could lead mated data, since flow data are used for several network wa-
to errors in the billing system, waste of energy, incorrect
management of control elements, etc. This paper is focused ter management tasks, namely: planning, investment plans,
on detecting flow meters reading abnormalities by exploiting operations, maintenance, and billing/consumer services and
the temporal redundancy of the demand time series by means operational control [1], [2].
of artificial neural networks (ANN). Communication problems This paper is related to works in the area of Artificial
with the sensor generate missing data and bad maintenance Neural Network (ANN) for the task of modelling time series
service in the flow meters produce false data. In this work,
a methodology to detect the false data (validate) and replace [3], [4], [5], [6], but differ in the way of creating the
the missing or false data (reconstruct) is proposed. As a architecture of the ANN, and also the way to train it. In
core methodology, ANNs are used to model the time series previous works ([7], [8], [9]), we have used Evolutionary
generated from the water demand flow meters, and use the Algorithms (EA) to define the architecture of the ANN and
confidence intervals to validate the information. To illustrate train it. The results show that the forecast accuracy achieved
the proposed methodology, the application to flow meters in
the water distribution network of Barcelona is used. using EA presents a better performance than traditional
gradient-based training algorithms.
I. I NTRODUCTION In this paper, we propose a methodology to detect (vali-
The purpose of a water distribution system is to deliver date) and reconstruct data of water demand time series by
water to consumers with appropriate quality, quantity and modelling the time series with an ANN. With this model,
pressure. The water supply system collects, stores, treats, and it is possible to detect anomalies and replace the missing
distributes the water for homes, commercial establishments, data. One of the inputs of the ANN is the mode (or regime)
industry, and irrigation, as well as for such public needs as of the next day to model. In previous works related to
firefighting, street flushing, etc. The main goal of the water regime recognition, the identification of regimes behaviour
distribution system is to satisfy the consumers’ demand (with has used qualitative information. Benmouiza and Cheknane
appropriate quality). This implies the continuous delivering [10] proposed the implementation of a global NAR (Non-
of quality water at a reasonable pressure to users. To maintain Linear Autoregressive) neural network predictor to estimate
the quality of service, the water distribution system requires the regimes associated to another local NAR neural network
a real-time monitoring system. This system will facilitate predictor for the hourly global solar radiation. Kumar and
finding faults in the water distribution system. Patel [11] propose a predictive algorithm using data clus-
Among the activities of the water distribution system, it is tering and local training models that combined produce the
to periodically measure the amount of water supplied to each forecast. Martinez Alvarez et al. [12] use clustering to group
user. In a complex water distribution network, a telecontrol the days with similar patterns, regarding the variation of
system must acquire, store, and validate data from many the electricity cost in working days and holidays. In [13], a
flow meters and other sensors every few minutes to achieve daily seasonality ARIMA model with hourly pattern with the
accurate monitoring of the whole network in real-time. goal of working in daily and hourly scales is proposed. The
Frequent operation problems occurs in the communication seasonal ARIMA predicts total days consumption and a daily
system between the set of the sensors and the data logger, or pattern is selected according to a calendar for distributing
in the telecontrol itself, generate missing data during certain the consumption hourly along the day. The proposed case
periods of time. Missing data must, therefore, be replaced study, the Barcelona water network model has been already
by a set of estimated data [1], [2]. studied by [2] also with the focus on the validation and
Missing data is not the only problem, a second common reconstruction of flow meter data.
problem is the lack of reliability of the flow meters (offset, The structure of the paper is the following: In Section
II, the approach proposed for the modelling the flow meter
1 Hector and Vicenç are with the Institut de Robòtica i Informàtica Indus-
data series is introduced. Section ?? describes the flow
trial (CSIC-UPC), Carrer LLorens Artigas, 4-6, Barcelona (Spain), 08028.
hrodriguez@iri.upc.edu and vicenc.puig@upc.edu meter data validation and reconstruction approach. Section
2 Juan F. works at División de Estudios de Postgrado, Facultad de
IV presents the case study used to illustrate the proposed
Ingenierı́a Eléctrica, Universidad Michoacana de San Nicolás de Hidalgo, approach. Section V summarizes the main results obtained
Morelia, Michoacán, Mexico juanf@umich.mx
2 Rodrigo is with IMT Institute for Advanced Studies, Lucca, (Italy). in the considered case study. Finally, in Section VI draws
rdglpz@gmail.com the main conclusions.
II. M ODELLING THE T IME S ERIES Given a time series, we need to provide a neural model
A. Background capable of producing an acceptable model that fits with the
behavior of the time series. The ANN architecture used for
Modelling a water demand time series is not something
this task is a Multi-Layer Perceptron (MLP). A MLP as a
new, since water is one of the most basic non renewable
universal approximator, can learn any function given, as long
natural resources to sustain life and ecosystems. The devel-
as it has enough neurons in the hidden layer. That fact allows
opment of new forecasting models and strategies are strongly
the network to capture the different forms of the function to
related to multidisciplinary novel research, mainly in physics,
be modelled. The network used in this work is a three layer
computing, and statistics.
MLP network. The past observations of the time serie are
In the literature regarding modelling time series (e.g. [14],
used as input of the ANN. The hidden layer has m neurons,
[15]), we found that there is a strong effort on finding the
and the output layer corresponds to the forecast (ŷt+1 )), and
best way to decompose time series in several simpler time
the sigmoid function as the activation function. The ANN
series to better fit an accurate model. This is not an easy task
that forecast ŷt+i is defined as:
since in many real cases there is not an available model that
describes the dynamic fluctuation of the data. In the early n
X
successful stage in statistics the divide and conquer strategy ŷt+1 = f1 ( w i xi )
has been used. For example, the decomposition on different i
h (1)
basic components that might explain the general dynamics of X
the time series, such as trend, seasonal, random, and cyclical xi = f2 ( wij yt−j )
components that are integrated in the ARIMA methodology j
[16], [17]. where f1 and f2 are the activation functions, and w are the
Nowadays, with the growing of computational resources coefficients or the weight connections.
and the development of machine learning and pattern recog- An example of this kind of architecture is depicted in
nition algorithms (e.g. [18], [19], [20]), it is possible to Figure 1. The proposed network has k inputs, which are
model more complex time series. There are practical cases selected from the k previous measurements, plus the next
where the single linear modeling approach is not enough for mode. It is understood as next mode, the estimation of the
systems that present different behaviors along time [10], [11], next day qualitative behavior (e.g. labor day or holiday). It
[12], [13]. These behavior changes might be produced by has m neurons in the hidden layer, one output (the corre-
changes on dynamical regimes. Although the multi-modeling sponding hour prediction (ŷt+1 )), and the sigmoid function
approach was born with the analysis of partially known as the activation function.
real systems, the same ideas can be adopted for time series
modelling where there is no knowledge about the system
behind the dynamics. That is the case of water demand time Next
series. Mode
Regarding the water demand, different methodologies for
yt
modelling have been explored. In particular, revising the
literature, several methodologies (e.g. based on Box Jenkins, Yt-1 Ŷt+1
ANNs, Holt-Winters, ARIMA, etc.) that deal with this prob-
lem [21], [22], [13], [23] already were satisfactorily applied. Yt-2
...
B. Proposed approach Yt-k
Recent research activities in ANNs have shown that they
have powerful pattern classification and pattern recognition
capabilities. Inspired by biological systems, particularly by
k inputs + m hidden One output
research into the human brain, ANNs are able to learn and mode neurons
generalise from experience. Currently, ANNs are used for a
wide variety of tasks in many different fields of business, Fig. 1. MLP architecture with k + 1 past observations plus the next mode,
industry, and science [24]. m hidden neurons in the hidden layer and one output (ŷt+1 ).
One major application area of ANNs is forecasting (mod-
elling time series). ANNs provide an attractive alternative
tool for both forecasting researchers and practitioners. Sev- C. Regimes
eral distinguishing features of ANNs make them valuable Time series can present different dynamic behaviors that
and attractive for a forecasting task [24]. cannot be fitted with a single model. These different behav-
Determination of the optimal number of hidden neurons is iors might be considered as changes of regimes that might
a crucial issue. If the hidden layer is too small, the network be unknown. Examples of this kind of behaviors are found
can not possess sufficient information processing power, and in natural processes, such as temperature, solar radiation,
thus yields inaccurate forecasting results. On the other hand, wind speed, etc., where the behavior changes according to
if it is too large, the training process will be very long. some rules. The behavior in these cases is different during
the winter than during the summer, but regular in its own of the vector corresponds to the weights connections from
season. Also, behavioral changes can be found in time series the input neurons to the neurons in the hidden layer (second
corresponding to human activities such as water demand, part); its size is (n+1)∗h (n is the number of inputs, h is the
electricity consumption, etc. In those cases the behavior of number of neurons in the hidden layer). And the third part
the series is reflected according to human activities (i.e. corresponds to the weights from the neurons in the hidden
holiday or working days). Each behavior can be represented layer to the output layer (third part)its size is h. Figure 2
by a regime that is characterized by a qualitative behavior. shows the structure of the chromosome.
In [25], to predict the time series water consumption, it
Number of hidden neurons Second set of weigh
is decomposed in two kinds of time series: a quantitative connections
and qualitative one. A quantitative time series that represents 0.1053 0.4253 0.8817 0.531 0.7571 … 0.4966 0.5177 0.5059 0.8924 0.6013 0.7681 … 0.9589 0.4426 0.2989
For the training process, taking into account the results I = Integer(LimM ax ∗ realData) (3)
obtained in previous works ([7], [8], [9]) where we tested
traditional methods (gradient-based [26]) versus evolutionary where LimM ax is the largest possible number defining
computation, we use Genetic Algorithms (GA) to define the the boundary of the architecture. We have one LimM ax
architecture of the net and the training process for finding to define the number of inputs, and another one to define
the best structure and weights of the ANN. the number of neurons in the hidden layer. I is an integer
GA is an optimisation technique inspired on Darwin’s number between zero and LimM ax; realData is the value
principle of evolution. That is, it mimics a simplistic version contained in the vector of reals, and the function Integer
of the process of biological evolution, which consists of returns the integer part of the product of LimM ax and
creating a population of individuals, where each individ- realData.
ual represents a prospective solution of the problem being III. DATA VALIDATION AND R ECONSTRUCTION
solved. GA modifies this population using genetic operators:
To validate and reconstruct the data from the flow meter
selection, mutation, recombination, etc. This stage, called a
sensors of the distribution network, first we need to model
generation, repeats until a termination criterion is met. At the
the behavior of the corresponding time series. To model the
end of the process, the best individual (e.g. the one with best
water demand time series, as explained before we use an
fit) found during the evolution is returned as the solution of
ANN. With the model of the time series, it is possible to
the problem.
start the process of validation and reconstruction of data.
Determining the best ANN architecture for forecasting is
The process starts from the acquisition of data, and ends
an optimization problem; we use GA to find the optimal
until they are stored in the operational database.
ANN architecture and its weights. The function to minimise
The information from flow meter is sent to the telecontrol
is:
center. If system does not receive the data, the process of
N
X reconstruction is started. If the data is received, the data goes
min(M SE) = (yk − ŷk )2 (2) to a validation process. If the data is out of the boundary
k of a valid range, the data is discarded and the process of
where yk is the real data, ŷk is the estimate data, N is the reconstruction is started. After this processes the data are
number of points in the training set, and M SE (Mean Square stored in the operational database. Figure 3 shows the flow
Error) is the function to minimise. of the validation process and reconstruction of information.
The validation process verify if the data is between a
GA defines the architecture of the ANN and the weights of
valid boundary. The valid boundary is defined by the model
the neurons connections. Each individual (the chromosome)
obtained with the ANN. The model produces an estimation,
is defined as a vector of real numbers. The vector is coded in
and with statistical methods a valid threshold can be deter-
three parts. The first part of the vector defines the architecture
mined. In this case, to quantify the model uncertainty, we use
of the net (n and h of Equation 1 ). This part is a a vector
the confidence interval assuming that error follows normal
of two elements: the first element is coded and defines the
distribution as follows:
number of previous measurements to use as an input (n),
and the second element, also coded, defines the number of σ σ
x̄ − 1.96 √ < µ < x̄ + 1.96 √ (4)
neurons in the hidden layer (h). The second and third part N N
of 1 hour. For the predictive control scheme a prediction
Flow
Meter horizon of 24 hr is chosen. This record is updated at each
time interval.
Yraw(k)
Due to the geographical topology of Barcelona and its
Si
NO surroundings, the water network supply in the metropolitan
Is there area is structured in pressure flows. Indeed, the topology of
data ?
Barcelona, with a big difference in height between the sea
Yraw(k) Yraw(k) level and the highest point to be supplied that is about 500 m
above the sea level makes it necessary to homogenise the
NO Reconstruc*on
pressure by intermediate tanks and pumping stations.
Is Valid?
Yraw(k) of
Data
The water supply system responds to changes in network
topology (e.g. ruptures), typical daily profiles, as well as
Yreconst(k)
Yraw(k) Si major changes in water demand, etc. The demand varies
Operational hourly, following a pattern that repeats every day. There
DataBase are slight changes during weekends; additionally, there are
changes in demand between winter and summer.
Fig. 3. Validation and reconstruction process of the flow meter data sensor.
V. R ESULTS
The considered water demand flow meter sensor is named
where µ is interval with a 95% confidence, x̄ is the mean of p10025 of Barcelona water distribution network. The time
the error, σ is the standard deviation, and N is length of the serie of this sensor of one month in hourly scale is depicted
population. The valid range is defined as: in the Figure 4.
Fig. 5. Real versus predicted demand using the ANN obtained using the
0.8
GA optimisation process.
TABLE I 0.6
PARAMETERS OF THE GA
0.4
Concept Value
Length of data 720
Length of the validation set 216 0.2
inputs of the ANN 35
neurons in the hidden layer 28
Individual mutation probability 35% 0.0
Gen mutation probability 0.05%
Crossover probability 70% Fig. 7. Missing data replaced by the reconstruction process. The grey
bars represent the real data magnitude and the the red ones represent the
reconstructed data.
ì
Princeton, 1994.
ì
ì
ì
ì æ
æ
ì
[15] W. W.-S. Wei, Time series analysis, Addison-Wesley publ, 1994.
æ
0.8
æ
æ ì
à à æ
ì
[16] J. Contreras, R. Espinola, F. J. Nogales, A. J. Conejo, Arima models to
ì ì à
ì
æ
à
à
æ
à ì
à
à
æ ì ì
predict next-day electricity prices, Power Systems, IEEE Transactions
æ ì
ò ò
0.6
ì
à æ æ
ì
ì ì
ì
ì ì
ì
æ ò à
ì ì
ì
æ
ì
on 18 (3) (2003) 1014–1020.
à à ò
ì
ò
ò
ò
à
à
æ
ì
ò æ
æ
à à
æ [17] A. C. Harvey, Forecasting, structural time series models and the
æ
à
æ
0.4
ì
ì
æ
à
ò
ò
à
à à
æ à
æ
à à
æ
æ ì
ò
æ
à à
æ
à
à
ì Kalman filter, Cambridge university press, 1990.
ò à ì ò
ì
ì
ì
ì
à
ò
ì
ì ì
ì ì ì
ì
à
ò
ò ò
æ
ì
ì
[18] R. T. Olszewski, Generalized feature extraction for structural pattern
ì ì ò ò æ æ ò
0.2
à
à
æ
ò ò
ò
ò ò
ò
à ò ò à
ò æ
recognition in time-series data, Tech. rep., DTIC Document (2001).
æ æ à
ò
à
æ
à
à
æ
æ
à
æ
à æ à à à à
æ
ò
à
æ
[19] T. W. Liao, A clustering procedure for exploratory mining of vec-
æ à
æ à à
à à
ò
æ æ æ ò
ò æ
æ æ
æ
ò
ò æ
Time
tor time series, Pattern Recognition 40 (9) (2007) 2550 – 2562.
10 20 30 40 50
ò
ò
ò
ò
ò
ò ò
ò ò ò
ò
ò
ò
doi:http://dx.doi.org/10.1016/j.patcog.2007.01.005.
ò ò
[20] T. W. Liao, Clustering of time series data?a survey, Pattern recognition
38 (11) (2005) 1857–1874.
[21] M. S. Al-Hafid, G. H. Al-maamary, Short term electrical load fore-
Fig. 8. Confidence intervals or valid boundary of the model. casting using holt-winters method, Al-Rafadain Engineering Journal
20 (6).
[22] S. Zhou, T. McMahon, A. Walton, J. Lewis, Forecasting daily urban
Magnitude water demand: a case study of melbourne, Journal of Hydrology
1.0
Ï Ï
Ï
236 (3?4) (2000) 153 – 164. doi:http://dx.doi.org/10.1016/S0022-
Ê
Ï
Ï
Ï
Ï Ê
Ê
Ï
1694(00)00287-0.
Ê
0.8
Ï
Ê
Ê Ï
‡ ‡ Ê
‡
Ï
[23] S. Alvisi, M. Franchini, A. Marinelli, A short-term, pattern-based
Ï Ï
Ê
‡
‡
Ê
‡ Ï
‡
‡
Ï
Ï Ï
model for water-demand forecasting, Journal of Hydroinformatics
Ê Ú Ú Ï
Ï
0.6 ‡ Ê Ê
Ï
Ï Ï
Ï
Ï Ï
Ï
Ê Ú ‡
Ï Ï Ê
Ï
9 (1) (2007) 39–50.
‡ ‡ Ú
‡
Ï
Ú
Ú
Ú ‡
Ê
Ï
Ú Ê
Ê
Ê
‡ ‡
Ê
[24] G. Zhang, B. E. Patuwo, M. Y. Hu, Forecasting with artificial neural
Ê ‡
0.4
Ï
Ï
‡
Ú
Ú
‡
‡ ‡
Ê ‡
Ê
‡
Ê ‡
Ê
‡
Ï
Ï Ú
Ú
Ê
‡ ‡
Ê
‡
‡
Ï networks:: The state of the art, International journal of forecasting
Ú
Ï
Ï
Ï
Ï
‡
Ú
Ï
Ï Ï
Ï Ï Ï
Ï
‡
Ú
Ú Ú
Ê
Ï
Ï
14 (1) (1998) 35–62.
Ï Ï Ú Ú Ê Ê Ú
0.2
‡
‡
Ê
Ú Ú
Ú
Ú Ú
Ú
‡ Ú Ú ‡
Ú Ê
[25] R. Lopez, V. Puig, H. Rodriguez, An implementation of a multi-model
Ê Ê ‡ Ê
Ú Ú
‡
Ê
‡
‡
Ê
Ê
‡
Ê
Ê ‡ ‡ ‡ ‡
‡ ‡ Ê
‡
Ê
predictor based on the qualitative and quantitative decomposition of
Ê ‡
Ê ‡ ‡
‡
Ú
Ê Ê Ê Ú
Ú Ê
Ê Ê
Ê
Ú
Ú Ê
Time the time-series, International work-conference on Time Series 1 (1).
Ú 10 20 30 40 50
Ú
Ú
Ú
Ú
Ú Ú
Ú Ú Ú
Ú
Ú
Ú
[26] Neurolab, Neurolab a python plugin, Available:
Ú Ú
https://pypi.python.org/pypi/neurolab [Online].
[27] S. Sociedad General de Aguas de Barcelona, Annual corporate gov-
ernance report (2008).
Fig. 9. False data detected by the boundary of the CIs