You are on page 1of 12

Advanced Engineering Informatics 44 (2020) 101092

Contents lists available at ScienceDirect

Advanced Engineering Informatics


journal homepage: www.elsevier.com/locate/aei

Transfer learning for long-interval consecutive missing values imputation T


without external features in air pollution time series
Jun Maa, Jack C.P. Chenga, Yuexiong Dingb, Changqing Lina, Feifeng Jiangc, Mingzhu Wanga,
Chong Zhaid,

a
Department of Civil and Environmental Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
b
Department of Research and Development, Big Bay Innovation Research and Development Limited, Hong Kong, China
c
Department of Architecture and Civil Engineering, City University of Hong Kong, Hong Kong, China
d
Shenzhen Qianhai Bruco Consulting Company Limited, Shenzhen, China

ARTICLE INFO ABSTRACT

Keywords: Air pollution has become one of the world’s largest health and environmental problems. Studies focusing on air
Air quality quality prediction, influential factors analysis, and control policy evaluation are increasing. When conducting
Deep learning these studies, valid and high-quality air pollution data are necessarily required to generate reasonable results.
Long-interval consecutive missing values Missing data, which is frequently contained in the collected raw data, therefore, has become a significant barrier.
Long short-term memory (LSTM)
Existing methods on missing data either cannot effectively capture the temporal and spatial mechanism of air
Neural network
pollution or focus on sequences with low missing rates and random missing positions. To address this problem,
Transfer learning
this paper proposes a new imputation methodology, namely transferred long short-term memory-based iterative
estimation (TLSTM-IE) to impute consecutive missing values with large missing rates. A case study is conducted
in New York City to verify the effectiveness and priority of the proposed methodology. Long-interval consecutive
missing PM2.5 concentration data are filled. Experimental results show that the proposed model can effectively
learn from long-term dependencies and transfer the learned knowledge. The imputation accuracy of the TLSTM-
IE model is 25–50% higher than other commonly seen methods. The novelty of this study lies in two aspects.
First is that we target at long-interval consecutive missing data, which has not been addressed before by existing
studies in atmospheric research. Second is the novel application of transfer learning on missing values im-
putation. To our best knowledge, no research on air quality has implemented this technique on this problem
before.

1. Introduction sensors located at different sites. Due to facility issues, routine main-
tenance, changes in sensors setting, human error, insufficient sampling,
1.1. Background and other reasons, the collected raw data always contain missing values
[9]. These missing data will not only increase the difficulty for air
Rapid economic growth, industrialization, and urbanization in re- quality prediction but also limit the discoveries in influential factor
cent years has led to extremely severe air pollution worldwide. It will analysis. Therefore, to better support the studies on cleaner produc-
negatively influence climate change, human health, as well as economic tions, it is essential to dive into the missing data problem.
development [1–3]. To control air pollution and mitigate its impacts on In fact, missing data is not a unique problem for air pollution but a
environment and human society, numerous studies have been con- ubiquitous concern in many scientific fields, such as epidemiology,
ducted to investigate the influential factors of air quality [4], predict traffic, building, finance, etc. [10–14]. In past years, scholars have
the air pollutant concentrations [5], and explore the practical methods proposed many approaches to mitigate the impacts of missing values.
and measurements for air quality prevention [6,7]. When conducting Deletion and imputation are two commonly used methods. Deletion
these studies, complete and high-quality air pollution data are ne- further includes listwise deletion and pairwise deletion. Listwise dele-
cessarily required for generating reliable results [8]. However, cur- tion discards the observations from the analysis if they contain missing
rently, air quality data are usually obtained using automated machine values at some variables, while pairwise deletion excludes a particular


Corresponding author.
E-mail address: zhaichong0131@outlook.com (C. Zhai).

https://doi.org/10.1016/j.aei.2020.101092
Received 9 September 2019; Received in revised form 17 March 2020; Accepted 26 March 2020
1474-0346/ © 2020 Elsevier Ltd. All rights reserved.
J. Ma, et al. Advanced Engineering Informatics 44 (2020) 101092

variable if it has a missing value [15]. Although these two deletion values, linear interpolations like mean and median imputations replace
methods are easy to implement and have been widely used, studies the gap with a straight line. The algorithms based on similarity mea-
have shown that addressing the missing data using the deletion method surement can roughly estimate the changing trend of the data, but their
could reduce statistical power and introduce a substantial bias in the filled values deviate from the real values a lot. The results of models like
study [16], especially when there is a significant fraction of missing ARIMA, furthermore, exhibit good consistency with the actual values at
data. first but become flat later since it cannot learn from the long-term de-
Instead of removing the observations or variables with missing data, pendencies of air pollutants and the number of observed values is
imputation methods keep the full sample size by filling missing values limited. In fact, few studies have tried to figure out an appropriate
with substituted values. A variety of imputation approaches have been method to improve the imputation performance for long-interval con-
proposed and used. The procedures range from operationally straight- secutive missing values [25].
forward to complicated. For example, linear interpolations like mean To address these gaps, this paper proposes a method, namely,
and median imputations perhaps are the easiest way to impute [10]. transferred long short-term memory-based iterative estimation (TLSTM-
They fill the missing values using the corresponding mean or median of IE). The novelty and the contributions of this study lie in two aspects.
the observed values. The expectation–maximization (EM) iterative al- First is that we target at long-interval consecutive missing data, which
gorithm is also a simple and widely used imputation method [17]. It has not been addressed before by existing studies in atmospheric re-
computes maximum likelihood estimates from incomplete datasets to search. Second is the novel application of transfer learning on missing
fill the missing values. K-nearest neighbor (KNN) is another imputation values imputation. To our best knowledge, no research on air quality
method. It fills the missing values using the mean value of the corre- has implemented this technique on this problem before. The results of
sponding column of the nearest neighbor of the corresponding row that our experiment exhibited promising performance, and this could pro-
have no missing values [9,18]. vide a new methodology direction to follow up research.
Other imputation methods are more complicated. For example, Li The proposed model takes advantage of long short-term memory
et al. [19] proposed a filling method based on least squares support (LSTM) neural network, transfer learning, and iterative estimation. Its
vector machine (LSSVM) for multivariate time series. Li et al. [20] also main idea is to learn from a complete sequence, which is the most si-
proposed an algorithm based on time series similarity measurement for milar to the sequence with missing values first, and then to transfer the
missing data imputation. Che et al. [21] developed a deep learning knowledge to the incomplete sequence using the LSTM model and to fill
model based on Gated Recurrent Unit (GRU) for missing data imputa- the missing values with iterative estimation. Therefore, the missing
tion. They took two representations of missing patterns, including values can be estimated effectively based on the observations of similar
masking and time interval, and incorporated them into the model to time series and follow their long-term moving trends.
capture the long-term temporal dependencies in time series and utilize To validate the effectiveness of the proposed method, this paper
the missing patterns. Wei and Tang [22] used the distance concept and performs the imputation of long-period consecutive missing values on
self-organization-map (SOM) neural network to fill missing values. PM2.5 concentrations in New York City as a case study. A series of
They filled missing data with default values at first and clustered all experiments are conducted to compare the performance of the proposed
records through SOM. Then they analyzed value distribution patterns in model with other models. The research framework of this paper is
sub-groups and re-filled missing data within sub-groups. Şahin et al. presented in Fig. 2. Results show that the proposed method can effec-
[23] developed a neural network named Cellular Neural Network tively improve the estimation accuracy for long-interval consecutive
(CNN) to predict missing air pollution data. missing values.

1.2. Research gaps and objectives 2. Methodology framework

Unfortunately, although various approaches have been proposed for Fig. 3 presents the methodology framework proposed in this paper.
missing data imputation, most of them were applied in the domain of It is composed of two parts, including preprocessing and modeling. The
clinical disease, computer science and economics [19–22,24]. Few first part collects, cleans, and normalizes the experiment data. The most
studies have focused on the missing data problem in air pollution. Be- similar sequence for the incomplete sequence is also defined as the
cause the data structure and modeling environment are different for air reference. Then, in order to achieve a better imputation performance,
pollution and other domains (e.g., image recognition and speech re- the base LSTM models built on the reference sequence and the trans-
cognition), it is necessary to design a specialized strategy for the ferred LSTM models on the target sequence are trained and optimized in
missing data imputation for air pollution data. the second part. After that, the proposed algorithm combines iterative
Moreover, previous studies have proved that those complicated estimation with transferred LSTM. The model keeps iterating until the
imputation methods could generate state-of-the-art performance for difference between the initialized and the predicted values is smaller
missing data handling compared with deletion methods and simple than a pre-set threshold. Details of the framework are presented as
imputation methods, but, most of the techniques were applied for dis- follows.
continuous missing values, or consecutive missing values with small
missing rate less than 30% [14,21]. It is expected that discontinuous 2.1. Most similar sequence
missing values are easier to fill up because consecutive missing values
would provide no information in the middle of the time series to adjust As shown in Fig. 3, LSTM and transfer learning are two primary
the imputation. Therefore, the performance of the methods becomes techniques utilized in this paper. Transfer learning is applied based on
unsatisfied when the missing values are consecutive, and the missing the similarity between the base domain and the target domain. The
rate is significant. For example, Fig. 1 simulates the imputation results more similar the domains are, the better the transfer performance will
of several commonly seen methods for long-period consecutive missing be. Therefore, after data collection, cleaning, and normalization, the
data. Assume the values within the dotted box are consecutively most similar sequence (MSS) of the target sequence requires to be de-
missing. Line 1 and 2 represent the imputation results of mean im- fined at first [26]. In this paper, the sequence contains missing values is
putation and median imputation, respectively. Line 3 presents the im- the target sequence, while other complete sequences are the potential
putation performance from the nearest station, while Line 4 gives the base sequences.
imputation results of models that predict the current observation based There are several approaches for similarity measurement, such as
on the previous observations using ARIMA. Line 5 represents the ori- Chi-square similarity [27], Pearson correlation coefficient [28], and
ginal data. It can be observed that for long-interval consecutive missing Kendall tau coefficient [29]. In general, the selection of similarity

2
J. Ma, et al. Advanced Engineering Informatics 44 (2020) 101092

Fig. 1. Imputation results of several commonly seen methods for consecutive missing values.

measurement relies on the data attribute of the features or time series. training after it was determined. To this end, the rolling window (RW)
For categorical features, Chi-square is a better choice, and for ordinal or method is utilized.
non-normal-distributed features, the Kendall or Spearman coefficient is For a temporal sequence S = [s0, s1, , sT 1, sT ] with the sequence
more suitable. While for numerical features, the Pearson correlation is length as T, RW transforms the observations into time-series samples
the most common choice. Compared with Kendall, although Pearson is based on the rolling window size r [31]. Formulation of a time series
less non-linear, it can better capture the details between the trends of sample is shown in Eq. (3).
two numerical time series [30]. Therefore, this paper adopted the
{[st r , , st 2, st 1] st } (3)
Pearson correlation coefficient to find the MSS.
In this paper, we mark the incomplete sequence as Sm . It can be where st represents the predicted target, [st r , , st 2, st 1] represents the
further expressed as Sm = [S1m; S2m; S3m], where S2m is the missing seg- inputs of the model. Typically, larger r means fewer time-series samples
ment, S1m and S3m are the data before and after the missing segment. but more inputs of each sample, while smaller r means more time-series
When S2m is at the beginning or the end of Sm , S1m or S3m can be . For a samples but fewer inputs of each sample. The most appropriate value
complete sequence S i , 1 i N , where N is the total number of air for the rolling window size, therefore, needs to be determined later.
stations, andi m , it can also be divided into three sub-sequences In this study, for the complete sequence S i , the formation of its time-
based on the same time points as Sm . Then S i can be further expressed as series samples is similar to Eq. (3). For the incomplete sequence
S i = [S1i; S2i ; S3i]. Pearson correlation coefficient between the incomplete Sm = [S1m; S2m; S3m] with S2m as the missing segment, its time series
sequence Sm and the complete sequence S i can be expressed as follows: samples constructed by RW can be expressed as Eq. (4).
SS = max |Corr(Sm , Si)|, 1 i N, i m {[stm1 r , , stm1 2, stm1 ] stm1 } {[stm3 r, , stm3 2, stm3 ] stm }
Si (1) 1 1 3 (4)

where t1 and t3 denote the time points in sub-sequences S1m and S3m ,
Corr(Sm , Si) = ([S1m ; S3m], [S1i; S3i]) (2)
respectively.
whereSS represents the most similar sequence.
Furthermore, except the most similar sequence SS , SSS is also cal- 2.3. Long short-term memory (LSTM)
culated and selected for further use. It is the most similar complete
sequence to SS . The estimation model, LSTM, is a special architecture of Recurrent
Neural Network (RNN) proposed by Hochreiter and Schmdhuber [32].
2.2. Rolling window sampling It overcomes the exploding/vanishing gradient problem that results
from the gradient propagation of the recurrent network over many
SS needs to be transformed into time series samples for model layers and is capable of learning from long-term dependencies [33,34].

Fig. 2. Research framework.

3
J. Ma, et al. Advanced Engineering Informatics 44 (2020) 101092

Fig. 3. Methodology framework.

In this paper, this problem is addressed by adjusting the inputs of


the model. Specifically, for an incomplete sequence Sm , instead of using
[stm r , , stm 2, stm 1]as the input, this paper replaces stm 1with stS 1, which is
the value of SS at time t-1. Therefore, the value of stm is predicted using
[stm r , , stm 2, stS 1]. In this way, the model will not only avoid the flat
value problem but also keep the information of its series to adjust the
predicted value.

2.4. Transfer learning

The scarcity of temporal information is another reason for the un-


satisfied imputation performance of temporal models for long-interval
Fig. 4. The architecture of an LSTM unit. consecutive missing values. To address this concern, transfer learning is
applied. Transfer learning is a technique that can learn and transfer the
Similar to other artificial neural networks (ANNs), LSTM also contains knowledge from one domain to another related field [40]. It is often
three layers, including the input layer, an output layer, and a plurality used when the model for the target domain is too complicated or when
of hidden layers [35]. The difference between LSTM and other ANNs is the target domain does not contain enough data. In this paper, transfer
that the hidden LSTM layers are composed of special LSTM units. These learning is responsible for learning and transferring the knowledge
units allow a value or gradient that flows into the unit to be preserved from the MSS to the target sequence, which contains long-interval
and subsequently retrieved at the required time step [36]. The archi- consecutive missing values.
tecture of an LSTM unit is presented in Fig. 4. Its primary component is Based on the transferred contents, transfer learning can be divided
the memory cellCt , which is controlled by several gates, and can either into sample transfer, model transfer, feature transfer, and relation
keep or reset the information flexible. To be more specific, three gates transfer [40]. In this paper, since we need to learn and model the
are designed to control whether to forget the knowledge of the last cell pattern and trend from a complete sequence, the model transfer is
(forget gateft ), to contain the inputs at the current moment (input adopted. Although model transfer may take longer training time than
gateit ) and to output the new cell state (output gateot ). Mathematically, other types of transfer learning, it usually has the best performance
the cell and three gates at time t can be expressed as follows: [40]. It pre-trains a base model based on the data of the base domain
and fine-tunes and tests the model using the data of the target domain.
ft = (Wf xt + Wf ht + bf ) (5)
The working process of model transfer herein is presented in Fig. 5. A
1

it = (Wi xt + Wi ht + b i) (6) base LSTM model, marked asModelB , is developed based on the MSS SS
1
and SSS . Note that the input form of the base model is
Ct = tanh(Wc xt + Wc ht 1 + bc ) (7) {[stS r , , stS 2, stSS1] stS } instead of {[stS r , , stS 2, stS 1] stS } due to the
reason presented in Section 2.3. After the base model is given, its first
Ct = ft Ct 1 + it Ct (8) several hidden layers are frozen, and the remained layers are fine-tuned
and tested using the data from Sm and SS . The sample form is
ot = (Wox t + Wo ht 1 + b o) (9) {[stm r , , stm 2, stS 1] stm} . The given model is marked as ModelTr , which
is used for missing data estimation in the following steps.
ht = ot tanh(Ct ) (10)

where xt represents the input to the current unit, ht 1 is the hidden state 2.5. Transferred LSTM-based iterative estimation (TLSTM-IE)
of the last unit, ht is the state of the current unit, Ct represents a
modulation measurement for the input, W represents the connection To further improve the imputation accuracy for long-interval con-
weights between neurons, b represents deflection, and denotes the secutive missing data, iterative estimation is combined with LSTM and
sigmoid activation the gates used. transfer learning. Iterative estimation is an important algorithm for
Due to the powerful learning ability for temporal sequential data, in signal filtering, parameter, and state estimation, and solving matrix
the past decades, LSTM has been adopted by some studies and achieved equation [41,42]. It has been widely adopted and has achieved ex-
excellent performance for speech recognition, image segmentation, cellent performance. Instead of generating the results based on simple
traffic volume prediction, etc. [37–39]. Nevertheless, few studies have one-time estimation, iterations help the estimator recursively improve
utilized LSTM for missing data imputation. Tian et al. [14] employed the estimation accuracy until one variable meets the pre-set threshold.
LSTM to fill the missing data for traffic flow. However, their study had In this paper, a transferred LSTM-based iterative estimation
the problem of generating flat values when the missing data is long- (TLSTM-IE) model is proposed. Specifically, for incomplete sequence
period consecutive. This is because that for temporal models or other Sm , TLSTM-IE initializes the missing values using the mean values of Sm
machine learning algorithms, such as LSTM, they use previous values to and SS . Then it uses ModelTr to iteratively estimate the missing values.
predict the current value. If too much historical information was During training, the memory cell C0 and the hidden state h 0 of the LSTM
missed, the model cannot learn enough knowledge but generate dumb units are initialized as zeros. The missing values calculated by ModelTr
and flat results. in each iteration are combined with the values in the last iteration for

4
J. Ma, et al. Advanced Engineering Informatics 44 (2020) 101092

Fig. 5. Model transfer.

Table 1 To sum up, this paper proposed a transferred LSTM-based iterative


TLSTM-IE algorithm. estimation (TLSTM-IE) model based on the long short-term memory
Algorithm 1: TLSTM-IE (LSTM) model, transfer learning, and iterative estimation, to improve
the imputation accuracy for long-interval consecutive missing values.
Function TLSTM-IE (ModelTr , Sm , Ss , k , l , max_iterators, min_delta, , ): To validate the effectiveness of the proposed methodology, a case study
Ms,m = Mean(Ss) Mean [Sm Sm
is conducted. Imputation performance of the model for consecutive
(0, k 1) ; (k + l,Lm 1) ]
PM2.5 concentration missing values is explored.
S(mk, k + l 1) = Ss(k, k + l 1) Ms,m
for i = 0, 1, …, max_iterators − 1 do:
3. Case study
Stemp = []
for j=0, 1, , l 1
Stemp =ModelTr ([Sm
3.1. Data collection and preprocessing
j k+ j r, … , Sm
k+ j 2, Ssk + j 1 ])

Spred = Sm + (1 ) Stemp
(k , k + l 1)
This paper selected New York City (NYC) for a case study because
if|Sm S pred| <min_delta:
(k, k + l 1) NYC offered a convenient open data platform where researchers can get
break; access to many kinds of city-related data such as air pollution, trans-
1) =S
S(mk, k + l pred
portation, and meteorological data. This could potentially help extend
+=
our method to other inter-discipline research topics.
End
Two-years (2016–2017) hourly PM2.5 concentrations data of 11
monitoring stations in NYC are collected from the United States
Table 2 Environmental Protection Agency (EPA). The distribution of the 11
Summary of notation. stations is presented in Fig. 6. Each green point represents one station.
The collected data of each station contains 17,544 records, and the
Notation Description
temporal resolution is 1 h. The mean values and standard deviation
Sm Time sequence with long-interval consecutive missing values values of the records of the 11 stations are listed in Table 3. To mitigate
Len (■) Length of the time sequence, Lm represents the length of S m the impacts of dimension and speed up the training, the PM2.5 data are
SS Complete time sequence which is the most similar sequence later normalized into [0,1].
(MSS) of S m
SSS Complete time sequence which is the MSS of S S
r Rolling window size
3.2. Experiment setup
t Timepoint
Mean (■) Mean value of the time sequence Based on the 11 monitoring stations, 11 air quality sequences are
l Length of the missing block given. Since the collected data do not contain any missing values, to
k Number of the first missing value in the sequence
evaluate the imputation performance of the proposed TLSTM-IE model
max_iterators The maximum number of iteration
min_delta The minimum change of the missing values for two consecutive
for long-interval consecutive missing values, this paper assumes one
iterations sequence Sm as an incomplete sequence. For illustration, S8 is selected.
The percent of the initialized missing values or the missing values As mentioned in Section 2.1, Sm can be divided into three segments,
given by the last iteration that retained for the current missing including S1m , S2m and S3m , based on the observed time. Then according
values
tok , l and Lm , these three segments can be further expressed as S(0,
m
k 1) ,
The difference between the for two consecutive iterations
Skm, k + l 1 and S(mk + l, Lm 1) , respectively. Missing rate mr = L . In this paper,
l
Stemp The missing values given by ModelTr for each loop
m
S pred The missing values of the current iteration after combining the S2m is set as the missing segment. k , l and mr together determine the
missing values of the last iteration and Stemp relative position (rp ) and the missing-block length of S2m . To fill the
j The jth missing value
missing values for Sm , TLSTM-IE requires the most similar sequence of
Sm to be identified first. To this end, Pearson correlation coefficient is
an update. When the difference between the missing values given by utilized in this paper to calculate the correlation between sequences.
two successive iterations is smaller than the pre-set value, TLSTM-IE For illustration purpose, we pre-set the missing rate as 0.1 and assume
(Len (Sm) l)
stops training and generates the final results. The pseudo-code of this S2m is in the middle of Sm . Then k= 2
and
process is presented in Table 1. The description of the notations used is l = mr .
Len (S m)
shown in Table 2. Based on the calculation process mentioned in Section 2.1,

5
J. Ma, et al. Advanced Engineering Informatics 44 (2020) 101092

Fig. 6. The distribution of the 11 monitoring stations in NYC.

Table 3 {[st81 r, , st81 2, st81 ]


1 st81 } {[st83 r, , st83 2, st83 ]
1 st83 }, 0 < t1
Statistic characteristics of the 11 stations.
<k 1, k + l < t3 < Lm 1 (11)
Site ID Mean Standard 25% 75%

1 6.8162 5.3471 3.2 9.4 {[st71 r, , st71 2, st71 ]


1 st71 } (12)
2 6.5605 5.0953 3.0 9.3
3 8.2790 5.7911 4.1 11.2 where r represents the rolling window size, and the most appropriate
4 8.0322 4.9986 4.6 10.6 value of r will be optimized later.
5 8.3842 6.3900 4.1 11.3
6 7.3429 5.1045 3.8 10.0
7 6.2749 4.6407 3.0 8.7 4. Results and discussion
8 6.3918 4.9893 2.9 9.0
9 6.4958 5.1640 2.8 9.3
10 6.8930 4.5666 3.7 9.2 4.1. Base model training
11 7.1370 5.1540 3.3 9.7
Total 7.1461 5.2775 3.5 9.8 After the time series samples are given, they are input into the
proposed model. According to the methodology framework, a base
LSTM model needs to be trained first using the base sequence. Based on
correlation coefficients between each incomplete sequence and com-
the results from the Pearson correlation, S 7 is calculated as the base
plete sequence are computed. Results show that when S8 is selected as
sequence for this part. Note that for base model training, the form of the
the incomplete sequence (target sequence), S7 is the MSS (base se- model input is {[st71 r , , st71 2, st61 1] st71 } instead of
quence), and S6 is the SSS . Therefore, in the following of the study, these {[st71 r , , st71 2, st71 1] st71 } .
three sequences are used as examples. Modeling results for other se- Followed the study of Li et al. [5], a base stacked LSTM model with
quences will be discussed later. four kinds of layers is designed in this paper, as shown in Fig. 7. It is
Note that when calculating the MSS, this study did not contain the composed of one input layer, a plurality of LSTM layers, one fully
station attributes like whether the station is near the factory/park/road
due to data availability. The authors think the station type may help in
identifying a more similar station for transfer learning. But in fact, these
“type” characteristics are all reflected in the time series values. The
numerical patterns in the time series can be quite different between a
“factory” station and a “park” station. They are less likely to be iden-
tified as the most similar through correlation coefficients, and won’t be
selected for transfer learning.
After the experimental sequences are set up, the numerical values of
these sequences are transformed into time-series samples. Rolling
window (RW) is adopted to accomplish the task as mentioned in Section
2.2. Based on RW, time-series samples of S8 can be constructed as Eq.
(11). Time series samples of S7 are formulated as Eq. (12).
Fig. 7. Architecture of the base LSTM model.

6
J. Ma, et al. Advanced Engineering Informatics 44 (2020) 101092

connected (FC) layer and one output layer. LSTM involves many Table 4
parameters, such as the batch_size, the number of LSTM layers Base model comparison.
(LSTM_Layers), the number of neurons in each LSTM layer Model RMSE R2
(LSTM_Neurons), the number of neurons in FC layer (FC_Neurons).
These parameters will profoundly influence the results accuracy and the ARIMA 0.1094 0.5445
convergence speed of the model. Rolling window size r, which will SVR 0.0873 0.7044
LASSO 0.0872 0.7053
affect the number of inputs and samples of the model, is also optimized.
Ridge 0.0872 0.7054
Therefore, in this paper, to achieve better performance, the optimal ANN 0.0870 0.7063
values of those parameters need to be determined. RNN 0.0870 0.7066
Root mean square error (RMSE) and R-square are used to evaluate LSTM 0.0864 0.7161
the model performance. The reasons for choosing these two indicators
lie in two aspects. First is that the benchmark should be reasonable and
commonly used. Otherwise, it is not convenient to feel the level of same method used for LSTM in this paper. Since the window size (r) is
performance. Based on this rule, we firstly picked RMSE. Secondly, it set as 6, the number of samples generated for base model cross-vali-
would be better if the selected indicator could allow further comparison dation is 17,538 inS 7 .
in similar studies in atmospheric research, even if they were using The comparison results of these six models are shown in Table 4.
different datasets to test the missing value imputation. Therefore, we Compared with traditional machine learning models such as ARIMA,
picked one of the most commonly used indicators that measure the SVR, LASSO regression, and Ridge regression, neural networks aver-
explained variance of regression models. That is the R-square value. agely have lower RMSE values. Compared with typical ANNs, temporal
Calculation of RMSE and R-square are presented in Eqs. (13) and (14). neural networks such as RNN and LSTM can yield smaller imputation
Smaller RMSE and higher R-square reflect better modeling perfor- errors. Furthermore, LSTM adopted in this paper has the lowest RMSE
mance. of 0.7160 compared with other models. This proves that LSTM is a
reasonable choice for this study.
n
1
RMSE = (yi yi )2
n i=1 (13) 4.3. Frozen layers identification
n
(y
i=1 i
yi ) 2 After the base LSTM model is given based on the base sequence, it is
R2 = 1 n
(y
i=1 i
y¯)2 (14) fine-tuned and tested by the target sequence to generate a transferred
model, ModelTr . As shown in Section 2.4, before fine-tuning the base
where n represents the number of samples, yi denotes the observed model, the number of frozen layers needs to be determined. If the
value of the ith sample, yi is the predicted value of the ith sample, and ȳ number of frozen layers is too large, the model cannot learn enough
is the mean value of the observations. patterns of the target sequence. If the number is too small, the model
To figure out the optimal values of the previously-mentioned may not store enough patterns or knowledge gained from the base se-
parameters, we implemented grid search to explore the candidates. quence. Therefore, it is necessary to figure out the optimal number of
After referring to the existing literatures [29,30,43–45], we pre-set the frozen layers. To this end, this paper sets the number of frozen layers as
candidates of batch_size as {32, 64, 128, 256, 512, 1024}, r as {3, 6, 12, {1, 2, 3, 4} and compares the performance of ModelTr with different
24, 48, 96}, LSTM_Layers are {3, 4, 5, 6}, LSTM_Neurons as {32, 64, numbers of frozen layers using 10-fold CV. Samples generated from S18
128} and FC_Neurons as {128, 256, 512, 1024, 2048}. For a stable and and S38 were used to tune the number of frozen layers. The results are
reliable comparison, we implemented 10-fold cross-validation when presented in Table 5. When the number of frozen layers equals to 1, the
modeling and testing [46]. This technique will first separate the dataset model has the lowest RMSE of 0.0716. Therefore, the number of frozen
into ten segments with no overlap, and it will conduct the training and layers is set as 1 and the remained 2 layers are fine-tuned using the data
testing process ten times. Each time it will use the pre-set parameters to of S 8. Note that the RMSEs in Table 5 are lower than those in Table 4,
train the model based on nine segments, and test the model perfor- while theR2 s are also lower. This is because the numbers calculated in
mance on the remaining one segment. Averaging the test performance Table 5 are for S 8 using the transferred model, while those in Table 4
of these ten rounds gives the result. It can be noticed that during this are for S 7 .
process, each sample will be used for training and be tested once, and
this helps provide more reliable and stable test results compared with
4.4. Determination
traditional 3/7 or 2/8 testing and training partition [47].
After these experiments, the optimal values are determine as
The last part of the proposed methodology framework is combining
batch_size = 512, r = 6, FC_Neurons = 1024, LSTM_Layers = 3,
the ModelTr with iterative estimation and estimating the missing values.
LSTM_Neurons = 32. This set of parameters help provide the minimum
For the proposed TLSTM-IE, is an important parameter. It represents
10-fold CV RMSE and R-square, which are 0.0864 and 0.7161 respec-
the percent of the initialized missing values or the missing values given
tively.
by the last iteration that retained for the current missing values.
Therefore, it will highly influence the value of the predicted data. If is
4.2. Base model comparison
too large, the final results would be similar to the initialized values. If
is too small, the final results would be very close to the values given by
Furthermore, to prove that LSTM is a reasonable choice for the base
directly applying ModelTr . To identify the most appropriate value of ,
model, its imputation performance is compared with six other com-
RMSE values of TLSTM-IE with different are calculated. Candidates of
monly seen machine learning models. Contrast models include auto-
regressive integrated moving average (ARIMA), support vector regres-
Table 5
sion (SVR), LASSO regression, Ridge regression, artificial neural The optimization results for the number of frozen layers.
network (ANN), and recurrent neural network (RNN). For comparison,
parameters of ARIMA are optimized using the Bayesian information Number of frozen layers 1 2 3 4

criterion (BIC). Parameters of SVR, LASSO regression, and Ridge re- RMSE 0.0716 0.0718 0.0717 0.0720
gression are also optimized using 10-fold cross-validation. Network R2 0.6764 0.6755 0.6763 0.6729
structures and parameters of ANN and RNN are optimized using the

7
J. Ma, et al. Advanced Engineering Informatics 44 (2020) 101092

are chosen from {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. is set as Table 7
0.05. Min_delta for RMSE is set as 0.0001. It was found that when Modeling performance of TLSTM-IE and other methods for missing segments
= 0.7, TLSTM-IE has the lowest RMSE of 0.0607. with different missing rates when mr is set as {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,
In addition, comparing the lowest RMSE of TLSTM-IE and ModelTr , it 0.8, 0.9}.
can be observed that the value decreases from 0.0716 to 0.0607, im- Algorithms Average RMSE TLSTM IE
RIRthismethod
proving the imputation accuracy around 15.22%. This proves that it is
necessary and effective to combine iterative estimation with the Mean 0.1306 48.04%
transferred LSTM model. EM 0.1318 48.52%
k-NN 0.1373 50.58%
ARIMA 0.1260 46.17%
4.5. Model evaluation SVR 0.1213 44.07%
LSTM 0.1131 40.01%
TLSTM 0.0768 11.62%
4.5.1. Experiments on different missing positions TLSTM-IE 0.0678 0.00%
To further evaluate the model performance, three groups of ex-
periments are designed to examine the generalization of the proposed
model for different missing positions, different missing rates, and dif-
ferent sequences.
Firstly, to investigate the performance of TLSTM-IE for different
missing positions, the experiment sets the missing rate as 0.1. S8 is se-
lected as the Sm , S 7 the SS , and S6 the SSS . The relative position of the
Len (S18)
missing segment S28 can be expressed as rp = Len (S8)
. Then
k = rp Len (S 8) and l = 0.1 Len (S 8) . Values of rp are set as {0, 0.1,
0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. When rp = 0, S28 is at the beginning
of S8 . For a more comprehensive experiment, we also referred some
latest literature [14,18–21] and picked some other methods for com-
parison. These include mean imputation, Expectation-Maximization
algorithm (EM), k-nearest neighbor imputation (k-NN), ARIMA, SVR,
LSTM, and the transferred LSTM model without iterative estimation
(TLSTM) are adopted for comparison.
RMSE values of TLSTM-IE and contrast methods for missing blocks
at different positions are shown in Table 6. It can be seen that for Fig. 8. RMSE performance of different methods with different missing rates.
missing blocks at different positions, RMSE scores of LSTM, TLSTM and
TLSTM-IE are always lower than other imputation methods. consecutive missing values with different missing rates, this experiment
To further evaluate the priority of the proposed model, the rate of assumes the relative position of the missing segment is in the middle of
improvement on RMSE (RIR) of TLSTM-IE to other methods is in- the sequence. S8 is selected as the Sm , S 7 the SS , and S6 the
troduced. The calculation of RIR is presented in Eq. (15), and the results SSS . k =
(Len (S8) l)
and l = mr Len (S 8) . Values of mr are set as
2
are shown in Table 6. The last column shows that all the RIR values of
{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. Still, the other seven methods
TLSTM-IE over other methods are positive, which means the proposed
are adopted for comparison. The results are presented in Table 7 and
TLSTM-IE outperforms other methods in the experiment. Especially for
Fig. 8. For all the methods, their RMSE performance generally increases
mean imputation, EM algorithm, and k-NN models, the RIR values are
as the missing rate becomes larger. This is reasonable since the more
higher than 40%. For the original LSTM rolling window model, the RIR
missing data, the less information those methods can utilize for im-
values are around 27%, which indicates that transfer learning can ef-
putation, which leads to a higher RMSE. Also, it can be seen from Fig. 8
fectively improve the interpolation accuracy of traditional machine
that these methods exhibited three different trends. For better illus-
learning methods and neural networks. The RIR of using TLSTM-IE over
tration, we plot these three groups using different colors. The first
TLSTM reflects that the iterative method can improve the interpolation
group includes mean imputation, EM algorithm, and k-NN. These three
accuracy by around 10%.
methods overall have high RMSE values, and their performance did not
RMSEBenchmarkB RMSEResultA change too much as the missing rate got higher. The second group is the
RIRBA = 100% model-based methods. They will either build a time series model or
RMSEBenchmarkB (15)
rolling window model to train and predict. It can be seen that their
performance exceeds the first group quite significant in the beginning,
4.5.2. Experiments on different missing rates but as mr gets higher than 0.5, their RMSE increases quite a lot. This is
To explore the performance of the proposed model for long-interval also reasonable because their performance relies more on the recent
historical values of the missing spot. If longer-interval consecutive va-
Table 6 lues were missed, their performance gets worse.
RMSE values of TLSTM-IE and contrast methods for missing blocks at different The last group of methods is transfer learning-based methods. It can
positions when rp is set as {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. be seen from Fig. 8 that generally, this group of methods keep a stable
performance like the first group. This is because the idea behind these
Algorithms Average RMSE TLSTM IE
RIRthismethod
methods is learning from a similar series, and when modeling the inputs
Mean 0.1311 42.08% and the output, our methods will also include data from similar stations
EM 0.1325 42.67% to better control the trend. This helped keep a high performance even if
k-NN 0.1437 47.15% the missing rate gets larger.
ARIMA 0.1263 39.66% Also, the average RIR values of TLSTM-IE to other methods under
SVR 0.1165 34.61%
different missing rates are calculated and presented in Table 7. Overall,
LSTM 0.1047 27.46%
TLSTM 0.0848 10.43% RIR values of TLSTM-IE to other non-transfer-learning methods are
TLSTM-IE 0.0759 0.00% around 40–50%, which reflects the advantage of utilizing transfer

8
J. Ma, et al. Advanced Engineering Informatics 44 (2020) 101092

Table 8
TLSTM
RIRMean IE
values based on the most similar sequences of each sequence with different missing rates.

Target station Missing rates Average

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

S1 32.0518 36.7616 34.5940 42.5701 39.6624 0.7417 −3.8171 29.1642 35.5823 27.4790
S2 35.8414 41.2631 35.7056 39.7388 37.2139 33.5536 34.1017 36.4584 43.5826 37.4955
S3 30.1273 27.8828 25.9887 24.9858 25.3204 23.5314 33.6258 36.9157 36.2344 29.4014
S4 19.9778 22.7974 33.2735 36.7720 28.2827 29.2348 22.4602 37.6227 −4.5289 25.0991
S5 13.4754 0.8214 6.2473 1.2023 1.3608 6.3638 10.0751 16.0693 34.9274 10.0603
S6 38.2550 35.7626 31.2061 34.9686 34.6381 33.4632 34.4818 31.6783 36.7550 34.5787
S7 39.1776 45.8839 41.3736 46.2490 45.8994 42.5574 42.9679 46.3264 52.3288 44.7516
S8 52.6589 45.4181 47.4678 50.2435 47.6871 47.8420 47.3814 47.2792 48.7725 48.3056
S9 20.1565 29.1362 14.7423 22.2628 20.4173 25.2311 23.9796 39.7802 45.7065 26.8236
S10 49.0273 33.0710 39.2777 29.0601 30.0223 25.4132 29.5819 26.4072 33.2543 32.7906
S11 43.6372 40.6715 34.1193 32.7588 33.2147 36.2621 34.5738 30.5503 36.4727 35.8067

learning in missing data imputation. Table 9


RMSE Performance of different segments with a pre-set total missing rate.
4.5.3. Experiments on different sequences Segments Total missing rate (mr) LSTM TLSTM-IE
The experiments presented previously are performed based on S 8
and S 7 as examples. To further explore the performance of TLSTM-IE on 3 0.3 0.1019 0.0644
0.6 0.1208 0.0689
other sequences, this paper experiments on all the remaining ten sta-
0.9 0.1497 0.0756
tions with different missing rates and calculates the interpolation ac- average 0.1241 0.0696
curacy. Table 8 shows the RIR values of TLSTM-IE on the traditional 5 0.3 0.0877 0.0612
mean imputation. Although for some station, such as S1 with mr = 0.7 0.6 0.0981 0.0654
and S4 with mr = 0.9, the RIRMeanTLSTM IE
becomes negative due to data 0.9 0.1016 0.0675
average 0.0958 0.0647
variance, the proposed method generally exhibits 25–50% improve- 7 0.3 0.0853 0.0583
ments over the traditional mean imputation. 0.6 0.0893 0.0651
0.9 0.0965 0.0657
average 0.0904 0.0630
4.6. Discussion
9 0.3 0.0803 0.0553
0.6 0.0850 0.0614
4.6.1. Three segments or four-plus segments 0.9 0.0892 0.0634
In the previous sections of this study, the experiments presented are average 0.0848 0.0600
all defined using three segments missing data. The three segments
missing data refers to the problem modeling presented in the metho-
dology section. The data of the air quality forecasting problem in this both methods reaches the lowest when setting at 9 segments, and as the
study is formalized into S = [S1; S2 ; S3], where S1, S2 , and S3 are three number of segments gets smaller, the average RMSE increases. This is
proportions of the time series, and are ordered on the time basis. When because three segments have no data in the middle of the missing part,
we have S1 and S2 , but miss S3 , it is the typical forecasting problem, and therefore has no information to adjust or positively update the
which is using S1 and S2 to forecast S3 . When we have S1 and S3 , but miss model during imputation. This resulted in a larger error. Also, from
S2 , then it turns into a typical missing data problem. That is using S1 and three segments to nine segments, the RMSE of the typical LSTM+
S3 to interpolate S2 . When we have S2 and S3 , but miss S1, it is then the rolling window method improved 31.67%, while TLSTM-IE improved
problem of studying non-recorded historical data. It is all about the 13.78%. This is because the proposed method utilized the inputs from
missing position. The forecasting problem, the missing data problem, the most similar station and rely less on its time series with missing
and the analysis of history can all be formalized into the missing data data, so it exhibited more stable performance.
problem. It is more than just a preprocessing step, but a more gen-
eralized problem. 4.6.2. Different years and seasons
Three segments division is the most basic format for long-interval Besides the number of segments, using a different length of the base
consecutive missing data and is the most challenging type for missing data can also affect the performance of imputation due to the inter-year
data imputation. But how will four or more segments affect the results? influence. Therefore, we expanded the experiment to include four-year
To address this question, we added another experiment here to compare data and tested the imputation performance using three datasets. The
the difference. In this experiment, the total missing rate (mr) was set as first only included the 2017 one-year data, the second the 2016–2017
30%, 60%, and 90% to represent different situations. Under each pre- two-year dataset used in previous sections, and the third included
set missing rate, we will test the performance when the data were se- 2014–2017 four-year data. After applying the same procedure and al-
parated into three, five, seven, and nine segments. When three seg- gorithms on these datasets, we obtained the results shown in Table 10.
ments, S2 is the missing part. When five segments, S2 and S4 represent It shows the performance at different missing rates under the three
the missing data. When seven, they are S2 , S4 , and S6 , and when nine, datasets. On average, the RMSE performance improves as the base data
they are S2 , S4 , S6 and S8 . The length of these missing segments is set as includes more years of data. This is because PM2.5 is generally a yearly
non-zero random, and their summed length equals to the total pre-set cycled time series, so more years of inclusion can help learn more
missing rate (0.3, 0.6, and 0.9). The center position of S1 to Sn are set as patterns and therefore reduce the imputation error.
evenly distributed. The imputation performance is tested 10 times. The The reviewers also suggested showing the performance for the
average RMSE performance was shown in Table 9. The LSTM algorithm missing value imputation for different seasons. This is talking about
on a rolling window model is also added for comparison. different missing positions. For example, for the 2016–2017 dataset, to
It can be seen from Table 9 that the average RMSE performance of test its performance in Spring, we may just set 2016.4–2016.6 and

9
J. Ma, et al. Advanced Engineering Informatics 44 (2020) 101092

Table 10 Table 12
RMSE performance of filling different rates of missing values based on different The most similar sequence of S m with different missing rates (empty places
lengths of data. mean the same station as the left).
Missing rate 2017 2016–2017 2014–2017 Target station Missing Rates

30% 0.0790 0.0644 0.0571 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
60% 0.0790 0.0689 0.0610
90% 0.0800 0.0756 0.0657 S1 S2 S6 S6 S8 S8
Average 0.0793 0.0697 0.0634 S2 S4 S1
S3 S4 S1 S1 S4
S4 S2 S8
Table 11 S5 S6 S8
Missing value imputation in different seasons. S6 S9 S8 S8
January to April to July to October to S7 S8
March June September December S8 S7 S6 S7
S9 S6 S8 S8
RMSE 0.0717 0.0789 0.0734 0.0694 S10 S11
R-square 0.6332 0.6619 0.6621 0.6573 S7 S7
S11 S10

2017.4–2017.6 missing and calculate the results. This equals to set rates, the interpolation performance of the proposed model based on
mr = 0.125, rp = 0.125/0.625. Since the experiments on different the closest sequence is evaluated. Based on Fig. 6, for
missing positions have been conducted in Section 4.5.1, here we present {S1, S 2, S 3, S 4 , S5, S 6, S 7, S 8, S 9, S10, S11}, their closest sequences (CS)
the results on different seasons in Table 11. The R-squares of the four are {S 2, S 3, S 2, S 2, S 6, S5, S 6, S 9, S 8, S11, S10 }. For each sequence, the
seasons not varying much. The comparisons of the true value and missing rate is set as {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} and the
predicted missing values in different seasons in 2017 are also shown in relative position of the missing segment is in the middle of the se-
Fig. 9. quence. Directly comparing RMSE values may not be clear enough. To
further compare the interpolation performance of the TLSTM-IE model
4.6.3. Similarity measurement based on the MSS and the CS, this paper selects the mean imputation as
Another problem that is worthy of discussion is the similarity the benchmark and calculates the MSS _RIRMean TLSTM IE
and
measurement. As the missing rate of the target station gets larger, the TLSTM IE
CS _RIRMean . The former represents the RIRMean TLSTM IE
value based on
calculated MSS may not be the most similar one due to data variance. the MSS and the latter represents the value based on the CS. Difference
The MSS of each sequence with different missing rates is presented in between the two RIR values is marked as RIRMean TLSTM IE
, and it is cal-
Table 12. It can be seen from Table 12 that whenmr is smaller than or culated using Eq. (16).
equal to 0.5, the MSSs of all the sequences did not change. But
RIR = MSS_RIR CS_RIR (16)
whenmr gets larger than 0.5, the MSSs of some sequence start to be
different. This is because when too much data was missed, the inter- The results are shown in Table 13. Average values for
TLSTM IE
RIRMean
ference from noise and data variance becomes larger. As a result, the all the sequences with certain missing rates are calculated and pre-
identified sequence with the smallest correlation may shift, and con- sented in the last two rows. It can be seen that the average values ex-
sequently, the interpolation accuracy might decrease. cluding min and max keep positive for sequences with different missing
Therefore, to further explore whether the proposed TLSTM-IE model rates. This suggests that the MSS is appropriate for the proposed model.
can maintain good performance for missing blocks with large missing On the other hand, it can be observed that when the missing rate is

Fig. 9. True value vs predicted missing value using TLSTM-IE for the four seasons in 2017 based on the 2016–2017 two-year data.

10
J. Ma, et al. Advanced Engineering Informatics 44 (2020) 101092

Table 13
TLSTM
RIRMean IE
for different sites under different missing rates.

Target station Missing rates

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

S1 0 0 0 0 0 −27.8619 −41.7709 −6.7363 −8.8851


S2 20.3578 36.7511 30.2336 25.9868 19.6199 13.0746 11.1195 7.4846 13.4282
S3 2.1852 −0.9816 0.1980 −5.8866 −7.6950 −7.5768 0.2788 1.1993 −5.3537
S4 0 0 0 0 0 0 0 0 −43.5351
S5 0 0 0 0 0 0 0 0 10.0410
S6 13.2684 6.2775 7.8009 7.6977 7.4239 6.0377 5.1854 −0.7964 −0.9275
S7 13.8296 −0.4392 8.6524 5.8178 6.5552 6.5257 13.4071 17.5234 15.7581
S8 10.3709 7.2665 8.8923 2.9583 −1.3272 4.1972 9.4559 9.8571 6.0758
S9 −5.4259 0.8973 −4.5403 1.2970 0.1952 −5.1307 −0.4636 0 0
S10 0 0 0 0 0 0 0 0 0
S11 0 0 0 0 0 0 0 −4.8441 −5.3389
Average 4.9624 4.5247 4.6579 3.4428 2.2520 −0.9758 −0.2534 2.1534 −1.7034
Average excluding min and max 3.9754 1.4202 2.5844 1.8171 1.3347 0.4653 2.6276 1.3701 0.9940

smaller than 0.6, the average values including min and max are posi- dependencies. Transfer learning can transfer the patterns and knowl-
tive. When the missing rate is larger than or equal to 0.6, the average edge of the most similar sequence (MSS) to the incomplete sequence,
shows negative values. This suggests that CS is also an appropriate and therefore improve the imputation accuracy of LSTM. Iterative es-
option for transfer learning when the missing rate is large. timation can further enhance the imputation accuracy by iteratively
minimizing the difference between the estimated values and the ob-
4.6.4. Consideration of other variables served values. In this paper, the proposed model is applied to fill the
This study is conducted to investigate methodologies to fill missing missing values for air pollutant concentrations data. PM2.5 concentra-
values for PM2.5 data. The proposed method did not utilize additional tions data of 11 monitoring stations in New York City (NYC) is collected
information from other variables, such as meteorological factors and for the case study. To validate the effectiveness of TLSTM-IE, a series of
location factors. Meteorological factors such as temperature, pressure, experiments are conducted. RMSE is used to evaluate model perfor-
and wind, have been reported to exhibit important impacts on PM2.5 mance. Key results of this study are as follows:
predictions [48,49], but we did not include those features to support
missing value imputation for two reasons. • Compared with other machine learning models, LSTM has lower
The first is data availability. The studied region was a city in the RMSE values and better imputation performance as the base model
U.S. The highest resolution of the meteorological factors we can get are due to its advantages in learning from long-term dependencies.
at the city level. That means we cannot obtain more detailed meteor- • Iterative estimation can further improve the imputation accuracy of
ological data around the eleven stations in Fig. 6. Therefore, we only the transfer learning-based LSTM model.
utilized the PM2.5 data from surrounding stations to help fill the • The proposed TLSTM-IE has a great generalization ability for long-
missing values. interval consecutive missing segments at different relative positions,
Secondly, not using additional information from other variables for with different missing rates, and in different sequences.
missing value imputation is more challenging and practical. As reported • Compared with other imputation methods, the proposed TLSTM-IE
in related literature [48], meteorological factors can help predict PM2.5 method has 25–50% higher imputation accuracy.
values. Those series of additional information can make the imputation
easier, but in the real cases, researchers often cannot get much data on The unique new points of this paper lie in two aspects. First is the
various kinds of variables. They may merely have the time series of air problem definition. Although there has existing literature studying
quality data, while limited studies have been conducted to explore different methods to fill missing data, none of them have explored the
high-performance models when there is only PM2.5 data available. This problem of “long-interval consecutive missing values”. Almost all of
assumption makes the proposed method more robust and practical. It is them were examining small proportions of missing values that are
expected that the performance of the proposed TLSTM-IE method can randomly distributed. Large-Interval consecutive missing values will
be better if other meteorological factors could be included. make traditional rolling window-based methods have minimal in-
formation to use from nearby existing values for imputation. This
makes the problem more challenging.
5. Conclusion
The second unique new point is the novel application of transfer
learning techniques. To our best knowledge, this concept has not been
Along with the rapid development of the Internet of Things (IoT),
proposed before in atmospheric research, especially the combination of
big data and Artificial Intelligence (AI), data-driven studies are now one
transfer learning and iterative estimation on missing data imputation.
of the biggest trends in atmospheric research. As scholars started to deal
This technique helps learn knowledge from other monitoring stations
with larger datasets in different experiments, various kinds of data re-
and utilize that knowledge on filling the missing value. The experiment
lated problems are now becoming more and more typical, such as
results have shown that the improvement of the proposed method in
missing data, noise data, and imbalanced data. There is a growing
RMSE over traditional methods is quite significant.
concern on how to solve these data problems efficiently and scientifi-
Of course, the proposed method also has limitations, and it is also
cally. Therefore, this study is targeting one of the most typical issues,
because of the implementation of transfer learning. Due to the idea
and that is the missing data. We proposed a transferred LSTM-based
“learning from similar time series”, the current version of the method
iterative estimation (TLSTM-IE) model for long-interval consecutive
can only be applied to the missing data that have similar stations or
missing values imputation for air pollutants. The model combines the
other dependent features to learn from. If such learning objectives are
long short-term memory (LSTM) model, transfer learning as well as
not available, then it would be infeasible to apply transfer learning.
iterative estimation. LSTM can effectively learn from long-term

11
J. Ma, et al. Advanced Engineering Informatics 44 (2020) 101092

Future research could try to address this problem, for example, try to values in urban scale based on big data and non-linear machine learning techniques,
transfer learning from meteorological features, and see if that can help Land Use Policy. 94 (2020) 104537, https://doi.org/10.1016/j.landusepol.2020.
104537.
improve the imputation of missing values when no data on similar [25] N.T.N. Anh, S.-H. Kim, H.-J. Yang, S.-H. Kim, Hidden dynamic learning for long-
stations are available. interval consecutive missing values reconstruction in EEG time series, in: IEEE,
2011: pp. 653–658.
[26] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22
Declaration of Competing Interest (2010) 1345–1359, https://doi.org/10.1109/TKDE.2009.191.
[27] J. Ma, J.C.P. Cheng, Data-driven study on the achievement of LEED credits using
percentage of average score and association rule analysis, Build. Environ. 98 (2016)
The author declare that there is no conflict of interest. 121–132, https://doi.org/10.1016/j.buildenv.2016.01.005.
[28] J. Benesty, J. Chen, Y. Huang, I. Cohen, Pearson Correlation Coefficient, in: I.
References Cohen, Y. Huang, J. Chen, J. Benesty (Eds.), Noise Reduct. Speech Process, Springer
Berlin Heidelberg, Berlin, Heidelberg, 2009: pp. 1–4. https://doi.org/10.1007/978-
3-642-00296-0_5.
[1] Y. Xie, H. Dai, H. Dong, T. Hanaoka, T. Masui, Economic impacts from PM2.5 [29] Y. Zhou, F.-J. Chang, L.-C. Chang, I.-F. Kao, Y.-S. Wang, Explore a deep learning
pollution-related health effects in china: a provincial-level, Analysis, Environ. Sci. multi-output neural network for regional multi-step-ahead air quality forecasts, J.
Technol. 50 (2016) 4836–4843, https://doi.org/10.1021/acs.est.5b05576. Clean. Prod. 209 (2019) 134–145, https://doi.org/10.1016/j.jclepro.2018.10.243.
[2] N. Fann, A.D. Lamson, S.C. Anenberg, K. Wesson, D. Risley, B.J. Hubbell, Estimating [30] M.A. Jun, J.C.P. Cheng, Selection of target LEED credits based on project in-
the national public health burden associated with exposure to ambient PM2.5 and formation and climatic factors using data mining techniques, Adv. Eng. Inform. 32
ozone, Risk Anal. 32 (2012) 81–95, https://doi.org/10.1111/j.1539-6924.2011. (2017) 224–236, https://doi.org/10.1016/j.aei.2017.03.004.
01630.x. [31] J. Ma, J.C.P. Cheng, F. Jiang, V.J.L. Gan, M. Wang, C. Zhai, Real-time detection of
[3] P.L. Kinney, Climate change, air quality, and human health, Am. J. Prev. Med. 35 wildfire risk caused by powerline vegetation faults using advanced machine
(2008) 459–467, https://doi.org/10.1016/j.amepre.2008.08.025. learning techniques, Adv. Eng. Inform. 44 (2020) 101070, https://doi.org/10.
[4] P.L. Haagenson, Meteorological and climatological factors affecting Denver air 1016/j.aei.2020.101070.
quality, Atmos. Environ. 1967 (13) (1979) 79–85, https://doi.org/10.1016/0004- [32] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (1997)
6981(79)90247-6. 1735–1780, https://doi.org/10.1162/neco.1997.9.8.1735.
[5] X. Li, L. Peng, X. Yao, S. Cui, Y. Hu, C. You, T. Chi, Long short-term memory neural [33] J.C.P. Cheng, L.J. Ma, A non-linear case-based reasoning approach for retrieval of
network for air pollutant concentration predictions: Method development and similar cases and selection of target credits in LEED projects, Build. Environ. 93
evaluation, Environ. Pollut. 231 (2017) 997–1004, https://doi.org/10.1016/j. (2015) 349–361, https://doi.org/10.1016/j.buildenv.2015.07.019.
envpol.2017.08.114. [34] J. Ma, Y. Ding, V.J.L. Gan, C. Lin, Z. Wan, Spatiotemporal prediction of PM2.5
[6] S.E. Atkinson, D.H. Lewis, A cost-effectiveness analysis of alternative air quality concentrations at different time granularities using IDW-BLSTM, IEEE Access 7
control strategies, J. Environ. Econ. Manage. 1 (1974) 237–250, https://doi.org/10. (2019) 107897–107907, https://doi.org/10.1109/ACCESS.2019.2932445.
1016/0095-0696(74)90005-9. [35] J. Ma, Y. Ding, J.C.P. Cheng, F. Jiang, Z. Wan, A temporal-spatial interpolation and
[7] Z. Lu, Q. Wang, R. Yin, B. Chen, Z. Li, A novel TiO2/foam cement composite with extrapolation method based on geographic long short-term memory neural network
enhanced photodegradation of methyl blue, Constr. Build. Mater. 129 (2016) for PM2.5, J. Clean. Prod. 237 (2019) 117729, https://doi.org/10.1016/j.jclepro.
159–162, https://doi.org/10.1016/j.conbuildmat.2016.10.105. 2019.117729.
[8] J. Ma, J.C.P. Cheng, Estimation of the building energy use intensity in the urban [36] Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient
scale by integrating GIS and big data technology, Appl. Energy. 183 (2016) descent is difficult, IEEE Trans. Neural Netw. 5 (1994) 157–166, https://doi.org/
182–192, https://doi.org/10.1016/j.apenergy.2016.08.079. 10.1109/72.279181.
[9] M.N. Norazian, Y.A. Shukri, R.N. Azam, A.M.M. Al, Bakri, Estimation of missing [37] A. Graves, S. Fernández, J. Schmidhuber, Bidirectional LSTM Networks for
values in air pollution data using single imputation techniques, ScienceAsia. 34 Improved Phoneme Classification and Recognition.in: W. Duch, J. Kacprzyk, E. Oja,
(2008) 341–345, https://doi.org/10.2306/scienceasia1513-1874.2008.34.341. S. Zadrożny (Eds.), Artif. Neural Netw. Form. Models Their Appl. – ICANN 2005,
[10] A.R.T. Donders, G.J.M.G. van der Heijden, T. Stijnen, K.G.M. Moons, Review: a Springer Berlin Heidelberg, 2005: pp. 799–804.
gentle introduction to imputation of missing values, J. Clin. Epidemiol. 59 (2006) [38] Y.-L. Hu, L. Chen, A nonlinear hybrid wind speed forecasting model using LSTM
1087–1091, https://doi.org/10.1016/j.jclinepi.2006.01.014. network, hysteretic ELM and Differential Evolution algorithm, Energy Convers.
[11] W.L. Junger, A. Ponce de Leon, Imputation of missing data in time series for air Manage. 173 (2018) 123–142, https://doi.org/10.1016/j.enconman.2018.07.070.
pollutants, Atmos. Environ. 102 (2015) 96–104, https://doi.org/10.1016/j. [39] R. Fu, Z. Zhang, L. Li, Using LSTM and GRU neural network methods for traffic flow
atmosenv.2014.11.049. prediction, in: 2016 31st Youth Acad. Annu. Conf. Chin. Assoc. Autom. YAC, 2016:
[12] Z. Lu, J. Zhang, G. Sun, B. Xu, Z. Li, C. Gong, Effects of the form-stable expanded pp. 324–328. https://doi.org/10.1109/YAC.2016.7804912.
perlite/paraffin composite on cement manufactured by extrusion technique, Energy [40] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22
82 (2015) 43–53, https://doi.org/10.1016/j.energy.2014.12.043. (2009) 1345–1359.
[13] J. Ma, J.C.P. Cheng, F. Jiang, W. Chen, M. Wang, C. Zhai, A bi-directional missing [41] F. Ding, Two-stage least squares based iterative estimation algorithm for CARARMA
data imputation scheme based on LSTM and transfer learning for building energy system modeling, Appl. Math. Model. 37 (2013) 4798–4808, https://doi.org/10.
data, Energy Build 109941 (2020), https://doi.org/10.1016/j.enbuild.2020. 1016/j.apm.2012.10.014.
109941. [42] J. Ma, J.C.P. Cheng, Identification of the numerical patterns behind the leading
[14] Y. Tian, K. Zhang, J. Li, X. Lin, B. Yang, LSTM-based traffic flow prediction with counties in the U.S. local green building markets using data mining, J. Clean. Prod.
missing data, Neurocomputing 318 (2018) 297–305, https://doi.org/10.1016/j. 151 (2017) 406–418, https://doi.org/10.1016/j.jclepro.2017.03.083.
neucom.2018.08.067. [43] J. Wang, G. Song, A deep spatial-temporal ensemble model for air quality predic-
[15] J.C.P. Cheng, L.J. Ma, A data-driven study of important climate factors on the tion, Neurocomputing 314 (2018) 198–206, https://doi.org/10.1016/j.neucom.
achievement of LEED-EB credits, Build. Environ. 90 (2015) 232–244, https://doi. 2018.06.049.
org/10.1016/j.buildenv.2014.11.029. [44] Y. Huang, L. Shen, H. Liu, Grey relational analysis, principal component analysis
[16] T.A. Myers, Goodbye, listwise deletion: presenting hot deck imputation as an easy and forecasting of carbon emissions based on long short-term memory in China, J.
and effective tool for handling missing data, Commun. Methods Meas. 5 (2011) Clean. Prod. 209 (2019) 415–423, https://doi.org/10.1016/j.jclepro.2018.10.128.
297–310, https://doi.org/10.1080/19312458.2011.624490. [45] J. Ma, J.C.P. Cheng, Identifying the influential features on the regional energy use
[17] T.K. Moon, The expectation-maximization algorithm, IEEE Signal Process. Mag. 13 intensity of residential buildings based on Random Forests, Appl. Energy. 183
(1996) 47–60, https://doi.org/10.1109/79.543975. (2016) 193–201, https://doi.org/10.1016/j.apenergy.2016.08.096.
[18] N. Ahmat Zainuri, A. Jemain, N. Muda, A comparison of various imputation [46] J. Ma, Y. Ding, J.C.P. Cheng, Y. Tan, V.J.L. Gan, J. Zhang, Analyzing the leading
methods for missing values in air quality data, Sains Malays. 44 (2015) 449–456. causes of traffic fatalities using XGBoost and grid-based analysis: a city management
[19] Z. Li, S. Wu, C. Li, Y. Zhang, Research on methods of filling missing data for mul- perspective, IEEE Access 7 (2019) 148059–148072, https://doi.org/10.1109/
tivariate time series, in: 2017 IEEE 2nd Int. Conf. Big Data Anal. ICBDA, 2017: pp. ACCESS.2019.2946401.
382–385. https://doi.org/10.1109/ICBDA.2017.8078845. [47] J. Ma, Y. Ding, J.C.P. Cheng, F. Jiang, Z. Xu, Soft detection of 5-day BOD with
[20] H. Li, P. Wang, L. Fang, J. Liu, An algorithm based on time series similarity mea- sparse matrix in city harbor water using deep learning techniques, Water Res. 170
surement for missing data filling, in: 2012 24th Chin. Control Decis. Conf. CCDC, (2020) 115350, https://doi.org/10.1016/j.watres.2019.115350.
2012: pp. 3933–3935. https://doi.org/10.1109/CCDC.2012.6244628. [48] J. Ma, Y. Ding, J.C.P. Cheng, F. Jiang, Y. Tan, V.J.L. Gan, Z. Wan, Identification of
[21] Z. Che, S. Purushotham, K. Cho, D. Sontag, Y. Liu, Recurrent neural networks for high impact factors of air quality on a national scale using big data and machine
multivariate time series with missing values, Sci. Rep. 8 (2018) 6085, https://doi. learning techniques, J. Clean. Prod. 244 (2020) 118955, , https://doi.org/10.1016/
org/10.1038/s41598-018-24271-9. j.jclepro.2019.118955.
[22] W. Wei, Y. Tang, A generic neural network approach for filling missing data in data [49] J. Song, C. Zhao, T. Lin, X. Li, A.V. Prishchepov, Spatio-temporal patterns of traffic-
mining, IEEE (2003) 862–867. related air pollutant emissions in different urban functional zones estimated by real-
[23] Ü.A. Şahin, C. Bayat, O.N. Uçan, Application of cellular neural network (CNN) to time video and deep learning technique, J. Clean. Prod. 238 (2019) 117881,
the prediction of missing air pollutant data, Atmospheric Res. 101 (2011) 314–326, https://doi.org/10.1016/j.jclepro.2019.117881.
https://doi.org/10.1016/j.atmosres.2011.03.005.
[24] J. Ma, J.C.P. Cheng, F. Jiang, W. Chen, J. Zhang, Analyzing driving factors of land

12

You might also like