You are on page 1of 30

Advanced Techniques for Missing Data Imputation in Time

Series Analysis
Phan Thi Thu Hong, Hoang Thao Nhien, Tran Van Hoai Thuong, Nguyen Phu Quang, Vo Manh Hai

Abstract: This paper discusses various methods for imputing missing data in time series datasets, with a
focus on one specific datasets is humidity in the field of environmental management and forecasting. We
compare the performance of Auto-regressive integrated moving average (ARIMA), machine learning
methods (k-Nearest Neighbor, KNN, SVM, AdaBoost, XGBoost, Decision Tree, Random Forest) for
imputation tasks. We also explore the use of a hybrid approach that combines machine learning and
statistical methods for improved time series forecasting. Our study shows that machine learning methods,
particularly Support Vector Regression with radial basis function and KNN, provide favorable predictability
for humidity dataset. We hope our findings will be useful for researchers and practitioners in the field of
environmental management and forecasting.

Keywords: Missing data imputation, Time series analysis, Machine learning, Humidity, ARIMA

I. Introduction
Humidity is a critical factor in various fields, including agriculture, weather forecasting, natural
resource management, and many other applications. Variations in humidity can impact human decision-making
and daily life planning in multiple aspects. Therefore, monitoring and predicting humidity is a crucial task for
environmental management and forecasting. In this context, time series data of humidity becomes an essential
data source for research and applications. However, data collection can encounter issues related to missing
values. These missing values may arise from various reasons, including sensor malfunctions, errors in data
collection, or simply the absence of information at specific time points. Addressing this issue becomes
extremely important to ensure data integrity and the usability of time series humidity data. Beside that, time
series data of humidity typically contain continuous temporal information, demonstrating seasonality, daily
cycles, and other factors. To capture and forecast humidity trends in the future, researchers and practitioners
often use statistical methods and time series forecasting techniques. In this paper, we focus on addressing the
problem of missing values in time series humidity data. To do this, we combine machine learning methods and
ARIMA modeling. ARIMA is a popular model for time series forecasting, while machine learning provides
the capability to learn and predict from data.We propose an integrated approach that combines ARIMA and
machine learning algorithms to fill in the missing values in time series humidity data. This method aims to
enhance prediction accuracy and effectively reconstruct time series humidity data. By leveraging the strengths
of ARIMA and machine learning, we hope that this research contributes to improving the ability to predict and
utilize time series humidity data, while mitigating the impact of missing values in environmental research and
management. Below, we will present specific methods related to the combination of ARIMA and machine
learning to address the issue of missing values in time series humidity data.

II. Related work


1. Arima
ARIMA (Autoregressive Integrated Moving Average) is one of the popular linear statistical models
used for time series prediction. (1) Applied the ARIMA model to forecast groundwater levels based on
information about rainfall, surface flow, and evaporation. The results showed that ARIMA performed well for
short-term predictions.(2) Conducted research and found that the combination of five-time series models,
including Moving Average (MA), Autoregressive Moving Average (ARMA), ARIMA, Seasonal ARIMA
(SARIMA), and their combination, improved the accuracy of groundwater level predictions. This is
particularly important when considering the nonlinear and non-stationary nature of time series data.
2. Machine learning
(3) Compared the performance of three statistical machine learning methods, including LASSO,
Random Forest (RF), and Support Vector Regression (SVR), in predicting water levels in the Mekong River.
SVR yielded the best results in the first 5 days and was considered an effective method for flood prediction.
(4)Used the Random Forest algorithm to predict water levels in the Cagayan River basin in the Philippines.
The results showed the favorable predictability of this method and suggested that it could be deployed for
other large river basins.(5) Employed various statistical machine learning methods, including SVR, in
predicting water levels in the Chao Phyra River in Thailand. SVR with radial basis function provided the best
results.
3. Hybrid Approach
(6)Tested this approach on real data from the Red River to predict hourly water levels, and the results
were promising. The combination of these two techniques with the aim of improving time series forecasting is
not only rational but also well-documented in the literature. Numerous studies have provided affirmative
answers to this integrated approach ((7); (8–13)).
III. Methodology
In this segment, we initially provide an overview of the datasets and the preprocessing steps
undertaken as part of our experiments. Following that, we outline the criteria used for performance evaluation,
and ultimately, we delve into the experimental outcomes, highlighting significant findings and noteworthy
observations.
1. Data representation and preprocessing
Our data includes 680 humidity values collected from the Phu Lien astronomical station (Viet Nam)
every month from 1959 - 2015 with Ox being January to December, Oy being the year.
Step 1: First, convert to 2-dimensional data with Ox as Time in months, Oy as humidity value over
time and remove null values

Figure 3.1: Convert data


Step 2: Use sliding window technique to create X and y sets. That is, use n previous values of t and t
to predict the value at position t+1.
Figure 3.2: Sliding window

Step 3: Take gaps of 1%, 6%, 10%, 12.5% in the middle to easily and objectively evaluate predictions
in two directions (forward and backward).
2. The proposed methods:
2.1. ARIMA
The ARIMA (Auto Regressive Integrated Moving Average)(14) model is one of the most popular
time series forecasting models in statistics and data science. ARIMA combines auto-regressive (AR),
integrated (I), and moving average (MA) components to describe and predict time series data. Component of
the ARIMA model:

Autoregressive (AR): This component examines the relationship between the current value of the time series
and past values in the series. It measures the long-term dependence of the data.

Integrated (I): This part deals with performing transformations to make the time series stationary. Stationarity
means that the mean and variance do not change over time. Typically, differencing of the current value and the
previous value is performed to achieve this transformation.

Moving Average (MA): This component looks at the relationship between the current value and some lagged
values based on a moving window. It measures short-term dependence on the data.

The ARIMA model is represented as ARIMA(p, d, q), where:


● p is the number of autoregressive (AR) terms.
● d is the degree of differencing required to make the series stationary.
● q is the number of moving average (MA) terms

2.2. Machine learning methods:


2.2.1. KNN:
Since 1967, it has worked by finding the closest data points to a data point being predicted and using
the average or weighted value of those closest points to make a prediction for the point under consideration.
Here is how the KNN Regressor algorithm works:
Determine the number of neighbors (K): First of all, you need to determine the number of nearest
neighbors you want to use to predict the value for a new data point. Usually, it is necessary to customize the K
value based on appropriate preferences and analysis.

Find nearest neighbors: Given a new data point to predict, the algorithm finds the K data points in the
training set that are closest in Euclidean distance (or other distance) from the new point.

Calculating the predicted value: For the regression problem, the predicted value for the new point
will be calculated as the average value (or weighted average) of the values of the K nearest neighbors. Weights
can be used to evaluate the importance of each neighbor in the prediction.

2.2.2. SVR:
Support vector machine for regression is an extension of SVM (Support vector machine)(15). SVR
differs from conventional regression methods by creating a "safe zone" (epsilon-tube) around the regression
line instead of trying to get all the data points right. This safe zone is defined by two upper and lower
boundaries, and the goal of SVR is to find a function (curve) such that the error (difference between predicted
value and actual value) does not cross the boundary. Limit the safe zone along with minimizing the mean
squared error in the safe zone. Important elements in SVR include:

Kernel Trick: Similar to SVM for classification, SVR also uses kernel tricks to handle non-linear data
by mapping the original data space to a higher dimensional space to create a non-linear regression line.

Parameters C and epsilon: C and epsilon are important parameters in SVR. Parameter C controls the
width of the safety zone and evaluates the tolerance level of error. The epsilon parameter determines the
thickness of the safe zone.

Figure 2.2.2. The parameters for support vector regressor (16)


2.2.3. Random forest:
Random Forest (17)is a machine learning algorithm used in classification and prediction tasks. It
belongs to the category of ensemble machine learning algorithms, meaning it combines multiple decision trees
to create a more powerful model. Random Forest has been widely applied in many fields, including image
classification, time series prediction, and many other complex problems in machine learning and data science.

Figure 2.2.3. Random Forest Regressor (18)


2.2.4. AdaBoost
AdaBoost (Adaptive Boosting)(19) is an ensemble machine learning algorithm used in classification
and prediction. AdaBoost belongs to the category of boosting algorithms, where it focuses on building a strong
model from many weak models.
AdaBoost works by building a strong model from many weak models. The process starts by training
an initial weak model, usually a simple decision tree. Then, it creates a new dataset by increasing the weight of
data samples that are misclassified by the current weak model. The new weak model is trained on this new
dataset, and this process repeats many times. Finally, the weak models are combined to create a strong model,
with final weights calculated for each weak model.
AdaBoost has many advantages, including the ability to handle noisy data and the flexibility to use
simple weak models. It also allows measuring the importance of features in prediction, providing insight into
how features influence the final decision.
Figure 2.2.4. Schematic diagram of AdaBoost regression.(20)

3. Evaluation metrics

3.1. Similarity: defines the similar percentage between the imputed values (y) and the respective true
values (x). It is calculated by the formula (2.1):
𝑇
1 1
𝑆𝑖𝑚(𝑦, 𝑥) = 𝑇
∑ |𝑦𝑖−𝑥𝑖| (2.1)
𝑡=1 1+ 𝑚𝑎𝑥(𝑥)−𝑚𝑖𝑛(𝑥)

Where T is the number of missing values. A higher similarity (similarity value ∈ [0, 1]) highlights a
better ability method for the task of completing missing values. If the signal is a constant (x = constant), we set
max(x)−min(x) = 1.

3.2. NMAE: The Normalized Mean Absolute Error between the imputed values y and the respective true
values time series x is computed as (2.2):
𝑇 |𝑦𝑖−𝑥𝑖|
1
𝑁𝑀𝐴𝐸(𝑦, 𝑥) = 𝑇
∑ 𝑉𝑚𝑎𝑥−𝑉𝑚𝑖𝑛
(2.2)
𝑖=1

Where Vmax, Vmin are the maximum and the minimum values of the input time series (time series
has missing data) by ignoring the missing values. The NMAE value lies in the range of 0 to ∞. In case of
constant signal, we set Vmax −Vmin = 1. A lower NMAE value means a better performance method for the
imputation task

3.3. R2 score: is calculated as the square of Pearson’s coefficient (with p-value) y and x. The coefficient
is a measure of the strength of the linear relationship between two variables. In the imputation
context, this coefficient measures the degree of association between the imputed values y and the
corresponding actual values (x). The R 2 parameter ranges between 0 and 1. Hence, a value closer to
1 indicates a strong predictive ability (imputation values are very close to true values). The
correlation coefficient is computed as follows (2.3):
𝑇
∑ (𝑥𝑖−𝑥)(𝑦𝑖−𝑦)
𝑅(𝑦, 𝑥) = 𝑖=1
(2.3)
𝑇 𝑇
2 2
∑ (𝑥𝑖−𝑥) ∑ (𝑦𝑖−𝑦)
𝑖=1 𝑖=1

3.4. RMSE: The Root Mean Square Error is a frequently used measure to evaluate the quality of a model
(an estimator or a predictor). RMSE is defined as the average squared difference between the
imputed values y and the respective true values x. Formally, it is computed as (2.4):

𝑇
1 2
𝑅𝑀𝑆𝐸(𝑦, 𝑥) = 𝑇
∑ (𝑦𝑖 − 𝑥𝑖) (2.4)
𝑖=1

This indicator is very useful for measuring overall precision or accuracy. The range of RMSE lies
between 0 to ∞. A RMSE of zero illustrates a perfect imputation model but in reality, it cannot be achieved. In
general, the most effective method would have the lowest RMSE.

3.5. FSD (Fraction of Standard Deviation) of y and x is defined as follows (2.5):


|𝑆𝐷(𝑦)−𝑆𝐷(𝑥)|
𝐹𝑆𝐷(𝑦, 𝑥) = 2 * 𝑆𝐷(𝑦)+𝑆𝐷(𝑥)
(2.5)

This fraction indicates whether a method is acceptable or not (here SD stands for Standard Deviation).
For the imputation task, if FSD is closer to 0, the imputation values are closer to the real values.

3.6. NSE: The Nash-Sutcliffe Efficiency (NSE) is a metric specifically employed to assess the predictive
performance of hydrological models. NSE values fall within the range of -1 to 1, where higher
values indicate a superior match between the observed and forecasted water levels. In other words, a
higher NSE value signifies a better fit between the predicted and actual data, according to the
evaluation proposed by Nash and Sutcliffe in 1970. It is computed as (2.6):
𝑇
(
∑ 𝑥𝑖 − 𝑦𝑖 )2
𝑁𝑆𝐸 = 1 − 𝑖=1
𝑇 2
(2.6)
(
∑ 𝑥𝑖 − 𝑥𝑖
𝑖=1
)
4. Details of the proposed method
4.1. Pipeline
Figure 4.1.1 The pipeline of our research

We propose a multi-step method to predict two dependent variables Y1 and Y2 based on a complex
data processing procedure. This process includes the following steps:

1. Data Collection and Processing: Start with collecting original data from the respective source. The
data then goes through a rigorous processing step to clean, normalize and transform the data to remove noise
and prepare it for prediction.

2. Predict Y1 through ARIMA and Machine Learning: In addition, we use the ARIMA model to
predict Y1 and obtain residuals. These Residuals are then fed into a separate Machine Learning model to
generate predictions for Y2.

3. Predict Y2 using Machine Learning: We use a Machine Learning model train directly from
processed data and then combine the outcome with the Residuals from the previous stage to generate Y2.

We use the ARIMA model to handle data continuity, capturing time-varying variables and trends in the data.
ARIMA helps us predict Y1 through time series analysis and modeling, and allows us to generate residuals
from the predictions. Besides, using Machine Learning helps us capture the discreteness and complexity of
data. Machine Learning focuses on understanding complex relationships between input variables and
predicting Y2 effectively. The combination of both methods allows us to take advantage of data continuity and
discreteness to ensure accurate and reliable predictions for Y1 and Y2.
IV. Experiments and results
In this paper, we develop multi-step ahead prediction models based on ARIMA, the above mentioned
statistical machine learning methods, namely KNN, SVR, RF, AdaBoost and the proposed hybrid models. This
means to forecast a range of T values. In particular, for data for machine learning models, the sliding window
method is used, which means using n previous values to predict the (n+1)th value (discrete) combined with the
previous value. prediction of the Arima model (continuity). In addition, two machine learning models are used
in parallel to predict the value and residual of the Arima model and then add them together. We experiment
with proposed methods on different sizes of missing gaps ( 1%, 6%, 10%, and 12.5%) with forward as well as
backward stages to evaluate and compare the tested methods.
For the residuals which get from ARIMA, we will train and predict them with SVR, Decision tree and
AdaBoost while the real values from the dataset will be applied by SVR, AdaBoost, KNN and Random
Forest… Then, we will compare and evaluate the outcome of each method.
1. Gap 1%
Missing gap size 1% is equivalent to 6 values ​that will be missing in the data set.
1.1. Forward

Table 4.1.1 Results of applying the methods on missing gap 1% at the forward stage.

Table 4.1.1 presents the results achieved from the proposed method on the missing gap of 1% when
training and testing forward. With the missing percentage of 1%, ARIMA achieved the best performance with
Similarity score of 0.76, MAE of 17.00 , RMSE of 17.61, FSD of 0.76, R of 0.76 and NSE of -0.44. Followed
by, Random Forest + AdaBoost with Similarity score of 0.73%, MAE of 17.23 , RMSE of 17.82, FSD of
0.77, R of 0.73, and NSE of -0.48 and SVR + AdaBosst with Similarity score of 0.70 but achieved the lower
MAE score with 14.26 and RMSE of 17.30, FSD of 1.10, R of 0.70 and NSE of -0.39. Although not get the
best Similarity score, the SVR+SVR method achieved the lowest MAE and RMSE score.
Figure 4.1.1 The imputation of ARIMA with actual values

Figure 4.1.2 The imputation of Machine Learning methods with actual values
Figure 4.1.3 The prediction of Machine Learning methods with residual values

1.2. Backward

Table 4.2.1 Results of applying the methods on missing gap 1% at the backward stage.

Table 4.2.1 presents the results achieved from the proposed method on the missing gap of 1% when
training and testing backward. With the missing percentage of 1%, we can see the better performance of
Machine Learning methods with the best results belonging to AdaBoost with Similarity score of 0.80, MAE of
13.04 , RMSE of 14.03, FSD of 0.75, R of 0.80 and NSE of 0.08. Followed by , AdaBoost+SVR and
AdaBoost+AdaBoost with the same Similitary score but higher MAE score. In this stage, ARIMA still get the
good outcome with Similarity score of 0.70, MAE of 13.22 , RMSE of 14.49, FSD of 0.89, R of 0.70 and
NSE of 0.02. However, the combined methods with SVR witness low results when just only get the low
Similitary score while MAE score and RMSE score are quietly high.
Figure 4.2.1 The imputation of ARIMA with actual values

Figure 4.2.2 The imputation of Machine Learning methods with actual values
Figure 4.2.3 The prediction of Machine Learning methods with residual values

2. Gap 6%
Missing gap size 6% is equivalent to 36 values ​that will be missing in the data set.
2.1. Forward

Table 4.3.1 Results of applying the methods on missing gap 6% at the forward stage.
The tested algorithms included ARIMA (Sim: 0.68, MAE: 15.77, RMSE: 20.40), Support Vector
Regression (SVR) (Sim: 0.52, MAE: 17.7, RMSE: 21.52), Decision Tree Regression (Sim: 0.56, MAE: 18.35,
RMSE: 22.94), AdaBoost (Sim: 0.67, MAE: 16.39, RMSE: 19.76), XGBoost (Sim: 0.67, MAE: 16.24, RMSE:
19.29), K-Nearest Neighbors (KNN) (Sim: 0.7, MAE: 14.44, RMSE: 17.67), and Random Forest (Sim: 0.66,
MAE: 15.21, RMSE: 19.15).

The results show that KNN and various variants of Random Forest perform the best in predicting the target
variable, with relatively high values of R (correlation coefficient) and NSE (normalized standard error) and
low errors. AdaBoost and XGBoost also demonstrate reasonably good results. In contrast, SVR and Decision
Tree have lower performance in this scenario. We also experimented with combining algorithms by using the
main algorithms and then using the remaining algorithm to predict residuals, and the specific results may vary
depending on the combination and model optimization.

Figure 4.3.1 The imputation of ARIMA with actual values


Figure 4.3.2 The imputation of Machine Learning methods with actual values
Figure 4.3.3 The prediction of Machine Learning methods with residual values
2.2. Backward

Table 4.4.1 Results of applying the methods on missing gap 6% at the backward stage.
Firstly, ARIMA exhibits poor performance with a low R-value (0.49) and high MAE and RMSE values (14.93
and 19.53), indicating its limited predictive capability.
A comparison of the original models reveals that AdaBoost performs the best with an R-value of 0.58 and NSE
of 0.5. Other variants such as Decision Tree Regression and XGBoost also demonstrate relatively good results.
We also experimented with combined models, such as "Random Forest(real) + SVR(residual)," "KNN(real) +
Decision Tree(residual)," and various other models. The results suggest that combining models can enhance
performance compared to the original models in some cases, but the effectiveness of such combinations
depends on the specific approach and dataset.
Modeling predictions of residuals, like "SVR(real) + SVR(residual)," exhibit the poorest performance with the
lowest R value (0.37) and a negative NSE. This indicates that such a model is not suitable for data prediction.

Figure 4.4.1 The imputation of ARIMA with actual values

Figure 4.4.2 The imputation of Machine Learning methods with actual values
Figure 4.4.3 The prediction of Machine Learning methods with residual values
3. Gap 10%
3.1. Forward

Table 4.5.1 Results of applying the methods on missing gap 10% at the forward stage.
The ARIMA method has the highest head-to-head R and NSE values in the list of methods, with R = 0.71 and
NSE = 0.19. However, the MAE and RMSE values are quite high (16.68 and 21.35), showing that the
deviation between prediction and reality still exists. Decision Tree Regression, this method has the highest
MAE and RMSE in the list (18.75 and 24.67), and the R and NSE values represent overfitting (R = 0.4, NSE =
-0.08). This may indicate that the Decision Tree is not suitable for this data. Combining Random Forest with
other methods: There are a series of experimental results combining Random Forest with other methods to
improve prediction performance. However, it seems that this result cannot be predicted in advance and
depends on the specific combination. For example, combining Random Forest with AdaBoost produces the
highest NSE value (0.29), while combining with Decision Tree results in the lowest NSE (-0.17). The KNN
method shows a relatively stable performance with NSE values ranging from 0.33 to 0.01, depending on the
combined method. This may indicate that KNN is suitable for prediction on this data. AdaBoost provides
relatively stable NSE values (0.13 to 0.42) when combined with other methods. The highest NSE value
appears when combined with KNN. The SVR method gives relatively stable results, with NSE values from
0.05 to 0.10 when combined with other methods.

Figure 4.5.1 The imputation of ARIMA with actual values


Figure 4.5.2 The imputation of Machine Learning methods with actual values
Figure 4.5.3 The prediction of Machine Learning methods with residual values
3.2. Backward

Table 4.6.1 Results of applying the methods on missing gap 10% at the forward stage.
ARIMA has high Sim accuracy (0.67) and relatively low error (MAE = 14.74, RMSE = 19.53). This means
that ARIMA can predict the data well in this particular case. The SVR method has a lower Sim (0.52) and high
error (MAE = 17.82, RMSE = 24.43), along with a high FSD (0.95). This suggests that the SVR is not
performing well in this case and may need further adjustment. Decision Tree Regression has low Sim (0.35)
and high error (MAE = 20.7, RMSE = 26.64). It has a large FSD value (1.01), indicating instability in
prediction. AdaBoost has quite high Sim accuracy (0.67) and low error (MAE = 15.67, RMSE = 19.57). It also
has a low FSD value (0.74), indicating stability in prediction. XGBoost has a quite high Sim (0.57) and
relatively low error (MAE = 16.56, RMSE = 22.03). It has a relatively low FSD value (0.82), indicating that
this model is a good choice. KNN shows high Sim accuracy (0.67) and low error (MAE = 14.15, RMSE =
18.49), along with a low FSD value (0.74). This shows that KNN works well in this case. Methods that
combine basic models such as "Random Forest(real) + SVR(residual)" or "KNN(real) + SVR(residual)" also
achieve relatively good results, with Sim and low error. However, there are some methods such as "SVR (real)
+ Decision Tree(residual)" or "AdaBoost(real) + Decision Tree(residual)" that have low FSD and Sim values,
along with high errors, indicating that they have poor predictive power in this case.

Figure 4.6.1 The imputation of ARIMA with actual values


Figure 4.6.2 The imputation of Machine Learning methods with actual values
Figure 4.6.3 The prediction of Machine Learning methods with residual values

4. Gap 12.5%
A missing gap size of 12.5% is equivalent to 75 values ​that will be missing in the data set.
4.1 Forward

Table 4.7.1 Results of applying the methods on missing gap 12,5% at the forward stage.

The data highlights the diverse performance of various algorithms in predicting the target variable.
Significantly, ARIMA stands out as the top performer with the highest Similarity score of 0.76, along with
a comparatively low MAE of 17.1 and RMSE of 20.55. AdaBoost follows closely with a Similarity score
of 0.61, MAE of 17.14, and RMSE of 21.3, showcasing its effectiveness in handling the given dataset.

Other machine learning methods like Random Forest (Sim: 0.63, MAE: 16.46, RMSE: 20.73) and KNN
(Sim: 0.59, MAE: 15.6, RMSE: 20.66) also demonstrate notable performance.

However, Support Vector Regression (SVR) presents a comparatively lower Similarity score of 0.55,
coupled with higher MAE (17.46) and RMSE (22.1) values, suggesting a weaker predictive performance.
Decision Tree, especially in combination with SVR and AdaBoost, shows mixed results with varying
Similarity scores and negative NSE values.

The ensemble methods, such as AdaBoost+SVR and AdaBoost+AdaBoost, exhibit promising results with
Similarity scores matching that of AdaBoost alone, albeit with higher MAE scores.

Figure 4.7.1 The imputation of ARIMA with actual values


Figure 4.7.2 The imputation of Machine Learning methods with actual values

Figure 4.7.3 The prediction of Machine Learning methods with residual values

4.2 Backward
Table 4.8.1 Results of applying the methods on missing gap 12,5% at the forward stage.
The provided data highlights the diverse performance of various algorithms in predicting the target
variable. Significantly, ARIMA stands out as the top performer with the highest Similarity score of
0.76, along with a comparatively low MAE of 17.1 and RMSE of 20.55. AdaBoost follows closely with
a Similarity score of 0.61, MAE of 17.14, and RMSE of 21.3, showcasing its effectiveness in handling
the given dataset.

Other machine learning methods like Random Forest (Sim: 0.63, MAE: 16.46, RMSE: 20.73) and KNN
(Sim: 0.59, MAE: 15.6, RMSE: 20.66) also demonstrate notable performance.

However, Support Vector Regression (SVR) presents a comparatively lower Similarity score of 0.55,
coupled with higher MAE (17.46) and RMSE (22.1) values, suggesting a weaker predictive
performance. Decision Tree, especially in combination with SVR and AdaBoost, shows mixed results
with varying Similarity scores and negative NSE values.

The ensemble methods, such as AdaBoost+SVR and AdaBoost+AdaBoost, exhibit promising results
with Similarity scores matching that of AdaBoost alone, albeit with higher MAE scores.

The analysis emphasizes the importance of algorithm selection and combination strategies, with
ARIMA leading the pack in terms of Similarity score for the given dataset.
Figure 4.8.1 The imputation of ARIMA with actual values

Figure 4.8.2 The imputation of Machine Learning methods with actual values
Figure 4.8.3 The prediction of Machine Learning methods with residual values

V. Conclusion
Our study shows that machine learning methods, including SVM, AdaBoostRegressor, and KNN, can
effectively handle missing data in time series datasets. These methods of combined Machine Learning and
ARIMA can be effective in terms of imputation accuracy and computational efficiency. Our findings suggest
that machine learning methods can be effective in handling missing data in time series datasets. However,
further research is needed to evaluate the performance of these methods on other datasets and to explore the
use of other machine learning models.
In summary, our study provides valuable insights into the use of machine learning methods for imputing
missing data in time series datasets. Our findings suggest that machine learning methods can be effective in
handling missing data in time series datasets and that further research is needed to explore the use of other
machine learning and deep learning models. We hope that our study will contribute to the development of
more accurate and efficient methods for handling missing data in time series analysis.

VI. References
1. Birylo M, Rzepecka Z, Kuczynska-Siehien J, Nastula J. Analysis of water budget prediction accuracy using
ARIMA models. Water Supply. 2017 Jul 25;18(3):819–30.
2. Mirzavand M, Ghazavi R. A Stochastic Modelling Technique for Groundwater Level Forecasting in an
Arid Environment Using Time Series Methods. Water Resour Manag. 2014 Nov 12;29.
3. Nguyen TT, Nguyen Huu Q. Forecasting Time Series Water Levels on Mekong River Using Machine
Learning Models. In 2015.
4. Development of a predictive model for on-demand remote river level nowcasting: Case study in Cagayan
River Basin, Philippines | Semantic Scholar [Internet]. [cited 2023 Nov 9]. Available from:
https://www.semanticscholar.org/paper/Development-of-a-predictive-model-for-on-demand-in-Garcia-Ret
amar/4bc5f585eeea2e68d2fea4fc327c11fca5ca8c37
5. Pasupa K, Jungjareantrat S. Water levels forecast in Thailand: A case study of Chao Phraya river.
In 2016. p. 1–6.
6. Phan TTH, Nguyen XH. Combining statistical machine learning models with ARIMA for water level
forecasting: The case of the Red river. Adv Water Resour. 2020 Aug 1;142:103656.
7. Chen KY, Wang CH. A hybrid SARIMA and support vector machines in forecasting the production values
of the machinery industry in Taiwan. Expert Syst Appl. 2007 Jan 31;32:254–64.
8. Khashei M, Bijari M. An artificial neural network (p, d, q) model for timeseries forecasting. Expert Syst
Appl. 2010 Jan 31;37:479–89.
9. Ömer Faruk D. A hybrid neural network and ARIMA model for water quality time series prediction. Eng
Appl Artif Intell. 2010 Jun 1;23(4):586–94.
10. Wongsathan R, Wararat Jaroenwiriyapap. A Hybrid ARIMA and RBF Neural Network Model for Tourist
Quantity Forecasting : A Case Study for Chiangmai Province. 1. 2016;21:KKU Research Journal.
11. Pannakkong W, Pham VH, Huynh VN. A novel hybridization of ARIMA, ANN, and K-means for time
series forecasting. Int J Knowl Syst Sci. 2017 Oct 1;8:30–53.
12. Zhong C, Guo T, Jiang Z, Liu X, Chu X. A hybrid model for water level forecasting: A case study of
Wuhan station. In 2017. p. 247–51.
13. Short-term prediction of groundwater level using improved random forest regression with a combination of
random features | Applied Water Science [Internet]. [cited 2023 Nov 9]. Available from:
https://link.springer.com/article/10.1007/s13201-018-0742-6
14. Tunnicliffe Wilson G. Time Series Analysis: Forecasting and Control,5th Edition, by George E. P. Box,
Gwilym M. Jenkins, Gregory C. Reinsel and Greta M. Ljung, 2015. Published by John Wiley and Sons
Inc., Hoboken, New Jersey, pp. 712. ISBN: 978-1-118-67502-1. J Time Ser Anal. 2016 Mar 1;37:n/a-n/a.
15. Predicting time series with support vector machines | SpringerLink [Internet]. [cited 2023 Nov 9]. Available
from: https://link.springer.com/chapter/10.1007/BFb0020283
16. Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V. Support Vector Regression Machines. In:
Advances in Neural Information Processing Systems [Internet]. MIT Press; 1996 [cited 2023 Nov 9].
Available from:
https://proceedings.neurips.cc/paper_files/paper/1996/hash/d38901788c533e8286cb6400b40b386d-Abstrac
t.html
17. Breiman L. Random Forests. Mach Learn. 2001 Oct 1;45(1):5–32.
18. ResearchGate [Internet]. [cited 2023 Nov 9]. Fig. A10. Random Forest Regressor. The regressor used here
is formed of... Available from:
https://www.researchgate.net/figure/Fig-A10-Random-Forest-Regressor-The-regressor-used-here-is-formed
-of-100-trees-and-the_fig3_313489088
19. Freund Y, Schapire R. Experiments with a New Boosting Algorithm. In 1996 [cited 2023 Nov 9]. Available
from:
https://www.semanticscholar.org/paper/Experiments-with-a-New-Boosting-Algorithm-Freund-Schapire/68c
1bfe375dde46777fe1ac8f3636fb651e3f0f8
20. ResearchGate [Internet]. [cited 2023 Nov 9]. Fig. 3. Schematic diagram of AdaBoost regression. Available
from: https://www.researchgate.net/figure/Schematic-diagram-of-AdaBoost-regression_fig1_303599540

You might also like