You are on page 1of 16

Expert Systems With Applications 122 (2019) 303–318

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

MOWM: Multiple Overlapping Window Method for RBF based missing


value prediction on big data
Brijraj Singh∗, Durga Toshniwal
Department of Computer Science & Engineering, Indian Institute of Technology Roorkee, India

a r t i c l e i n f o a b s t r a c t

Article history: The problem of missing values in the process of data acquisition is becoming critical due to both hard-
Received 13 May 2018 ware failure and human error. Radial Basis Function based interpolation and missing value prediction in
Revised 28 November 2018
the dependent variable through surface fitting has been the viable solution from a long time. However,
Accepted 31 December 2018
the solution works well on the idea of building one equation of ‘N’ weight variables corresponding to
Available online 2 January 2019
each sampled data point from a set of ‘N’ samples. Still, because of the inherent computations, it de-
Keywords: mands big primary memory when ‘N’ becomes large. Hence, on a memory-restricted setup, sometimes it
Missing value imputation suffers from the memory overflow problem. In this paper, we propose a novel data decomposition based
Interpolation RBF enabled surface fitting approach by building and aggregating multiple small models in an overlap-
RBF ping manner, which works well even with smaller primary memory with minimal impact on loss of
Kernel regression generality. We also consider two hyperparameters as window size and overlapping index in order to tune
MOWM
bias-variance tradeoff. The proposed approach is applied to ten real-world datasets having multiple di-
Big data
Machine learning
mensions and results are found competitive while comparing with single trained model and Kernel Ridge
Curve fitting Regression. Therefore, we believe this approach will rejuvenate the RBF based surface fitting method in
attaining better performance in the Big data world too.
© 2019 Elsevier Ltd. All rights reserved.

1. Introduction Intelligent machines are actively involved in all verticals of


modern society. The machines perceive the real world by acquir-
Missing value prediction is one of the most challenging tasks ing the data from the surrounding. The increasing dependency of
of data pre-processing. Data pre-processing is a phase of machine humans on machines have raised the need for more and more data
learning architecture which takes the responsibility of converting acquisition, consequently giving the birth to a buzzword ‘Big data’,
the raw data into the suitable form so that statistical analysis which comes with certain inherent properties i.e. Volume, Vari-
can be performed on it. Among several popular methods (Ding ety, Velocity, Veracity etc. (Toshniwal, Venkoparao et al., 2017). Our
& Ross, 2012; Farhangfar, Kurgan, & Dy, 2008; de França, Coelho, concern in this work is with the Volume of the data.
& Von Zuben, 2013; Liu, Dai, & Yan, 2010; Liu, Pan, Dezert, & Because of hardware limitations memory can hold a model up
Martin, 2016; Raghunathan, Lepkowski, Van Hoewyk, & Solen- to a certain size and bigger model opens the problem. This in-
berger, 2001; Royston, 2004; Schafer & Graham, 2002; Xia et al., evitable necessity of big memory requirement activated the statis-
2017; Zhang, Cheng, Deng, Zong, & Deng, 2017) regression based ticians and computer scientists for developing the algorithm or
curve fitting and interpolating missing values is the well-accepted technique which should be able to build a model even with limited
solution of the aforementioned problem (Alaoui & Mahoney, 2015; primary memory. The implicit expectation from this model is to
Chang, Lin, Wang et al., 2017; Chen & Xie, 2014; Guha et al., 2012; have a comparable performance with the one which is developed
Hardy, 1971; Hsieh, Si, & Dhillon, 2014; Zhang, Duchi, & Wain- on a computer whose primary memory is larger than data size.
wright, 2015). However, the challenge with the solution is that it This problem attracted the attention of machine learning commu-
works very well by the time all the data under consideration can nity around 2010 and consequently few techniques came to handle
reside on primary memory otherwise it causes the memory error the problem. The primary objective while hitting the problem was
(Anagnostopoulos & Triantafillou, 2014; Li, Lin, & Li, 2013; Seeger, to divide the whole data in the way that it can reside in mem-
2007; Sun, Li, Wu, Zhang, & Li, 2010). ory, and so all the previously developed methods used divide and
conquer strategy. Most of the existing algorithms were based on
the idea of decomposing total data (‘N’ samples) among ‘m’ num-

Corresponding author. ber of data chunks randomly, such that each data-chunk has ‘N/m’
E-mail address: bsingh1@cs.iitr.ac.in (B. Singh).

https://doi.org/10.1016/j.eswa.2018.12.060
0957-4174/© 2019 Elsevier Ltd. All rights reserved.
304 B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318

Table 1 Guha et al., (2012) have introduced the divide and recombine
Commonly used radial basis functions.
approach for solving the problem of training on a large dataset
Type of basis functions φ (r )(r >= 0 ) which doesn’t fit on memory. Authors proposed the model to de-
Infinitely smooth RBFs
compose the dataset among ‘m’ number of subsets and performing
Gaussian e ( r )
2
the operation on each subset at a distributed site. Further frame-
Inverse Quadratic 1 work is built to combine the results from distributed sites using
1+( r )2
‘R’ programming with Hadoop architecture. This paper presented
Inverse multiquadric √ 1
1+( r )2 the improved performance of the model by using the map-reduce

Multiquadric 1 + (  r )2 framework. They used large internet packet-level data to simulate
Piecewise smooth RBFs the technique. (Chen & Xie, 2014) have proposed split and con-
Linear r
quer approach for reducing timeliness of computations over a large
Cubic r3
Thin plate spline (TPS) r2 log(r) dataset. This method used penalizing regression method for gen-
erating estimators on data subsets. The paper gave the theoreti-
cal proof of the idea that combined estimator obtains the asymp-
number of samples. In successive step these methods develop totic normality with the same variance as the estimator developed
trained models over all those ‘m’ data-chunks independently, fur- using all the data as a whole on a high capacity computer. The
ther they aggregate the models obtained from all the distributed main focus of the paper was on finding out the efficient meth-
sites. ods for aggregating all the trained models. This study considered
Radial Basis Function (RBF) is among the versatile surface fitting both majority voting and averaging methods for aggregating the
method, which is capable of dealing with multi-dimensions. How- independently trained models. Zhang et al. (2015) have also pro-
ever, because of having high O (N 3 ) time complexity and O (N 2 ) posed a divide and conquer based nonparametric approach of us-
space complexity requirements, it becomes infeasible to use it with ing kernel ridge regression for fitting a curve on scattered data. In
large value of ‘N’. The motivation of this work is to enable RBF for this approach the dataset is partitioned into a certain number of
covering the case of large values of ‘N’. subsets and then a kernel based ridge regression was applied to
In this work, we have applied the window based method to all those data subsets independently, in next step all those local
solve the problem of high memroy requirement by following the solutions were aggregated together to build a global hypothesis.
divide and conquer approach and dealing with multiple small data The motivation of the work was same as to reduce the time re-
chunks for building multiple small models. The proposed approach quirement and memory requirement without compromising much
is nonparametric in nature and uses a novel method for data de- on the accuracy. Authors provided the theoretical support for his
composition. The previous methods are based on the idea of de- proposed argument and showed the performance of the aforemen-
composing the data randomly while the proposed method first tioned model over data with samples having count ranging from
sorts the given multidimensional data and then generates the over- size 212 to 217 , which were divided among the number of subsets
lapping data chunks using predefined hyperparameters. All the ranging from 1 to 1024. Results shown were quite stable and there
data-chunks build their respective models independently and par- was the only minute increment in error with increasing subset’s
allelly on Spark architecture and are aggregated to be used for count. Xu, Zhang, Li, and Wu (2016) presented the feasibility of
prediction. The proposed method uses Multiquadric as a RBF for kernel ridge regression on Big data. They worked on the problem
fitting the curve which can be seen in Table 1. The objective of of ambiguity between theoretical performance and practical per-
experiments is to find the results using relatively smaller primary formance on real-world data dealing the aggregation by combin-
memory, which should be comparable with the results of the con- ing various hypothesis collected from distributed data processing
ventional method of generating a single model using big primary sites. For showing the empirical performance on real-world data,
memory. The smooth parameter for RBF corresponding to each this study included Twitter Discussion dataset of 583,250 instances
dataset is fine-tuned using 10-fold cross-validation technique. having 77 features. This approach decomposed the data in 40, 120,
Novelty and contribution of our work can be pointed as: 30 0, 40 0, 10 0 0 number of partitions and showed the performance
of Ridge, Lasso and Lad regression techniques. Working down the
• The proposed method decomposes data in an ordered way and line on the same theme, Chang et al. (2017) have added concepts
uses novel non-parametric approach for training and for aggre- on the contemporary state of the art techniques which used to
gating the results. The in order decomposition injects more ca- consider the averaging of all local solutions collected from differ-
pacity to the model for handling nonlinearity in data, where-as ent blocks. This paper suggested using local averaging regression
parameters tuning helps in maintaining the generality. such as Nadaraya-Watson (NWK) kernel and K-Nearest Neighbor
• Proposed model enables RBF for the curve (or surface) fitting (KNN) to average only those outputs whose corresponding inputs
on Big data and gives better performance than popular Kernel satisfy certain localization assumptions. Authors said that average
ridge regression technique in terms of missing values predic- mixture-local average regression (AVM-LAR) is a good approach
tion. with very strict restrictions on ‘m’ which is the number of blocks.
This work mainly targets showcasing the performance of Mul- This work has provided two concrete variants of AVM-LAR. First
tiple Overlapping Window Method (MOWM) approach relative to one was based on their simulation results which depicted that the
the conventional Single Window Method (SWM) and kernel ridge value of ‘m’ for high performance of optimal learning in AVM-KNN
regression (KRR) in terms of accuracy. is quite high than that of AVM-NWK, and authors have given rea-
sons for data dependent property of localization parameter of KNN.
2. Related work In second variant authors said that output of a model at any new
input is derived by the nearest input samples so if a data block is
This section deals with existing approaches which were devel- not having such input then there will be no effect of the value pre-
oped with the same motivation i.e. reduction in memory require- dicted from that data block, While AVM directly takes the average
ment using the parallel execution of decomposed data in the dis- of all the values predicted from data blocks which lead to inac-
tributed environment. Along with these motivations latest work for curate prediction. With the aforementioned reason, this approach
performance enhancement of RBF by analyzing its parameters for included extra constraints over whether a data-block should par-
interpolation are also incorporated in this section. ticipate in prediction or not.
B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318 305

This paragraph shows the work done on the RBF in the recent makes it immune against overfitting as shown in Eq. (1).
past. Grady in his Ph.D. thesis (Wright, 2003) explored the dimen-
1 N 1
sions of RBF and presented all the stepwise increment in basis C=  (yi − wT xi )2 + λ||w||2 (1)
2 i=1 2
functions. In his chapter wise analysis, he enlightened up the per-
formance of RBF near boundary values, the performance of stable where 12 λ||w||2 is a regularizer term which imposes the
computations of multiquadric basis function, and behavior of limit- penalty for overfitting and λ is a penalty parameter and comes
ing RBF (when  = 0) as well, whereas (Chenoweth & Sarra, 2009) through constraint optimization using Lagrange’s multiplier.
in his paper explored the generalized Multiquadric basis function. Where xi , yi be the ith input, output vectors respectively, w as
β weight matrix, λ as penalty parameter and C be the cost. Dif-
In their work over the generalized equation φ (r ) = (1 +  2 r 2 )
they have performed successive experiments for finding out the ferentiating the cost C w.r.t. w to get its optimal value.
optimal value of β for better performance of the function. Authors ∂C
strongly claimed that there is no performance improvement while = iN=1 (yi − wT xi )(−xi ) + λ||w||
∂w
going away from the standard (1/2) value of β . Therefore, authors
disproved the claims of previous studies about finding out the op- For optimal value: ∂∂w
C
=0
timal value of β through experiments. Mongillo (2011) worked on
the decision over shape parameter of RBF. He used Leave One Out
λw = in=1 {yi − wT xi }(xi )
Cross Validation (LOOCV), Generalized cross-validation (GCV), and 
n

maximum likelihood estimator (MLE) techniques for tuning the pa- λw = in=1 yi xi − wT xi xTi
i=1
rameters and deciding the most appropriate basis function. The au-    
thor explained that there is an instability issue in the performance w= λI +  nj=1 (x j xTj ) −1 in=1 yi xi (2)
of the model for interpolating the missing values. On the perfor-
mance basis, the author stated that LOOCV, GCV were not the best where Eq. (2) shows the value of weight w corresponding to an
choice for unstable problems. Although, MLE was found very ac- optimal value of cost C, where xi xTi is the crucial value which is
curate in general yet its interpolating results in unstable regions solved using Kernel Trick.
was also found a victim of high variance. Li et al. (2013) presented • Kernel trick
the statistical aspect of managing the large dataset. The study pre- Kernel Trick helps in performing computations in high dimen-
sented various methods of handling the Big data by segmenting sions without actually going on it (Welling, 2013). The inner
it down into smaller size blocks and joining them in the way of product xTi x j in the objective function (Eq. (2)) is replaced by
not losing the generality and gaining the same performance as it K(xi , xj ), where K is the kernel function. If ϕ be the objective
would have been operated on all the data in one go. Authors have function, φ be the input value and i be the input number.
performed asymptotic analysis and established the asymptotic nor-
mality as well. For performance verification, this paper proposed a ϕ i = y i − w T φi
standard error formula and tested it empirically. If LP be the primal Lagrangian then Objective function as the
Most of the previous work was based on the similar idea of constraint QP can be written as:
decomposing the dataset among ‘m’ number of subsets and then
distributing those subsets among various sites, the model is built LP = in ϕi2 Subjected to: y i − w T φi = ϕ i
up there and then aggregated all together for building up the gen-
Using the Lagrangian method of optimization over these equa-
eral model for prediction. All these work has given the theoretical
tions:
proof for the performance of their models statistically and rarely
used any real-world dataset for testing the generality of the model. LP = in ϕi2 + in βi [yi − wT φi − ϕi ] + λ(||w||2 − w20 )
None of the previous studies considered the dataset decomposition
in an order, therefore despite the availability of decomposed mod- KKT conditions give us the insight of solution which would
els their approach could not be considered as a potential choice be:
for online learning. The proposed method explores the novel non- ϕ = βi ∀i, 2λw = i βi φi
parametric approach for using Multiquadric-RBF based curve fitting
technique for interpolation and missing values prediction based Putting these values in Primal Lagrangian LP will give Dual La-
on divide and conquer strategy. It is observed that MOWM per- grangian LD as an optimization problem for λ >= 0
forms better than other existing techniques for the prediction on  1  1
Big data. L D = i − βi2 + βi yi − i j (βi β j Ki j ) − λw20
4 4λ

3. Preliminaries where β , λ are Lagrange multipliers.


β
Using αi = 2λi will give:
This section gives the brief idea about the existing techniques
LD = −λ2  αi2 + 2λ αi yi − λi j αi α j Ki j − λw20
which are either used directly as an intermediary or used for com-
paring the performance of observed results. where: λ>=0
Differentiating with respect to α gives:
3.1. Kernel ridge regression (KRR)
αi = (K + λI )−1 y
KRR is the strict form of regression method which uses the ker- It shows how the kernel trick is used to solve the problem
nel to handle the nonlinearity in the data. of computing inner products by apparently bringing them in
higher dimensions.
• Ridge regression
KRR uses kernel trick with ridge regression in order to compute
Ridge regression (Hoerl & Kennard, 1970; Zhang, Shah, & Kaka-
the inner products thus learns the linear or nonlinear function
diaris, 2017) is a constraint and robust form of regression
respective to linear or non linear kernel (An, Liu, & Venkatesh,
method in terms of model capacity. Ridge regression penalizes
2007; Welling, 2013).
the output using L2 norm of coefficients as regularizer which
306 B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318

3.2. Radial basis function

Missing values in the dataset is a prevalent problem of data en-


gineering. There are several methods to fill the missing values e.g.
using the average of nearest neighbors as fillers in case of continu-
ous data or using the most frequent terms as fillers in case of cat-
egorical data. Interpolation is curve fitting based technique, which
usually works very well for continuous data. Interpolation is curve
fitting based method which finds the most appropriate curve (one
independent variable) or hyperplane and surfaces (more than one
independent variable) by imitating the behavior of data distribu-
tion (Wan & Bone, 1997), the word appropriate handles the cases
of underfitting and overfitting. This method is based on the idea of
Fig. 1. Spark architecture.
finding out the appropriate weighted basis function φ (x) for data
points {xi }i=1 whose linear combination fulfills the interpolation
n

criteria i.e s(xi ) = Yi , tells that the value predicted by the system nodes. The communication among processes is established by ob-
of equations should approximate it’s actual function value. ject of sparkContext in the main program which is also known as
s(x ) = in=1 λi φ (x − xi ) (3) driver program and the communication between driver program
and worker nodes is established through cluster manager.
The Eq. (3) tells about the weighted aggregation of various basis The architecture1 shown in Fig. 1 facilitates applications to run
functions, where φ (xi ) is the basis function for data point xi and λi in isolation with others by assigning different executer process to
is it’s weight. If we assume the actual value of the data point to each application, this reduces the cases of database anomalies such
be Y at point x, then the weights are set with the need of approx- as dirty read. Dataset of Spark is known as Resilient distributed
imating s(x) with Y where a selection of basis function is indepen- dataset (RDD). RDD has a specific property known as lazy evalua-
dent of data points. The system of linear equations generated in tion, which means nothing gets computed until it is the last option.
this manner will always be nonsingular for distinct data points. As To follow this property RDD only creates a logical plan of opera-
per the Haar theorem’s implications, the above-explained system tions with each line of code and starts computations when oper-
works well when data points are of one dimension but for mul- ations like count are performed. Each RDD is a collection of data
tidimensional data, this linear system of equations become singu- objects partitioned across clusters. By this way, it supports an im-
lar and doesn’t work. To make this system usable for multidimen- portant feature of functional programming named as MapReduce.
sional data some changes are suggested to be made to the way MapReduce is a processing technique which works in a distributed
the basis functions are used. Which says instead of taking a linear environment. This technique is composed of two elementary tasks
combination of a set of basis functions which are independent of Map and Reduce which are always performed in the same order.
data points, the combination of translates of single basis function The map takes set of data as input and converts it in different
(radially symmetric about its center) are taken into account. This sets of data by decomposing each data element in key-value (ki ,
approach is developed by R.L. Hardy which is known as Radial Ba- vi ) pairs. Where Reduce operation which is always performed in
sis Function method (Hardy, 1971). succession of Map, takes the output of Map and combines the data
Mhaskar (1992) has proposed various conditions on φ (x) in or- elements to produce a smaller number of data elements. This tech-
der to make the system nonsingular which will always satisfy in- nique helps in processing multiple operations in parallel with each
terpolation condition. Few common examples of uniquely solvable other.
methods are shown in first five entries of Table 1. Among all the
mentioned methods in Table 1, Gaussian and Multiquadric are the
most used methods for interpolation and classification tasks. The 4. Methodology
 on the given equations is known as shape parameter which is a
data dependent hyperparameter required to be tuned using cross- This section contains basic notations, problem formulation, used
validation. In the similar manner the generalized multiquadric hyperparameters, and detailed method. The section covers the im-
function i.e. (1 + ( r )2 )β has two hyperparameters as  , β . These portance of hyperparameters in Theorems 1 and 2.
parameters are explored in previous studies (Mongillo, 2011). How-
ever, Chenoweth and Sarra (2009) in his work disapproved the re- 4.1. Basic notations
lationship between the accuracy of the result with optimality of β
and shown that appropriate conditions on system matrix improve Here we denote the semantics of different symbols and
the accuracy and this is regardless of the value of the exponent β . acronyms used in general and in algorithmic framework through-
out the paper. The vector of items are represented as v, where v
s(x ) = in=1 λi φ (x − xci ) (4) is the name and → tells about vector behavior, similarly [v] is
used to show the matrix of items. Input dataset is [D ] which con-
Eq. (4) explains that for ‘n’ data points there are total ‘n’ trans- tains N number of data samples each in the form of {xi , yi }, where
lation of basis function by considering each data points as a cen- {D } ∈ R × R and i ∈ {1, N}. Hyperparameters used in the method
ter. The generalized form of RBF can be seen in Botros and Atke- are window size and sliding index which are referred as win_size
son (1991) and Losser, Li, and Piltner (2014). and slid_by and denoted by {ω, S } ∈ R respectively. Basis function
is represented as φ , array of models as M  , MAE: Means Abso-
3.3. SPARK lute Error, MSE: Mean Squared Error, RDD: Resilient Distributed
Dataset, SWM: Single Window Method, MOWM: Multiple Overlap-
Spark is a general-purpose cluster computing engine for large- ping Window Method, KRR: Kernel Ridge Regression.
scale data processing (Zaharia, Chowdhury, Franklin, Shenker, &
Stoica, 2010). Spark decomposes an application program into in-
dependent groups of processes which run on a cluster of worker 1
https://spark.apache.org/docs/latest/cluster-overview.html.
B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318 307

Fig. 2. Visualization of MOWM’s working in 2-d and 3-d.

4.2. Problem formulation Table 2


Window size and error.

We wish to train multiple small models which should learn Dataset Error type 1/2 1/4 1/6 1/8 1/10
overall pattern hidden in the data without loss of generality. In- Airfoil MAE 0.402 0.415 0.424 0.434 0.446
put data is divided among few numbers of overlapping windows MSE 0.308 0.324 0.334 0.345 0.359
as shown is Fig. 2, which depicts the way of development of mul- White Wine MAE 0.623 0.626 0.629 0.634 0.638
tiple trained models covering the case of one dependent variable MSE 0.642 0.649 0.653 0.659 0.665
Red Wine MAE 0.612 0.617 0.622 0.626 0.630
and one (Fig. 2(a)), two (Fig. 2(b)) independent variables. For each
MSE 0.632 0.642 0.648 0.654 0.659
window, we have got set of equations (one for each data point). CBM MAE 0.293 0.318 0.319 0.367 0.449
These equations are to be solved for coefficients (λ) as represented MSE 0.143 0.69 1.026 3.869 5.101
in Eq. (5). With these calculated coefficients, outputs of overlap- Concrete MAE 0.328 0.328 0.338 0.350 0.360
MSE 0.222 0.2310 0.2473 0.2602 0.2704
ping models are aggregated for approximation with actual values
CCPP MAE 0.149 0.150 0.152 0.154 0.155
corresponding to the input value. MSE 0.044 0.045 0.045 0.046 0.047

Yim =  (j=m(−1 )k+n
m−1 )k+1
λ jm φi (xim − x j ) (5)

where m ∈ {1, N−nk +k }, i ∈ {(m − 1 )k + 1, (m − 1 )k + n} and Yim is Table 3


the ith output value of mth window. The above Eq. (5) depicts a set Training results.

of equations for a particular value of m. Since there are total N−nk +k SWM MOWM
number of windows and each window will have same n number of
Dataset MAE MSE MAE MSE
equations as the number of points in that set/window. Further, n
equations for each window are solved for coefficients λjm , where Airfoil 0.198 0.082 0.178 0.071
White Wine 0.603 0.599 0.542 0.488
j ∈ { ( m − 1 )k + 1, ( m − 1 )k + n}. Red Wine 0.576 0.544 0.533 0.478
1
  CBM 0.261 0.106 0.253 0.103
Fˆ (Xi ) =  p  (m−1)k+n λ φ (X − x j ) (6) Concrete 0.172 0.054 0.330 0.223
( p − l ) m=l j=(m−1)k+1 jm i i CCPP 0.104 0.022 0.092 0.018
In the above Eq. (6) the average of the values predicted from all Beijing PM Fail Fail 0.415 0.366
PCP Fail Fail 0.394 0.311
( p − l ) models is Fˆ respective to input value Xi for i ∈ {1, N}. Our
objective is to approximate this predicted value Fˆ with the actual
value Y.

4.3. Hyperparameters N/(2 ∗ 4 ) number of data points. These parameters guide to find
the optimum size of window and sliding index, which provide the
The proposed method includes two hyperparameters named as best results with given resources. The impact of these parameters
win_size, slid_by, where win_size tells about size of the window and on the properties of the data is explained through Theorems 1 and
slid_by tells about the index by which window is slid to create new 2. Theorems explain how variation in the parameters helps to iden-
instance. If win_size = 1/2, slid_by = 1/4 and total number of data tify the cases of high or low variance. The effect of these param-
points are N, then there will be total ω = N ∗ win_size = N/2 num- eters on the performance of the method can be seen in Figs. 4–
ber of points in one window and it will be slid by S = ω ∗ slid_by = 11 and in Tables 2 and 3.
308 B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318

V1 = iω=1 (xi − μ1 )2
V2 = iω=1+
+S1
S
( x i − μ2 ) 2
1

V3 = iω=1+
+S2
S
( x i − μ3 ) 2
2

Difference between variance of window-1 and window-2 will be:

V1 − V2 = iω=1 (xi − μ1 )2 − iω=1+


+S1
S
( x i − μ2 ) 2
1

= iω=1 (x2i + μ21 − 2xi μ1 ) − iω=1+


+S1
S
(x2i + μ22 − 2xi μ2 )
 1

= iω=1 (x2i ) + iω=1 (μ21 ) − 2iω=1 (xi μ1 )

− iω=1++S1
x
S1 i
2
+  ω +S1
μ
i=1+S1 2
2
− 2  ω +S1
i=1+S1
( x i μ 2 )
 
Fig. 3. Visualization of spatially relaxed window with two slidings instances. = iω=1 x2i − iω=1+
+S1
x2 + (−2 ) μ1 .(ω.μ1 )
S1 i

− μ2 (ωμ2 ) + μ21 ω − μ22 ω
4.4. Proposed method: MOWM 
= iS=1
1
x2i + iω=S1 +1 x2i − (iω=S1 +1 x2i + iω=+ωS+1 1
x2i )
Multiple Overlapping Window Method (MOWM) follows Divide 
& Conquer approach and operates on small data chunk at a time. − 2 ωμ21 − ω.μ22 + μ21 ω − μ22 ω
In this method, the data chunk is referred as window and num- 
ber of data points in a window determines the size of the window = iS=1
1
x2i + iω=S1 +1 x2i − iω=S1 +1 x2i − iω=+ωS+1 1
x2i
which is w_len therefore spatial size of the window depends on
data density, hence two windows of same window size may spa- − μ21 ω + μ22 ω
tially differ with each other. In the preliminary stage, the method
= iS=1
1
x2i − iω=+ωS+1
1
x2i + ω (μ22 − μ21 ) (7)
sorts the data in increasing order of their Euclidean distance. First
ω∗ N number of data points corresponding to the first window are Similarly, difference between the variance of window 1 and win-
picked and model is prepared for it, where ω is the windowing in- dow 3 wil be:
dex and N is the total number of data points. Next, the window is
slid by S and another model is prepared. This process of model V1 − V3 = iS=1
2
x2i − iω=+ωS+1
2
x2i + ω (μ23 − μ21 ) (8)
preparation is repeated until it covers all the data points. Mod-
Let us say V1 − V2 = δ1 and V1 − V3 = δ2 in Eqs. (7) and (8), where
els collected in this way are aggregated together to produce the
δ 1 , δ 2 are the variations between variances of successive windows.
global model. Window Size and Sliding Index are taken as hyper-
Subtracting Eq. (8) from Eq. (7) to get the difference of variance
parameters and are tuned corresponding to each dataset. The value
between window-3 and window-2 :
of Sliding Index and Window Size are data dependent parameters 
and are decided on the basis of affordability and variance in the δ2 − δ1 = iS=1
2
x2i − iω=+ωS+1
2
x2i + ω (μ23 − μ21 )
data (Theorem 1). Since bigger data window possesses more gen- 
x2i − iω=+ωS+1
eralization and so provides a better estimate but due to resource
− iS=1 1 2
x2i + ω (μ23 − μ22 )
limitation, it is required to keep it small under affordable loss in

generalization. The general behavior of the method can be seen in
Fig. 2 and analytical aspect can be seen in Fig. 3, where the initial
δ2 − δ1 = iS=2S1 +1 x2i − iω=+ωS+2S1 +1 x2i + ω (μ23 − μ22 )
window is of size ω, variance V1 , mean μ1 after sliding by S1 , S2
Since S1 , S2 are consecutive slid_by indexes, which are supposed
new windows are of variance V2 , V3 and mean μ2 , μ3 . The present S
to be very close. Therefore, (i=2S +1 x2i ) is a small number and
method is divided among four phases A, B, C, D. Where Phase-A 1
S ω +S
explains about preprocessing work, Phase-B explains about the de- so (i=2S x2i − i=ω+2S x2i ) will be even smaller with respect to
1 +1 1 +1
velopment of data windows and conversion to RDD for parallelism, ω (μ23 − μ22 ). Ignoring the smaller terms:
Phase-C explains about models creation and Phase-D tests the out-
come. The pictorical view of the proposed method can be viewed δ2 − δ1 = ω (μ23 − μ22 )
in Fig. 2 which shows the functioning of the algorithm using one    
or two independent and one dependent variable. δ2 − δ1 = iω=+SS2 +1
2
xi 2 − iω=+SS+1
1
1
xi 2

Theorem 1. Variation in variance between two consecutive windows Since data points are in increasing order therefore, δ2 − δ1 will al-
of the same size under the sliding window environment depends on ways be positive and will depend on the value of S2 − S1 . This sim-
sliding index. ply depicts that variance in the decomposed data distribution de-
pends on the value of slid_by index. The behavior can be tracked in
Fig. 4, which shows the changes in variance with respect to vary-
Proof. Let τ be the set of three windows as τ = {τ1 , τ2 , τ3 } and τ i
ing sliding index and constant size window length. Fig. 4(a) shows
is the parameter’s set for ith window. Where τi = {Vi , μi , si , ei , x}
for all variables and Fig. 4(b) shows for dependent variable. 
and V stands for variance, μ stands for mean, s stands for start
index, e stands for end index, and x stands for data points in the Theorem 2. The cumulative variance of all windows under the sliding
window. If variance in first window is V1 , and if the window is slid window method depends on selected window size.
by S1 then the variance is V2 and if the window is slid by S2 st:
S1 < S2 then the variance becomes V3 . Proof. Assuming V j be the variance of a jth window having ω
Theorem needs to prove that: |V1 − V2 | < |V1 − V3 |. number of data points denoted by xi st i ∈ {1, ω} and there are total
Finding variance of the mentioned windows: t number of small windows st j ∈ {1, t}.
B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318 309

Fig. 4. Variance vs Sliding_Width keeping constant size window.

1
1 
Vj = iω=1 (xi − μ j )2 VTot − Vnet = iN=1 (x2i ) − μ21 + μ22 + . . . + μt2
ω ω
1 1 xi 1 
= iω=1 x2i + iω=1 μ j 2 − 2(μ j iω=1 ) − iN=1 (x2i ) − μ2
ω ω ω N
μ2j ω 1 1

1
= iω=1 (x2i ) +
 (1 ) − 2(μ2j ) = iN=1 (x2i ) − iN=1 (x2i )
ω ω i=1 ω N
 
1 μ2j − μ1 + μ2 + . . . + μt2 − μ2
2 2
= iω=1 (x2i ) + ω − 2(μ2j )
ω ω
N
1 If there were t number of small windows then t = ω . Assuming
= iω=1 (x2i ) − μ2j 
ω μ1 + μ2 + . . . + μt ≈ t μ
2 2 2 2

Similarly, if there are total N number of data points then variance t 1



of that one single window covering all the data will be: = − iN=1 (x2i ) − (t − 1 )μ2
N N
1 N t − 1
Vnet =  ( x 2 ) − μ2 (9) = iN=1 (x2i ) − (t − 1 )μ2
N i=1 i N
1
Now finding cumulative variance of all small windows:
= (t − 1 ) iN=1 (x2i ) − μ2
N
VTot = V1 + V2 + . . . Vt
1 1 Since { N1 iN=1 (x2i )
− μ2 } >= 0 Therefore, VTot − Vnet will always be
= iω=1 (x2i ) − μ21 + i2=ωω (x2i ) − μ22 positive, Which clearly explains that cumulative variances of small
ω ω
1 windows is greater than one big window. The effect of this behav-
+...+ it=ω(t−1)ω (x2i ) − μt2 ior can be witnessed in the performance of the method shown in
ω Table 2, It can be noticed easily that bigger windows perform well
 1

and the performance keeps decreasing as the size of the window
= i=1 (x2i ) + i2=ωω (x2i ) . . . + it=ω(t−1)ω (x2i )
ω
ω decreases. 

− μ21 + μ22 + . . . + μt2
4.4.1. Phase-A
Assuming raw data is ready to serve and have already been
Since: processed with standard preprocessing techniques. Here, we divide
 the data randomly into two parts i.e 70% (split index = 0.7) data
iN=1 (x2i ) = iω=1 (x2i ) + i2=ωω (x2i ) . . . + it=ω(t−1)ω (x2i ) for training purpose and rest (30%) of the data for testing purpose.
Since the proposed method is based on the idea of sliding window
Therefore: concept therefore, input data are first sorted in increasing order of
1
  Euclidean distance as per the logic explained in Algorithm 2. We
VTot = iN=1 (x2i ) − μ21 + μ22 + . . . + μt2 (10) need to decide source point in order to calculate the distances,
ω
which will be S = (s0 , s1 , . . . , sm−1 ) for m dimensional data. The
Subtracting Eq. (9) from Eq. (10): value of source point needs to be lesser than the minimum value
310 B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318

Fig. 5. MAE vs. RBF: smoothing parameter.

of each attribute. So s j < min[xi j ]ni=0 , where j ∈ (0, m − 1 ). data and twc is total window count. Lines 9–15 in Algorithm 3 are
 used for creating windows corresponding to given window length
[di = ((xi,0 − s0 ) + (xi,1 − s1 ) + . . . (xi,m−1 − sm−1 ) )
2 2 2
]n0 (11) and sliding index.
Eq. (11) explains the general idea of calculating Euclidean distances Step-2 In this step, data windows generated from Step-1 are
for n points over m dimensions. taken into account for parallelization on Spark architecture. Ini-
tially, all the prepared windows are marked for its beginning and
4.4.2. Phase-B ending distance from the source point (line:7, Algorithm:1). Next,
This phase takes sorted data as an input from the previous marked windows are converted to RDD and further forwarded to
phase and performs operations explained in following steps: the mapper function which returns the actual trained model as
Step-1 The sorted data received from the previous state are pwmi for ith window (lines 8–10). The mapper function is ex-
converted to data vectors in this step. Two hyper-parameters plained in Algorithm 4. This function trains the windows parallelly
named as window length (ω) and other is sliding index (S) are on different nodes. At this point of time the number of windows
considered to generate multiple data windows as explained in which are to be plugged in RBF for training in parallel is decided
Algorithm 3. The algorithm takes trained data, window size and judiciously. This decision is based on the availability of primary
sliding index as inputs and calculates the total window count (twc) memory to hold the trained models of all the data windows si-
corresponding to given hyperparameters (line 8 in Algorithm 3). multaneously.
( train − ω )
twc = (12)
S 4.4.3. Phase-C
Eq. (12) finds the total number of windows generated through the In order to make accurate interpolations, we need to learn the
sliding process where train is the length of the received training pattern of the data without any loss of generality. Here is a crucial
B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318 311

point, because of the limitation of primary memory we can not af-  ], ω, S, dim).
Algorithm 3 CWM([train
ford to make a single model using all the data. Algorithm 1 uses
1: Creating row vector :
 ], ω, S, ρ ). 2: beg = 0, end = len([train ])
Algorithm 1 Missing_Value_Prediction([D
3: for i=0 to dim do
1: [train], [test ] = Spl itRandoml y ([D ], Spl itIndex )
   4:  ][beg end] = [train
[temp  ][0 len([train
 ] ), i]
Training : 5: beg = end, end = end + len([train  ])
2: dtn = EuclidanDistance ([train  ], origin )
6: end for
 ], d =ArrangeData([train
3: [T rain  ], dtn )
7: Creating Window Stacks :
4: dim = length ([D  ][0] ) 8:  ] ) − ω )/S )
twc = (len([train
5: [w _m] =CWM([T rain  ], dim, ω, S )
9: for j = 0 to twc do
6: w_count = CountRows ([w _m] ) 10: a=0
s, e =Marker(ω, S, d, [T rain
7:   )]
11: for i = 0 to dim do
8: for i = 0 to w_count do 12: [w_m][ j, ω ∗ i (i + 1 ) ∗ ω] = [temp
 ][a + S ∗ j a + ω + S ∗ j]
9: RDD = to_Rdd ([w_m][(i ) ∗ ρ (i + 1 ) ∗ ρ , ] ) 13: a = a + len([train
 ])
10: pwmi = RDD.map(F un ) 14: end for
11: end for 15: end for
Testing: 16: return [w_m]
12: dtt = EuclidanDistance ([test  )
 ], origin
 ], dtt =ArrangeData([test
13: [T est  ], dtt )
14: for i = 0 to len ([T est  ] ) do Algorithm 4 Fun(RDD).
15: M =SelectModels(T est[  i], s, e, dtt ) 1: Mapper :
16: end for 2: for i=0 to dim do
17: models.enqueue (M ) 3: [v][i] = RDD[i ∗ ω (i + 1 ) ∗ ω]
18: pop = 0 4: end for
19: push = ρ − 1 5: pwm = RBF ([v][0], [v][1], [v][2]. . .[v][dim − 1], smooth )
20: for i = 0 to len ([T est  ] ) do 6: return pwm
21: count = 0, s = 0
22: if (M  [i][0]! = M  [i − 1][0]) then
23: pop = pop + 1 mance. The best option we are now left with is to decompose
24: if ( pop == ρ ) then the whole dataset into multiple data chunks and generate several
25: models.dequeue(removed_ f rom_P r imar y ) trained models, one corresponding to each chunk. Creating the dis-
26: pop=0 joint decomposition will lead to the problem of model isolation
27: end if where all the models will be developed independently and two
28: end if even consecutive models would not share any common informa-
29: if ([M  ][i][−1]! = [M  ][i − 1][−1]) then tion which will lead to high variance i.e. overfitting and so the
30: push = push − 1 bad generalization performance. To overcome this conceptual set-
31: model s.enqueue(l oad_next _model ) back, in this approach we have decomposed the data in the over-
32: push = 1 lapping manner which helped in retaining the connection between
33: end if successive windows (Theorem 1). Models, trained on the decom-
34: obser ved_val[i] = CalcV alues(models, [T est  ][i] ) posed dataset, can easily reside in memory and because of over-
35: end for lapping connections, all the data points will have the influence on
36: MAE = MeanAbsoluteError (obser  ed )
 ved, l abel the other data points living in its vicinity up to a particular dis-
37: MSE = MeanSquaredEr ror (obser  ed )
 ved, l abel tance. This method helps in controlling the variance.
38: return (MAE,MSE)
S j (x ) = iω=1+
+ j∗S
λ φ (x − xci )
j∗S i i
(13)


S (l−1)/S+ω/S

 ], dist
Algorithm 2 ArrangeData([data  ). Y ( xi ) = Sk ( xi ) (14)
ω k=(l−1 )/S
1: s_argu  )
 =argsort(dist
2: for i = 0 to len(s_argu
 ) do Eq. (13) explains that jth window of m dimensional data is
3:  ][i, ] = [data
[s_data  ][s_argu[i], ] plugged-in to RBF function to generate the jth model, and
4:  ][i] = [dist
[s_dist  ][s_argu[i]] Eq. (14) shows the aggregation of all the overlapping models to
5: end for predict the value of the dependent variable. Smoothing index
6:  ], s_dist
return [s_data  is tuned using cross-validation technique corresponding to each
dataset. In the next step, the trained models are swapped from
main memory to auxiliary memory in order to provide the occu-
ρ which is decided on the basis of available memory resource and pancy for new models.
it tells how many windows can reside on the memory simultane-
ously.
4.4.4. Phase-D
Assumption 1. Data is big enough and single model trained on This is the last phase which deals with testing and evaluation
whole data as such can not fit in memory. of the trained models. As input, it takes all the trained models,
training data, source point and testing data. The core of this phase
Assumption 2. Increasing the system memory is not a feasible op-
is to identify that among multiple trained models which trained
tion.
models accurately cover the queried input data point. The method
Therefore, we can only reduce the data size for model cre- of selecting the appropriate models and putting markers on them
ation, however, data reduction affects the generalization perfor- is explained in Algorithm 5 labeled as Marker. The idea of putting
312 B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318

Algorithm 5 Marker(ω, S, d, [T rain


 ]). ‘free sulfur dioxide’, ‘total sulfur dioxide’, ‘density’, ‘pH’, ‘sul-
fates’, ‘alcohol’and ‘quality score’as the output variable.

1: for i = 0 to len([T rain]) do
Wine quality-red: Red wine dataset has 1600 number of sam-
2: s[i] = d[S ∗ i] ples.
3: e[i] = d[S ∗ i + ω] Wine quality-white: White wine dataset has 4899 number of
4: end for samples.
5: return s, e • Condition based maintenance of naval propulsion plants:
This is the measurements of the 15 features of a Gas turbine
 ], s, e, dtt ). taken at steady physical state, which are ‘Lever position (lp)’,
Algorithm 6 SelectModels([T est
‘Ship speed (v) [knots]’, ‘Gas Turbine (GT) shaft torque (GTT)

1: for i=0 to len([T est ]) do
[kN m]’, ‘GT rate of revolutions (GTn) [rpm]’, ‘Gas Generator
2: k=0 rate of revolutions (GGn) [rpm]’, ‘Starboard Propeller Torque
3: for j=0 to len(s ) do (Ts) [kN]’, ‘Port Propeller Torque (Tp) [kN]’, ‘Hight Pressure (HP)
4:  ][i] > s[ j] and [T est
if ([T est  ][i] < e[ j]) then
Turbine exit temperature (T48) [C]’, ‘GT Compressor inlet air
5: [M ][i, k] = j
temperature (T1) [C]’, ‘GT Compressor outlet air temperature
6: end if (T2) [C]’, ‘HP Turbine exit pressure (P48) [bar]’, ‘GT Compres-
7: k=k+1 sor inlet air pressure (P1) [bar]’, ‘GT Compressor outlet air pres-
8: end for sure (P2) [bar]’, ‘GT exhaust gas pressure (Pexh) [bar]’, ‘Turbine
9: end for Injection Control (TIC) [%]’, ‘Fuel flow (mf) [kg/s]’as input vari-
10: return [M ]
able and ‘GT Turbine decay state coefficient’as output variable.
It contains total 11934 number of samples each with 15 dimen-
sions.
marker is to find the extreme points of each trained models which • Concrete compressive strength: This dataset has 1030 num-
are at nearest and at the farthest distance from the source and ber of samples each with nine attributes. The attributes are
they are marked as sentinels. Now for each testing data, its Eu- as follows ‘Cement measured as kg in a m3 mixture’, ‘Blast
clidean distance is calculated from source and compared with sen- Furnace Slag measured as kg in a m3 mixture’, ‘Fly Ash mea-
tinels and all those trained models which cover the queried test- sured as kg in a m3 mixture’, ‘Water measured as kg in a m3
ing point, are taken into account. The working is explained in mixture’, ‘Superplasticizer measured as kg in a m3 mixture’,
Algorithm 6 labeled as SelectModels. ‘Coarse Aggregate measured as kg in a m3 mixture’, ‘Fine Ag-
All the selected models are loaded in a very sophisticated man- gregate measured as kg in a m3 mixture’, ‘Age measured as Day
ner as explained in Algorithm 1. The adopted way provides de- (1 to 365)’and ‘Concrete compressive strength MPa’as Output
terrence from memory overflow problem. The models selected as Variable. The dataset is about finding the pattern in concrete
suitable for a particular testing point using Algorithm 6 can be strength corresponding to given attributes.
given weights. • Combined cycle power plant data set: This is the yearly
1 dataset of power plant. Features are hourly averaged ambient
valuei = in=1 [wi ∗ modeli (x )] (15) variables recorded through various sensors fixed all over the
n
plant. Ambient variables are recorded on the scale of second
Eq. (15) explains the mean of weighted predictions on testing data and are averaged on the hourly basis. There is total 9568 num-
using selected models. Weights here can be decided using data dis- ber of samples over five dimensions. Where input variables are
tribution properties of each window and its alignment with global ‘Temperature (T) in the range 1.81 °C and 37.11 °C’, ‘Ambient
data behavior. However, this work uses uniform weights for all the Pressure (AP) in the range 992.89–1033.30 millibar’, ‘Relative
models and so finds uniform mean rather than weighted mean. We Humidity (RH) in the range 25.56% to 100.16%’, ‘Exhaust Vac-
have used 10-fold cross-validation technique for generating the re- uum (V) in the range 25.36–81.56 cm Hg’, and output variable
sults. is ‘Net hourly electrical energy output (EP) 420.26–495.76 MW’.
• Physicochemical properties of protein tertiary structure: This
5. Experiments and results dataset is of physiochemical properties of the protein. It con-
tains ‘Total Surface Area’, ‘Euclidean distance (ED)’, ‘Total em-
The section contains detailed information of used datasets and pirical energy’, ‘Secondary structure penalty (SS)’, ‘Sequence
system-setup. length (SL)’, and ‘Pair number (PN)’as independent variables
and ‘RMSD (Root mean standard deviation)’as the dependent
5.1. Datasets variable. It contains total 99986 number of samples each having
seven dimensions.
Open source datasets from UCI repository (Lichman, 2013; Rana,
• Beijing PM2.5 dataset: This is the time-based pollution dataset
Sharma, Bhattacharya, & Shukla, 2015) are used in this work for of Beijing city. The dataset contains ‘year’, ‘month’, ‘day’,
validating the algorithm. ‘hour’, ‘DEWP (Dew Point)’, ‘TEMP (Temperature)’, ‘PRES (Pres-
sure)’, ‘cbwd (Combined wind direction)’, ‘Iws (Cumulated wind
• Airfoil self-noise: This is a NACA-0012 Airfoil dataset of dif- speed)’, ‘Is (Cumulated hours of snow)’, ‘Ir (Cumulated hours
ferent sizes with various wind tunnel speed and angle of at- of rain)’as independent variables and ‘PM2.5 concentration
tack, provided by NASA. It has five attributes labeled as ‘Fre- (ug/m3 )’as dependent variable. It contains total 41754 number
quency’, ‘Angle of attack’, ‘Chord length’, ‘Free-stream veloc- of samples each with 11 dimensions.
ity’, ‘Suction side displacement thickness’as input variables and
‘Scaled sound pressure level’as the output variable. In total it 5.2. System setup
contains 1503 number of samples each with six dimensions.
• Wine quality: This is the data of the Portuguese ‘Vinho • Hardware specifications: All the experiments are performed
Verde’wine. The dataset contains 11 input variables as ‘fixed on Fujitsu 64 bit workstation with 64 GB primary memory, 48
acidity’, ‘volatile acidity’, ‘citric acid’, ‘residual sugar’, ‘chlorides’, cores, and two TB auxiliary memory.
B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318 313

• Software specifications: Operating System: Ubuntu 14.04, Pro- in Mean Absolute Error (MAE) subjected to variation in win_size
gramming Language: Python 3.5, Programming Ide: Spyder on (ω) and variation in sliding width i.e. slid_by (S), subfigure (b)
Anaconda environment, Big data Architecture: spark-2.1.1 (pys- tells variation in MAE by variation in win_size (ω) (lower one ‘l’)
park). with a constant sliding index and variation in MAE by variation
in sliding width (S) (upper one ‘u’) keeping the constant window
5.3. Experimental detail size respectively. Subfigure (c) and subfigure (d) follows the same
pattern of subfigure (a) and subfigure (b) and they are for Mean
Each of the previously explained datasets goes through prepro- Squared Error (MSE). In the present work, hyper-parameters are
cessing step before logging into main Algorithm 1. Where data pre- tuned using k-fold cross-validation technique where k is fixed as 10
processing involves data standardization as per the Eq. (16). for all datasets. The proposed algorithm is tested on the regression
data sets explained in Section 5.1 and testing results are shown in
dataset − mean(dataset )
dataset = (16) Table 4. The table contains results of RBF based interpolation using
V ariance(dataset ) conventional Single Window Method (SWM) approach, proposed
In order to sort the data using Algorithm 2, it is required to de- MOWM, and results of KRR with d=1 (all data), d=4 (data decom-
fine the source point. Where source point is defined as a point in posed among 4 data chunks randomly). First three columns of ta-
same dimensions with the value slightly lesser than the minimum ble show data set information next two columns show RBF based
value of any point in that dimension. The standardized dataset MAE using conventional method and of the proposed method, next
passes through RBF having randomly chosen smoothing parame- two columns show results of KRR for decomposition index as one
ter to train a model. The training is performed using 10-fold cross- and for decomposition index as four next columns show the same
validation technique corresponding to particular smoothing param- results in case of MSE. Last 2 column shows the value of tuned
eter. The behavior of variation in smoothing parameter and its im- hyperparameters for the performance of the proposed method i.e.
pact on error can be understood from Fig. 5. We have selected win_size, slid_by respectively. Table 4shows the mean absolute er-
smoothing value corresponding to least MAE as the final choice ror and mean squared error corresponding to tuned hyperparame-
throughout the experiment for the particular dataset. Big data ar- ters where-as training results are shown in Table 3. From both of
chitecture (Spark) helps in training multiple small models simul- these tables, one analysis can be drawn very easily which explains
taneously. In order to find the appropriate value of window size that the training result of MOWM is always better than SWM,
and sliding index, we have cross-validated it with several values where-as testing results don’t behave proportionally. Which indi-
which can be referred from Table 2 and Figs. 6–11. The figures cates inclination of the proposed approach towards high variance.
clearly explain the nature of error with increasing or decreasing This high capacity of the model makes it the obvious choice for
the values of mentioned hyperparameters. At testing point extra the system of having a long input range of independent variables
care is taken while bringing the trained windows on the mem- with local influence on dependent variables. Since the objective of
ory so that the stack of brought-up models should not overwhelm experiments is to analyze the relative performance of both SWM
the primary memory. In order to expedite the process, we have and MOWM approaches, therefore, other parameters of RBF inter-
used testing set in sorted order. We have used Queue data struc- polation are kept as constant throughout including shape param-
ture to perform this task and reduced the number of disk ac- eter which is tuned separately corresponding to each dataset and
cess by keeping the neighborhood windows in memory, removing shown is Fig. 5.
only the obsolete windows and invoking only few windows as per On most of the datasets (Table 4) best generalization perfor-
demand. mance of the MOWM is found when win_size was 1/2 which goes
on decreasing if window_size decreases. This is the point where one
5.4. Results and discussion needs to take a precise decision judiciously by taking into account
system-configuration, error-tolerance, and affordable-time. If a sys-
The proposed method is developed for solving the out of mem- tem can afford to retain a training model built using 50% of the
ory problem faced while building a model using huge training data then win_size = 1/2 (or > 1/2) would be the choice for obtain-
data. The algorithm is based on divide and conquer approach and ing the best performance. Similarly, based on how much a system
pays attention to controlling the bias-variance trade-off by tuning can hold in its memory, the proportional performance is achieved.
the hyperparameters precisely. The insight of the proposed method The training time could be reduced by increasing the parallelism
can be witnessed from Fig. 12, which pictorically describes the case in the model which can be achieved by having more number of
of two (3-d) and one (2-d) independent variables and one depen- smaller models so that more models can run simultaneously keep-
dent variable. Where Fig. 12(a) shows two neighborhood sliding in- ing parameter win_size to be smaller. Where-as, decrease in slid_by
stances and coverage of a multidimensional window and Fig. 12(b) index leads to more overlapping so it will generate the number
and (c) give its views in lower dimensions. This pictorial represen- of windows, consequently will be more time-consuming but bet-
tation signifies the importance of ‘N’ dimensional sorting of the ter generalization performance. Table 2 shows the behavior of er-
given data points and its meaningful impact on the neighborhood. ror on various datasets with decreasing win_size, the results clearly
The proposed algorithm takes values of win_size and slid_by in- state that on few datasets reducing the win_size doesn’t make
dex as hyper-parameters. Where win_size decides the size of the much difference in model’s performance, for instance in White-
model and slid_by decides overlapping index. The method proceeds Wine dataset reducing win_size from 50% to 25% increases error
by decomposing the total data points among multiple small data with 0.3% only and on CCPP dataset reducing the win_size from
chunks having a fewer number of data points per chunk (equal to 50% to 33%, increases only 0.5% of the error which comprehensibly
win_size). Generating training models for all those chunks and ag- states that there is a substantial reduction in memory requirement
gregating them together provides higher capacity to inculcate the with very little error overhead which makes it suitable for many
nonlinearity of the pattern existing in the data points. Parameters applications.
are required to be tuned in a sophisticated manner else the ap- The error curves from Figs. 6–11 shows the influence of window
proach may lead to high variance causing overfitting and bad gen- size and sliding index on the behavior of MAE and MSE. It is quite
eralization. clear that win_size has more influence on performance than that
The optimization of errors on various datasets are shown in of slid_by index. In most of the datasets, error Vs slid_by curves
Figs. 6–11. In each of these figures, subfigure (a) depicts variation are found to be very stable and weakly depend on slid_by index
314 B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318

Fig. 6. AirFoil data.

Fig. 7. WhiteWine data.

Table 4
Results of SWM approach and MOWM approach.

MAE MSE

Dataset info. RBF KRR RBF KRR Parameters of MOWM

Name Dim Samples SWM MOWM d#1 d#4 SWM MOWM d#1 d#4 W_Size Slid_by

Airfoil 6 1503 0.392 0.402 0.558 0.577 0.300 0.308 0.542 0.570 1/2 1/25
White Wine 12 4899 0.629 0.623 0.668 0.668 0.654 0.642 0.750 0.749 1/2 1/15
Red Wine 12 1600 0.617 0.612 0.627 0.639 0.632 0.632 0.680 0.681 1/2 1/10
CBM 15 11,934 0.289 0.293 0.458 0.605 0.124 0.232 0.309 0.531 1/2 1/2
Concrete 9 1030 0.337 0.329 0.588 0.591 0.224 0.222 0.588 0.581 1/2 1/10
CCPP 5 9569 0.150 0.149 0.212 0.212 0.044 0.044 0.071 0.071 1/2 1/15
Beijing PM 11 41,754 Fail 0.513 0.625 0.624 Fail 0.540 0.759 0.759 1/10 1/10
PCP 7 99,986 Fail 0.492 Fail 0.617 Fail 0.565 Fail 0.616 1/10 1/10
B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318 315

Fig. 8. RedWine data.

Fig. 9. CBM data.

whereas, changes in win_size causes sudden deflection in the error formance. Keeping the slid_by index high, will lead to having
curve. lesser overlapping between the current window and the next
window thereby generating the more sensitive model for new
• Bias-Variance tradeoff: The higher value of win_size covers the data points and so will be the case of high variance. Whereas
broader perspective and generates more meaningful model but smaller slid_by index will lead to having more overlapping be-
with high memory requirement. Therefore, as per the resource tween two consecutive windows and so lesser sensitivity for
availability, we decide the smaller value of win_size and then new data points and model will be more biased. The effect
tune the second hyper-parameter which is slid_by index with of these hyperparameters on data variance can be analyzed
the condition of not compromising much with the model’s per-
316 B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318

Fig. 10. Concrete data.

Fig. 11. CCPP data.

through Theorems 1 and 2 and on model’s performance can be 6. Conclusion


realized through Tables 2–4.
• Applications: MOWM provides an approach to managing huge This paper presents RBF based surface fitting method for in-
data training with smaller memory. The method can be used terpolating missing values in the context of Big data. In order to
for interpolation in general and use case specific applications solve the memory overflow problem, we propose a data decom-
like handling missing value in genome sequences, with sensors position based approach of building multiple small trained models
data in power industry, losses in streaming data etc. on overlapped data windows. We further aggregate the indepen-
dent results of models in a way so that it should retain the gen-
B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318 317

Fig. 12. Visualization of neighborhood analysis.

erality of the architecture as a whole. The size of the data chunk Chang, X., Lin, S. B., & Wang, Y. (2017). Divide and conquer local average regression.
is decided by hyperparameter window length (win_len) which fa- Electronic Journal of Statistics, 11(1), 1326–1350.
Chen, X., & Xie, M.-g. (2014). A split-and-conquer approach for analysis of extraor-
cilitates the window to shrink and expand itself spatially as per dinarily large data. Statistica Sinica, 1655–1684.
data density. The overlapping index between two successive win- Chenoweth, M. E., & Sarra, S. A. (2009). A numerical study of generalized multi-
dows is decided by the second spatially relaxed hyperparameter quadric radial basis function interpolation. SIAM Undergraduate Research Online,
2, 58–70.
named as sliding index (slid_by). Fine tuning of these hyperparam- Ding, Y., & Ross, A. (2012). A comparison of imputation methods for handling miss-
eters lead to the development of a generalized model, however, ing scores in biometric fusion. Pattern Recognition, 45(3), 919–933.
win_len must be derived as per available primary memory. In this Farhangfar, A., Kurgan, L., & Dy, J. (2008). Impact of imputation of missing values on
classification error for discrete data. Pattern Recognition, 41(12), 3692–3705.
work, we use Spark architecture to treat multiple mutually inde-
de França, F. O., Coelho, G. P., & Von Zuben, F. J. (2013). Predicting missing val-
pendent data windows parallelly and to train their corresponding ues with biclustering: A coherence-based approach. Pattern Recognition, 46(5),
models. Our results show that even with smaller primary mem- 1255–1266.
Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., et al. (2012). Large complex data:
ory MOWM performs with minimal impact on accuracy, however,
Divide and recombine (D&R) with rhipe. Stat, 1(1), 53–67.
performance degrades if we keep decreasing the window size. Hardy, R. L. (1971). Multiquadric equations of topography and other irregular sur-
The proposed method assumes to have all the data available at faces. Journal of geophysical research, 76(8), 1905–1915.
each point in time and so doesn’t have any provision for streaming Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12(1), 55–67.
data. The method uses cross-validation technique for tuning hyper- Hsieh, C.-J., Si, S., & Dhillon, I. S. (2014). Fast prediction for large-scale kernel ma-
parameters which may be proved as a time-consuming job on Big chines. In Advances in neural information processing systems (pp. 3689–3697).
data. Li, R., Lin, D. K., & Li, B. (2013). Statistical inference in massive data sets. Applied
Stochastic Models in Business and Industry, 29(5), 399–409.
This method gives promising results on various datasets but Lichman, M. (2013). UCI machine learning repository.
theoretical analysis of the method along with dynamic modeling Liu, C.-C., Dai, D.-Q., & Yan, H. (2010). The theoretic framework of local weighted ap-
for streaming data and enhancing capacity for extrapolation is left proximation for microarray missing value estimation. Pattern Recognition, 43(8),
2993–3002.
as future work. Liu, Z.-g., Pan, Q., Dezert, J., & Martin, A. (2016). Adaptive imputation of missing
values for incomplete pattern classification. Pattern Recognition, 52, 85–95.
Losser, T., Li, L., & Piltner, R. (2014). A spatiotemporal interpolation method using
Acknowledgement radial basis functions for geospatiotemporal big data. In Computing for geospa-
tial research and application (com. geo), 2014 fifth international conference on
(pp. 17–24). IEEE.
This research is supported by Ministry of Electronics and In-
Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sig-
formation technology, Govt. of India under Visvesvaraya Ph.D. moidal and radial basis functions. Advances in Applied mathematics, 13(03),
Scheme. 350–373.
Mongillo, M. (2011). Choosing basis functions and shape parameters for radial basis
function methods. SIAM Undergraduate Research Online, 4, 190–209.
References Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J., & Solenberger, P. (2001). A
multivariate technique for multiply imputing missing values using a sequence
Alaoui, A., & Mahoney, M. W. (2015). Fast randomized kernel ridge regression of regression models. Survey Methodology, 27(1), 85–96.
with statistical guarantees. In Advances in neural information processing systems Rana, P. S., Sharma, H., Bhattacharya, M., & Shukla, A. (2015). Quality assessment of
(pp. 775–783). modeled protein structure using physicochemical properties. Journal of Bioinfor-
An, S., Liu, W., & Venkatesh, S. (2007). Fast cross-validation algorithms for least matics and Computational Biology, 13(02), 1550 0 05.
squares support vector machine and kernel ridge regression. Pattern Recognition, Royston, P. (2004). Multiple imputation of missing values. Stata Journal, 4(3),
40(8), 2154–2162. 227–241.
Anagnostopoulos, C., & Triantafillou, P. (2014). Scaling out big data missing value Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art.
imputations: pythia vs. godzilla. In Proceedings of the 20th ACM SIGKDD interna- Psychological Methods, 7(2), 147.
tional conference on knowledge discovery and data mining (pp. 651–660). ACM. Seeger, M. (2007). Cross-validation optimization for large scale hierarchical clas-
Botros, S. M., & Atkeson, C. G. (1991). Generalization properties of radial basis func- sification kernel methods. In Advances in neural information processing systems
tions. In Advances in neural information processing systems (pp. 707–713). (pp. 1233–1240).
318 B. Singh and D. Toshniwal / Expert Systems With Applications 122 (2019) 303–318

Sun, B.-Y., Li, J., Wu, D. D., Zhang, X.-M., & Li, W.-B. (2010). Kernel discriminant Xu, C., Zhang, Y., Li, R., & Wu, X. (2016). On the feasibility of distributed kernel
learning for ordinal regression. IEEE Transactions on Knowledge and Data Engi- regression for big data. IEEE Transactions on Knowledge and Data Engineering,
neering, 22(6), 906–910. 28(11), 3041–3052.
Maurya, C., Toshniwal, D., & Venkoparao, G. (2017). Distributed sparse class-imbal- Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark:
ance learning and its applications. IEEE Transactions on Big Data. Cluster computing with working sets. HotCloud, 10(10), 95.
Wan, E., & Bone, D. (1997). Interpolating earth-science data using RBF networks Zhang, L., Shah, S., & Kakadiaris, I. (2017). Hierarchical multi-label classification us-
and mixtures of experts. In Advances in neural information processing systems ing fully associative ensemble learning. Pattern Recognition, 70, 89–103.
(pp. 988–994). Zhang, S., Cheng, D., Deng, Z., Zong, M., & Deng, X. (2017). A novel knn algorithm
Welling, M. (2013). Kernel ridge regression. Max Welling’s Classnotes in Machine with data-driven k parameter computation. Pattern Recognition Letters.
Learning, 1–3. Zhang, Y., Duchi, J. C., & Wainwright, M. J. (2015). Divide and conquer kernel ridge
Wright, G. B. (2003). Radial basis function interpolation: Numerical and analytical regression: A distributed algorithm with minimax optimal rates. Journal of Ma-
developments. chine Learning Research, 16, 3299–3340.
Xia, J., Zhang, S., Cai, G., Li, L., Pan, Q., Yan, J., et al. (2017). Adjusted weight voting
algorithm for random forests in handling missing values. Pattern Recognition, 69,
52–60.

You might also like