You are on page 1of 8

International Journal of Computer Techniques - Volume 3 Issue 1, Jan- Feb 2016

RESEARCH ARTICLE

OPEN ACCESS

Missing Value Imputation Based on MINNS-SVM


XueLing Chan1, YanjunZhong2,YujuanQuan3
*(Computer Science, JiNan University, China)

----------------------------------------************************----------------------------------

Abstract:
Missing value processing is an unavoidable problem of data pre-processing in the field of Machine
learning. Most traditional missing value imputation methods are based on probability distribution and the
likes, which might not be suitable for high-dimensional data. Inspired by many unique advantages of
Support Vector Machine (SVM) in the high-dimensional model, this paper proposes the missing value
imputation based on the nearest neighbors similarity and support vector machine. Four commonly used
data sets in the UCI machine learning database are adopted in the experiment, with experimental results
showing that MINNS-SVM is effective.
Keywords missing value, data imputation, support vector machine.
----------------------------------------************************---------------------------------by mean value is likely to affect the objectivity of
I. INTRODUCTION
data and reduce the accuracy of machine learning
A lot of factors in real life lead to missing value, outcomes[4,5]; probability theory imputation
such as a failed data acquisition in experiment, loss includes the missing value imputation methods
of data during transmission, incomplete data based on Bayesian probability and probabilitybecause of unanswered questions in the survey, etc. weighted[6,7]. According to a priori probability of
Even the most famous UCI database in the field of data distribution, Chiu and Sedransk put forward a
machine learning also has more than 40% of the new Bayesian method for predicting the missing
data sets that contains missing value [1]. Missing value[8]. The probability imputation method has a
value is usually divided into two categories good anti-jamming capability; neighbor similarity
missing completely at random (MCAR) and imputation-based principle is: things of the same
missing at random (MAR), in which the missing kind or similar nature are similar in attributes, so
value of MCAR does not depend on all the neighbor similarity-based imputation method can
observed data, while MAR is just the opposite[2,3]. effectively complement the missing value.
Since MAR meets the requirements of reality to the
As the most classic in neighbor similarity-based
largest extent, most of the researches on imputation imputation methods, K-means clustering proposed
algorithms are based on MAR data.
by Amanda is to find out K samples nearest to the
Faced with the inevitable MAR missing value, missing value firstly, followed by a weighted
existing research has proposed a large number of average of the missing value corresponding to K
commonly used methods for missing value samples, and then take calculated results as the
imputation, which can be divided into the following imputation of the missing value[9]. Difficulty of this
three categories: simple processing, probability method is to select similarity and K value between
theory imputation and imputation based on samples, different similarities and K values will
neighborsimilarity. Simple processing method affect the accuracy of results, and the selection of
mainly includes the simple discarding method and these two parameters is also more difficult in the
the mean value imputation method. These two high-dimensional model. Support vector machine
methods can not estimate whether the missing value exhibits many unique advantages in the problem of
will affect the results of machine learning, so high-dimensional pattern recognition, and every
simply discarding or substituting the missing value item of the real missing value has a lot of attributes,

ISSN: 2394-2231

http://www.ijctjournal.org

Page 6

International Journal of Computer Techniques - Volume 3 Issue 1, Jan- Feb 2016

forming a high-dimensional model. Therefore, a


MINNS-SVM (Missing value Imputation base on
Nearest Neighbors Similarity and Support Vector
Machine) is proposed in this paper.
In order to verify the effectiveness of MINNSSVM algorithm, the experimental investigation is
conducted from two aspects. On the one hand, four
complete data sets are chosen from UCI database,
and then a certain percentage of random data is
extracted as missing value, using MINNS-SVM
algorithm for missing value imputation to verify the
validity of the algorithm through comparison of the
error between the original value and imputation
value; on the other hand, the data set usually needs
to be classified in the field of machine learning. In
the experiment, commonly used classification
algorithms of machine learning are used to classify
the complete data set for verification. Then missing
value and imputed value are classified with the
identical classification algorithm to contrast the
accuracy of classification. The experimental
analysis shows that imputed value of MINNS-SVM
algorithm has a better quality.

d = "*"

attributes. ij
indicates that the attributes of
this value is missing.
Dataset D is divided into two parts: D = M C
and C = (C1 , C2 ,...., Cn ) with complete attributes.
M = ( M 1 , M 2 ,...., M n ) is the dataset withmissing

attributes.
A. Configuration of SVM

Support vector machine is an algorithm of


machine learningbased on the Statistical Learning
Theory (SLT). Principle of this algorithm is to find
a hyperplane, so that points belonging to different
categories are located on both sides of the
hyperplanewith the largest empty region [10, 11].
1) Linear SVM

= "*"

In the dataset with missing value, ipi


. pi
represents the location of missing attribute in the
missing value, while the training sample set deletes
this attribute of the same category, shown as
follows:

(C * E ,T ) ,i = 1, 2,3,, k ,T {+1, 1}
'

0
E pi 1
II.
MINNS-SVM ALGORITHM DESIGN
E' =

MINNS-SVM algorithm is a mix of the ideas of


En pi
0
(1)
nearest neighbor similarity and the methods of
T
support vector machine classification. The dataset is
k isthe line width of matrix C . i isthe value of
divided into two setswith missing value and
(Ci * E ' ) + b = 0 ,
without missing value. The dataset without missing classification, so hyperplane is
value serves as the training set, and the one with which should satisfy the condition of
missing value serves as the test set, finally missing Ti (Ci * E ' ) + b 1,i = 1, 2,3,, k
. At this point,
value is imputed with the use of the ideas of nearest
2
neighbor similarity.

data sets can expressed as a matrix.


interval of classification is
, so the problem of
d11 d12 L d1 j L d n finding the optimal hyperplane is transformed to

that of finding the minimum for constrained


d 21 d 22 L d 2 j L d 2 n

optimization:
M
D = ( D1 , D2 ,...., Dn ) =
di1

M
d
m1

M
di 2

is

the

D j = (d1 j ,d 2 j ,...,d mj )T

ISSN: 2394-2231

dij

M
d m 2 L d mj

Di = (d i1 , d i 2 ,...,d in ) represents
dij

attribute,

value

the
of

1
2
d in min
2

O M subject to T [ (C * E ' ) + b] 1 0 i = 1, 2,..., k


i
i
L d mn

i-th

(2)
In order to solve constrained optimization
problem, Lagrange function is introduced:

data

attributes.

n
1
L ( , b,a ) = 2 ai Ti (Ci * E ' ) + b 1
2
i =1

( (

) )

(3)

isthe setofvalues of all the j-th


http://www.ijctjournal.org

Page 7

International Journal of Computer Techniques - Volume 3 Issue 1, Jan- Feb 2016

According to Saddle Point Theorem, the result


isa = , , , , .

( cij mij )

dist ( Ci , M i ) =

j =1

(8)
Where k represents the datas dimension, xi, yi
j =1
(4)
are the i-th components of data x, y.
Therefore, the transformational relation between
j j a*j > 0
Where the subscript
. Therefore, the similarity and distance can be expressed as:

* = a*j (C j * E ' )T j

{
}
( (C * E )) + b
optimal hyperplane is
*

'

=0

, and the

optimal classification function is:

{(

) }

f ( x ) = sgn * x + b* , x M * E '

sim(Ci , M i ) = g ( dist ( Ci , M i ) )
mk

subject to

sim(Ci , M i ) = 1
i =1

(5)

2) Non-linear SVM

For the case of non-linear data, SVM is to map


the training set to a higher-dimensional vector space,
thereby solving the maximum interval hyperplane
in the higher-dimensional vector space. In the
corresponding nonlinear case, x will be mapped to a
high-dimensional space H , then

Where g:

(9)
is the one-to-one mapping from

distance dist ( Ci , M i ) to similarity sim(Ci , M i ) , with


1
the existence of an inverse mapping g , make

dist ( Ci , M i ) = g 1 ( sim(Ci , M i ) )

. Since the distance is


the vector displacement from the starting point to
the end, the relationship between distance
dist ( Ci , M i )

sim(Ci , M i )
and
similarity
is
(6)
monotonically decreasing.
The SVM decision function is obtained by
According to the nearest neighbors similarity idea,
solving the above function under nonlinear situation:
the following function can be got:
n
x ( x ) = (1 ( x ), 2 ( x ),..., n ( x)) T

f ( x) = sgn( aiTi ( xi ) ( x) + b)

mk

mipi =

sim(C , M ) /(m k)

ipi
i
i
(7)
i =1
(10)
Seen from the above formula, objective function
and decision function only relate to the inner C. MINNS-SVM algorithm flow
product operation between data, so support vector
This algorithm is a combination of support vector
machine is dominant in the high-dimensional data machine algorithm with the nearest neighbors
calculation.
similarity idea. On account of missing value
imputation, the first step is to scan the data set to
B. Configuration of MINNS-SVM
classify the complete value and missing value. Next,
With the use of the nearest neighbors similarity
get data out from the missing dataset circularly and
idea, missing value imputation is achieved on the
record the location of lost attributes in the acquired
basis of the determination of missing value category.
missing value. Take the complete dataset out, and
Similarity refers that the specific number of
then extract the lost attributes as a basis for
relative indicators is on a unified scale, using the
classification and the remaining data as a training
principles of fuzzy comprehensive evaluation to
set to be input in the support vector machine.
determine the value of the evaluation criteria and
Maximum interval of hyperplane is taken as a
draw an items difference values betweenthe
classification basis of the support vector machine,
standard value and those of different indexes.
and only the inner product between data is used for
Similarity is to measure the degree of similarity
operation, so multidimensional data have a good
between two data, which is often calculated by the
fitting ability. Then calculate the similarity of
distance between the two data. The most common
missing value and similar values based on the
formula for calculating the distance is the Euclidean
classification result, as well as the value after
distance:
imputation based on K neighbors which are
i =1

ISSN: 2394-2231

http://www.ijctjournal.org

Page 8

International Journal of Computer Techniques - Volume 3 Issue 1, Jan- Feb 2016

obtained in accordance with the nearest neighbors extracted from the complete data set, assuming that
similarity.
one or more attribute values is missing, the data set
Core pseudo-code of the algorithm:
is regarded as the missing value. The classic
Inputs:
algorithmk-means algorithm which is also based
original
on nearest neighbors similarity, and Bayesian
X
: it contains the n d dataset of the missing
network algorithm which is commonly used in
value
probability theory are selected for comparison.
Outputs:
The original value of the missing data is known
predicted
to assess the continuous effectiveness of the
X
: n d dataset after imputation
missing value imputation algorithm. One of the
Step 1: The missing value is separated from the dataset
important indicators is the error between the value
original
complete
missing
complete
missing
after imputation and the original value, so the most
X
=X
X
X
X
=
commonly used root-mean-square error is selected
Xcomplete: nc d matrix without missing value
as evaluation criteria for the first group of
comparative experiments in this paper.
Xmissing: Each row has a nm d matrix of the missing
On the other hand, missing value imputation aims
value
to improve the accuracy of machine learning, in
Step 2: The missing value of each row is imputedbased on
which the classification algorithm is extensively
used, so the accuracy improvement achieved by the
nearest neighbors similarity and support vector machine
classification algorithm can be taken as evaluation
fori=1 to nc
criteria. Through a classification of the data set after
Xi = i-th row of Xcomplete Matrix
imputation and then a comparison with the data
Imissing = location of the missing value in Xmissing
without imputation, ACC improvement (simple
classification accuracy improvement) percentage
Xinputi = value of Xi without the location of Imissing
can be calculated [12].
SVMtrain.add(Xinputi)
ACC =
=
(11)
end for

for j=1 to nm
Xj=i-th row of Xmissing Matrix

ACC with predicted


ACC improvement(%)=100

ACC without predicted

(12)
Where I () is a judgment function. If the
condition is true, return to 1, otherwise return to 0.
for k = 1 tonc
In order to fully test the effectiveness of the
ifXtarget
= Xtarget
k
j
algorithm for missing value imputation in this paper,
five percent of data is firstly extracted from four
Disjk = STD(Xj ,XK)
data sets as the missing value, with five percent
end if
increments until the missing ratio reaches 90%. For
end for
each data set, each missing value is randomly
Sumdis = 1/Disjk
selected in order to test the accuracy of the results,
predicted
and the same experiment is repeated a thousand
Xj
= Xk [Imissing]/(Disjk Sumdis)
times to take the average value as the result.
end for
Experimental hardware environment is: Intel (R)
Core i3-3240 @ 3.4GHZ, 4GB of memory;
III.
EXPERIMENT
software environment is: Windows8 operating
In order to verify the effectiveness of MINNS- system.
SVM algorithm, two comparative experiments are
designed in this paper. At the beginning of the A. Datasets
experiment, a certain amount of data is randomly
Xtarget
= SVM.predict(Xj)
j

ISSN: 2394-2231

http://www.ijctjournal.org

Page 9

International Journal of Computer Techniques - Volume 3 Issue 1, Jan- Feb 2016


TABLE 1 BASIC INFORMATION OF DATASETS
Four real-life data sets are selected from the UCI
Sample
machine learning library. In order to have an
Dataset
Dimension
accurate test data for comparison, the selected data
number
sets are complete without missing value, but a
Iris
150
5
certain proportion of missing values will be
randomly selected in the experiment. Four data sets
Mushroom
8124
22
are respectively: Iris Dataset, Mushroom Dataset,
Steel Plates Faults
1941
27
Steel Plates Faults Dataset and Breast Cancer
EisconsinDataset. Table 1 shows the basic
Breast Cancer Eisconsin
569
32
information of four data sets, all of which are
selectedfrom different perspectives of consideration.
From the perspective of the sample size of datasets,
Iris, Breast Cancer Eisconsin have a smaller data B. RMSE Assessment Results
Root-mean-square error formula:
size, while Mushroom has a larger data size; from
the perspective of the dimension of datasets, the

RMSE =
! Where N = n-1, n is
dimension of Iris is lower, while the dimensions of
Mushroom, Steel Plates Faults, Breast Cancer the number of missing value, is the original value,
Eisconsin are higher. These data are collected from and is the imputation value.
reality with authenticity.

Iris Dataset

ISSN: 2394-2231

Mushroom Dataset

http://www.ijctjournal.org

Page 10

International Journal of Computer Techniques - Volume 3 Issue 1, Jan- Feb 2016

Steel Plates Faults Dataset

Breast Cancer Eisconsin Dataset

Fig.1 comparison chart of four data sets root-mean-square error after completion

It can clearly be seen from Fig.1 that the


algorithm proposed in this paper is better than the
k-means and Bayesian missing value imputation in
most cases, and the accuracy of k-means algorithm
even shows instability in some specific data sets;
while MINNS-SVM algorithm is stable,
particularly in the three high-dimensional datasets,
the root-mean-square error of MINNS-SVM
algorithm is smaller obviously, indicating that
MINNS-SVM algorithm is feasible and effective
for missing value imputation.

machine learning are adopted by the classification


algorithm [13].
Core idea of K-NN algorithm is to select k
nearest neighbors from the feature space as the
sample, if a majority of the selected neighbors
belong to certain category, the sample can be
divided into this category:
(11)
" = arg max'
'(
ANN is one of the most widely used machine
learning algorithms. Classification is determined
based on the input attributes and their weights:
3
C. Classification Algorithm Assessment Results
()* + ., (*, -, !, / = 1,2, , 2
) = *
Common K-Nearest neighbor regression and
(12)
Artificial Neural Network (ANN) in the field of
TABLE 2 ACC IMPROVEMENT (%) OF K-NN CLASSIFICATION ALGORITHM

DataSet
iris

Imputation
k-means
Bayesian
MINNS-SVM

5
0.90
1.03
1.03

10
0.88
1.09
1.10

15
0.95
1.16
1.17

20
0.96
1.19
1.25

25
1.08
1.29
1.29

30
1.11
1.35
1.42

35
1.13
1.49
1.51

40
1.15
1.54
1.61

45
1.31
1.61
1.75

50
1.42
1.67
1.88

Mushroom

k-means
Bayesian
MINNS-SVM

0.83
1.02
1.03

0.91
1.04
1.07

0.92
1.07
1.10

0.92
1.09
1.17

0.95
1.13
1.25

1.01
1.16
1.39

1.21
1.21
1.36

1.31
1.26
1.51

1.38
1.32
1.71

1.42
1.39
1.76

Steel Plates
Faults

k-means
Bayesian
MINNS-SVM

0.81
1.00
1.04

0.82
1.02
1.10

0.82
1.04
1.15

0.85
1.03
1.20

0.82
1.05
1.28

0.85
1.06
1.39

0.92
1.09
1.46

0.90
1.15
1.59

0.93
1.14
1.74

0.92
1.20
1.87

Breast Cancer
Eisconsin

k-means
Bayesian
MINNS-SVM

1.04
0.95
1.05

1.08
0.92
1.09

1.14
0.95
1.14

1.19
0.98
1.21

1.25
1.02
1.27

1.31
1.10
1.32

1.40
1.18
1.41

1.46
1.32
1.48

1.57
1.31
1.60

1.72
1.42
1.72

ISSN: 2394-2231

http://www.ijctjournal.org

Page 11

International Journal of Computer Techniques - Volume 3 Issue 1, Jan- Feb 2016


TABLE 3 ACC IMPROVEMENT (%) OF ANN CLASSIFICATION ALGORITHM

DataSet
iris

Imputation
k-means
Bayesian
MINNS-SVM

5
0.92
1.04
1.04

10
0.88
1.10
1.11

15
0.92
1.16
1.17

20
0.96
1.25
1.25

25
1.02
1.32
1.33

30
1.11
1.40
1.43

35
1.16
1.45
1.52

40
1.18
1.61
1.67

45
1.33
1.78
1.77

50
1.52
1.88
1.97

Mushroom

k-means
Bayesian
MINNS-SVM

0.88
1.01
1.04

0.91
1.04
1.08

0.91
1.06
1.13

0.90
1.09
1.19

0.93
1.17
1.25

1.02
1.18
1.34

1.15
1.19
1.42

1.33
1.27
1.53

1.38
1.36
1.63

1.42
1.34
1.79

Steel Plates
Faults

k-means
Bayesian
MINNS-SVM

0.81
1.01
1.04

0.83
1.04
1.09

0.85
1.03
1.15

0.85
1.04
1.20

0.83
1.07
1.31

0.86
1.10
1.40

0.99
1.24
1.41

0.95
1.13
1.57

0.95
1.14
1.70

0.95
1.42
1.79

Breast Cancer
Eisconsin

k-means
Bayesian
MINNS-SVM

1.04
0.90
1.05

1.08
0.92
1.08

1.14
0.95
1.14

1.19
0.93
1.20

1.26
1.01
1.26

1.33
1.12
1.33

1.40
1.18
1.41

1.49
1.35
1.52

1.58
1.36
1.60
1.60

1.72
1.52
1.71

As can be seen from Table 2 and Table 3, all the algorithm in the field of machine learning,
methods for missing value imputation can improve particularly for the high-dimensional missing value.
the accuracy of classification. MINNS-SVM is
This algorithm still remains several issues to be
better than traditional k-means and Bayesian further studied. First, attributes of the data sets
missing value imputation in most cases. On the selected in the experiment are totally numerical, so
low-dimensional Iris Dataset, compared to k-means the algorithm in this paper can not handle the
algorithm, MINNS-SVM algorithm has increased 5% datasets which have non-numeric attributes. Second,
on average; compared to Bayesian algorithm, the the use of SVM algorithm significantly increases
accuracy has increased 3%. But the average the running time of algorithm in the case of
increasing rate of high-dimensional datasets of numerous data sets. Therefore, parallelization is
Mushroom, Steel Plates Faults and Breast Cancer considered to be put into practice with distributed
Eisconsin reaches 15%, indicating that MINNS- systems to handle large amounts of data in future
SVM algorithm is effective for the high- research work.
dimensional missing value imputation.
ACKNOWLEDGMENT
CONCLUSIONS
IV.
The work was supported by the research program
incomplete datasets are common in the practical funded by the Smart City Research Program
application of machine learning, and data through the Province Research Foundation of
preprocessing is inseparable from the handling of GuangDong,China (No. 2013B090500030).
missing value. Therefore, this paper proposes
MINNS-SVM. In order to test the effectiveness of REFERENCES
J. Garca-Laencina, J.-L. Sancho-Gmez, A. R. Figueiras-Vidal,M.
the proposed MINNS-SVM algorithm, four hot data [1] P.
Verleysen.K nearest neighbours with mutual information for
simultaneous classification and missing data imputation [J].
sets in the UCI machine learning database are
2009, 72(7):1483-1493.
adopted and used MINNS-SVM algorithm [2] Neurocomputing,
P. R. Rosenbaum,D. B. Rubin.The central role of the propensity score
in observational studies for causal effects [J]. Biometrika, 1983,
compared with the traditional K-means algorithm
and Bayesian algorithm in the experiment. The [3] 70(1):41-55.
BU Fan-yu,CHENZhi-kui,ZHANG Qing-che..Incomplete Big Data
Imputation Algorithm Based On Deep Learning [J]. Microelectronics
experimental results show that, the error between
Computer.2014(12):173-176
the original value and MINNS-SVM algorithm- [4] &
R. J. Little,D. B. Rubin.Statistical analysis with missing data [J]. 2002.
based value after imputation is smaller. Compared [5] A. R. T. Donders, G. J. van der Heijden, T. Stijnen,K. G. Moons.a
gentle introduction to imputation of missing values [J]. Journal of
to the data without imputation, the one after
clinical epidemiology, 2006, 59(10):1087-1091.
imputation has a higher success rate of [6] H.Chiu,J.Sedransk.A Bayesian procedure for imputing missing values
in sample surveys [J]. Journal of the American Statistical Association,
classification when being used for the classification
1986, 81(395):667-676.

ISSN: 2394-2231

http://www.ijctjournal.org

Page 12

International Journal of Computer Techniques - Volume 3 Issue 1, Jan- Feb 2016


[7]

[8]

[9]

DeyS.Bayesian Estimation of the Parameter and Reliability Function of


an Inverse Rayleigh Distribution. Malaysian J.of Mathematical
Sciences . 2012.
S. J. Rizvi,J. R. Haritsa.Maintaining data privacy in association rule
mining:Proceedings of the 28th international conference on Very Large
Data Bases,2002[C].VLDB Endowment:682-693
A. N. Baraldi,C. K. Enders.An introduction to modern missing data
analyses [J]. Journal of School Psychology, 2010, 48(1):5-37.

[10]

[11]

[12]
[13]

ISSN: 2394-2231

DAI Liang,XU Hong-ke,CHENTing,QIAN Chao. Least squares


support vector machine regression model based on MapReduce [J].
Application Research of Computers,2014(12): 1060-1064
DING Shi-fei, QI Bing-juan,TAN Hong-yan.An Overview on Theory
and Algorithm of Support Vector Machines[J]. Journal of University of
Electronic Science and Technology of China,, 2011( 01):2-10.
P. Kang.Locally linear reconstruction based missing value imputation
for supervised learning [J]. Neurocomputing, 2013, 118(65-78).
T. Cover,P. Hart.Nearestneighbor pattern classification [J]. Information
Theory, IEEE Transactions on, 1967, 13(1):21-27.

http://www.ijctjournal.org

Page 13

You might also like