Professional Documents
Culture Documents
Ijct V3i1p2
Ijct V3i1p2
RESEARCH ARTICLE
OPEN ACCESS
----------------------------------------************************----------------------------------
Abstract:
Missing value processing is an unavoidable problem of data pre-processing in the field of Machine
learning. Most traditional missing value imputation methods are based on probability distribution and the
likes, which might not be suitable for high-dimensional data. Inspired by many unique advantages of
Support Vector Machine (SVM) in the high-dimensional model, this paper proposes the missing value
imputation based on the nearest neighbors similarity and support vector machine. Four commonly used
data sets in the UCI machine learning database are adopted in the experiment, with experimental results
showing that MINNS-SVM is effective.
Keywords missing value, data imputation, support vector machine.
----------------------------------------************************---------------------------------by mean value is likely to affect the objectivity of
I. INTRODUCTION
data and reduce the accuracy of machine learning
A lot of factors in real life lead to missing value, outcomes[4,5]; probability theory imputation
such as a failed data acquisition in experiment, loss includes the missing value imputation methods
of data during transmission, incomplete data based on Bayesian probability and probabilitybecause of unanswered questions in the survey, etc. weighted[6,7]. According to a priori probability of
Even the most famous UCI database in the field of data distribution, Chiu and Sedransk put forward a
machine learning also has more than 40% of the new Bayesian method for predicting the missing
data sets that contains missing value [1]. Missing value[8]. The probability imputation method has a
value is usually divided into two categories good anti-jamming capability; neighbor similarity
missing completely at random (MCAR) and imputation-based principle is: things of the same
missing at random (MAR), in which the missing kind or similar nature are similar in attributes, so
value of MCAR does not depend on all the neighbor similarity-based imputation method can
observed data, while MAR is just the opposite[2,3]. effectively complement the missing value.
Since MAR meets the requirements of reality to the
As the most classic in neighbor similarity-based
largest extent, most of the researches on imputation imputation methods, K-means clustering proposed
algorithms are based on MAR data.
by Amanda is to find out K samples nearest to the
Faced with the inevitable MAR missing value, missing value firstly, followed by a weighted
existing research has proposed a large number of average of the missing value corresponding to K
commonly used methods for missing value samples, and then take calculated results as the
imputation, which can be divided into the following imputation of the missing value[9]. Difficulty of this
three categories: simple processing, probability method is to select similarity and K value between
theory imputation and imputation based on samples, different similarities and K values will
neighborsimilarity. Simple processing method affect the accuracy of results, and the selection of
mainly includes the simple discarding method and these two parameters is also more difficult in the
the mean value imputation method. These two high-dimensional model. Support vector machine
methods can not estimate whether the missing value exhibits many unique advantages in the problem of
will affect the results of machine learning, so high-dimensional pattern recognition, and every
simply discarding or substituting the missing value item of the real missing value has a lot of attributes,
ISSN: 2394-2231
http://www.ijctjournal.org
Page 6
d = "*"
attributes. ij
indicates that the attributes of
this value is missing.
Dataset D is divided into two parts: D = M C
and C = (C1 , C2 ,...., Cn ) with complete attributes.
M = ( M 1 , M 2 ,...., M n ) is the dataset withmissing
attributes.
A. Configuration of SVM
= "*"
(C * E ,T ) ,i = 1, 2,3,, k ,T {+1, 1}
'
0
E pi 1
II.
MINNS-SVM ALGORITHM DESIGN
E' =
optimization:
M
D = ( D1 , D2 ,...., Dn ) =
di1
M
d
m1
M
di 2
is
the
D j = (d1 j ,d 2 j ,...,d mj )T
ISSN: 2394-2231
dij
M
d m 2 L d mj
Di = (d i1 , d i 2 ,...,d in ) represents
dij
attribute,
value
the
of
1
2
d in min
2
i-th
(2)
In order to solve constrained optimization
problem, Lagrange function is introduced:
data
attributes.
n
1
L ( , b,a ) = 2 ai Ti (Ci * E ' ) + b 1
2
i =1
( (
) )
(3)
Page 7
( cij mij )
dist ( Ci , M i ) =
j =1
(8)
Where k represents the datas dimension, xi, yi
j =1
(4)
are the i-th components of data x, y.
Therefore, the transformational relation between
j j a*j > 0
Where the subscript
. Therefore, the similarity and distance can be expressed as:
* = a*j (C j * E ' )T j
{
}
( (C * E )) + b
optimal hyperplane is
*
'
=0
, and the
{(
) }
f ( x ) = sgn * x + b* , x M * E '
sim(Ci , M i ) = g ( dist ( Ci , M i ) )
mk
subject to
sim(Ci , M i ) = 1
i =1
(5)
2) Non-linear SVM
Where g:
(9)
is the one-to-one mapping from
dist ( Ci , M i ) = g 1 ( sim(Ci , M i ) )
sim(Ci , M i )
and
similarity
is
(6)
monotonically decreasing.
The SVM decision function is obtained by
According to the nearest neighbors similarity idea,
solving the above function under nonlinear situation:
the following function can be got:
n
x ( x ) = (1 ( x ), 2 ( x ),..., n ( x)) T
f ( x) = sgn( aiTi ( xi ) ( x) + b)
mk
mipi =
sim(C , M ) /(m k)
ipi
i
i
(7)
i =1
(10)
Seen from the above formula, objective function
and decision function only relate to the inner C. MINNS-SVM algorithm flow
product operation between data, so support vector
This algorithm is a combination of support vector
machine is dominant in the high-dimensional data machine algorithm with the nearest neighbors
calculation.
similarity idea. On account of missing value
imputation, the first step is to scan the data set to
B. Configuration of MINNS-SVM
classify the complete value and missing value. Next,
With the use of the nearest neighbors similarity
get data out from the missing dataset circularly and
idea, missing value imputation is achieved on the
record the location of lost attributes in the acquired
basis of the determination of missing value category.
missing value. Take the complete dataset out, and
Similarity refers that the specific number of
then extract the lost attributes as a basis for
relative indicators is on a unified scale, using the
classification and the remaining data as a training
principles of fuzzy comprehensive evaluation to
set to be input in the support vector machine.
determine the value of the evaluation criteria and
Maximum interval of hyperplane is taken as a
draw an items difference values betweenthe
classification basis of the support vector machine,
standard value and those of different indexes.
and only the inner product between data is used for
Similarity is to measure the degree of similarity
operation, so multidimensional data have a good
between two data, which is often calculated by the
fitting ability. Then calculate the similarity of
distance between the two data. The most common
missing value and similar values based on the
formula for calculating the distance is the Euclidean
classification result, as well as the value after
distance:
imputation based on K neighbors which are
i =1
ISSN: 2394-2231
http://www.ijctjournal.org
Page 8
obtained in accordance with the nearest neighbors extracted from the complete data set, assuming that
similarity.
one or more attribute values is missing, the data set
Core pseudo-code of the algorithm:
is regarded as the missing value. The classic
Inputs:
algorithmk-means algorithm which is also based
original
on nearest neighbors similarity, and Bayesian
X
: it contains the n d dataset of the missing
network algorithm which is commonly used in
value
probability theory are selected for comparison.
Outputs:
The original value of the missing data is known
predicted
to assess the continuous effectiveness of the
X
: n d dataset after imputation
missing value imputation algorithm. One of the
Step 1: The missing value is separated from the dataset
important indicators is the error between the value
original
complete
missing
complete
missing
after imputation and the original value, so the most
X
=X
X
X
X
=
commonly used root-mean-square error is selected
Xcomplete: nc d matrix without missing value
as evaluation criteria for the first group of
comparative experiments in this paper.
Xmissing: Each row has a nm d matrix of the missing
On the other hand, missing value imputation aims
value
to improve the accuracy of machine learning, in
Step 2: The missing value of each row is imputedbased on
which the classification algorithm is extensively
used, so the accuracy improvement achieved by the
nearest neighbors similarity and support vector machine
classification algorithm can be taken as evaluation
fori=1 to nc
criteria. Through a classification of the data set after
Xi = i-th row of Xcomplete Matrix
imputation and then a comparison with the data
Imissing = location of the missing value in Xmissing
without imputation, ACC improvement (simple
classification accuracy improvement) percentage
Xinputi = value of Xi without the location of Imissing
can be calculated [12].
SVMtrain.add(Xinputi)
ACC =
=
(11)
end for
for j=1 to nm
Xj=i-th row of Xmissing Matrix
(12)
Where I () is a judgment function. If the
condition is true, return to 1, otherwise return to 0.
for k = 1 tonc
In order to fully test the effectiveness of the
ifXtarget
= Xtarget
k
j
algorithm for missing value imputation in this paper,
five percent of data is firstly extracted from four
Disjk = STD(Xj ,XK)
data sets as the missing value, with five percent
end if
increments until the missing ratio reaches 90%. For
end for
each data set, each missing value is randomly
Sumdis = 1/Disjk
selected in order to test the accuracy of the results,
predicted
and the same experiment is repeated a thousand
Xj
= Xk [Imissing]/(Disjk Sumdis)
times to take the average value as the result.
end for
Experimental hardware environment is: Intel (R)
Core i3-3240 @ 3.4GHZ, 4GB of memory;
III.
EXPERIMENT
software environment is: Windows8 operating
In order to verify the effectiveness of MINNS- system.
SVM algorithm, two comparative experiments are
designed in this paper. At the beginning of the A. Datasets
experiment, a certain amount of data is randomly
Xtarget
= SVM.predict(Xj)
j
ISSN: 2394-2231
http://www.ijctjournal.org
Page 9
RMSE =
! Where N = n-1, n is
dimension of Iris is lower, while the dimensions of
Mushroom, Steel Plates Faults, Breast Cancer the number of missing value, is the original value,
Eisconsin are higher. These data are collected from and is the imputation value.
reality with authenticity.
Iris Dataset
ISSN: 2394-2231
Mushroom Dataset
http://www.ijctjournal.org
Page 10
Fig.1 comparison chart of four data sets root-mean-square error after completion
DataSet
iris
Imputation
k-means
Bayesian
MINNS-SVM
5
0.90
1.03
1.03
10
0.88
1.09
1.10
15
0.95
1.16
1.17
20
0.96
1.19
1.25
25
1.08
1.29
1.29
30
1.11
1.35
1.42
35
1.13
1.49
1.51
40
1.15
1.54
1.61
45
1.31
1.61
1.75
50
1.42
1.67
1.88
Mushroom
k-means
Bayesian
MINNS-SVM
0.83
1.02
1.03
0.91
1.04
1.07
0.92
1.07
1.10
0.92
1.09
1.17
0.95
1.13
1.25
1.01
1.16
1.39
1.21
1.21
1.36
1.31
1.26
1.51
1.38
1.32
1.71
1.42
1.39
1.76
Steel Plates
Faults
k-means
Bayesian
MINNS-SVM
0.81
1.00
1.04
0.82
1.02
1.10
0.82
1.04
1.15
0.85
1.03
1.20
0.82
1.05
1.28
0.85
1.06
1.39
0.92
1.09
1.46
0.90
1.15
1.59
0.93
1.14
1.74
0.92
1.20
1.87
Breast Cancer
Eisconsin
k-means
Bayesian
MINNS-SVM
1.04
0.95
1.05
1.08
0.92
1.09
1.14
0.95
1.14
1.19
0.98
1.21
1.25
1.02
1.27
1.31
1.10
1.32
1.40
1.18
1.41
1.46
1.32
1.48
1.57
1.31
1.60
1.72
1.42
1.72
ISSN: 2394-2231
http://www.ijctjournal.org
Page 11
DataSet
iris
Imputation
k-means
Bayesian
MINNS-SVM
5
0.92
1.04
1.04
10
0.88
1.10
1.11
15
0.92
1.16
1.17
20
0.96
1.25
1.25
25
1.02
1.32
1.33
30
1.11
1.40
1.43
35
1.16
1.45
1.52
40
1.18
1.61
1.67
45
1.33
1.78
1.77
50
1.52
1.88
1.97
Mushroom
k-means
Bayesian
MINNS-SVM
0.88
1.01
1.04
0.91
1.04
1.08
0.91
1.06
1.13
0.90
1.09
1.19
0.93
1.17
1.25
1.02
1.18
1.34
1.15
1.19
1.42
1.33
1.27
1.53
1.38
1.36
1.63
1.42
1.34
1.79
Steel Plates
Faults
k-means
Bayesian
MINNS-SVM
0.81
1.01
1.04
0.83
1.04
1.09
0.85
1.03
1.15
0.85
1.04
1.20
0.83
1.07
1.31
0.86
1.10
1.40
0.99
1.24
1.41
0.95
1.13
1.57
0.95
1.14
1.70
0.95
1.42
1.79
Breast Cancer
Eisconsin
k-means
Bayesian
MINNS-SVM
1.04
0.90
1.05
1.08
0.92
1.08
1.14
0.95
1.14
1.19
0.93
1.20
1.26
1.01
1.26
1.33
1.12
1.33
1.40
1.18
1.41
1.49
1.35
1.52
1.58
1.36
1.60
1.60
1.72
1.52
1.71
As can be seen from Table 2 and Table 3, all the algorithm in the field of machine learning,
methods for missing value imputation can improve particularly for the high-dimensional missing value.
the accuracy of classification. MINNS-SVM is
This algorithm still remains several issues to be
better than traditional k-means and Bayesian further studied. First, attributes of the data sets
missing value imputation in most cases. On the selected in the experiment are totally numerical, so
low-dimensional Iris Dataset, compared to k-means the algorithm in this paper can not handle the
algorithm, MINNS-SVM algorithm has increased 5% datasets which have non-numeric attributes. Second,
on average; compared to Bayesian algorithm, the the use of SVM algorithm significantly increases
accuracy has increased 3%. But the average the running time of algorithm in the case of
increasing rate of high-dimensional datasets of numerous data sets. Therefore, parallelization is
Mushroom, Steel Plates Faults and Breast Cancer considered to be put into practice with distributed
Eisconsin reaches 15%, indicating that MINNS- systems to handle large amounts of data in future
SVM algorithm is effective for the high- research work.
dimensional missing value imputation.
ACKNOWLEDGMENT
CONCLUSIONS
IV.
The work was supported by the research program
incomplete datasets are common in the practical funded by the Smart City Research Program
application of machine learning, and data through the Province Research Foundation of
preprocessing is inseparable from the handling of GuangDong,China (No. 2013B090500030).
missing value. Therefore, this paper proposes
MINNS-SVM. In order to test the effectiveness of REFERENCES
J. Garca-Laencina, J.-L. Sancho-Gmez, A. R. Figueiras-Vidal,M.
the proposed MINNS-SVM algorithm, four hot data [1] P.
Verleysen.K nearest neighbours with mutual information for
simultaneous classification and missing data imputation [J].
sets in the UCI machine learning database are
2009, 72(7):1483-1493.
adopted and used MINNS-SVM algorithm [2] Neurocomputing,
P. R. Rosenbaum,D. B. Rubin.The central role of the propensity score
in observational studies for causal effects [J]. Biometrika, 1983,
compared with the traditional K-means algorithm
and Bayesian algorithm in the experiment. The [3] 70(1):41-55.
BU Fan-yu,CHENZhi-kui,ZHANG Qing-che..Incomplete Big Data
Imputation Algorithm Based On Deep Learning [J]. Microelectronics
experimental results show that, the error between
Computer.2014(12):173-176
the original value and MINNS-SVM algorithm- [4] &
R. J. Little,D. B. Rubin.Statistical analysis with missing data [J]. 2002.
based value after imputation is smaller. Compared [5] A. R. T. Donders, G. J. van der Heijden, T. Stijnen,K. G. Moons.a
gentle introduction to imputation of missing values [J]. Journal of
to the data without imputation, the one after
clinical epidemiology, 2006, 59(10):1087-1091.
imputation has a higher success rate of [6] H.Chiu,J.Sedransk.A Bayesian procedure for imputing missing values
in sample surveys [J]. Journal of the American Statistical Association,
classification when being used for the classification
1986, 81(395):667-676.
ISSN: 2394-2231
http://www.ijctjournal.org
Page 12
[8]
[9]
[10]
[11]
[12]
[13]
ISSN: 2394-2231
http://www.ijctjournal.org
Page 13