Professional Documents
Culture Documents
Lee 2017
Lee 2017
Chang-Hwan Lee
www.elsevier.com
PII: S0169-023X(16)30127-6
DOI: http://dx.doi.org/10.1016/j.datak.2017.11.002
Reference: DATAK1623
To appear in: Data & Knowledge Engineering
Received date: 2 August 2016
Revised date: 1 November 2017
Accepted date: 11 November 2017
Cite this article as: Chang-Hwan Lee, An Information-Theoretic Filter Approach
for Value Weighted Classification Learning in Naive Bayes, Data & Knowledge
Engineering, http://dx.doi.org/10.1016/j.datak.2017.11.002
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
An Information-Theoretic Filter Approach for Value
Weighted Classification Learning in Naive Bayes
Chang-Hwan Lee
Department of Information and Communications
Dongguk University, Seoul, Korea
Abstract
Assigning weights in features has been an important topic in some classification
learning algorithms. In this paper, we propose a new paradigm of assigning
weights in classification learning, called value weighting method. While the
current weighting methods assign a weight to each feature, we assign a different
weight to the values of each feature. The performance of naive Bayes learning
with value weighting method is compared with that of some other traditional
methods for a number of datasets. The experimental results show that the value
weighting method could improve the performance of naive Bayes significantly.
Keywords: Feature Weighting, Feature Selection, Naive Bayes,
Kullback-Leibler
1. Introduction
In some classifiers, the algorithms operate under the implicit assumption that
all features are of equal value as far as the classification problem is concerned.
However, when irrelevant and noisy features influence the learning task to the
same degree as highly relevant features, the accuracy of the model is likely to
deteriorate. Since the assumption that all features are equally important hardly
holds true in real world application, there have been some attempts to relax
this assumption in classification. Zheng and Webb [1] provide a comprehensive
overview of work in this area. The first approach for relaxing this assumption is
to combine feature subset selection with classification learning. It is to combine
a learning method with a preprocessing step that eliminates redundant features
from the data. Feature selection methods usually adopt a heuristic search in
the space of feature subsets. Since the number of distinct feature subsets grows
exponentially, it is not reasonable to do an exhaustive search to find optimal
feature subsets. In the literature, it is known that the predictive accuracy
of naive Bayes can be improved by removing redundant or highly correlated
2
discusses the mechanisms of the new value weighting method. Section V shows
the experimental results of the proposed method, and Section VI summarizes
the contributions made in this paper.
2. Related Work
3
method, the hill climbing method, the Markov Chain Monte Carlo method, and
combinations of these methods. The performance of the methods is measured
using AUC, and a weighted naive Bayes is claimed to produce accurate ranking
outperforms naive Bayes.
Cardie [13] used an information gain based on the position of an feature in
a decision tree to derive feature weights for nearest neighbor algorithm. They
designed case-based learning algorithms to improve the performance of minority
class predictions. They used local weighting method, and weights are derived
for each test instance based on their path it takes in the tree.
Gartner [5] employs feature weighting performed by SVM. The algorithm
looks for an optimal hyperplane that separates two classes in given space, and
the weights determining the hyperplane can be interpreted as feature weights
in the naive Bayes classifier. The weights are optimized such that the danger
of overfitting is reduced. It can solve the binary classification problems and
the feature weight is based on conditional independence. They showed that the
method compares favorably to state-of-the-art machine learning approaches.
Zaidi et al. [14] proposed a weighted naive Bayes algorithm, called WANBIA,
that selects weights to minimize either the negative conditional log likelihood or
the mean squared error objective functions. They perform numerous evaluations
and find that WANBIA is a competitive alternative to state of the art classifiers
like Random Forest, Logistic Regression and A1DE.
Taheri et al. [15] propose a novel attribute weighted Naive Bayes classifier
by considering weights to the conditional probabilities. An objective function is
modeled and taken into account, which is based on the structure of the Naive
Bayes classifier and the attribute weights. The optimal weights are determined
by a local optimization method using the quasisecant method.
The idea of value-based weight is first proposed in Lee [12]. He applied gra-
dient approach for calculating value weights, and showed by assigning different
weights to feature values we can improve the performance of the classification
learning. In this paper, we employ a static filter approach for calculating value
weights in naive Bayes learning. We employ Kullback-Leibler measure and re-
liability measure.
3. Background
4
dent given the class value, the classification on d is defined as follows
VN B (d) = argmaxc P (c) P (aij |c)
aij ∈d
VF W N B (d) = argmaxc P (c) P (aij |c)wi (1)
aij ∈d
5
Meanwhile, the observation of male value does not change much of the target
distribution. Majority of the significance of Gender feature takes place when its
value is female. However, by assigning the same weight to each feature value,
we are not able to discriminate the embedded significance among the feature
values.
The proposed value weighting method is more fine-grained and provides a
new weighting paradigm in classification learning. The basic assumption be-
hind the value weighting method is that each feature value has different sig-
nificance with respect to class value. When we say a certain feature is impor-
tant/significant, we think the importance of the feature can be decomposed. As
we have seen in Example 1, feature values are not equally important. Some
feature values are more important than other feature values. In the case of
Example 1, we can see that the value of female is much more important than
that of male with respect to the target feature (pregnancy-related disease). If
we assign the same weight to each feature value, we will lack the capability to
discriminate the predicting power residing across feature values.
In this paper, we think of a problem in which weights are assigned in a more
fine-grained way. Unlike feature weighting methods, we assign different weight
for each feature value. We call this method as value weighting method. The
formula of value weighting method is defined as follows.
VV W N B (d) =argmaxc P (c) P (aij |c)wij (3)
aij ∈d
where wij ∈ R represents the weight of feature value aij . You can easily see
that each feature value is assigned a different weight. The wij can be any real
value, representing the significance of feature value aij .
If a problem is linear, it is best to use a simpler linear classifier. However,
if a problem is non-linear and its class boundaries cannot be approximated well
with linear hyperplanes, then non-linear classifiers are often more accurate than
linear classifiers.
p(x =1|y=1) p(x =0|y=1)
Lemma 2 : Suppose uij = log p(xijij=1|y=−1) and vij = log p(xijij=0|y=−1) , respec-
tively. The value weighted naive Bayesian classifier in Equation 3 is represented
as follows.
p(y = 1)
ŷ = sign wij (uij − vij )xij + wij vij + log . (4)
i i
p(y = −1)
The proof of Lemma 2 is very similar to that of Lemma 1, and thus omit-
ted. In Lemma 2, the wi which determines the degree of slope for the linear
classifier is changed to wij . This difference is subtle but critical. The value
of wij changes depending on the value of xij , which makes the shape of class
boundary non-linear. The value weighted
naive Bayesian classifier is no longer
represented in the form of ŷ = sign( i ai xi ). The value weighted naive Bayes
classifier has more expressive power and the capability of finding a non-linear
class boundaries.
6
Table 1: Legend of symbols
A feature takes on a number of discrete values, and each feature value has
different importance with respect to the target value. In this paper, we propose
a new method for calculating the weights for feature values using filter approach.
We will use an information-theoretic filter method for assigning weights to fea-
ture values. We choose the information theoretic method because it has strong
theoretical background for deciding and calculating weight values.
Many different symbols and variables are defined in this section, and to
make the reading of paper more clear, the symbols/variables used in Section 4
are explained in Table 1.
4.1. KL measure
The basic assumption of the value weighting method is that when a certain
feature value is observed, it gives a certain amount of information to the target
feature. The more information a feature value provides to the target feature,
the more important the feature value becomes. The critical part now is how to
define or select a proper measure which can correctly measure the amount of
information.
The first candidate we can use for this purpose is information gain. Infor-
mation gain is a widely used method for calculating the importance of features,
including in decision tree algorithms. It is quite an intuitive argument that a
feature with higher information gain deserves higher weight. Quinlan proposed
a classification algorithm C4.5 [16], which introduces the concept of information
gain. C4.5 uses the information gain (or gain ratio) that underpins the criterion
to construct the decision tree for classifying objects. It calculates the difference
7
between the entropy of a priori distribution and that of a posteriori distribution
of class, and uses the value as the metric for deciding branching node. The
information gain used in C4.5 is defined as follows.
H(C) − H(C|A) =
P (a) P (c|a)logP (c|a) − P (c)logP (c) (5)
a c c
Equation (5) represents the discriminative power of a feature and this can be
regarded as the weight of a feature. Since we need the discriminative power of
a feature value, we cannot directly use Equation (5) as the measure of discrim-
inative power of a feature value.
In this paper, let us define IG(C|a) as the instantaneous information that
the event A = a provides about C, i.e., the information gain that we receive
about C given that A = a is observed. The IG(C|a) is the difference between a
priori and a posteriori entropies of C given the observation a, and is defined as
IG(C|a) = H(C) − H(C|a)
= P (c|a)logP (c|a) − P (c)logP (c) (6)
c c
While the information gain used in C4.5 is the information content of a specific
feature, the information gain defined in Equation (6) is that of a specific observed
value. Therefore, Equation (6) can be a candidate for the weight measure of a
feature value (A = a).
However, although IG(C|a) is a well known formula, there is a fundamental
problem with using IG(C|a) as the measure of value weight. The first problem
is that IG(C|a) can be zero even if P (c|a) = P (c) for some c. For instance,
consider the case of an n-valued feature where a particular value of C = c is
particularly likely a priori (p(c) = 1 − ), while all other values in C are equally
unlikely with probability /(n − 1). As for IG(C|a), it can not distinguish
the permutation of these probabilities, i.e., an observation which predicts the
relatively rare event C = c. Since it cannot distinguish between particular
events, IG(C|a) would yield zero information for such events.
Another problem of using IG(C|a) as the weight of feature value is that this
formula can give negative value. It is very unnatural for the weight of a feature
value to be negative. Due to these problems described so far, we do not use the
IG(C|male) form as the measure of value weight.
Instead of information gain, in this paper, we employ Kullback-Leibler mea-
sure. This measure has been widely used in many learning domains since it
originally was proposed in [17]. The Kullback-Leibler measure (denoted as KL)
for a feature value aij is defined as
P (c|aij )
KL(C|aij ) = P (c|aij )log (7)
c
P (c)
where aij means the j-th value of i-th feature. The formula KL(C|aij ) is the
average mutual information between the events c and aij with the expectation
8
taken with respect to a posteriori probability distribution of C. (The original
notation should be KL(aij |C). However, we use KL(C|aij ) because it is more
meaningful in this paper.) The difference is subtle, yet significant enough that
the KL(C|aij ) is always non-negative, while IG(C|aij ) may be either negative
or positive.
KL(C|aij ) appears in information theoretic literature under various guises.
For instance, it can be viewed as a special case of the cross-entropy or the
discrimination, a measure which defines the information theoretic similarity
between two probability distributions. In this sense, the KL(C|aij ) is a measure
of how dissimilar our a priori and a posteriori beliefs are about C–a useful feature
value should have a high degree of dissimilarity. Therefore, we employ the KL
measure in Equation 7 as a measure of divergence, and the information content
of a feature value aij is calculated with the use of the KL measure.
5. Reliability of KL Measure
In our value weighting method, each feature value has its own weight based
on its corresponding samples. However, the number of instances for a specific
feature value is sometimes very limited. The feature value must occur relatively
often for its weight to be deemed meaningful. Since the value weighting method
is a more fine-grained method, the value weighting method increases the number
of parameters in naive Bayes model. For the case of binary classification in naive
Bayes with m features, the number of parameters to estimate in normal naive
Bayes is 2m + 1. When we use feature weighting methods in naive Bayes, there
are additional m parameters for feature weights. Therefore, the number of
parameters in feature weighted naive Bayes is 2m + 1 + m = 3m + 1. In value
weighting method, assuming each feature has the same number of values r, the
number of parameters in value weighted naive Bayes becomes 2m + 1 + m · r =
(2 + r)m + 1.
Larger number of parameters usually means that the model has more expres-
sive power in classification learning. At the same time, the model may be more
adaptable to the characteristics of training data. If the training data contains a
certain degree of noise in it, there is possibility that the value weighting method
is more sensitive to the characteristics of training data. Overfitting problem
can happen especially when the data has high dimensions and training data is
sparse.
We introduce the notion of reliability of the KL measure to compensate this
problem. The basic idea behind reliability is that the more often a specific
feature value occurs, the more reliable the KL measure becomes. For instance,
suppose that, in the case of Gender feature, the majority of data has the value
of male and very few are female. Also assume that the KL measure value of
female is very large. However, this measure comes from only a small number of
instances, and therefore, we can hardly trust the validity of the KL measure of
’female.’
This is also closely related with the overfitting problem of the KL(C|aij )
measure. The KL(C|aij ) measure with small number of sample is heavily af-
9
fected by the characteristics of data, and thus may cause an overfitting problem.
Similar issues have been addressed in classification learning with a skewed dis-
tribution.
One method for solving this problem is to use a sampling method [18]. This
approach is to build a balanced training data set, and train the model on this
balanced set. For the minority feature values, we randomly select the desired
number of minority value instances, and also add an equal number of ran-
domly selected other feature value instances. This process ensures that each
feature value is represented with approximately equal proportions in training
data. However, in our work, we can not assume that there is additional valida-
tion data for sampling, and, therefore, we can not use the sampling method to
solve this problem.
We solve this problem with respect to the estimation of proportions of class
values given a feature value. The KL measure calculates how different poste-
rior probabilities are from prior probabilities. The KL measures are calculated
based on the probabilities of P (c|aij ) and P (c). As for the value of P (c), this
probability is relatively reliable since this value is derived from the entire set of
training data. However, for P (c|aij ), the probability may be based on a small
number of samples which makes the probability un-reliable. Therefore, the re-
liability of the KL measure heavily depends on the reliability of the P (c|aij )
value.
For a specific feature value aij , its conditional probability distribution is
given as {P (ck |aij )}L
k=1 . Suppose its true (unknown) probability distribution is
denoted as {P (ck |aij )}L
k=1 . If aij value has many of its corresponding instances,
it is very likely that P (ck |aij ) ≈ P (ck |aij ), for all k, then the KL measure
becomes reliable.
We measure the reliability of the KL of the feature value by measuring
how close the estimated probabilities P (ck |aij ) are to the true probabilities
P (ck |aij ). Since the true probabilities are unknown, we will instead calculate
the error bound of each estimated probability.
We assume each instance of a specific feature value is an independent trial
of the same event. It is known that a succession of independent events follows
a Bernoulli process. For a certain feature value aij and class value k, #(aij )
trials are assumed to be taken from a Bernoulli process with P (ck |aij ) mean.
Statistical learning theory provides us with confidence intervals for the true
underlying proportions. By applying Hoeffding’s inequality [19] to estimate the
error bound, we have
P rob(|P (ck |aij ) − P (ck |aij )| ≥ θij ) ≤ 2 exp(−2θij
2
m)
where m is the number of instances. Equivalently,
P rob(|P (ck |aij ) − P (ck |aij )| < θij )
2
≥ 1 − 2 exp(−2θij m)
It means that P (ck |aij ) will be within θij of P (ck |aij ) with a probability of at
2
least 1 − 2 exp(−2θij m).
10
2
By setting δ = 2 exp(−2θij m) and solving for θij we can represent the error
bound as
P (ck |aij ) − P (ck |aij )
≤ θij
with a probability of at least 1 − δ.
Since m = #(aij ), the length of the confidence interval for P (ck |aij ) is given
as
1 2 1 2
θij ≤ log = log (8)
2m δ 2 · #(aij ) δ
We assume that each class value follows a Bernoulli distribution, and repeat
this process for each class value of the feature value aij . As seen in Equation
(8), the confidence interval for the true proportion of class values remains the
same regardless of those class values.
We will define the reliability value of the KL measure to be within the range
of zero and unity. Therefore, we convert θij into a formula that ranges between
zero and one. Let us denote the reliability of the KL measure of aij as rij . Here
the rij value is a non-negative valued reliability, and its value should decrease
as the value of θij increases. A standard choice for the weight function, based
on [20], is given as
⎛ 2 ⎞
1 2
2
−θij − log
⎜ 2·#(aij ) δ ⎟
rij = exp = exp ⎝ ⎠
α α
− log 2δ log 2δ
= exp = exp (9)
2α#(aij ) 2α#(aij )
The bandwidth α determines how fast the weights decrease as the length in-
creases. Figure 1 shows the values of the reliability function (rij ) by changing
the size of the data. Note that the value of rij ranges from one to zero and
monotonically increases as the size of data increases. For simplicity, we set the
value of α and δ to be one and 0.05, respectively, in this paper.
This section is concerned with a few common problem that arises in prac-
tical value weighting learning: handling numeric features and handling missing
values.
The first practical issue in the value weighting method is how to handle
numeric features. For the case of numeric features, the maximum number of
feature values is not known in advance, and can even become infinite. In ad-
dition, the amount of data corresponding to each feature value might be very
small, which causes overfitting problem. In light of these issues, assigning a
weight to each feature value for numeric features is not a plausible approach.
11
1
0.8 α=1
α=2
r 0.6
ij α=5
0.4
0.2
0
0 5 10 15
#(a )
ij
Therefore, for numerical features, we need to discretize the feature values into
a number of nonoverlapping intervals.
The second issue is how to handle missing values in the value weighting
method. The presence of incompletely described instances complicates learning,
and many approaches to solve this problem have been proposed in literature.
Since the value weighting method assigns a weight to each feature value, the
handling of missing values affects significantly the performance of the value
weighting method more than that of feature weighting. Therefore, the method
for handling missing values becomes an important issue in the value weighting
method. This section describes how the missing values are handled in our value
weighting method. It is important to distinguish two contexts: values may be
missing in training data or in test data.
When values are missing in training data, the simple approach for handling
missing values is to use listwise deletion, which removes instances if there is a
missing value on any of the features. However, this approach results in a very
small data set. According to the study by Quinten and Raaijmakers [21], the
use of listwise deletion resulted in a loss of statistical power ranging between
35%(for scales with 10% missing values) and 98% (for scales with 30% missing
values).
In this paper, we use imputation method to handle missing data. Although
value imputation method is more common in data mining literature, we employ
distribution-based imputation method. Instead of estimating a single value for
missing value (value imputation), distribution-based imputation estimates the
distribution of the missing value, and learning for missing values will be based
on this estimated distribution.
In our work, a straightforward extension of the proposed value weighting
method allows us to process instances with missing values. Rather than trying
to estimate the missing feature values, we could treat missing value as a new
possible value for each feature and deal with it in the same way as other values.
12
Algorithm 1 Value Weighted Naive Bayes
Input: S : training data, T : test data, C : target feature, α : bandwidth of
reliability, δ : confidence level
read training data S
for each feature i do
for each feature value j do
if (∃ missing value ai∗ ) then
KL(C|ai∗ ) = c P (c|ai∗ ) log P (c|a P (c)
i∗ )
// calculate the KL measure
for missing value in training data
else
KL(C|ai∗ ) = 0
end if
P (c|aij )
KL(C|aij ) = c P (c|aij ) log P (c) + P (aij )KL(C|ai∗ ) // KL mea-
sure of feature
value
is augmented with that of missing value
log 2δ
rij = exp // calculate the reliability of value weight
2α#(aij )
wij = rij · KL(C|aij ) // the final weight of feature value
end for
end for
for each test data d ∈ T do
if (aij ∈d is missing value) then
Pij = j|i P (aij )P (aij |c) // missing value is split into multiple pseudo
instances
else
Pij = P (aij |c)
end if w
class value of d = argmaxc∈C P (c) Pij ij
aij ∈d
end for
13
Then we measures the information content the missing value contains and dis-
tribute it among the other category values corresponding to their frequencies in
the feature.
For instance, suppose a feature i has k categories as its possible values de-
noted as ai1 , ai2 , . . . , aik , and ai∗ , where ai∗ means the missing value. From
the Equation (7), the amount of information that the missing value ai∗ has is
determined as
P (c|ai∗ )
KL(C|ai∗ ) = P (c|ai∗ ) log
c
P (c)
So, for the rest of the values in feature i, the amount of information for the
category aij , is augmented as follows.
where ⎧
⎨ P (aij )P (aij |c) if aij = ai∗
Pij = j|i
⎩
P (aij |c) otherwise
14
Unlike the value of IG(C|a), we can see that wij is a non-negative value,
which is a necessary condition for valid weight values. Due to the characteristics
of the KL(C|aij ) measure, we can see that the value of wij is bounded as follows.
P (c|aij )
Theorem 1 : For a certain value aij , suppose a = min P (c) and b =
P (c|aij )
max P (c) . The range of wij is bounded as follows
(a − b)2
0 ≤ wij ≤
4ab E[wij ]
8. Experimental Evaluation
15
Table 2: Accuracies of the methods
16
Table 3: Comparison with feature weighting method
dataset VWNB DTNB WDT N B FWNB WF W N B
balance ∗ 71.28 70.72 10 71.4 4
cmc 52.17 ∗ 54.04 16 52.2 3
credit ∗ 75.14 69.90 43 75.3 5
crx 86.00 85.90 3 85.9 3
dermatology ∗ 98.05 96.93 12 97.9 4
diabetes 77.96 77.60 6 77.9 2
flare 74.29 ∗ 75.14 9 74.5 6
glass 74.25 ∗ 76.17 24 74.2 3
haberman 72.41 ∗ 73.69 16 72.1 7
hayes roth ∗ 80.76 80.30 11 80.7 2
heart ∗ 64.44 60.37 38 64.1 6
ionosphere 88.92 ∗ 90.31 16 88.8 3
iris 94.53 94.60 3 94.4 2
kr ∗ † 97.75 96.62 15 87.8 79
lung cancer 63.67 ∗ 74.07 82 † 65.1 18
lymph ∗ 85.30 75.68 86 85.1 4
monks ∗ † 65.31 64.13 12 63.4 17
nursery 90.28 ∗ 94.75 38 90.2 2
post operative ∗ 68.68 65.52 26 † 69.2 10
promoters ∗ 91.17 90.57 9 91.0 5
sonar † 60.36 ∗ 67.79 54 59.5 9
spambase 90.21 ∗ 92.60 33 90.2 3
spect ∗ 73.63 68.75 42 73.8 6
splice 95.42 95.14 4 95.4 3
tae ∗ † 52.20 47.00 44 51.2 11
vehicle 62.26 ∗ 68.09 38 62.0 4
wine ∗ 97.70 94.30 46 97.6 3
zoo † 93.57 93.65 4 93.0 10
17
results of the accuracies of these methods. The numbers with ∗ mark mean
the best accuracies among the methods, and the † mark means the second best
accuracy.
As we can see in Table 2, the performance of the value weighting method is
quite impressive. The proposed method (VWNB) shows the top or the second
performance in 11 cases out of 28 datasets. Its performance is quite competi-
tive with that of other algorithms such as Logistic, CART, TAN, and Random
Forests. These results indicate that assigning weights to each feature value can
improve the performance of the classification task for the naive Bayesian method
in many cases.
Table 3 compares the performance of VWNB with that of DTNB (feature
weighting NB) in a pairwise way. We cannot apply Students t-test since there
is no guarantee that accurate values are within the normal distribution. Fur-
thermore it is known that a t-test does not have good performance in cases
where the data sets are not completely independent of each other. Due to these
limitations, we use a Wilcoxon signed rank test, instead of a t-test, to compare
the differences of these two algorithms. We use a two-tailed Wilcoxon test with
α=0.05 and the critical value of the Wilcoxon value for α=0.05 is given as 8.
The ∗ symbol indicates that the method is significantly better than the others.
VWNB presents better performance than DTNB in 13 cases out of 28, while
DTNB outperforms 10 cases. For the remaining five data sets, we cannot reject
the null hypothesis, meaning that there were no significant differences between
the mean accuracies of these algorithms. Based on the results of these exper-
iments, we can posit that using the value weighting method can improve the
performance of naive Bayes algorithm.
Figure 2 shows the weights of feature values in various datasets. As shown
in Figure 2, each feature value has a different weight value, which means some
values are more important than other values in a given feature. For the cases
of the left weight feature and the right distance feature of Balance data, value 1
has the highest weight among feature values in both cases, as shown in Figure
2 (a) and 2 (b). Among the datasets we tested in this paper, we found out that
virtually all feature values have different weights in a given feature. Figure 2
(c) shows the value weights of binary feature, and Figure 2 (d) shows the value
weights of discretized feature.
The second experiments are about the robustness of the value weighted learn-
ing method. The value weighting method increases the number of parameters
in naive Bayes model. On the other hand, as we mentioned in Section 5, a large
number of parameters sometimes means the model is more sensitive to noise in
the data. To see the effect of noise in the proposed algorithm, we intentionally
inserted noise values in the vehicle data, and ran the VWNB algorithm along
with DTNB, SMO(SVM), Logistic, RF, and NB.
Figure 3 shows the results of the experiment. The performance of VWNB
does not drop as significantly as that of RF and DTNB. The degree with which
the performance of the algorithm drops is quite similar to that of NB, Logistic,
and SMO. This result shows that the algorithm is quite robust to noise in data
while maintaining its expressive power.
18
(a) The weights of Left weight feature (b) The weights of Right distance fea-
values in Balance ture values in Balance
(c) The weights of F1 feature values in (d) The weights of Plasma glucose con-
Spect centration feature values in Diabetes
19
9. Conclusions
Acknowledgements
References
[1] Z. Zheng, G. I. Webb, Lazy learning of bayesian rules, in: Machine Learn-
ing, Kluwer Academic Publishers, 2000, pp. 53–84.
[6] M. Hall, A decision tree-based attribute weighting filter for naive bayes,
Knowledge-Based Systems 20 (2) (2007) 120–126.
20
[7] H. Zhang, S. Sheng, Learning weighted naive bayes with accurate ranking,
in: ICDM ’04: Proceedings of the Fourth IEEE International Conference
on Data Mining, 2004.
21
[21] A. Quinten, W. Raaijmakers, Effectiveness of different missing data treat-
ments in surveys with likert-type data: introducing the relative mean sub-
stitution approach, Educational and Psychological Measurement 59 (5)
(1999) 725–748.
[22] A. Frank, A. Asuncion, UCI machine learning repository (2010).
URL http://archive.ics.uci.edu/ml
[23] U. M. Fayyad, K. B. Irani, Multi-interval discretization of continuous-
valued attributes for classification learning, in: Int’l Joint Conference on
Artificial Intelligence, 1993, pp. 1022–1029.
[24] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten,
The weka data mining software: An update, SIGKDD Explorations 11.
[25] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regres-
sion Trees, Wadsworth and Brooks, Monterey, CA, 1984.
[26] L. Breiman, Random forests, Machine Learning 45 (1) (2001) 5–32.
[27] T. M. Cover, J. A. Thomas, Elements of Information Theory, Wiley-
Interscience, New York, NY, USA, 1991.
[28] P. Henrici, Two remarks of the kantorovich inequality, American Mathe-
matical Monthly 68 (1961) 904–906.
Appendices
Proof of Lemma 1:
sketch) Applying logarithm to the definition of feature weighted naive Bayes,
we get
ŷ = argmaxy wi log p(xi |y) + log p(y) . (11)
Therefore,
ŷ = sign wi log p(xi |y = 1) + log p(y = 1) −
wi log p(xi |y = −1) + log p(y = −1)
p(xi |y = 1) p(y = 1)
= sign wi log + log (12)
p(xi |y = −1) p(y = −1)
We can represent
p(xi |y = 1)
log = xi (ui − vi ) + vi (13)
p(xi |y = −1)
From Equation 12 and 13,
p(y = 1)
ŷ = sign wi (ui − vi )xi + wi vi + log (14)
i i
p(y = −1)
22
Proof of Theorem 1:
The proof of the first inequality is given in [27].
For the second inequality: since f (x) = −log(x) is a convex function and
log(x) ≤ x − 1,
P (c|aij )
P (c|aij )log ≤
c
P (c)
P (c|aij ) P (c|aij )2
log P (c|aij ) ≤ −1 (15)
c
P (c) c
P (c)
Suppose
P (c|aij )
pc = P (c|aij ) and xc = (16)
P (c)
P (c|aij )
for a certain target value c. Let b̂ = max P (c) + where represents a
very tiny constant. Then it is clear that 0 ≤ pc and 0 < a ≤ xc < b̂. From
Kantorovich inequality [28], we have the following inequality.
2
1 (a + b̂)2 (a + b̂)2
pc x c pc ≤ pc = (17)
c c
xc 4ab̂ c 4ab̂
From Equation (16) and Equation (17), we have
P (c|aij )
P (c|aij ) P (c)
c
P (c) c
P (c|aij )2 (a + b̂)2
= ≤ (18)
c
P (c) 4ab̂
From Equation (15) and (18),
P (c|aij )
P (c|aij )log
c
P (c)
(a + b̂)2 (a − b̂)2 (a − b)2
≤ −1= ≤
4ab̂ 4ab̂ 4ab
Proof of Proposition 1:
From IG(C|aij ) = IG(C|aik )
P (c|aij )logP (c|aij ) = P (c|aik )logP (c|aik ) (19)
c c
23