You are on page 1of 24

Author’s Accepted Manuscript

An Information-Theoretic Filter Approach for


Value Weighted Classification Learning in Naive
Bayes

Chang-Hwan Lee

www.elsevier.com

PII: S0169-023X(16)30127-6
DOI: http://dx.doi.org/10.1016/j.datak.2017.11.002
Reference: DATAK1623
To appear in: Data & Knowledge Engineering
Received date: 2 August 2016
Revised date: 1 November 2017
Accepted date: 11 November 2017
Cite this article as: Chang-Hwan Lee, An Information-Theoretic Filter Approach
for Value Weighted Classification Learning in Naive Bayes, Data & Knowledge
Engineering, http://dx.doi.org/10.1016/j.datak.2017.11.002
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
An Information-Theoretic Filter Approach for Value
Weighted Classification Learning in Naive Bayes

Chang-Hwan Lee
Department of Information and Communications
Dongguk University, Seoul, Korea

Abstract
Assigning weights in features has been an important topic in some classification
learning algorithms. In this paper, we propose a new paradigm of assigning
weights in classification learning, called value weighting method. While the
current weighting methods assign a weight to each feature, we assign a different
weight to the values of each feature. The performance of naive Bayes learning
with value weighting method is compared with that of some other traditional
methods for a number of datasets. The experimental results show that the value
weighting method could improve the performance of naive Bayes significantly.
Keywords: Feature Weighting, Feature Selection, Naive Bayes,
Kullback-Leibler

1. Introduction

In some classifiers, the algorithms operate under the implicit assumption that
all features are of equal value as far as the classification problem is concerned.
However, when irrelevant and noisy features influence the learning task to the
same degree as highly relevant features, the accuracy of the model is likely to
deteriorate. Since the assumption that all features are equally important hardly
holds true in real world application, there have been some attempts to relax
this assumption in classification. Zheng and Webb [1] provide a comprehensive
overview of work in this area. The first approach for relaxing this assumption is
to combine feature subset selection with classification learning. It is to combine
a learning method with a preprocessing step that eliminates redundant features
from the data. Feature selection methods usually adopt a heuristic search in
the space of feature subsets. Since the number of distinct feature subsets grows
exponentially, it is not reasonable to do an exhaustive search to find optimal
feature subsets. In the literature, it is known that the predictive accuracy
of naive Bayes can be improved by removing redundant or highly correlated

Email address: chlee@dgu.ac.kr (Chang-Hwan Lee)

Preprint submitted to Elsevier November 15, 2017


features [2]. This makes sense as these features violate the assumption that
each feature is independent on each other.
Another major way to help mitigate this weakness, feature independence
assumption, is to assign weights to important features in classification. Since
features do not play the same role in many real world applications, some of
them are more important than others. Therefore, a natural way to extend
classification learning is to assign each feature different weight to relax the
conditional independence assumption. Feature weighting is a technique used
to approximate the optimal degree of influence of individual features using a
training set. While feature selection methods assign 0 or 1 values as the weights
of features, feature weighting is more flexible than feature subset selection by
assigning continuous weights.
When successfully applied, important features are attributed a high weight
value, whereas unimportant features are given a weight value close to zero.
There have been many feature weighting methods proposed in the machine
learning literature, mostly in the domain of nearest neighbor algorithms [3].
They have significantly improved the performance of classification algorithms.
In this paper, we propose a new paradigm of weighting method, called value
weighting method. While the current weighting methods assign a weight to
each feature, we assign a weight to each feature value. Therefore, the value
weighting method is a more fine-grained weighting method than the feature
weighting method. The new value weighting method provides a potential of
expanding the expressive power of classification learning, and possibly improves
its performance. Since features weighting methods improve the performance
of the classification learning, we will investigate whether assigning weights to
feature values can improve the performance even further.
The main contribution of this paper is to provide a new paradigm of weight-
ing method in classification learning. While there have been a few work focused
on feature weighting in the literature, to our best knowledge, there has been no
work which assigns different weight to each feature value in classification lean-
ing. We extended the current hypothesis space of classification learning into
next level by introducing a new set of weight space to the problem.
In this paper, we study the value weighting method in the context of naive
Bayesian algorithm. We have chosen naive Bayesian algorithm as the template
algorithm since it is one of the most common classification algorithms, and
many researchers have studied the theoretical and empirical results of this ap-
proach. It has been widely used in many data mining applications, and performs
surprisingly well on many applications [4].
There have been only a few methods for combining feature weighting with
naive Bayesian learning [5] [6] [7]. The feature weighting methods in naive
Bayes are known to improve the performance of classification learning. We
will investigate, in this paper, whether the value weighting method provides
enhanced performance in the context of naive Bayes.
The rest of this paper is structured as follows. In Section II, we describe the
related work on weighting methods in naive Bayesian learning. Section III shows
the basic concepts of value weighted naive Bayesian learning, and Section IV

2
discusses the mechanisms of the new value weighting method. Section V shows
the experimental results of the proposed method, and Section VI summarizes
the contributions made in this paper.

2. Related Work

A number of approaches have been proposed in the literature of feature


selection [8] [2] [9] [10]. While feature selection methods assign 0 or 1 value to
each feature, feature weighting assigns a continuous value weight to each feature.
Feature weighting can be viewed as learning bias, and many feature weighting
methods have been applied mostly to nearest neighbor algorithms [3]. While
there have been many research for assigning feature weights in the context of
nearest neighbor algorithms, very little work of weighting features is done in
naive Bayesian learning.
In this section, we focus on feature weighting methods in naive Bayesian
learning. The methods for calculating feature weights can be roughly divided
into two categories: filter methods and wrapper methods [11]. These methods
are distinguished based on the interaction between the feature weighting and
classification. The class of filter-based methods contains algorithms that use
no input other than the training data itself to calculate the feature weights,
whereas wrapper-based algorithms use feedback from a classifier to guide the
search. In filter methods, the bias is pre-determined in advance, and the method
calculates incorporate bias as a preprocessing step. Filters are data driven
and weights are assigned based on some property or heuristic measure of the
data. Wrapper-based algorithms are inherently more powerful than their filter-
based counterpart as they implicitly take the inductive bias of the classifier into
account.
Hall [6] proposed a feature weighting algorithm for naive Bayes using decision
tree, called DTNB. This method estimates the degree of feature dependency by
constructing unpruned decision trees and looking at the depth at which features
are tested in the tree. A bagging procedure is used to stabilize the estimates.
Features that do not appear in the decision trees receive a weight of zero. They
show that using feature weights with naive Bayes improves the quality of the
model compared to standard naive Bayes.
Lee et al. [12] proposed a method for calculating feature weights in naive
Bayes. They calculated the feature weights of naive Bayes using Kullback-
Leibler measure. The averaged amount of information for a feature is calculated,
and it showed improvement over normal naive Bayes and other supervised learn-
ing methods.
In case of wrapper (feedback) method, the performance feedback from the
classification algorithm is incorporated in determining feature weights. In wrap-
per methods, the weights of features are determined by how well the specific
feature settings perform in classification learning.
Zhang and Sheng [7] investigated a weighted naive Bayes with accurate rank-
ing from data. They used various weighting methods including the gain ratio

3
method, the hill climbing method, the Markov Chain Monte Carlo method, and
combinations of these methods. The performance of the methods is measured
using AUC, and a weighted naive Bayes is claimed to produce accurate ranking
outperforms naive Bayes.
Cardie [13] used an information gain based on the position of an feature in
a decision tree to derive feature weights for nearest neighbor algorithm. They
designed case-based learning algorithms to improve the performance of minority
class predictions. They used local weighting method, and weights are derived
for each test instance based on their path it takes in the tree.
Gartner [5] employs feature weighting performed by SVM. The algorithm
looks for an optimal hyperplane that separates two classes in given space, and
the weights determining the hyperplane can be interpreted as feature weights
in the naive Bayes classifier. The weights are optimized such that the danger
of overfitting is reduced. It can solve the binary classification problems and
the feature weight is based on conditional independence. They showed that the
method compares favorably to state-of-the-art machine learning approaches.
Zaidi et al. [14] proposed a weighted naive Bayes algorithm, called WANBIA,
that selects weights to minimize either the negative conditional log likelihood or
the mean squared error objective functions. They perform numerous evaluations
and find that WANBIA is a competitive alternative to state of the art classifiers
like Random Forest, Logistic Regression and A1DE.
Taheri et al. [15] propose a novel attribute weighted Naive Bayes classifier
by considering weights to the conditional probabilities. An objective function is
modeled and taken into account, which is based on the structure of the Naive
Bayes classifier and the attribute weights. The optimal weights are determined
by a local optimization method using the quasisecant method.
The idea of value-based weight is first proposed in Lee [12]. He applied gra-
dient approach for calculating value weights, and showed by assigning different
weights to feature values we can improve the performance of the classification
learning. In this paper, we employ a static filter approach for calculating value
weights in naive Bayes learning. We employ Kullback-Leibler measure and re-
liability measure.

3. Background

The naive Bayesian classifier is a straightforward and widely used method


for supervised learning. It is one of the fastest learning algorithms, and can
deal with any number of features or classes. Despite of its simplicity in model,
naive Bayesian performs surprisingly well in a variety of problems. Furthermore,
naive Bayesian learning is robust enough that small amount of noise does not
perturb the results. In this paper we implement the value weighting method in
the context of naive Bayes using filter method.
The naive Bayesian learning uses Bayes theorem to calculate the most likely
class label of the new instance. Since all features are considered to be indepen-

4
dent given the class value, the classification on d is defined as follows

VN B (d) = argmaxc P (c) P (aij |c)
aij ∈d

where aij represents the j-th value of the i-th feature.


The naive Bayesian classification with feature weighting is now represented
as follows.


VF W N B (d) = argmaxc P (c) P (aij |c)wi (1)
aij ∈d

where wi ∈ R represents the weight of feature. In this formula, unlike traditional


naive Bayesian approach, each feature i has its own weight wi . Since feature
weighting is a generalization of feature selection, it involves a much larger search
space than feature selection.
In spite of its good performance, naive Bayesian classifier is known to be a
linear classifier, and thus has trouble solving non-linearly separable problems.
The feature weighted naive Bayesian classifier provides weight to each feature,
and could improve the performance of regular naive Bayesian learning, However,
we will show that it still belongs to the category of linear classifier. For simplic-
ity, suppose an instance X has n binary features, such as X = (x1 , , xn ), and
there are only two classes, y = +1 and y = −1. From the definition of linear  clas-
sifier, a classifier is linear if it can be represented in a form of ŷ = sign( i ai xi )
where ai is the constant coefficient associated with xi .
p(xi =1|y=1) p(xi =0|y=1)
Lemma 1 : Suppose ui = log p(x i =1|y=−1)
and vi = log p(x i =0|y=−1)
, respec-
tively. The feature weighted naive Bayesian classifier in Equation 1 is a linear
classifier given as
 
  p(y = 1)
ŷ = sign wi (ui − vi )xi + wi vi + log . (2)
i i
p(y = −1)

The proof of Lemma 1 is given in Appendices.


Current feature weighting methods have some limitations that each feature
value has the same significance with respect to the target concept. A feature
takes on a number of discrete values, and each value might have different im-
portance with respect to the target values. Since the current feature weighting
methods assign a weight to each feature, all values of the given feature are given
the same weight. As an illustrative example of value weighting paradigm, let us
consider the following example.
Example 1 : Suppose the Gender feature has values of male and female, and the
target feature has the value of y and n. Suppose their corresponding probabilities
are given as {p(y) = 0.9, p(n) = 0.1, p(y|f emale) = 0.1, p(n|f emale) =
0.9, p(y|male) = 0.99, p(n|male) = 0.01.}
Traditional feature weighting methods give the same weight to both cases
whether the feature value is either male or female. When female value is ob-
served, it significantly impacts the probability distribution of the target feature.

5
Meanwhile, the observation of male value does not change much of the target
distribution. Majority of the significance of Gender feature takes place when its
value is female. However, by assigning the same weight to each feature value,
we are not able to discriminate the embedded significance among the feature
values. 
The proposed value weighting method is more fine-grained and provides a
new weighting paradigm in classification learning. The basic assumption be-
hind the value weighting method is that each feature value has different sig-
nificance with respect to class value. When we say a certain feature is impor-
tant/significant, we think the importance of the feature can be decomposed. As
we have seen in Example 1, feature values are not equally important. Some
feature values are more important than other feature values. In the case of
Example 1, we can see that the value of female is much more important than
that of male with respect to the target feature (pregnancy-related disease). If
we assign the same weight to each feature value, we will lack the capability to
discriminate the predicting power residing across feature values.
In this paper, we think of a problem in which weights are assigned in a more
fine-grained way. Unlike feature weighting methods, we assign different weight
for each feature value. We call this method as value weighting method. The
formula of value weighting method is defined as follows.

VV W N B (d) =argmaxc P (c) P (aij |c)wij (3)
aij ∈d

where wij ∈ R represents the weight of feature value aij . You can easily see
that each feature value is assigned a different weight. The wij can be any real
value, representing the significance of feature value aij .
If a problem is linear, it is best to use a simpler linear classifier. However,
if a problem is non-linear and its class boundaries cannot be approximated well
with linear hyperplanes, then non-linear classifiers are often more accurate than
linear classifiers.
p(x =1|y=1) p(x =0|y=1)
Lemma 2 : Suppose uij = log p(xijij=1|y=−1) and vij = log p(xijij=0|y=−1) , respec-
tively. The value weighted naive Bayesian classifier in Equation 3 is represented
as follows.
 
  p(y = 1)
ŷ = sign wij (uij − vij )xij + wij vij + log . (4)
i i
p(y = −1)

The proof of Lemma 2 is very similar to that of Lemma 1, and thus omit-
ted. In Lemma 2, the wi which determines the degree of slope for the linear
classifier is changed to wij . This difference is subtle but critical. The value
of wij changes depending on the value of xij , which makes the shape of class
boundary non-linear. The value weighted
 naive Bayesian classifier is no longer
represented in the form of ŷ = sign( i ai xi ). The value weighted naive Bayes
classifier has more expressive power and the capability of finding a non-linear
class boundaries.

6
Table 1: Legend of symbols

symbol definition symbol definition


d data instance c class value
aij j-th value of i-th feature ai∗ missing value of i-th feature
wij weight of aij #(aij ) number of data in aij
KL(C|aij ) Kullback-Leibler measure for aij KL(C|ai∗ ) Kullback-Leibler measure for ai∗
P(C|aij ) true prob. dist. in aij rij reliability of aij

By providing more parameters in naive Bayes learning, it expands the dimen-


sion of hypothesis space of naive Bayes learning. Introducing a new dimension
in naive Bayes increases the expressive power of naive Bayes, which in turn can
possibly improve the performance of naive Bayesian learning. In this paper, we
will investigate whether the proposed value weighting method can improve the
performance of naive Bayesian learning even further, and its performance will
be compared with that of other methods.

4. Value Weighted Learning

A feature takes on a number of discrete values, and each feature value has
different importance with respect to the target value. In this paper, we propose
a new method for calculating the weights for feature values using filter approach.
We will use an information-theoretic filter method for assigning weights to fea-
ture values. We choose the information theoretic method because it has strong
theoretical background for deciding and calculating weight values.
Many different symbols and variables are defined in this section, and to
make the reading of paper more clear, the symbols/variables used in Section 4
are explained in Table 1.

4.1. KL measure
The basic assumption of the value weighting method is that when a certain
feature value is observed, it gives a certain amount of information to the target
feature. The more information a feature value provides to the target feature,
the more important the feature value becomes. The critical part now is how to
define or select a proper measure which can correctly measure the amount of
information.
The first candidate we can use for this purpose is information gain. Infor-
mation gain is a widely used method for calculating the importance of features,
including in decision tree algorithms. It is quite an intuitive argument that a
feature with higher information gain deserves higher weight. Quinlan proposed
a classification algorithm C4.5 [16], which introduces the concept of information
gain. C4.5 uses the information gain (or gain ratio) that underpins the criterion
to construct the decision tree for classifying objects. It calculates the difference

7
between the entropy of a priori distribution and that of a posteriori distribution
of class, and uses the value as the metric for deciding branching node. The
information gain used in C4.5 is defined as follows.
H(C) − H(C|A) =
  
P (a) P (c|a)logP (c|a) − P (c)logP (c) (5)
a c c

Equation (5) represents the discriminative power of a feature and this can be
regarded as the weight of a feature. Since we need the discriminative power of
a feature value, we cannot directly use Equation (5) as the measure of discrim-
inative power of a feature value.
In this paper, let us define IG(C|a) as the instantaneous information that
the event A = a provides about C, i.e., the information gain that we receive
about C given that A = a is observed. The IG(C|a) is the difference between a
priori and a posteriori entropies of C given the observation a, and is defined as
IG(C|a) = H(C) − H(C|a)
 
= P (c|a)logP (c|a) − P (c)logP (c) (6)
c c

While the information gain used in C4.5 is the information content of a specific
feature, the information gain defined in Equation (6) is that of a specific observed
value. Therefore, Equation (6) can be a candidate for the weight measure of a
feature value (A = a).
However, although IG(C|a) is a well known formula, there is a fundamental
problem with using IG(C|a) as the measure of value weight. The first problem
is that IG(C|a) can be zero even if P (c|a) = P (c) for some c. For instance,
consider the case of an n-valued feature where a particular value of C = c is
particularly likely a priori (p(c) = 1 − ), while all other values in C are equally
unlikely with probability /(n − 1). As for IG(C|a), it can not distinguish
the permutation of these probabilities, i.e., an observation which predicts the
relatively rare event C = c. Since it cannot distinguish between particular
events, IG(C|a) would yield zero information for such events.
Another problem of using IG(C|a) as the weight of feature value is that this
formula can give negative value. It is very unnatural for the weight of a feature
value to be negative. Due to these problems described so far, we do not use the
IG(C|male) form as the measure of value weight.
Instead of information gain, in this paper, we employ Kullback-Leibler mea-
sure. This measure has been widely used in many learning domains since it
originally was proposed in [17]. The Kullback-Leibler measure (denoted as KL)
for a feature value aij is defined as
  
P (c|aij )
KL(C|aij ) = P (c|aij )log (7)
c
P (c)

where aij means the j-th value of i-th feature. The formula KL(C|aij ) is the
average mutual information between the events c and aij with the expectation

8
taken with respect to a posteriori probability distribution of C. (The original
notation should be KL(aij |C). However, we use KL(C|aij ) because it is more
meaningful in this paper.) The difference is subtle, yet significant enough that
the KL(C|aij ) is always non-negative, while IG(C|aij ) may be either negative
or positive.
KL(C|aij ) appears in information theoretic literature under various guises.
For instance, it can be viewed as a special case of the cross-entropy or the
discrimination, a measure which defines the information theoretic similarity
between two probability distributions. In this sense, the KL(C|aij ) is a measure
of how dissimilar our a priori and a posteriori beliefs are about C–a useful feature
value should have a high degree of dissimilarity. Therefore, we employ the KL
measure in Equation 7 as a measure of divergence, and the information content
of a feature value aij is calculated with the use of the KL measure.

5. Reliability of KL Measure
In our value weighting method, each feature value has its own weight based
on its corresponding samples. However, the number of instances for a specific
feature value is sometimes very limited. The feature value must occur relatively
often for its weight to be deemed meaningful. Since the value weighting method
is a more fine-grained method, the value weighting method increases the number
of parameters in naive Bayes model. For the case of binary classification in naive
Bayes with m features, the number of parameters to estimate in normal naive
Bayes is 2m + 1. When we use feature weighting methods in naive Bayes, there
are additional m parameters for feature weights. Therefore, the number of
parameters in feature weighted naive Bayes is 2m + 1 + m = 3m + 1. In value
weighting method, assuming each feature has the same number of values r, the
number of parameters in value weighted naive Bayes becomes 2m + 1 + m · r =
(2 + r)m + 1.
Larger number of parameters usually means that the model has more expres-
sive power in classification learning. At the same time, the model may be more
adaptable to the characteristics of training data. If the training data contains a
certain degree of noise in it, there is possibility that the value weighting method
is more sensitive to the characteristics of training data. Overfitting problem
can happen especially when the data has high dimensions and training data is
sparse.
We introduce the notion of reliability of the KL measure to compensate this
problem. The basic idea behind reliability is that the more often a specific
feature value occurs, the more reliable the KL measure becomes. For instance,
suppose that, in the case of Gender feature, the majority of data has the value
of male and very few are female. Also assume that the KL measure value of
female is very large. However, this measure comes from only a small number of
instances, and therefore, we can hardly trust the validity of the KL measure of
’female.’
This is also closely related with the overfitting problem of the KL(C|aij )
measure. The KL(C|aij ) measure with small number of sample is heavily af-

9
fected by the characteristics of data, and thus may cause an overfitting problem.
Similar issues have been addressed in classification learning with a skewed dis-
tribution.
One method for solving this problem is to use a sampling method [18]. This
approach is to build a balanced training data set, and train the model on this
balanced set. For the minority feature values, we randomly select the desired
number of minority value instances, and also add an equal number of ran-
domly selected other feature value instances. This process ensures that each
feature value is represented with approximately equal proportions in training
data. However, in our work, we can not assume that there is additional valida-
tion data for sampling, and, therefore, we can not use the sampling method to
solve this problem.
We solve this problem with respect to the estimation of proportions of class
values given a feature value. The KL measure calculates how different poste-
rior probabilities are from prior probabilities. The KL measures are calculated
based on the probabilities of P (c|aij ) and P (c). As for the value of P (c), this
probability is relatively reliable since this value is derived from the entire set of
training data. However, for P (c|aij ), the probability may be based on a small
number of samples which makes the probability un-reliable. Therefore, the re-
liability of the KL measure heavily depends on the reliability of the P (c|aij )
value.
For a specific feature value aij , its conditional probability distribution is
given as {P (ck |aij )}L
k=1 . Suppose its true (unknown) probability distribution is
denoted as {P (ck |aij )}L
k=1 . If aij value has many of its corresponding instances,
it is very likely that P (ck |aij ) ≈ P (ck |aij ), for all k, then the KL measure
becomes reliable.
We measure the reliability of the KL of the feature value by measuring
how close the estimated probabilities P (ck |aij ) are to the true probabilities
P (ck |aij ). Since the true probabilities are unknown, we will instead calculate
the error bound of each estimated probability.
We assume each instance of a specific feature value is an independent trial
of the same event. It is known that a succession of independent events follows
a Bernoulli process. For a certain feature value aij and class value k, #(aij )
trials are assumed to be taken from a Bernoulli process with P (ck |aij ) mean.
Statistical learning theory provides us with confidence intervals for the true
underlying proportions. By applying Hoeffding’s inequality [19] to estimate the
error bound, we have
P rob(|P (ck |aij ) − P (ck |aij )| ≥ θij ) ≤ 2 exp(−2θij
2
m)
where m is the number of instances. Equivalently,
P rob(|P (ck |aij ) − P (ck |aij )| < θij )
2
≥ 1 − 2 exp(−2θij m)

It means that P (ck |aij ) will be within θij of P (ck |aij ) with a probability of at
2
least 1 − 2 exp(−2θij m).

10
2
By setting δ = 2 exp(−2θij m) and solving for θij we can represent the error
bound as


P (ck |aij ) − P (ck |aij )
≤ θij
with a probability of at least 1 − δ.
Since m = #(aij ), the length of the confidence interval for P (ck |aij ) is given
as

1 2 1 2
θij ≤ log = log (8)
2m δ 2 · #(aij ) δ
We assume that each class value follows a Bernoulli distribution, and repeat
this process for each class value of the feature value aij . As seen in Equation
(8), the confidence interval for the true proportion of class values remains the
same regardless of those class values.
We will define the reliability value of the KL measure to be within the range
of zero and unity. Therefore, we convert θij into a formula that ranges between
zero and one. Let us denote the reliability of the KL measure of aij as rij . Here
the rij value is a non-negative valued reliability, and its value should decrease
as the value of θij increases. A standard choice for the weight function, based
on [20], is given as
⎛  2 ⎞
  1 2
2
−θij − log
⎜ 2·#(aij ) δ ⎟
rij = exp = exp ⎝ ⎠
α α
   
− log 2δ log 2δ
= exp = exp (9)
2α#(aij ) 2α#(aij )

The bandwidth α determines how fast the weights decrease as the length in-
creases. Figure 1 shows the values of the reliability function (rij ) by changing
the size of the data. Note that the value of rij ranges from one to zero and
monotonically increases as the size of data increases. For simplicity, we set the
value of α and δ to be one and 0.05, respectively, in this paper.

6. Incorporating Missing Values in Value Weighting Method

This section is concerned with a few common problem that arises in prac-
tical value weighting learning: handling numeric features and handling missing
values.
The first practical issue in the value weighting method is how to handle
numeric features. For the case of numeric features, the maximum number of
feature values is not known in advance, and can even become infinite. In ad-
dition, the amount of data corresponding to each feature value might be very
small, which causes overfitting problem. In light of these issues, assigning a
weight to each feature value for numeric features is not a plausible approach.

11
1

0.8 α=1
α=2
r 0.6
ij α=5
0.4

0.2

0
0 5 10 15
#(a )
ij

Figure 1: The characteristics of rij function

Therefore, for numerical features, we need to discretize the feature values into
a number of nonoverlapping intervals.
The second issue is how to handle missing values in the value weighting
method. The presence of incompletely described instances complicates learning,
and many approaches to solve this problem have been proposed in literature.
Since the value weighting method assigns a weight to each feature value, the
handling of missing values affects significantly the performance of the value
weighting method more than that of feature weighting. Therefore, the method
for handling missing values becomes an important issue in the value weighting
method. This section describes how the missing values are handled in our value
weighting method. It is important to distinguish two contexts: values may be
missing in training data or in test data.
When values are missing in training data, the simple approach for handling
missing values is to use listwise deletion, which removes instances if there is a
missing value on any of the features. However, this approach results in a very
small data set. According to the study by Quinten and Raaijmakers [21], the
use of listwise deletion resulted in a loss of statistical power ranging between
35%(for scales with 10% missing values) and 98% (for scales with 30% missing
values).
In this paper, we use imputation method to handle missing data. Although
value imputation method is more common in data mining literature, we employ
distribution-based imputation method. Instead of estimating a single value for
missing value (value imputation), distribution-based imputation estimates the
distribution of the missing value, and learning for missing values will be based
on this estimated distribution.
In our work, a straightforward extension of the proposed value weighting
method allows us to process instances with missing values. Rather than trying
to estimate the missing feature values, we could treat missing value as a new
possible value for each feature and deal with it in the same way as other values.

12
Algorithm 1 Value Weighted Naive Bayes
Input: S : training data, T : test data, C : target feature, α : bandwidth of
reliability, δ : confidence level
read training data S
for each feature i do
for each feature value j do
if (∃ missing value ai∗ ) then  

KL(C|ai∗ ) = c P (c|ai∗ ) log P (c|a P (c)
i∗ )
// calculate the KL measure
for missing value in training data
else
KL(C|ai∗ ) = 0
end if   
P (c|aij )
KL(C|aij ) = c P (c|aij ) log P (c) + P (aij )KL(C|ai∗ ) // KL mea-
sure of feature
 value 
is augmented with that of missing value
log 2δ
rij = exp // calculate the reliability of value weight
2α#(aij )
wij = rij · KL(C|aij ) // the final weight of feature value
end for
end for
for each test data d ∈ T do
if (aij ∈d is missing value) then
Pij = j|i P (aij )P (aij |c) // missing value is split into multiple pseudo
instances
else
Pij = P (aij |c)
end if  w
class value of d = argmaxc∈C P (c) Pij ij
aij ∈d
end for

13
Then we measures the information content the missing value contains and dis-
tribute it among the other category values corresponding to their frequencies in
the feature.
For instance, suppose a feature i has k categories as its possible values de-
noted as ai1 , ai2 , . . . , aik , and ai∗ , where ai∗ means the missing value. From
the Equation (7), the amount of information that the missing value ai∗ has is
determined as
  
P (c|ai∗ )
KL(C|ai∗ ) = P (c|ai∗ ) log
c
P (c)
So, for the rest of the values in feature i, the amount of information for the
category aij , is augmented as follows.

KL(C|aij ) = KL(C|aij ) + KL(C|ai∗ )P (aij ) .

We also need a method for treating missing values at prediction time in


test data. We employ the same approach used in training data: distribution-
based imputation. When the algorithm encounters a missing value in test data,
the missing value is split into multiple pseudo instances. Each pseudo instance
comes with a different value for the missing feature and a weight corresponding
to the estimated probability for a particular missing value. Each pseudo instance
is then applied to the value weighting method (Equation (3)). Therefore, the
final formula for value weighted naive Bayesian with missing values is given as
follows.  w
Vvwnb (d, wij ) = argmaxc P (c) Pij ij
aij ∈d

where ⎧ 
⎨ P (aij )P (aij |c) if aij = ai∗
Pij = j|i

P (aij |c) otherwise

7. Calculating Value Weights

The final weight of feature value is represented as the multiplication form of


the reliability term with the KL measure of the feature value. Therefore, the
final formula(wij ) for the feature value aij (Equation (10)) is given as
 
log 2δ
wij = rij KL(C|aij ) = exp KL(C|aij ) (10)
2α#(aij )

The formula of value weight incorporating missing values is given as

wij = rij (KL(C|aij ) + KL(C|ai∗ )P (aij ))

This formula possesses a direct interpretation as a multiplicative measure of the


reliability and KL measure of a given feature value. Algorithm 1 explains the
pseudo code for value weighted naive Bayesian learning method.

14
Unlike the value of IG(C|a), we can see that wij is a non-negative value,
which is a necessary condition for valid weight values. Due to the characteristics
of the KL(C|aij ) measure, we can see that the value of wij is bounded as follows.
 
P (c|aij )
Theorem 1 : For a certain value aij , suppose a = min P (c) and b =
 
P (c|aij )
max P (c) . The range of wij is bounded as follows

(a − b)2
0 ≤ wij ≤
4ab E[wij ]

The proof of Theorem 1 is given in Appendices. We can see, in Theorem 1, that


the value of wij does not grow indefinitely, which is another important property
for weight values.
The weight of each value becomes identical when the following conditions
are satisfied.
Proposition 1 : For a certain feature i, suppose wij and wik are the weights
of the feature values aij and aik , respectively. Then wij = wik if both of the
following hold.
IG(C|aij ) = IG(C|aik ) and
E|P (c|aij ) [log(C)] = E|P (c|aik ) [log(C)]
where E|P (c|aij ) [log(C)] means the expectation of log(C) with respect to a pos-
teriori distribution of aij . The proof of Proposition 1 is given in Appendices.
Proposition 2 : As a special case of Proposition 1, wij = wik when P (c|aij ) =
P (c|aik ) for all c values. The proof of Proposition 2 is trivial.
Finally, when calculating the value weights, we use Laplace smoothing for
calculating probability values. We use this approximation method to avoid the
problem where the denominator of the probabilities is zero.

8. Experimental Evaluation

In this section, we provide empirical results on several benchmark datasets


and compare the value weighted naive Bayes(VWNB) algorithm against other
state-of-the-art supervised algorithms.
We selected 28 datasets from the UCI repository [22]. The continuous fea-
tures in datasets were discretized using [23]. The characteristics of the datasets
we used are omitted due to space limitations. To evaluate the performance, we
used 10-fold cross validation method. We used Weka software [24] to run these
programs.
Firstly, since the NB-related algorithm is the gold standard algorithm in
this paper, we compare the performance of VWNB with that of other naive
Bayes with feature weighting. These include (1) naive Bayes(NB), (2) DTNB
[6] and (3) naive Bayes with feature weighting (FWNB) [12]. VWNB is also
compared with five other well-known classification methods : (1) Logistic, (2)
CART [25], (3) TAN [10] and (4) Random Forests(RF) [26]. Table 2 shows the

15
Table 2: Accuracies of the methods

dataset VWNB DTNB FWNB NB Logistic CART TAN RF


balance † 71.2 70.7 ∗ 71.4 70.7 69.7 70.0 71.0 70.5
cmc 52.1 54.0 52.2 52.8 † 55.8 ∗ 55.9 54.6 52.3
credit 75.1 69.9 75.3 † 75.9 ∗ 76.1 73.5 72.4 72.8
crx ∗ 86.0 85.9 † 85.9 85.9 85.6 85.9 85.8 85.1
dermatology ∗ 98.0 96.9 97.9 † 98.0 97.4 94.9 97.3 96.3
diabetes † 77.9 77.6 77.9 77.8 † 78.6 76.1 ∗ 78.7 77.3
flare 74.2 † 75.1 74.5 74.3 75.0 ∗ 75.7 75.0 72.8
glass 74.2 76.1 74.2 74.3 75.2 ∗ 76.6 73.7 ∗ 76.6
haberman 72.4 ∗ 73.6 72.1 72.7 73.3 † 72.5 71.6 71.5
hayes roth † 80.7 80.3 80.7 80.3 78.0 ∗ 83.3 64.9 71.2
heart 64.4 60.3 64.1 64.4 ∗ 66.6 62.5 64.7 † 65.5
ionosphere 88.9 90.3 88.8 88.8 86.8 † 93.1 92.7 ∗ 94.8
iris 94.5 † 94.6 94.4 94.1 93.3 94.0 94.2 ∗ 94.6
kr 97.7 96.6 87.8 87.8 97.5 ∗ 99.3 92.3 † 98.9
lung cancer 63.6 ∗ 74.0 65.1 † 66.6 62.9 48.1 61.1 62.9
lymph † 85.3 75.6 85.1 83.7 80.4 76.3 ∗ 86.8 80.4
monks † 65.3 64.1 63.4 63.1 63.7 ∗ 66.3 63.7 58.2
nursery 90.2 94.7 90.2 90.3 92.5 ∗ 99.5 96.4 † 98.2
post 68.6 65.5 ∗ 69.2 † 68.9 65.5 † 68.9 66.5 56.3
promoters ∗ 91.1 90.5 † 91.0 90.5 86.7 72.6 81.3 83.9
sonar 60.3 ∗ 67.7 59.5 60.5 66.3 62.5 † 66.5 64.4
spambase 90.2 92.6 90.2 90.2 ∗ 94.3 92.4 93.1 † 94.1
spect 73.6 68.7 † 73.8 ∗ 75.0 67.5 67.5 72.5 70.0
splice † 95.4 95.1 ∗ 95.4 95.3 90.9 94.7 95.0 89.1
tae † 52.2 47.0 51.2 50.9 ∗ 52.3 50.3 49.5 50.9
vehicle 62.2 68.0 62.0 62.6 73.2 72.1 † 73.3 ∗ 74.7
wine † 97.7 94.3 97.6 ∗ 97.7 96.0 88.2 95.8 94.9
zoo 93.5 93.6 93.0 93.0 94.0 88.1 ∗ 96.6 † 96.0

16
Table 3: Comparison with feature weighting method
dataset VWNB DTNB WDT N B FWNB WF W N B
balance ∗ 71.28 70.72 10 71.4 4
cmc 52.17 ∗ 54.04 16 52.2 3
credit ∗ 75.14 69.90 43 75.3 5
crx 86.00 85.90 3 85.9 3
dermatology ∗ 98.05 96.93 12 97.9 4
diabetes 77.96 77.60 6 77.9 2
flare 74.29 ∗ 75.14 9 74.5 6
glass 74.25 ∗ 76.17 24 74.2 3
haberman 72.41 ∗ 73.69 16 72.1 7
hayes roth ∗ 80.76 80.30 11 80.7 2
heart ∗ 64.44 60.37 38 64.1 6
ionosphere 88.92 ∗ 90.31 16 88.8 3
iris 94.53 94.60 3 94.4 2
kr ∗ † 97.75 96.62 15 87.8 79
lung cancer 63.67 ∗ 74.07 82 † 65.1 18
lymph ∗ 85.30 75.68 86 85.1 4
monks ∗ † 65.31 64.13 12 63.4 17
nursery 90.28 ∗ 94.75 38 90.2 2
post operative ∗ 68.68 65.52 26 † 69.2 10
promoters ∗ 91.17 90.57 9 91.0 5
sonar † 60.36 ∗ 67.79 54 59.5 9
spambase 90.21 ∗ 92.60 33 90.2 3
spect ∗ 73.63 68.75 42 73.8 6
splice 95.42 95.14 4 95.4 3
tae ∗ † 52.20 47.00 44 51.2 11
vehicle 62.26 ∗ 68.09 38 62.0 4
wine ∗ 97.70 94.30 46 97.6 3
zoo † 93.57 93.65 4 93.0 10

17
results of the accuracies of these methods. The numbers with ∗ mark mean
the best accuracies among the methods, and the † mark means the second best
accuracy.
As we can see in Table 2, the performance of the value weighting method is
quite impressive. The proposed method (VWNB) shows the top or the second
performance in 11 cases out of 28 datasets. Its performance is quite competi-
tive with that of other algorithms such as Logistic, CART, TAN, and Random
Forests. These results indicate that assigning weights to each feature value can
improve the performance of the classification task for the naive Bayesian method
in many cases.
Table 3 compares the performance of VWNB with that of DTNB (feature
weighting NB) in a pairwise way. We cannot apply Students t-test since there
is no guarantee that accurate values are within the normal distribution. Fur-
thermore it is known that a t-test does not have good performance in cases
where the data sets are not completely independent of each other. Due to these
limitations, we use a Wilcoxon signed rank test, instead of a t-test, to compare
the differences of these two algorithms. We use a two-tailed Wilcoxon test with
α=0.05 and the critical value of the Wilcoxon value for α=0.05 is given as 8.
The ∗ symbol indicates that the method is significantly better than the others.
VWNB presents better performance than DTNB in 13 cases out of 28, while
DTNB outperforms 10 cases. For the remaining five data sets, we cannot reject
the null hypothesis, meaning that there were no significant differences between
the mean accuracies of these algorithms. Based on the results of these exper-
iments, we can posit that using the value weighting method can improve the
performance of naive Bayes algorithm.
Figure 2 shows the weights of feature values in various datasets. As shown
in Figure 2, each feature value has a different weight value, which means some
values are more important than other values in a given feature. For the cases
of the left weight feature and the right distance feature of Balance data, value 1
has the highest weight among feature values in both cases, as shown in Figure
2 (a) and 2 (b). Among the datasets we tested in this paper, we found out that
virtually all feature values have different weights in a given feature. Figure 2
(c) shows the value weights of binary feature, and Figure 2 (d) shows the value
weights of discretized feature.
The second experiments are about the robustness of the value weighted learn-
ing method. The value weighting method increases the number of parameters
in naive Bayes model. On the other hand, as we mentioned in Section 5, a large
number of parameters sometimes means the model is more sensitive to noise in
the data. To see the effect of noise in the proposed algorithm, we intentionally
inserted noise values in the vehicle data, and ran the VWNB algorithm along
with DTNB, SMO(SVM), Logistic, RF, and NB.
Figure 3 shows the results of the experiment. The performance of VWNB
does not drop as significantly as that of RF and DTNB. The degree with which
the performance of the algorithm drops is quite similar to that of NB, Logistic,
and SMO. This result shows that the algorithm is quite robust to noise in data
while maintaining its expressive power.

18
(a) The weights of Left weight feature (b) The weights of Right distance fea-
values in Balance ture values in Balance

(c) The weights of F1 feature values in (d) The weights of Plasma glucose con-
Spect centration feature values in Diabetes

Figure 2: The weights of feature values

Figure 3: Accuracies due to noise level

19
9. Conclusions

In this paper, a new paradigm of weighting method, called value weighting


method, is proposed. Unlike current feature weighting methods, we propose a
more fine-grained weighting methods, where we assign a weight to each feature
value. An information-theoretic filter method for calculating value weights was
developed.
The proposed weighting method is implemented and tested in the context
of naive Bayes. The experimental results show that the value weighting method
is successful and shows better performance in most cases than its counterpart
algorithms. In light of these evidences, we could improve the performance of
classification learning even further by using a value weighting approach.
As future work, we will combine the value weighting method with other
classification algorithm, and study whether the value weighting method could
show the better performance regardless of algorithms.

Acknowledgements

This work was supported by the Korea Research Foundation(KRF) grant


funded by the Korea government(MEST) (No. 2017R1A2A2A05069662) and
by the Technology Innovation Program: Industrial Strategic Technology Devel-
opment Program(No: 11073162) funded By the Ministry of Trade, Industry &
Energy(MOTIE, Korea)

References

[1] Z. Zheng, G. I. Webb, Lazy learning of bayesian rules, in: Machine Learn-
ing, Kluwer Academic Publishers, 2000, pp. 53–84.

[2] R. Kohavi, Scaling up the accuracy of naive-bayes classifiers: A decision-


tree hybrid, in: Second International Conference on Knoledge Discovery
and Data Mining, 1996.

[3] D. Wettschereck, D. W. Aha, T. Mohri, A review and empirical evaluation


of feature weighting methods for a class of lazy learning algorithms, AI
Review 11 (1997) 273–314.

[4] P. Domingos, M. Pazzani, On the optimality of the simple bayesian classifier


under zero-one loss, Machine Learning 29 (2-3).

[5] T. Gärtner, P. A. Flach, Wbcsvm: Weighted bayesian classification based


on support vector machines, in: the Eighteenth International Conference
on Machine Learning, 2001.

[6] M. Hall, A decision tree-based attribute weighting filter for naive bayes,
Knowledge-Based Systems 20 (2) (2007) 120–126.

20
[7] H. Zhang, S. Sheng, Learning weighted naive bayes with accurate ranking,
in: ICDM ’04: Proceedings of the Fourth IEEE International Conference
on Data Mining, 2004.

[8] C. A. Ratanamahatana, D. Gunopulos, Feature selection for the naive


bayesian classifier using decision trees, Applied Artificial Intelligence 17 (5-
6) (2003) 475–487.
[9] P. Langley, S. Sage, Induction of selective bayesian classifiers, in: in Pro-
ceedings of the Tenth Conference on Uncertainty in Artificial Intelligence,
1994, pp. 399–406.

[10] N. Friedman, D. Geiger, M. Goldszmidt, G. Provan, P. Langley, P. Smyth,


Bayesian network classifiers, in: Machine Learning, 1997, pp. 131–163.

[11] R. Kohavi, G. H. John, Wrappers for feature subset selection, Artificial


Intelligence 97 (1-2) (1997) 273–324.
[12] C.-H. Lee, F. Gutierrez, D. Dou, Calculating feature weights in naive bayes
with kullback-leibler measure, in: 11th IEEE International Conference on
Data Mining, 2011.
[13] C. Cardie, N. Howe, Improving minority class prediction using case-specific
feature weights, in: Proceedings of the Fourteenth International Conference
on Machine Learning, 1997, pp. 57–65.

[14] N. A. Zaidi, J. Cerquides, M. J. Carman, G. I.Webb, Alleviating naive


bayes attribute independence assumption by attribute weighting, Journal
of Machine Learning Research 14.

[15] S. Taheri, J. Yearwood, M. Mammadov, S. Seifollahi, Attribute weighted


naive bayes classifier using a local optimization, Neural Computing & Ap-
plications 24 (5) (2014) p995–1002.

[16] J. R. Quinlan, C4.5: programs for machine learning, Morgan Kaufmann


Publishers Inc., San Francisco, CA, USA, 1993.

[17] S. Kullback, R. A. Leibler, On information and sufficiency, The Annals of


Mathematical Statistics 22 (1) (1951) 79–86.
[18] H. He, E. A. Garcia, Learning from imbalanced data, IEEE Transactions
On Knowledge And Data Engineering 21 (9) (2009) 1263–1284.

[19] W. Hoeffding, Probability inequalities for sums of bounded random vari-


ables, Journal of the American Statistical Association 58 (301) 13–30.

[20] C. G. Atkeson, A. W. Moore, S. Schaal, Locally weighted learning, Artificial


Intelligence Review 11 (1-5) (1997) 11–73.

21
[21] A. Quinten, W. Raaijmakers, Effectiveness of different missing data treat-
ments in surveys with likert-type data: introducing the relative mean sub-
stitution approach, Educational and Psychological Measurement 59 (5)
(1999) 725–748.
[22] A. Frank, A. Asuncion, UCI machine learning repository (2010).
URL http://archive.ics.uci.edu/ml
[23] U. M. Fayyad, K. B. Irani, Multi-interval discretization of continuous-
valued attributes for classification learning, in: Int’l Joint Conference on
Artificial Intelligence, 1993, pp. 1022–1029.
[24] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten,
The weka data mining software: An update, SIGKDD Explorations 11.
[25] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regres-
sion Trees, Wadsworth and Brooks, Monterey, CA, 1984.
[26] L. Breiman, Random forests, Machine Learning 45 (1) (2001) 5–32.
[27] T. M. Cover, J. A. Thomas, Elements of Information Theory, Wiley-
Interscience, New York, NY, USA, 1991.
[28] P. Henrici, Two remarks of the kantorovich inequality, American Mathe-
matical Monthly 68 (1961) 904–906.

Appendices
Proof of Lemma 1:
sketch) Applying logarithm to the definition of feature weighted naive Bayes,
we get
 
ŷ = argmaxy wi log p(xi |y) + log p(y) . (11)

Therefore,
 
ŷ = sign wi log p(xi |y = 1) + log p(y = 1) −
 
wi log p(xi |y = −1) + log p(y = −1)
 
p(xi |y = 1) p(y = 1)
= sign wi log + log (12)
p(xi |y = −1) p(y = −1)
We can represent
p(xi |y = 1)
log = xi (ui − vi ) + vi (13)
p(xi |y = −1)
From Equation 12 and 13,
 
  p(y = 1)
ŷ = sign wi (ui − vi )xi + wi vi + log (14)
i i
p(y = −1)

22
Proof of Theorem 1:
The proof of the first inequality is given in [27].
For the second inequality: since f (x) = −log(x) is a convex function and
log(x) ≤ x − 1,
 P (c|aij )
P (c|aij )log ≤
c
P (c)
 P (c|aij )  P (c|aij )2
log P (c|aij ) ≤ −1 (15)
c
P (c) c
P (c)

Suppose
P (c|aij )
pc = P (c|aij ) and xc = (16)
P (c)
 
P (c|aij )
for a certain target value c. Let b̂ = max P (c) +  where  represents a
very tiny constant. Then it is clear that 0 ≤ pc and 0 < a ≤ xc < b̂. From
Kantorovich inequality [28], we have the following inequality.
   2
  1 (a + b̂)2  (a + b̂)2
pc x c pc ≤ pc = (17)
c c
xc 4ab̂ c 4ab̂
From Equation (16) and Equation (17), we have
  
 P (c|aij ) 
P (c|aij ) P (c)
c
P (c) c
 P (c|aij )2 (a + b̂)2
= ≤ (18)
c
P (c) 4ab̂
From Equation (15) and (18),
 P (c|aij )
P (c|aij )log
c
P (c)
(a + b̂)2 (a − b̂)2 (a − b)2
≤ −1= ≤ 
4ab̂ 4ab̂ 4ab
Proof of Proposition 1:
From IG(C|aij ) = IG(C|aik )
 
P (c|aij )logP (c|aij ) = P (c|aik )logP (c|aik ) (19)
c c

From E|P (c|aij ) [log(C)] = E|P (c|aik ) [log(C)]


 
P (c|aij )logP (c) = P (c|aik )logP (c) (20)
c c

Subtracting (20) from (19), we get wij = wik . 

23

You might also like