You are on page 1of 10

Research Article

Improved Naive Bayes with optimal correlation factor for text


classification
Jiangning Chen1 · Zhibo Dai1 · Juntao Duan1 · Heinrich Matzinger1 · Ionel Popescu1

© Springer Nature Switzerland AG 2019

Abstract
Naive Bayes (NB) estimator is widely-used in text classification problems. However, it does not perform well with small-
size training datasets. Most previous literature focuses on either creating and modifying features or combing clustering
to improve the performance of NB. We directly tackle the problem by constructing a new estimator, called Naive Bayes
with correlation factor. We introduce a correlation factor to NB estimator that incorporates overall correlation among
the different classes. This effectively exploits the idea of bootstrapping, which reuses data for all classes even if they only
belong to one class. Moreover, we obtain a formula for the optimal correlation factor by balancing bias and variance of
the estimator. Experimental results on real-world data show that our estimator achieves better accuracy compared with
traditional Naive Bayes, yet at the same time maintaining the simplicity of NB.

Keywords Naive Bayes · Correlation factor · Text classification · Insufficient training set

1 Introduction by [21, 27, 30]. Other research projects [22] propose a


simple heuristic solution of applying a hierarchical tree to
In recent years, rapid growth of text documents on the assign components to classes, which performs better on
Internet and digital libraries has enhanced the importance large data sets.
of text classification, whose goal is to find the categories of For the second stage, classification, it has been studied
each document given their contents. Text classification has from both supervised classification and unsupervised clus-
many applications in natural language processing, such as tering. For supervised classification, if we assume all the
topic detection [24], spam filtering [8, 11, 29], author iden- categories follow independent multinomial distribution
tification [6], web page classification [25] and sentiment and each document is a sample generated by the distri-
analysis [19]. Despite intensive research, it still remains an bution, a straight-forward idea would be applying some
open problem today. linear model to do classification, such as Support Vector
Although text classification can be realized with Machine [4, 13], which is used to find the maximum-mar-
schemes having different settings, the fundamental gin hyper-plane that divides the documents with differ-
scheme usually consists of two stages: feature genera- ent labels. Under these assumptions, another important
tion and classification. In the classification step, there are method is Naive Bayes (NB) [7, 15, 26, 31], which uses
two optional steps that would benefit the model: feature scores based on the ‘probabilities’ of each document con-
extraction and feature selection. Many research projects ditioned on the categories. NB classifier learns from train-
have been done on feature extraction and selection areas, ing data to estimate the distribution of each category, then
such as some novel feature selection methods proposed computes the conditional probability of each document

* Zhibo Dai, zdai37@gatech.edu; Jiangning Chen, jchen444@math.gatech.edu; Juntao Duan, jt.duan@gatech.edu; Heinrich Matzinger,
matzi@math.gatech.edu; Ionel Popescu, ipopescu@math.gatech.edu | 1School of Mathematics, Georgia Institute of Technology, Atlanta,
GA 30313, USA.

SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5

Received: 31 May 2019 / Accepted: 23 August 2019 / Published online: 31 August 2019

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5

given the class label by applying Bayes rule. The predic- a smaller order compared with Naive Bayes estimator. Sec-
tion of the class is done by choosing the highest posterior tion 5 is devoted to finding the optimal correlation factor.
probability. When we take into account more factors, such We also show the order of the error is less than traditional
as order of the sequence and meaning of words given a naive Bayes. In Sect. 6, we show the results of our simula-
large enough data set, we can use deep learning models tions, which demonstrates the performance of our method
such as recurrent neural network [18, 28]. presented in Sect. 4. Finally, Sect. 7 concludes our work
For unsupervised problems, [1] proposed SVD (Singu- and mentions possible future work.
lar Value Decomposition) for dimension reduction, then
use general clustering algorithm such as K-means. There
also exist some algorithms based on EM algorithm, such 2 General setting
as pLSA (Probabilistic latent semantic analysis) [10], which
considers the probability of each co-occurrence of words Consider a classification problem with the sample (docu-
and documents as a mixture of conditionally independent ment) set S, and the class set C with k different classes:
multinomial distributions. The parameters in pLSA can not
be derived, therefore they used the standard EM algorithm C = {C1 , C2 , … , Ck }.
for estimation. Using the same idea, but assuming that the Assume we have totally v different words, thus for each
topic distribution has sparse Dirichlet prior, [2] proposed document d ∈ S , we have:
LDA (Latent Dirichlet allocation). The sparse Dirichlet priors
d = {x1 , x2 , … , xv }.
encode the intuition that documents cover only a small
set of topics and topics frequently use only a small set of where xi represents the number of occurrence for i-th word
words. In practice, this results in a better disambiguation Define y = (y1 , y2 , … , yk ) as our label vector. For docu-
of words and a more precise assignment of documents ment d in class Ci , we have yi (d) = 1. Notice that for a single
∑k
to topics. label problem, we have: i=1 yi = 1.
This paper focuses on the performance of Naive Bayes For a test document d, our target is to predict:
approach for text classification problems. Although it is ŷ (d) = f (d;𝜃) = (f1 (d;𝜃), f2 (d;𝜃), … , fk (d;𝜃))
a widely-used method, it requires plenty of well-labelled
given training sample set S, where 𝜃 is the parameter
data for training purpose. Moreover, its conditional inde-
matrix and fi (d;𝜃) is the likelihood function of document
pendence assumption is rarely held in reality. There are
d in class Ci . The detail proof of Theorems can be found in
some researches on how to relax this restriction, such as
“Appendix”.
the feature weighting approach [12, 33] and instance-
weighting approach [32]. [3] proposed a method that finds
better estimation of centroid, which helps improve the
accuracy of Naive Bayes estimation. In order to tackle the 3 Naive Bayes classifier in text classification
situation where there does not exist enough labelled data problem
for each class, we propose a novel estimation method.
Different from other feature weighting approaches, our In this section we will discuss the properties of Naive Bayes
method not only relaxes the conditional independence estimator. The traditional Naive Bayes uses a simple proba-
assumption, but also improves the performance compared bilistic model to infer the most likely class of an unknown
to traditional Naive Bayes with insufficient labelled train- document. Let class Ci (1 ≤ i ≤ k) with centroid
∑v
ing data for each class. The key part of our method is intro- 𝜃i = (𝜃i1 , 𝜃i2 , … , 𝜃iv ) and 𝜃i satisfies: j=1 𝜃ij = 1. Assuming
ducing a correlation factor, which would combine more independence of the words, the most likely class for a
feature information taken from different classes. document d is computed as:
The remainder of this paper is organized as follows. Sec-
label(d) = argmax i P(Ci )P(d|Ci )
tion 2 details the general settings for the whole paper. In
Sect. 3, we give a detail review of Naive Bayes estimator. ∏
v
= argmax i P(Ci ) (𝜃ij )xj
The error of the Naive Bayes estimator is discussed in Theo-
j=1 (3.1)
rem 3.1. In Sect. 4, we derive our proposed method Naive

v
Bayes with correlation factor (NBCF) (see 4.5). It improves = argmax i log P(Ci ) + xj log 𝜃ij .
traditional Naive Bayes in the situation where there is only j=1
limited available training data, which is a common situ-
ation in many real world text classification applications.
Furthermore, in Theorem 4.1, we show the NBCF error is This gives the classification criteria once 𝜃 is estimated,
controlled by the correlation factor and the variation has namely finding the largest among

Vol:.(1234567890)
SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5 Research Article

by introducing a correlation factor, especially for the situa-



v
tion where there is no sufficient data compared with large
log fi (d;𝜃) = log P(Ci ) + xj log 𝜃ij 1≤i≤k
j=1
number of classes.

Now we shall derive an maximum likelihood estimator for 4 Naive Bayes with correlation factor
𝜃 . For a class Ci , we have the standard likelihood function:
∏ From Theorem 3.1, we can see that traditional Naive Bayes
L(Ci , 𝜃) = fi (d;𝜃)yi (d) estimator 𝜃̂ is an unbiased estimator with variance
𝜃i (1−𝜃i )
d∈S
O( j|C |m j ). Now we will try to find an estimator, and prove
∏∏
v (3.2) i

=
x
𝜃i j that it can perform better than traditional Naive Bayes
d∈Ci j=1
j
estimator.
There are two main approaches to improve the Naive
Bayes model: modifying the feature and modifying the
Take logarithm for both sides, we obtain the log-likelihood
model. Many researchers have proposed approaches to
function:
modify the document representation in order to better fit
∑∑
v the assumption made by Naive Bayes. These include extract-
log L(Ci , 𝜃) = xj log 𝜃ij . (3.3) ing more complex features such as syntactic or statistical
d∈Ci j=1 phrases [20], extracting features using word clustering [5]
and exploiting relations using lexical resources [9]. We pro-
We would like to solve optimization problem:
pose an approach that modifies the probabilistic model.
Therefore, our model should work well with other document
max log L(Ci , 𝜃)
representation modifications to achieve a better result.

v
Our basic idea is that, even for a single labeling problem,
subject to ∶ 𝜃ij = 1 (3.4)
j=1
a document d usually contains words appearing in different
classes, thus it should include some information from dif-
𝜃ij ≥ 0
ferent classes. However, our label y in training set does not
reflect that information because only one component of y
The problem (3.4) can be explicitly solved by Lagrange is 1 and all others are 0. We would like to replace y by y + t
Multiplier, for class Ci , we have 𝜃i = {𝜃i1 , 𝜃i2 , … , 𝜃iv }, where: in Naive Bayes likelihood function 3.2 with some optimized
t to get our new likelihood function L1:

d∈Ci xj

𝜃̂ij = ∑ ∑v . (3.5) L1 (Ci , 𝜃) = fi (d;𝜃)yi (d)+t
d∈Ci x
j=1 j d∈S
( v )yi (d)+t (4.1)
∏ ∏ x
For estimator 𝜃̂ , we have following theorem. = 𝜃i j .
j
d∈S j=1
Theorem 3.1 Assume we have normalized length of each
∑v
document, that is: j=1 xj = m for all documents d ∈ S , the By introducing the correlation factor t, we include more
estimator [(3.5) satisfies following properties: information between the document and classes, which
improves the classification accuracy.
(1) 𝜃̂ij is unbiased. Notice that to compute L1 of a given class Ci in our esti-
𝜃ij (1−𝜃ij )
(2) E[|𝜃̂i − 𝜃i |2 ] =
j j |Ci |m
. mator, instead of just using documents in C1 as Naive Bayes
estimator, we will use every d ∈ S.
The Naive Bayes with multinomial prior distribution has Take logarithm for both sides of 4.1, we obtain the log-
a strong assumption about the data: it assumes that words likelihood function:
in documents are independent. However, this assumption [ ]
clearly does not hold in real world text. There are many dif- ∑ ∑
v

ferent kinds of dependence between words induced by


log L1 (Ci , 𝜃) = (yi (d) + t) xj log 𝜃ij . (4.2)
d∈S j=1
semantic, pragmatic, and conversational structure of a text.
Although it has its advantages in practice compared to some
more sophisticated models, we propose a new method
based on Naive Bayes model that has a better performance

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5

Similar to Naive Bayes estimator, we would like to solve L L 1


We can see that E[|𝜃̂i 1 − E[𝜃̂i 1 ]|2 ] is in O( |S| ), which means
optimization problem: j j

it converges faster than standard Naive Bayes O( |C1 | ), how-


i
L
max log L1 (Ci , 𝜃) ever, since E[|𝜃̂ 1 − 𝜃i |] ≠ 0 , it is not an unbiased
ij j


v estimator.
subject to ∶ 𝜃ij = 1 (4.3)
j=1

𝜃ij ≥ 0 5 Determine the correlation factor

In general statistical estimation theory, a biased estimator is


Let: acceptable, and sometimes even outperforms an unbiased
estimator. A more important perspective is to find a suit-

v
Gi = 1 − 𝜃ij , able loss function to determine parameters. Li and Yang [17]
j=1 introduces ways of choosing loss function for many famous
models, the common idea is to sum a biased term and com-
by Lagrange multiplier, we have: plexity penalty term for model parameters. Nigam et al. [23]
uses the maximum entropy as the loss function for text clas-
⎧ 𝜕 log(L1 ) 𝜕G sification problems.
⎪ 𝜕𝜃 + 𝜆i i = 0 ∀ 1 ≤ i ≤ k and ∀ 1 ≤ j ≤ v
⎪ ij 𝜕𝜃ij In our problem, from 3.1 and 4.1, we know that traditional
⎨� v Naive Bayes estimator is unbiased. Our estimator is biased,
⎪ 𝜃ij = 1, ∀ 1 ≤ i ≤ k but we want to find an optimal t to get smaller variance. In

⎩ j=1 order to balance the trade-off between bias and variance, we
would like to select a loss function which takes into account
of both bias and variance.
plug in, we obtain: In this task, we can use mean squared error as loss func-
tion. There is a well-known bias-variance decomposition for
⎧ � (yi (d) + t)xj mean square error.
⎪ − 𝜆i = 0, ∀ 1 ≤ i ≤ k and ∀ 1 ≤ j ≤ v
⎪ d∈S 𝜃ij [ ] [ ]
⎨� v
L L L L
E |𝜃̂i 1 − 𝜃ij |2 = E |𝜃̂i 1 − E 𝜃̂i 1 + E 𝜃̂i 1 − 𝜃ij |2
⎪ j j j j

⎪ 𝜃ij = 1, ∀ 1 ≤ i ≤ k [ ] [ ]2
⎩ j=1 L L L
= E |𝜃̂i 1 − E 𝜃̂i 1 | + E 𝜃̂i 1 − 𝜃ij
2
(4.4) j
( )
j
( )2
j

L L
∶= Var 𝜃̂i 1 + Bias 𝜃̂i 1
Solve (4.4), we got the solution of optimization problem j j

(4.3):
∑ In practice, we would like to minimize a general linear
(yi (d) + t)xj combination of bias and variance, namely,
̂𝜃 L1 = ∑ d∈S ∑
ij v
j=1 d∈S (yi (d) + t)xj ( ) ( )2 ( )
∑ (4.5) L L
L 𝜃ij , c1 , c2 = c1 Bias 𝜃̂i 1 + c2 Var 𝜃̂i 1 (5.1)
d∈S (yi (d) + t)xj j j
=
m(�Ci � + t�S�)
Theorem 5.1 The minimum of the loss function 5.1 is
L
achieved at
For estimator 𝜃̂i 1, we have the following result:
j

Theorem 4.1 Assume for each class, we have prior distribu- c2 (1 − pi )𝜃ij (1 − 𝜃ij )
t∗ = ∑k ∑k
tions p1 , p2 , … , pk with pi = |Ci |∕|S|, and we have normal-
∑v c2 l=1
pl 𝜃lj (1 − 𝜃lj ) − c2 𝜃ij (1 − 𝜃ij ) + c1 m�S�( l=1 pl 𝜃lj − 𝜃ij )2
ized length for each document, that is: j=1 xj = m. The esti-
(5.2)
mator (4.5) satisfies following property:
We can see from (5.2) that the optimal correlation factor
1
L L t ∗ should be a very small number close to O( m|S| ). Therefore
(1) 𝜃̂i 1 is biased, with: |E[𝜃̂i 1 ] − 𝜃ij | = O(t)
j
L L
j
by Eq. A.2, we know the squared bias is
(2) E[|𝜃̂ 1 − E[𝜃̂ 1 ]|2 ] = O( 1 ). ( )
ij ij m|S| ( )2
̂ L1 2 1
Bias 𝜃i = O(t ) = O
j m2 |S|2

Vol:.(1234567890)
SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5 Research Article

We have already shown in A.3 that the order of the Reuters-21578 data [16], there are approximately 3000
variance documents in this sample set. For 20 news group data
( ) ( ) [14], it includes 20 groups and approximately 20,000
L 1
Var 𝜃̂i 1 = O documents.
j m|S|
We take t ∈ (0, 2) and use 10% of data for training and
L
Therefore in the case of expected square error E[|𝜃̂i 1 − 𝜃ij |2 ] assess the trained model on the remaining test data.
j
In our simulation, we notice that when we choose cor-
(c1 = c2 = 1) is dominated by the variance. Thus we have
relation factor to be around 0.1, we get the best accuracy
the following Corollary:
for our estimation. See Fig. 1a, b.
1
Theorem 5.2 With any selection of t = O( m|S| ), we have
6.2 Compare with Naive Bayes
[ ] ( )
L 1
E |𝜃̂i 1 − 𝜃ij |2 = O (5.3) Next, we compare the result of traditional Naive Bayes esti-
j m|S| L
mator (3.5) 𝜃̂ijand our estimator (4.5) 𝜃̂i 1. In this simulation,
By Theorem 3.1, we know that for Naive Bayes,
j

our correlation factor t is chosen to be 0.1 for Figs. 2, 3 and


E[|𝜃̂i − 𝜃i |2 ] = O( 1 ), thus we can see that our estimator
j j |Ci | 4.
actually works better. First, we run both algorithms on these two sample data-
sets. We know that when the sample size becomes large
enough, our estimator is baised. But when the training set
is small, our estimator should converge faster. Thus we first
6 Experiment take the training size relatively small (10%). See Fig. 2a, b.
According to the simulation, we can see our method is
6.1 Simulation with different correlation factor more accurate for most of the classes, and more accurate
on average.
In the previous section we obtained that the order of t Then we test our estimator 𝜃̂ L1 with larger training set
1
must be O( |S| ). However, we will still need to determine ( 90%). In our analysis above, we know that as datasets
how to choose the best correlation factor t. That we will become large enough, our estimator converges to a biased
have to tune the parameter by running t in some deter- estimator, so we expect a better result with traditional
mined interval. Naive Bayes estimator. See Fig. 3a, b. According to the
We applied our method on single labeled documents simulation, we can see for 20 news group, traditional Naive
of 10 topics, which have almost the same sample size, in Bayes performs better than our method, but our method

(a) Accuracy Behavior with respect to different t in Reuters


(b) Accuracy Behavior with respect to different t in 20News
0.9 0.74

0.85 0.72

0.8 0.7
Accuracy

Accuracy

0.75 0.68

0.7 0.66

0.65 0.64

0.6 0.62
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
t t

Fig. 1  We test accuracy behavior with respect to different correlation factors in Reuter-21578 (a) and 20 News group dataset (b). We take
10% of the data as training set. The y-axis is the accuracy and the x-axis is the correlation factor t

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5

(a) training set = 10%, behavior in reuter


(b) training set = 10%, behavior in 20 news group
1 1

0.9 0.9

0.8
0.8

0.7
0.7

0.6
0.6
0.5

0.5
0.4

0.4
0.3

0.2 0.3

0.1 0.2
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20

Fig. 2  We take 10 largest groups in Reuter-21578 dataset (a) and 20 news group dataset (b), and take 10% of the data as training set. The
y-axis is the accuracy, and the x-axis is the class index

(a) training set = 90%, behavior in reuter (b) training set = 90%, behavior in 20 news group
1 1

0.95 0.9

0.9 0.8

0.85 0.7

0.8 0.6

0.75 0.5

0.7 0.4

0.65 0.3

0.6 0.2
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20

Fig. 3  We take 10 largest groups in Reuter-21578 dataset (a) and 20 news group dataset (b), and take 90% of the data as training set. The
y-axis is the accuracy, and the x-axis is the class index

is still more accurate than Naive Bayes in Reuter’s data. The 6.3 Robustness of t for prediction
reason might be that we have a huge unbalanced sample
size in Reuter’s data, 90% of the training set is still not large 1
For estimation purposes, t must satisfy t = O( |S| ) by The-
enough for many classes. orem 5.2 in order to find the most accurate parameters.
Finally, we apply the same training set with training size However, it turns out that for prediction purposes, there
10% and test the accuracy on the training set instead of 1
is a phase transition phenomenon. As long as t ≥ O( |S| ),
the test set. We find the traditional Naive Bayes estima-
the prediction power is not reduced even when we
tor actually achieves better results, which means it might
increase t to very large value (for example t = 105). In the
have more over-fitting problems. This might be the reason
simulation of finding best t in Fig. 1a, b, we see the test-
why our method works better when the dataset is not too
ing error is only decreasing slightly as t increasing from
large: adding the correlation factor t helps us bring some
0.1 to 2. We summarize this fact as follows
uncertainty in training process, which helps avoid over-
fitting. See Fig. 4a, b.

Vol:.(1234567890)
SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5 Research Article

(a) 1
training set = 10%, training set behavior in reuter (b) 1
training set = 10%, training set behavior in 20 news group

0.99

0.95
0.98

0.97
0.9

0.96

0.85
0.95

0.94
0.8

0.93

0.92 0.75
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20

Fig. 4  We take 10 largest groups in Reuter-21578 dataset (a), and 20 news group dataset (b), and take 10% of the data as training set. We
test the result on training set. The y-axis is the accuracy, and the x-axis is the class index

Proposition 6.1 For prediction purpose, the correlation factor (3) We can only use our method in single labeled dataset
t can take value in the interval so far, it would be interesting to see if we can extend
our result in partial labeled dataset or multi-labeled
1
≤t≤1 dataset.
|S|
The reason we restraint the upper bound to be 1 is that
the effect of correlation factor should not exceed the effect Compliance with ethical standards
of original class yi = 1.
Conflict of interest The authors declare that they have no conflict
of interest.

7 Conclusion
Appendix A: Proof of the theorems
In this paper, we modified the traditional Naive Bayes esti-
mator with a correlation factor to obtain a new estima- A.1. Proof of Theorem 3.1
tor, NBCF, which is biased but has a smaller variance. We ∑v
justified that our estimator has significant advantage by Proof With assumption j=1 xj = m, we can rewrite (3.5) as:
analyzing the mean square error. In simulation, we applied ∑ ∑
d∈Ci xj d∈Ci xj
our method to real world text classification problems, and 𝜃̂ij = ∑ = .
showed that it works better when the training data set is d∈C m �Ci �m
i

insufficient.
There are several important open problems related our Since d = (x1 , x2 , … , xv ) is multinomial distribution in class
estimator: Ci , we have: E[xj ] = m𝜃ij , and E[xj2 ] = m𝜃ij (1 − 𝜃ij + m𝜃ij ).
�∑ � ∑
(1) Is our proposed estimator admissible for the square (1) � � xj E[xj ]
d∈Ci d∈Ci
error loss? Even though we know it outperform Naive E 𝜃̂ij = E =
�Ci �m �Ci �m
Bayes estimator, it might not be the optimal one. ∑
(2) Will our estimator still work in other more independ- d∈Ci m𝜃ij
= = 𝜃ij .
ent datasets? We only test our result in Reuter’s data �Ci �m
[16] and 20 news group [14], these datasets are news
from newspapers, which means they are highly cor-
Thus 𝜃̂ij is unbiased.
related to each other.
(2) By (1), we have:

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5
[ ] [ ] [ ]
Thus:
E |𝜃̂ij − 𝜃ij |2 ] = E[𝜃̂i2 − 2𝜃ij E 𝜃̂ij + 𝜃i2 = E 𝜃̂i2 − 𝜃i2 .
j j j j
∑k
Then notice L
t�S�� l=1
pl 𝜃lj − 𝜃ij �
�E[𝜃̂i 1 ] − 𝜃ij � =
�∑ �2
j t�S� + �Ci �
∑k
d∈Ci xj � l=1
pl 𝜃lj − 𝜃ij � (A.2)
𝜃̂i2 = =
j �Ci �2 m2 1 + pi ∕t
∑ ∑ (A.1)
2 d d�
d∈Ci xj + d≠d � ∈Ci xj xj = O(t).
= ,
�Ci �2 m2
where d = (x1d , x2d , … , xvd ). This shows our estimator is biased. The error is con-
 Since: trolled by t. When t converges to 0, our estimator con-
verges to the unbiased Naive Bayes estimator. We can
� � also derive a lower bound for the square error:
�∑ �
2
d∈Ci xj
�Ci �m𝜃ij 1 − 𝜃ij + m𝜃ij � � � � � �2
E = L L
E �𝜃̂i 1 − 𝜃ij �2 ≥ E 𝜃̂i 1 − 𝜃ij
�Ci �2 m2 �Ci �2 m2 j j
� � ∑k
𝜃ij 1 − 𝜃ij + m𝜃ij � l=1 pl 𝜃lj − 𝜃ij �2
=
= , (1 + pi ∕t)2
�Ci �m

and (2) For variance part, since


�∑ � � � � �
d≠d � ∈Ci xjd xjd

�Ci � �Ci � − 1 m2 𝜃i2 �Ci � − 1 𝜃i2

j j
E = = .
�Ci �2 m2 �Ci �2 m2 �Ci � + t)xj
L d∈S (yi (d)
𝜃̂i 1 = ,
j m(�Ci � + t�S�)
Plugging them into (A.1) obtains:
( ) we have:
[ ] 𝜃ij 1 − 𝜃ij
E 𝜃̂i2 = + 𝜃i2 , L L
E[�𝜃̂i 1 − E[𝜃̂i 1 ]�2 ]
j |Ci |m j
j j

[ ] 𝜃i (1−𝜃i ) � ∑ �
� d∈S (yi (d) + t)(xj − E[xj ]) �2
thus: E |𝜃̂ij − 𝜃ij |2 = j |C |m j . � �
=E � �
i � m(�Ci � + t�S�) �
 � �
◻ ∑ 2 2
(y
d∈S i (d) + t) E(�xj − E[x j ]� )
=
A.2. Proof of Theorem 4.1 m2 (�Ci � + t�S�)2
∑ 2 ∑ 2
d∈Ci (1 + t) m𝜃ij (1 − 𝜃ij ) + d∈Cl ,l≠i t m𝜃lj (1 − 𝜃lj )
Proof =
m2 (�Ci � + t�S�)2
∑v ∑
(1) With assumption j=1 xj = m, we have: �Ci �(1 + 2t)𝜃ij (1 − 𝜃ij ) + kl=1 �Cl �t 2 𝜃lj (1 − 𝜃lj )
∑ =
L d∈S (yi (d) + t)E[xj ]
m(�Ci � + t�S�)2
E[𝜃̂i ] =
1
∑k
j m(t�S� + �Ci �) �S�pi (1 + 2t)𝜃ij (1 − 𝜃ij ) + �S� l=1 pl t 2 𝜃lj (1 − 𝜃lj )
∑ ∑ =
d∈S tE[xj ] + x∈Ci E[xj ] m(�S�pi + t�S�)2
= ∑k
m(t�S� + �Ci �) pi (1 + 2t)𝜃ij (1 − 𝜃ij ) + l=1 pl t 2 𝜃lj (1 − 𝜃lj )
∑k =
t l=1 �Cl �𝜃lj + 𝜃ij �Ci � m�S�(pi + t)2
= � �
t�S� + �Ci � 1
∑k =O
t�S� l=1 pl 𝜃lj + 𝜃ij �Ci � m�S�
= (A.3)
t�S� + �Ci �
 ◻

Vol:.(1234567890)
SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5 Research Article

A.3. Proof of Theorem 5.1 c2 (1 − pi )𝜃ij (1 − 𝜃ij )


t= �∑ �2
∑k
Proof First let us fix some notations for constants do not c2 l=1
pl 𝜃lj (1 − 𝜃lj ) − c2 𝜃ij (1 − 𝜃ij ) + c1 m�S�
k
l=1
pl 𝜃lj − 𝜃ij
involve t to simplify the derivation. Let Θij ∶= 𝜃ij (1 − 𝜃ij ),
∑k ∑k
A = l=1 pl 𝜃lj (1 − 𝜃lj ) and B = ( l=1 pl 𝜃lj − 𝜃ij )2. As shown  ◻
in Eq. A.2, the squared bias is
�∑ �2
k
� �2 � �2 m�S�t 2 p 𝜃 − 𝜃
L L
Bias 𝜃̂i 1 = E 𝜃̂i 1 − 𝜃ij =
l=1 l lj ij
References
j j m�S�(pi + t)2
t 2 m�S�B 1. Albright R (2004) Taming text with the SVD. SAS Institute Inc,
= Cary
m�S�(pi + t)2 2. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J
Mach Learn Res 3(Jan):993–1022
3. Chen J, Matzinger H, Zhai H, Zhou M (2018) Centroid estimation
From A.3, the variance is based on symmetric KL divergence for multinomial text classifi-
cation problem. In: 2018 17th IEEE international conference on
� � � � machine learning and applications (ICMLA). IEEE, pp 1174–1177
L L L 4. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn
Var 𝜃̂i 1 = E �𝜃̂i 1 − E 𝜃̂i 1 �2
20(3):273–297
� �
j j j

∑k 5. Dhillon IS, Mallela S, Kumar R (2003) A divisive information-


pi (1 + 2t)𝜃ij 1 − 𝜃ij + t 2 l=1 pl 𝜃lj (1 − 𝜃lj ) theoretic feature clustering algorithm for text classification. J
= Mach Learn Res 3(Mar):1265–1287
m�S�(pi + t)2 6. Feng X, Qin H, Shi Q, Zhang Y, Zhou F, Haochen W, Ding S, Niu Z,
pi (1 + 2t)Θij + t 2 A Yan L, Shen P (2014) Chrysin attenuates inflammation by regu-
= lating m1/m2 status via activating ppar𝛾 . Biochem Pharmacol
m�S�(pi + t)2 89(4):503–514
7. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network
Therefore, classifiers. Mach Learn 29(2–3):131–163
8. Günal S, Ergin S, Bilginer GM, Gerek Ö (2006) On feature extrac-
tion for spam e-mail detection. In: International workshop on
multimedia content representation, classification and security.
( ) c2 pi (1 + 2t)Θi + t 2 (c2 A + c1 m|S|B) Springer, pp 635–642
j
L 𝜃ij , c1 , c2 = 9. Hidalgo JMG, Rodriguez MdB (1997) Integrating a lexical data-
m|S|(pi + t)2 base and a training collection for text categorization. arXiv pre-
print arXiv​:cmp-lg/97090​04
Then we should optimize t to minimize the loss L(𝜃ij , c1 , c2 ). 10. Hofmann T (1999) Probabilistic latent semantic analysis. In: Pro-
Taking derivative with respect to t and setting it to be 0. ceedings of the fifteenth conference on uncertainty in artificial
intelligence. Morgan Kaufmann Publishers Inc., pp 289–296
We find 11. Idris I, Selamat A, Omatu S (2014) Hybrid email spam detection
model with negative selection algorithm and differential evolu-
[ ] tion. Eng Appl Artif Intell 28:97–110
12. Jiang L, Li C, Wang S, Zhang L (2016) Deep feature weighting for
c2 pi Θij + t(c2 A + c1 m|S|B) (pi + t) Naive Bayes and its application to text classification. Eng Appl
[ ] Artif Intell 52:26–39
− c2 pi (1 + 2t)Θij + t 2 (c2 A + c1 m|S|B) = 0. 13. Joachims T (1998) Text categorization with support vector
machines: learning with many relevant features. In: European
conference on machine learning. Springer, pp 137–142
That simplifies to 14. Lang K 20 newsgroups data set. http://qwone​.com/~jason​
/20New​sgrou​ps/
15. Langley P, Iba W, Thompson K et al (1992) An analysis of Bayesian
c2 (pi − 1 − t)Θij + t(c2 A + c1 m|S|B) = 0 classifiers. Aaai 90:223–228
16. Lewis DD Reuters-21578. http://kdd.ics.uci.edu/databa​ ses/reute​
which shows rs215​78/reute​rs215​78.html
17. Li F, Yang Y (2003) A loss function analysis for classification meth-
ods in text categorization. In: Proceedings of the 20th interna-
c2 (1 − pi )Θij tional conference on machine learning (ICML-03), pp 472–479
t= . 18. Liu P, Qiu X, Huang X (2016) Recurrent neural network for text
c2 (A − Θij ) + c1 m|S|B classification with multi-task learning. arXiv preprint arXiv​
:1605.05101​
19. Medhat W, Hassan A, Korashy H (2014) Sentiment analy-
Plug in original parameters, we obtain
sis algorithms and applications: a survey. Ain Shams Eng J
5(4):1093–1113

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5

20. Mladenic D, Grobelnik M (1998) Word sequences as features in 27. Tang B, Kay S, He H (2016) Toward optimal feature selection in
text-learning. In: Proceedings of the 17th electrotechnical and Naive Bayes for text categorization. IEEE Trans Knowl Data Eng
computer science conference (ERK98). Citeseer 28(9):2508–2521
21. Narayanan V, Arora I, Bhatia A (2013) Fast and accurate senti- 28. Tang D, Qin B, Liu T (2015) Document modeling with gated
ment classification using an enhanced naive bayes model. In: recurrent neural network for sentiment classification. In: Pro-
International conference on intelligent data engineering and ceedings of the 2015 conference on empirical methods in natu-
automated learning. Springer, pp 194–201 ral language processing, pp 1422–1432
22. Nguyen N, Yamada K, Suzuki I, Unehara M (2018) Hierarchical 29. Uysal AK, Gunal S, Ergin S, Sora Gunal E (2013) The impact of
scheme for assigning components in multinomial naive bayes feature extraction and selection on sms spam filtering. Elektron
text classifier. In: 2018 Joint 10th international conference on Elektrotech 19(5):67–72
soft computing and intelligent systems (SCIS) and 19th inter- 30. Uysal AK (2016) An improved global feature selection scheme
national symposium on advanced intelligent systems (ISIS), pp for text classification. Expert Syst Appl 43:82–92
335–340 31. Venkatesh, RKV (2018) Classification and optimization scheme
23. Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text data using machine learning nave bayes classifier. In:
for text classification. In: IJCAI-99 workshop on machine learning 2018 IEEE world symposium on communication engineering
for information filtering, vol 1, pp 61–67 (WSCE), pp 33–36
24. Rill S, Reinel D, Scheidt J, Zicari RV (2014) Politwi: early detec- 32. Wu Y (2018) A new instance-weighting Naive Bayes text classi-
tion of emerging political topics on twitter and the impact on fiers. In: 2018 IEEE international conference of intelligent robotic
concept-level sentiment analysis. Knowl-Based Syst 69:24–33 and control engineering (IRCE), pp 198–202
25. Saraç E, Özel SA (2014) An ant colony optimization based feature 33. Zhang L, Jiang L, Li C, Kong G (2016) Two feature weighting
selection for web page classification. Sci World J. https​://doi. approaches for Naive Bayes text classifiers. Knowl-Based Syst
org/10.1155/2014/64926​0 100:137–144
26. Su J, Shirab JS, Matwin S (2011) Large scale text classification
using semi-supervised multinomial Naive Bayes. In: Proceed- Publisher’s Note Springer Nature remains neutral with regard to
ings of the 28th international conference on machine learning jurisdictional claims in published maps and institutional affiliations.
(ICML-11). Citeseer, pp 97–104

Vol:.(1234567890)

You might also like