Professional Documents
Culture Documents
Abstract
Naive Bayes (NB) estimator is widely-used in text classification problems. However, it does not perform well with small-
size training datasets. Most previous literature focuses on either creating and modifying features or combing clustering
to improve the performance of NB. We directly tackle the problem by constructing a new estimator, called Naive Bayes
with correlation factor. We introduce a correlation factor to NB estimator that incorporates overall correlation among
the different classes. This effectively exploits the idea of bootstrapping, which reuses data for all classes even if they only
belong to one class. Moreover, we obtain a formula for the optimal correlation factor by balancing bias and variance of
the estimator. Experimental results on real-world data show that our estimator achieves better accuracy compared with
traditional Naive Bayes, yet at the same time maintaining the simplicity of NB.
Keywords Naive Bayes · Correlation factor · Text classification · Insufficient training set
* Zhibo Dai, zdai37@gatech.edu; Jiangning Chen, jchen444@math.gatech.edu; Juntao Duan, jt.duan@gatech.edu; Heinrich Matzinger,
matzi@math.gatech.edu; Ionel Popescu, ipopescu@math.gatech.edu | 1School of Mathematics, Georgia Institute of Technology, Atlanta,
GA 30313, USA.
Received: 31 May 2019 / Accepted: 23 August 2019 / Published online: 31 August 2019
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5
given the class label by applying Bayes rule. The predic- a smaller order compared with Naive Bayes estimator. Sec-
tion of the class is done by choosing the highest posterior tion 5 is devoted to finding the optimal correlation factor.
probability. When we take into account more factors, such We also show the order of the error is less than traditional
as order of the sequence and meaning of words given a naive Bayes. In Sect. 6, we show the results of our simula-
large enough data set, we can use deep learning models tions, which demonstrates the performance of our method
such as recurrent neural network [18, 28]. presented in Sect. 4. Finally, Sect. 7 concludes our work
For unsupervised problems, [1] proposed SVD (Singu- and mentions possible future work.
lar Value Decomposition) for dimension reduction, then
use general clustering algorithm such as K-means. There
also exist some algorithms based on EM algorithm, such 2 General setting
as pLSA (Probabilistic latent semantic analysis) [10], which
considers the probability of each co-occurrence of words Consider a classification problem with the sample (docu-
and documents as a mixture of conditionally independent ment) set S, and the class set C with k different classes:
multinomial distributions. The parameters in pLSA can not
be derived, therefore they used the standard EM algorithm C = {C1 , C2 , … , Ck }.
for estimation. Using the same idea, but assuming that the Assume we have totally v different words, thus for each
topic distribution has sparse Dirichlet prior, [2] proposed document d ∈ S , we have:
LDA (Latent Dirichlet allocation). The sparse Dirichlet priors
d = {x1 , x2 , … , xv }.
encode the intuition that documents cover only a small
set of topics and topics frequently use only a small set of where xi represents the number of occurrence for i-th word
words. In practice, this results in a better disambiguation Define y = (y1 , y2 , … , yk ) as our label vector. For docu-
of words and a more precise assignment of documents ment d in class Ci , we have yi (d) = 1. Notice that for a single
∑k
to topics. label problem, we have: i=1 yi = 1.
This paper focuses on the performance of Naive Bayes For a test document d, our target is to predict:
approach for text classification problems. Although it is ŷ (d) = f (d;𝜃) = (f1 (d;𝜃), f2 (d;𝜃), … , fk (d;𝜃))
a widely-used method, it requires plenty of well-labelled
given training sample set S, where 𝜃 is the parameter
data for training purpose. Moreover, its conditional inde-
matrix and fi (d;𝜃) is the likelihood function of document
pendence assumption is rarely held in reality. There are
d in class Ci . The detail proof of Theorems can be found in
some researches on how to relax this restriction, such as
“Appendix”.
the feature weighting approach [12, 33] and instance-
weighting approach [32]. [3] proposed a method that finds
better estimation of centroid, which helps improve the
accuracy of Naive Bayes estimation. In order to tackle the 3 Naive Bayes classifier in text classification
situation where there does not exist enough labelled data problem
for each class, we propose a novel estimation method.
Different from other feature weighting approaches, our In this section we will discuss the properties of Naive Bayes
method not only relaxes the conditional independence estimator. The traditional Naive Bayes uses a simple proba-
assumption, but also improves the performance compared bilistic model to infer the most likely class of an unknown
to traditional Naive Bayes with insufficient labelled train- document. Let class Ci (1 ≤ i ≤ k) with centroid
∑v
ing data for each class. The key part of our method is intro- 𝜃i = (𝜃i1 , 𝜃i2 , … , 𝜃iv ) and 𝜃i satisfies: j=1 𝜃ij = 1. Assuming
ducing a correlation factor, which would combine more independence of the words, the most likely class for a
feature information taken from different classes. document d is computed as:
The remainder of this paper is organized as follows. Sec-
label(d) = argmax i P(Ci )P(d|Ci )
tion 2 details the general settings for the whole paper. In
Sect. 3, we give a detail review of Naive Bayes estimator. ∏
v
= argmax i P(Ci ) (𝜃ij )xj
The error of the Naive Bayes estimator is discussed in Theo-
j=1 (3.1)
rem 3.1. In Sect. 4, we derive our proposed method Naive
∑
v
Bayes with correlation factor (NBCF) (see 4.5). It improves = argmax i log P(Ci ) + xj log 𝜃ij .
traditional Naive Bayes in the situation where there is only j=1
limited available training data, which is a common situ-
ation in many real world text classification applications.
Furthermore, in Theorem 4.1, we show the NBCF error is This gives the classification criteria once 𝜃 is estimated,
controlled by the correlation factor and the variation has namely finding the largest among
Vol:.(1234567890)
SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5 Research Article
Now we shall derive an maximum likelihood estimator for 4 Naive Bayes with correlation factor
𝜃 . For a class Ci , we have the standard likelihood function:
∏ From Theorem 3.1, we can see that traditional Naive Bayes
L(Ci , 𝜃) = fi (d;𝜃)yi (d) estimator 𝜃̂ is an unbiased estimator with variance
𝜃i (1−𝜃i )
d∈S
O( j|C |m j ). Now we will try to find an estimator, and prove
∏∏
v (3.2) i
=
x
𝜃i j that it can perform better than traditional Naive Bayes
d∈Ci j=1
j
estimator.
There are two main approaches to improve the Naive
Bayes model: modifying the feature and modifying the
Take logarithm for both sides, we obtain the log-likelihood
model. Many researchers have proposed approaches to
function:
modify the document representation in order to better fit
∑∑
v the assumption made by Naive Bayes. These include extract-
log L(Ci , 𝜃) = xj log 𝜃ij . (3.3) ing more complex features such as syntactic or statistical
d∈Ci j=1 phrases [20], extracting features using word clustering [5]
and exploiting relations using lexical resources [9]. We pro-
We would like to solve optimization problem:
pose an approach that modifies the probabilistic model.
Therefore, our model should work well with other document
max log L(Ci , 𝜃)
representation modifications to achieve a better result.
∑
v
Our basic idea is that, even for a single labeling problem,
subject to ∶ 𝜃ij = 1 (3.4)
j=1
a document d usually contains words appearing in different
classes, thus it should include some information from dif-
𝜃ij ≥ 0
ferent classes. However, our label y in training set does not
reflect that information because only one component of y
The problem (3.4) can be explicitly solved by Lagrange is 1 and all others are 0. We would like to replace y by y + t
Multiplier, for class Ci , we have 𝜃i = {𝜃i1 , 𝜃i2 , … , 𝜃iv }, where: in Naive Bayes likelihood function 3.2 with some optimized
t to get our new likelihood function L1:
∑
d∈Ci xj
∏
𝜃̂ij = ∑ ∑v . (3.5) L1 (Ci , 𝜃) = fi (d;𝜃)yi (d)+t
d∈Ci x
j=1 j d∈S
( v )yi (d)+t (4.1)
∏ ∏ x
For estimator 𝜃̂ , we have following theorem. = 𝜃i j .
j
d∈S j=1
Theorem 3.1 Assume we have normalized length of each
∑v
document, that is: j=1 xj = m for all documents d ∈ S , the By introducing the correlation factor t, we include more
estimator [(3.5) satisfies following properties: information between the document and classes, which
improves the classification accuracy.
(1) 𝜃̂ij is unbiased. Notice that to compute L1 of a given class Ci in our esti-
𝜃ij (1−𝜃ij )
(2) E[|𝜃̂i − 𝜃i |2 ] =
j j |Ci |m
. mator, instead of just using documents in C1 as Naive Bayes
estimator, we will use every d ∈ S.
The Naive Bayes with multinomial prior distribution has Take logarithm for both sides of 4.1, we obtain the log-
a strong assumption about the data: it assumes that words likelihood function:
in documents are independent. However, this assumption [ ]
clearly does not hold in real world text. There are many dif- ∑ ∑
v
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5
∑
v estimator.
subject to ∶ 𝜃ij = 1 (4.3)
j=1
⎪ 𝜃ij = 1, ∀ 1 ≤ i ≤ k [ ] [ ]2
⎩ j=1 L L L
= E |𝜃̂i 1 − E 𝜃̂i 1 | + E 𝜃̂i 1 − 𝜃ij
2
(4.4) j
( )
j
( )2
j
L L
∶= Var 𝜃̂i 1 + Bias 𝜃̂i 1
Solve (4.4), we got the solution of optimization problem j j
(4.3):
∑ In practice, we would like to minimize a general linear
(yi (d) + t)xj combination of bias and variance, namely,
̂𝜃 L1 = ∑ d∈S ∑
ij v
j=1 d∈S (yi (d) + t)xj ( ) ( )2 ( )
∑ (4.5) L L
L 𝜃ij , c1 , c2 = c1 Bias 𝜃̂i 1 + c2 Var 𝜃̂i 1 (5.1)
d∈S (yi (d) + t)xj j j
=
m(�Ci � + t�S�)
Theorem 5.1 The minimum of the loss function 5.1 is
L
achieved at
For estimator 𝜃̂i 1, we have the following result:
j
Theorem 4.1 Assume for each class, we have prior distribu- c2 (1 − pi )𝜃ij (1 − 𝜃ij )
t∗ = ∑k ∑k
tions p1 , p2 , … , pk with pi = |Ci |∕|S|, and we have normal-
∑v c2 l=1
pl 𝜃lj (1 − 𝜃lj ) − c2 𝜃ij (1 − 𝜃ij ) + c1 m�S�( l=1 pl 𝜃lj − 𝜃ij )2
ized length for each document, that is: j=1 xj = m. The esti-
(5.2)
mator (4.5) satisfies following property:
We can see from (5.2) that the optimal correlation factor
1
L L t ∗ should be a very small number close to O( m|S| ). Therefore
(1) 𝜃̂i 1 is biased, with: |E[𝜃̂i 1 ] − 𝜃ij | = O(t)
j
L L
j
by Eq. A.2, we know the squared bias is
(2) E[|𝜃̂ 1 − E[𝜃̂ 1 ]|2 ] = O( 1 ). ( )
ij ij m|S| ( )2
̂ L1 2 1
Bias 𝜃i = O(t ) = O
j m2 |S|2
Vol:.(1234567890)
SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5 Research Article
We have already shown in A.3 that the order of the Reuters-21578 data [16], there are approximately 3000
variance documents in this sample set. For 20 news group data
( ) ( ) [14], it includes 20 groups and approximately 20,000
L 1
Var 𝜃̂i 1 = O documents.
j m|S|
We take t ∈ (0, 2) and use 10% of data for training and
L
Therefore in the case of expected square error E[|𝜃̂i 1 − 𝜃ij |2 ] assess the trained model on the remaining test data.
j
In our simulation, we notice that when we choose cor-
(c1 = c2 = 1) is dominated by the variance. Thus we have
relation factor to be around 0.1, we get the best accuracy
the following Corollary:
for our estimation. See Fig. 1a, b.
1
Theorem 5.2 With any selection of t = O( m|S| ), we have
6.2 Compare with Naive Bayes
[ ] ( )
L 1
E |𝜃̂i 1 − 𝜃ij |2 = O (5.3) Next, we compare the result of traditional Naive Bayes esti-
j m|S| L
mator (3.5) 𝜃̂ijand our estimator (4.5) 𝜃̂i 1. In this simulation,
By Theorem 3.1, we know that for Naive Bayes,
j
0.85 0.72
0.8 0.7
Accuracy
Accuracy
0.75 0.68
0.7 0.66
0.65 0.64
0.6 0.62
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
t t
Fig. 1 We test accuracy behavior with respect to different correlation factors in Reuter-21578 (a) and 20 News group dataset (b). We take
10% of the data as training set. The y-axis is the accuracy and the x-axis is the correlation factor t
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5
0.9 0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.2 0.3
0.1 0.2
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
Fig. 2 We take 10 largest groups in Reuter-21578 dataset (a) and 20 news group dataset (b), and take 10% of the data as training set. The
y-axis is the accuracy, and the x-axis is the class index
(a) training set = 90%, behavior in reuter (b) training set = 90%, behavior in 20 news group
1 1
0.95 0.9
0.9 0.8
0.85 0.7
0.8 0.6
0.75 0.5
0.7 0.4
0.65 0.3
0.6 0.2
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
Fig. 3 We take 10 largest groups in Reuter-21578 dataset (a) and 20 news group dataset (b), and take 90% of the data as training set. The
y-axis is the accuracy, and the x-axis is the class index
is still more accurate than Naive Bayes in Reuter’s data. The 6.3 Robustness of t for prediction
reason might be that we have a huge unbalanced sample
size in Reuter’s data, 90% of the training set is still not large 1
For estimation purposes, t must satisfy t = O( |S| ) by The-
enough for many classes. orem 5.2 in order to find the most accurate parameters.
Finally, we apply the same training set with training size However, it turns out that for prediction purposes, there
10% and test the accuracy on the training set instead of 1
is a phase transition phenomenon. As long as t ≥ O( |S| ),
the test set. We find the traditional Naive Bayes estima-
the prediction power is not reduced even when we
tor actually achieves better results, which means it might
increase t to very large value (for example t = 105). In the
have more over-fitting problems. This might be the reason
simulation of finding best t in Fig. 1a, b, we see the test-
why our method works better when the dataset is not too
ing error is only decreasing slightly as t increasing from
large: adding the correlation factor t helps us bring some
0.1 to 2. We summarize this fact as follows
uncertainty in training process, which helps avoid over-
fitting. See Fig. 4a, b.
Vol:.(1234567890)
SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5 Research Article
(a) 1
training set = 10%, training set behavior in reuter (b) 1
training set = 10%, training set behavior in 20 news group
0.99
0.95
0.98
0.97
0.9
0.96
0.85
0.95
0.94
0.8
0.93
0.92 0.75
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20
Fig. 4 We take 10 largest groups in Reuter-21578 dataset (a), and 20 news group dataset (b), and take 10% of the data as training set. We
test the result on training set. The y-axis is the accuracy, and the x-axis is the class index
Proposition 6.1 For prediction purpose, the correlation factor (3) We can only use our method in single labeled dataset
t can take value in the interval so far, it would be interesting to see if we can extend
our result in partial labeled dataset or multi-labeled
1
≤t≤1 dataset.
|S|
The reason we restraint the upper bound to be 1 is that
the effect of correlation factor should not exceed the effect Compliance with ethical standards
of original class yi = 1.
Conflict of interest The authors declare that they have no conflict
of interest.
7 Conclusion
Appendix A: Proof of the theorems
In this paper, we modified the traditional Naive Bayes esti-
mator with a correlation factor to obtain a new estima- A.1. Proof of Theorem 3.1
tor, NBCF, which is biased but has a smaller variance. We ∑v
justified that our estimator has significant advantage by Proof With assumption j=1 xj = m, we can rewrite (3.5) as:
analyzing the mean square error. In simulation, we applied ∑ ∑
d∈Ci xj d∈Ci xj
our method to real world text classification problems, and 𝜃̂ij = ∑ = .
showed that it works better when the training data set is d∈C m �Ci �m
i
insufficient.
There are several important open problems related our Since d = (x1 , x2 , … , xv ) is multinomial distribution in class
estimator: Ci , we have: E[xj ] = m𝜃ij , and E[xj2 ] = m𝜃ij (1 − 𝜃ij + m𝜃ij ).
�∑ � ∑
(1) Is our proposed estimator admissible for the square (1) � � xj E[xj ]
d∈Ci d∈Ci
error loss? Even though we know it outperform Naive E 𝜃̂ij = E =
�Ci �m �Ci �m
Bayes estimator, it might not be the optimal one. ∑
(2) Will our estimator still work in other more independ- d∈Ci m𝜃ij
= = 𝜃ij .
ent datasets? We only test our result in Reuter’s data �Ci �m
[16] and 20 news group [14], these datasets are news
from newspapers, which means they are highly cor-
Thus 𝜃̂ij is unbiased.
related to each other.
(2) By (1), we have:
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5
[ ] [ ] [ ]
Thus:
E |𝜃̂ij − 𝜃ij |2 ] = E[𝜃̂i2 − 2𝜃ij E 𝜃̂ij + 𝜃i2 = E 𝜃̂i2 − 𝜃i2 .
j j j j
∑k
Then notice L
t�S�� l=1
pl 𝜃lj − 𝜃ij �
�E[𝜃̂i 1 ] − 𝜃ij � =
�∑ �2
j t�S� + �Ci �
∑k
d∈Ci xj � l=1
pl 𝜃lj − 𝜃ij � (A.2)
𝜃̂i2 = =
j �Ci �2 m2 1 + pi ∕t
∑ ∑ (A.1)
2 d d�
d∈Ci xj + d≠d � ∈Ci xj xj = O(t).
= ,
�Ci �2 m2
where d = (x1d , x2d , … , xvd ). This shows our estimator is biased. The error is con-
Since: trolled by t. When t converges to 0, our estimator con-
verges to the unbiased Naive Bayes estimator. We can
� � also derive a lower bound for the square error:
�∑ �
2
d∈Ci xj
�Ci �m𝜃ij 1 − 𝜃ij + m𝜃ij � � � � � �2
E = L L
E �𝜃̂i 1 − 𝜃ij �2 ≥ E 𝜃̂i 1 − 𝜃ij
�Ci �2 m2 �Ci �2 m2 j j
� � ∑k
𝜃ij 1 − 𝜃ij + m𝜃ij � l=1 pl 𝜃lj − 𝜃ij �2
=
= , (1 + pi ∕t)2
�Ci �m
[ ] 𝜃i (1−𝜃i ) � ∑ �
� d∈S (yi (d) + t)(xj − E[xj ]) �2
thus: E |𝜃̂ij − 𝜃ij |2 = j |C |m j . � �
=E � �
i � m(�Ci � + t�S�) �
� �
◻ ∑ 2 2
(y
d∈S i (d) + t) E(�xj − E[x j ]� )
=
A.2. Proof of Theorem 4.1 m2 (�Ci � + t�S�)2
∑ 2 ∑ 2
d∈Ci (1 + t) m𝜃ij (1 − 𝜃ij ) + d∈Cl ,l≠i t m𝜃lj (1 − 𝜃lj )
Proof =
m2 (�Ci � + t�S�)2
∑v ∑
(1) With assumption j=1 xj = m, we have: �Ci �(1 + 2t)𝜃ij (1 − 𝜃ij ) + kl=1 �Cl �t 2 𝜃lj (1 − 𝜃lj )
∑ =
L d∈S (yi (d) + t)E[xj ]
m(�Ci � + t�S�)2
E[𝜃̂i ] =
1
∑k
j m(t�S� + �Ci �) �S�pi (1 + 2t)𝜃ij (1 − 𝜃ij ) + �S� l=1 pl t 2 𝜃lj (1 − 𝜃lj )
∑ ∑ =
d∈S tE[xj ] + x∈Ci E[xj ] m(�S�pi + t�S�)2
= ∑k
m(t�S� + �Ci �) pi (1 + 2t)𝜃ij (1 − 𝜃ij ) + l=1 pl t 2 𝜃lj (1 − 𝜃lj )
∑k =
t l=1 �Cl �𝜃lj + 𝜃ij �Ci � m�S�(pi + t)2
= � �
t�S� + �Ci � 1
∑k =O
t�S� l=1 pl 𝜃lj + 𝜃ij �Ci � m�S�
= (A.3)
t�S� + �Ci �
◻
Vol:.(1234567890)
SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5 Research Article
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:1129 | https://doi.org/10.1007/s42452-019-1153-5
20. Mladenic D, Grobelnik M (1998) Word sequences as features in 27. Tang B, Kay S, He H (2016) Toward optimal feature selection in
text-learning. In: Proceedings of the 17th electrotechnical and Naive Bayes for text categorization. IEEE Trans Knowl Data Eng
computer science conference (ERK98). Citeseer 28(9):2508–2521
21. Narayanan V, Arora I, Bhatia A (2013) Fast and accurate senti- 28. Tang D, Qin B, Liu T (2015) Document modeling with gated
ment classification using an enhanced naive bayes model. In: recurrent neural network for sentiment classification. In: Pro-
International conference on intelligent data engineering and ceedings of the 2015 conference on empirical methods in natu-
automated learning. Springer, pp 194–201 ral language processing, pp 1422–1432
22. Nguyen N, Yamada K, Suzuki I, Unehara M (2018) Hierarchical 29. Uysal AK, Gunal S, Ergin S, Sora Gunal E (2013) The impact of
scheme for assigning components in multinomial naive bayes feature extraction and selection on sms spam filtering. Elektron
text classifier. In: 2018 Joint 10th international conference on Elektrotech 19(5):67–72
soft computing and intelligent systems (SCIS) and 19th inter- 30. Uysal AK (2016) An improved global feature selection scheme
national symposium on advanced intelligent systems (ISIS), pp for text classification. Expert Syst Appl 43:82–92
335–340 31. Venkatesh, RKV (2018) Classification and optimization scheme
23. Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text data using machine learning nave bayes classifier. In:
for text classification. In: IJCAI-99 workshop on machine learning 2018 IEEE world symposium on communication engineering
for information filtering, vol 1, pp 61–67 (WSCE), pp 33–36
24. Rill S, Reinel D, Scheidt J, Zicari RV (2014) Politwi: early detec- 32. Wu Y (2018) A new instance-weighting Naive Bayes text classi-
tion of emerging political topics on twitter and the impact on fiers. In: 2018 IEEE international conference of intelligent robotic
concept-level sentiment analysis. Knowl-Based Syst 69:24–33 and control engineering (IRCE), pp 198–202
25. Saraç E, Özel SA (2014) An ant colony optimization based feature 33. Zhang L, Jiang L, Li C, Kong G (2016) Two feature weighting
selection for web page classification. Sci World J. https://doi. approaches for Naive Bayes text classifiers. Knowl-Based Syst
org/10.1155/2014/649260 100:137–144
26. Su J, Shirab JS, Matwin S (2011) Large scale text classification
using semi-supervised multinomial Naive Bayes. In: Proceed- Publisher’s Note Springer Nature remains neutral with regard to
ings of the 28th international conference on machine learning jurisdictional claims in published maps and institutional affiliations.
(ICML-11). Citeseer, pp 97–104
Vol:.(1234567890)