You are on page 1of 5

2015 7th International Conference on Intelligent Human-Machine Systems and Cybernetics

Large Scale Text Classification using Map Reduce and Naive Bayes Algorithm for
Domain Specified Ontology Building

Joan Santoso∗† , Eko Mulyanto Yuniarno∗ , and Mochamad Hariadi∗


∗ Department of Electrical Engineering, Faculty of Industrial Technology
Institut Teknologi Sepuluh Nopember
Surabaya, Indonesia
† Department of Informatics Engineering
Sekolah Tinggi Teknik Surabaya
Surabaya, Indonesia
Email: joan.santoso@gmail.com, ekomulyanto@ee.its.ac.id, mochar@ee.its.ac.id

Abstract—Internet that covers a large information gives an each relation. To build some domain specified ontology, we
opportunity to obtain knowledge from it. Internet contains need to obtain data that belong to those domain. Internet
large unstructured and unorganized data such as text, video, provides large data with multiple domain information. We
and image. Problems arise on how to organize large amount of
data and obtain a useful information from it. This information need to manage those data by classifying them into specified
can be used as knowledge in the intelligent computer system. domain before we extract and build the ontology. We use
Ontology as one of knowledge representation covers a large text classification algorithm to organize text data on Internet
area topic. For constructing domain specified ontology, we use before our domain specified ontology building process is
large text dataset on Internet and organize it into specified done.
domain before ontology building process is done. We try
to implement naive bayes text classifier using map reduce Text classification is a process to categorize text in
programming model in our research for organizing our large specific label or class. Many research are developed to
text dataset. In this experiment, we use animal and plant provide automatic classification using Machine Learning
domain article in Wikipedia online encyclopedia as our dataset. algorithm. Zheng et.al. in [2] use for classifying Chinese
Our proposed method can achieve highest accuracy with score News Document. Rennie et.al. in [3] provides optimization
about 98.8%. This experiment shows that our proposed method
provides a robust system and good accuracy for classifying for Naive Bayes in text classification. Main objective for
document into specified domain in preprocessing phase for our research is to develop a robust text classification system
domain specified ontology building. that can handle large data by implementing naive bayes
Keywords-Naive Bayes; Map Reduce; Big Data; Text Classi- text classification algorithm using map reduce programming
fication; Domain Specified Text Classification; model.
Map Reduce defined in [12] is a technique used to perform
I. I NTRODUCTION Big Data Processing and proposed in [13]. We can write Map
Internet provides a large resource of data or information. Reduce in functional programming style so it is possible to
In everyday life, people use Internet for sharing information run parallel. Map Reduce method divides into 2 step, i.e.
on personal blog or website, upload and watch videos, map and reduce. Map function is needed to divide existing
download or upload pictures on social media. Recently large work into a smaller process by generating intermediate key
data on the Internet can grow because of user online activity. and value pairs. Reduce function is needed to combine
Internet contains large unstructured and unorganized data several intermediate values that have same intermediate key.
such as text, video, and image. We divide our research paper into five section. First
Big Data delivers a new paradigm in processing a large section delivers about our background and objective in this
unstructured and complex data efficiently. Big data problems research. Our related works can be seen in second section.
in [1] are defined as how to process a large amount of data For third section, we define about our proposed method.
where the data is complex and difficult to be traditionally Fourth section is used to define our experiment dataset and
processed by system. These paradigms are given to challenge experiment scenario. Last section is used for providing our
us to organize large data on the Internet, especially text research conclusion and our further research.
data and obtain information to represent as knowledge in
the intelligent computer system. II. R ELATED W ORKS
Many type of knowledge representation is developed Text classification is used in many application such as
recently for representing specified domain information or sentiment analysis [4], emotion identification from twitter
topics. Ontology is defined as representation concept and [5], spam filtering in [6] and [7]. Other researchs in text

978-1-4799-8646-0/15 $31.00 © 2015 IEEE 428


DOI 10.1109/IHMSC.2015.24
classification for Indonesian Language are developed in [8] classifier knowledge. Second phase is classifying new data
and [9]. Rachmania et.al. in [8] provides classification for into specific class. Preprocessing is used for each process
Indonesian news document and Laksana et.al. in [9] provides either classification or learning phase.
classification for Indonesian tweets in twitter. Some research
A. Preprocessing
that use Naive Bayes as classifier can be found at [10], [7],
and [11]. By looking at this, we also try to use Naive Bayes Our preprocessing in this research consists of 4 processes.
as classifier algorithm in our research. We use this algorithm We can see detail of our preprocessing process in figure 2.
because Naive bayes is widely used in text classification and First process is about content extraction. This step is used
have some advantages, that are simple to implement and to extract website content by removing unimportant part in
have a good performance in text classification. web page. We use some handmade rule in this web page
Several Map Reduce performance for handling big data extraction process.
problem are defined in [1]. Yahoo in 2009 successfully sort
500GB of data with only 59 seconds using Map Reduce.
By looking at this advantage, Map Reduce is taken into
many research about developing machine learning algorithm
in map reduce way. Research in [14] provides many machine
algorithms that can be implemented using Map Reduce, such
as Naive Bayes, k-Means, Logistic Regression, Principal
Components Analysis, Neural Network, etc. We design our
Naive Bayes algorithm in map reduce programming style
because we believe our classifier can provides us a good
speed and accuracy in classification process.
III. P ROPOSED M ETHOD
We present our proposed method in this section. Our
proposed method is divided into two parts. First part is
preprocessing for our data. Second part presents about text
classification that we us in this research. Architecture system
for our work can be looked at figure 1.

Figure 2. Preprocessing Process

Second process is tokenization. We use this process to


break up document into token such as word, number, or
punctuation. We develop some regular expressions to help
our tokenization process. Third process is stopword removal.
Some common words that usually appear in document will
be removed using some dictionary list from [15]. Fourth
or last process in our preprocessing phase is stemming.
This process is used to map different morphological variant
of word into base word. We use Indonesian Language
Stemmer in [15] for our stemming process. We implement
this preprocessing algorithm using map reduce in algorithm
1 and 2.
B. Naive Bayes Text Classification
Our text classification phase is divided into two phases.
First phase is learning process to build some knowledge and
second phase is document classification. Feature extraction
Figure 1. Architecture System is needed to help classifier to learn or classify document.
We use standard bag-of words representation as feature in
Our classification process is divided into two phases. our classification algorithm. We can see detail of our text
First phase is learning from annotated document to build classification process in figure 3.

429
Algorithm 1 Map Function for Preprocessing Algorithm
1: Input : filename,document
2: Output : Intermediate key, Intermediate value
3: token list ← getT oken(document)
4: for each token t in token list do
5: if t not in stopword then
6: content ← content + getStemmingResult(t)
7: end if
8: end for
9: emitIntermediate(f ilename, content)

Algorithm 2 Reduce Function for Preprocessing Algorithm


1: Input : key,list(value)
2: Output : key,value
3: for each value v in list(value) do
4: content ← content + v
5: end for
6: emit(key, content) Figure 3. Text Classification Process

And for computing prior probability P(c), we can use 4.


Naive Bayes algorithm is widely use as text classification
algorithm. This classifier uses bayes theorem with some Nc
naive assumption about class conditional independence for P(c) = (4)
N
each feature. Naive Bayes classifier uses probability to find
Where :
output target class with maximum posterior probability. We
can define naive bayes classifier model in 1. We can compute • Nc : total target class c in dataset

P(ci |D) using Bayes Theorem in 2 as follows. • N : total instance in dataset

ŷ = argmax P(ci |D) (1) Document D represents bag-of word feature. We can
ci ∈C define document D={w1 , w2 , w3 , ..., wn } where n is total
Where : number of features in document. For computing posterior
• D : document
probability P(D|c), we can use 5 and 6. We use laplace
• ci : target class ci in C
smoothing in 6 that is a smoothing algorithm which is widely
• C : all target class in dataset
used in text classification.
• P(ci |D) : posterior probability class ci if given 
document D P(D|c) = P(wi |c) (5)
wi ∈D

. Where :
P(D|c) P(c)
P(c|D) = (2) • D : document D
P(D) • c : target class
• wi : word in document D
Usually some research in naive bayes classifier is dropping • P(wi |c) : posterior probability word wi in D if given
denominator so this equation can define into 3. class c

P(c|D) = P(D|c) P(c) (3)


count(wi ,c)+1
Where : P(wi |c) = (6)
count(c) + |V |
• D : document
• c : target class Where :
• P(D|c) : posterior probability document D if given • |V |: Vocabulary size
target class c • count(wi ,c) : total term wi in class c
• P(c) : prior probability for target class c in dataset • count(c) : total term in class c

430
Our naive bayes classifier implementation in map reduce Algorithm 6 Reduce Function for Naive Bayes Classifica-
is based on research in [14]. This implementation for naive tion Algorithm
bayes algorithm is divided into two phases. First phase 1: Input : key,list(value)
produces a knowledge from training data. Second phase 2: Output : key,value
performs classification itself. We implement our training 3: maxV alue ← −1
phase using algorithm 3 and 4. 4: for each value in list(value) do
5: if value.probability ≥ maxV alue then
Algorithm 3 Map Function for Naive Bayes Learning 6: maxV alue ← value.probability
Algorithm 7: tclass ← value.class
1: Input : label,document 8: end if
2: Output : Intermediate key,Intermediate value 9: end for
3: emitIntermediate(tclass + ” ” + label, 1) 10: emit(key, tclass)
4: for each term in document do
5: emitIntermediate(label + ” ” + term, 1)
6: end for A. Dataset
Our dataset are articles from Wikipedia which is an online
encyclopedia. We use 2000 articles for our experiment. We
Algorithm 4 Reduce Function for Naive Bayes Learning use 2000 Wikipedia articles in this experiment from two
Algorithm domains, i.e. animal and plant. We can see statistic for each
1: Input : key,list(value) dataset that we used in table I.
2: Output : key,value
Table I
3: for v in list(value) do T EXT DATASET FOR T RAINING
4: result ← result + v
5: end for No Dataset Articles
1 Plant Domain 1000
6: emit(key, result)
2 Animal Domain 1000
3 Total 2000
For classification process, we use map reduce program-
ming style for labeling our large text dataset. Detail process We use these data in table I as training dataset for our
in our text classification implementation can be seen in classifier. For our experiment, we use 3 testing datasets
algorithm 5 and 6. that consist of several documents belong to animal or plant
domain. These three testing datasets are used in our three
Algorithm 5 Map Function for Naive Bayes Classification experiment scenarios. We can see statistic for our testing
Algorithm dataset in table II.
1: Input : document,learning knowledge, list of class
Table II
2: Output : Intermediate key, Intermediate value T EXT DATASET FOR E XPERIMENTS
3: nb model ← buildClassif ier(learning knowledge)
4: for each c in list of class do No Experiment Article
1 First Scenario 250
5: prior ← nb model.getP riorP robability(c) 2 Second Scenario 500
6: for each token in document do 3 Third Scenario 1000
7: temp ← nb model.getP osterior(c, token)
8: posterior ← posterior x temp
9: end for B. Experiment Result
10: probability ← prior x posterior Our learning process shows a fast process in running time,
11: emitIntermediate(document, (class, probability)) i.e. 31 second for learning 2000 articles. We divide our
12: end for experiment scenario into three parts. For first experiment
scenario, we try classifying using 250 documents. Second
experiment scenario is used to classify 500 documents. Last
IV. E XPERIMENTS experiment scenario is used to classify 1000 documents. Our
We present our experiment in this section. We use single experiment accuracy and running time can be seen in table
node machine with hadoop configured in pseudo distributed III.
mode for our experiment. We divide this section into two First experiment shows that our classifier can achieve
sections. First is about our dataset and second is about our 98.4% for classifying document into animal and plant
experiment result. domain. Second experiment shows that our classifier can

431
Table III [4] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sen-
E XPERIMENTS R ESULT timent classification using machine learning techniques,” in
Proceedings of the ACL-02 conference on Empirical methods
No Experiment Accuracy Running Time in natural language processing-Volume 10. Association for
1 First Scenario 98.4% 1 mins 31 sec Computational Linguistics, 2002, pp. 79–86.
2 Second Scenario 98.8% 2 mins 18 sec
3 Third Scenario 98.1 % 3 mins 35 sec
[5] W. Wang, L. Chen, K. Thirunarayan, and A. P. Sheth,
“Harnessing twitter” big data” for automatic emotion iden-
tification,” in Privacy, Security, Risk and Trust (PASSAT),
achieve 98.8% in document classification result. Third exper- 2012 International Conference on and 2012 International
iment can achieve 98.1% as experiment result. Our proposed Confernece on Social Computing (SocialCom). IEEE, 2012,
pp. 587–592.
system can achieve highest accuracy value 98.8% in second
experiment. This result shows our proposed method can [6] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos,
provides a robust system for text classification for specifying G. Paliouras, and C. D. Spyropoulos, “An evaluation
domain specified article in preprocessing phase for domain of naive bayesian anti-spam filtering,” arXiv preprint
specified ontology building. cs/0006013, 2000.

V. C ONCLUSION AND F URTHER R ESEARCH [7] V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam
filtering with naive bayes-which naive bayes?” in CEAS,
Internet provides a lot of data. Many data type in Internet 2006, pp. 27–28.
is unstructured, such as image, video, text, and sound.
We believe that text classification can be used to organize [8] A. Rachmania, J. Jaafar, and N. Zamin, “Likelihood calcula-
tion classification for indonesian language news documents,”
large unstructured data on Internet into structured data in in Information Technology and Electrical Engineering (ICI-
specified domain. Our proposed method using Naive Bayes TEE), 2013 International Conference on, Oct 2013, pp. 149–
algorithm in Map Reduce programming model as classifier 154.
can achieve highest accuracy with score about 98.8%. This
experiment shows that our proposed method provides a [9] J. Laksana and A. Purwarianti, “Indonesian twitter text
authority classification for government in bandung,” in
robust system and good accuracy for classifying document Advanced Informatics: Concept, Theory and Application
into specified domain as preprocessing phase for domain (ICAICTA), 2014 International Conference of, Aug 2014, pp.
specified ontology building. 129–134.
Some of articles usually not only belong to one topic but
[10] I. Ahmed, D. Guan, and T. C. Chung, “Sms classification
sometimes it can also belong to another topics. We will try
based on naive bayes classifier and apriori algorithm frequent
to improve our algorithm so it can classify document with itemset,” International Journal of Machine Learning and
multi label class as result in our further research. We will Computing, vol. 4, pp. 183–187, 2014.
also try to implement real time text classification algorithm
to classify document into specified domain. This process [11] N. M. A. Lestari, I. K. G. D. Putra, and A. K. A. Cahyawan,
“Personality types classification for indonesian text in partners
is done as preprocessing in automatic domain specified
searching website using naı̈ve bayes methods,” International
ontology building process for our further research. Journal of Computer Science Issues (IJCSI), vol. 10, no. 1,
2013.
ACKNOWLEDGMENT
We would like to thank partners in Institut Teknologi [12] A. Cuzzocrea, I.-Y. Song, and K. C. Davis, “Analytics over
large-scale multidimensional data: the big data revolution!”
Sepuluh November Surabaya for their helpful comments. We in Proceedings of the ACM 14th international workshop on
also thank partners from computational linguistic research Data Warehousing and OLAP. ACM, 2011, pp. 101–104.
group in Sekolah Tinggi Teknik Surabaya for providing us
some tools and several data in this research. [13] J. Dean and S. Ghemawat, “Mapreduce: Simplified data
processing on large clusters, osdi04: Sixth symposium on
R EFERENCES operating system design and implementation, san francisco,
ca, december, 2004,” S. Dill, R. Kumar, K. McCurley, S.
[1] T. White, Hadoop: The definitive guide. ” O’Reilly Media, Rajagopalan, D. Sivakumar, ad A. Tomkins, Self-similarity in
Inc.”, 2012. the Web, Proc VLDB, 2001.

[2] G. Zheng and Y. Tian, “Chinese web text classification system [14] C. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y.
model based on naive bayes,” in E-Product E-Service and E- Ng, and K. Olukotun, “Map-reduce for machine learning
Entertainment (ICEEE), 2010 International Conference on. on multicore,” Advances in neural information processing
IEEE, 2010, pp. 1–4. systems, vol. 19, p. 281, 2007.

[3] J. D. Rennie, L. Shih, J. Teevan, D. R. Karger et al., “Tackling [15] F. Tala, “A study of stemming effects on information retrieval
the poor assumptions of naive bayes text classifiers,” in ICML, in bahasa indonesia,” 2003.
vol. 3. Washington DC), 2003, pp. 616–623.

432

You might also like