Professional Documents
Culture Documents
1 Introduction
Document clustering and summarization has a lot of applications. For example, docu-
ment clustering has applications for web document clustering for search engines.
Summarization is used to represent a document’s information in fewer words accurately
and completely so that a reader won’t have to go through the entire document in order to
get the main idea of the document. Although a lot of work has been done in English and
other major languages, the work done for Bangla is still trivial and can be improved.
Bangla is a language spoken as first language by almost 210 million people all over
the world. The amount of Bangla document in the internet is increasing exponentially
day by day. The objective of this paper is to find a way to efficiently perform stemming
to Bangla news documents collected from popular news portals in order to calculate the
similarity between the documents and to perform summarization to Bangla news in
such a way that the summarized document will represent the main document as
accurately and completely as possible.
The main challenge in the field of text similarity calculation and summarization is
the preprocessing of the documents. Since Bangla grammar is really complicated, the
techniques used to stem English words do not accurately apply for Bangla words. In
this paper, we have stemmed Bangla documents by firstly removing common stop
words from the document and finally finding the longest common substring we can find
from the dictionary for each word in the document.
© Springer International Publishing AG 2017
N.T. Nguyen et al. (Eds.): ICCCI 2017, Part II, LNAI 10449, pp. 479–488, 2017.
DOI: 10.1007/978-3-319-67077-5_46
480 A. Paul et al.
2 Related Work
Uddin and Khan [1] have performed a survey on different techniques that have been
applied in English or other languages for summarization and selected some of them in
order to apply those techniques for Bangla. The summarization model they designed
got the best result when they summarized the document down to 40%. Shaharia et al.
[2] worked on summarizing resource-poor languages that are spoken mostly in east of
India. These languages include Bangla, Assamese, Bodo etc. They introduced a rule
based approach and used a dictionary of frequent Bangla words for the task of stem-
ming. They succeeded to obtain 94% accuracy for Bangla and Assamese whereas 87%
and 82% accuracy was obtained for Bishnupriya and Bodo respectively using this
technique. Urmi et al. [3] designed a Bangla stemmer that is corpus based. It uses a
6-gram model and simple technique for threshold in order to identify the root. The next
paper [4] used WordNet ontology for abstractive summary generation. Although their
result was experimental, the summary that was generated was compressed well, easily
understandable for humans and did not have any grammatical errors.
Baralis et al. [5] proposed an approach that can take a financial document collection
and can summarize it though the documents are written in separate languages. The
mining system they proposed has the capability of handling news written in separate
languages and it can analyze the skill level of the users to cover specific parts in the
summary and can also rank the summaries which are generated. In the next paper [6]
weka-LibSVM classifier is used for multi-domain documents classification whereas
term document matrix is used to represent the data.
3 System Overview
In this paper, Bangla news documents are collected from popular websites. The system
is divided into two major parts. Part one consists of the process of finding similar
documents using DBSCAN [9] after preprocessing. In part two, the system summarizes
selected document depending on priority values assigned to the preprocessed sentences
of that document. Both parts have preprocessing as a common step. The preprocessing
part contains the task of deleting unnecessary words and stemming using a dictionary
and custom word lists. In the part of finding similarity among the documents, cosine
similarity [9] is used after the part of preprocessing. In part two, similar documents are
taken and the target document which would be summarized is selected. The term
frequency matrix is generated for each word occurring in the documents. Finally,
sentences are given rank points in order to choose the top sentences.
We got a word list of 98525 words from Kolkata: West Bengal Bangla Academy
[7]. Some words in the complete word list contained some additional redundant
character. Some examples are shown in Table 1. We got a list of 61790 words after
cleaning the number and character containing words using Regular Expression [8].
This word list was used while stemming to find the largest common substrings for
words in the news documents.
3.2 Preprocessing
In preprocessing, the unnecessary words (stop words) were deleted first and then the
words were stemmed using dictionary lookup [2].
Before clustering, deleting stop words is needed because they occur very frequently
and they are not the key points to determine the similarity of any two documents. The
stemming needs to be used as a way to identify different representation of the same words
as one. If two words have the same meaning but have different prefix or suffix depending
on the tense and context, normal string matching algorithm will consider them as different
words though they bear the same meaning. As a result, their Term Frequency and Inverse
Document Frequency will be calculated incorrectly. Therefore, stemming is used to
eliminate these additional prefix and suffix to find out the root word of all the words.
In Bangla documents, it is observed that most of the prepositions, conjunctions and
some verbs and their variations used are not necessary while prioritizing words or
sentences for finding similar documents or summarization process since these words
are common and widely used everywhere. We generated a list of 49 prepositions and
conjunctions and a list of 173 verbs and their variations. We got rid of those words
from the documents before further processing. Table 2 shows such few words that we
can get rid of before further processing.
In Bangla language, stemming words by cutting suffix or by considering prefix does
not always give the correct result. Table 3 represents the comparison between prefix
based result, suffix based result and result from word list look up.
The first example contains no additional suffix or prefix and is filtered perfectly by
all three methods. The second and last example show the drawbacks of suffix and prefix
stemming respectively. The letters “র” and “অ” work as common suffix and prefix
respectively in Bangla and get removed although these are part of the original words in
this case. Both the situations are handled perfectly by the Dictionary Lookup based
stemming. In fact, this process gives 81% accuracy [2].
482 A. Paul et al.
3.3 Clustering
After stemming was done, we observed the cosine similarities [9] between the news
documents. In DBSCAN [9], we set the minimum number of news articles to 3 in order
to consider a group of articles as a cluster. By this we have tried to avoid fake or
unreliable news. We also set the eps to 0.5. Which means news article whose similarity
is between 0.5–1 will be in the same cluster. Figure 1 shows the correlation matrix
which represents the similarity distance between the documents. Here, x-axis and
y-axis represents news articles.
Cosine similarity calculates in terms of inner product space, how much two vectors
are closed to each other. If two vectors are similar, the Cosine similarity value between
Bangla News Summarization 483
them will be close to +1 and if they are not similar, the value will be close to −1. But
cosine similarity always works on numeric values; it cannot straightforward calculate
the similarity between two strings or two documents. Therefore, each of the words in
every news article is replaced with their TF-IDF values. Here each of the words in a
news article is considered as a separate axis in the vector space. After this, if two news
articles have the cosine similarity close to +1 this can be considered as similar. We
tested various Cosine Similarity thresholds on several sample tests to find out the
optimum Cosine Similarity value. Documents having cosine similarity above threshold
are considered as similar match. Table 4 represents the accuracy for different threshold
where we have calculated similarity between documents about a particular topic. In this
table, 81 documents are considered where 16 documents are about the selected topic.
The goal is to avoid false positive completely as it will mix up various news as the
same. We decide to use the threshold value 0.5 for further analysis since this result
shows that the threshold 0.5 gives us the optimal result.
Here,
Sik = Score or priority value for k-th line in the i-th document
tik = Number of words in the k-th line of i-th document
wi = i-th word score.
Method 1
In this method, we have used term frequency [12] to calculate wi. Term frequency
(TF) is calculated using (3). Sik is then calculated using wi calculated in (2).
wi ¼ TF ð2Þ
x
TF ¼ ð3Þ
n
Here,
x = number of occurrences of a word in all document
n = number of words in current document.
Method 2
In this method, we have summarized the article which has got the highest total score
according to sentence scoring of method 1 and has shown it as the general summary for
all articles in the cluster.
Method 3
In this method, we have used multiplication of term frequency and inverse document
frequency. Inverse document frequency (IDF) [12] is calculated using (5). Sik is then
calculated using wi calculated in (4).
wi ¼ TF IDF ð4Þ
p
IDF ¼ 1 þ loge ð5Þ
q
Bangla News Summarization 485
Here,
p = number of documents in the cluster
q = number of document in which this word exists.
Method 4
In this method, we have used the same process as method 2, but in this case, for
sentence scoring we have used method 3.
3.6 Summarization
After all the sentences get a priority value, a threshold can be defined for the number of
lines taken for the summarized document. We have decided to use the threshold value
of 40% according to Uddin and Khan [1].
The following news is taken from The Daily Ittefaq [10].
For this particular document, Table 6 shows the score for each sentence calculated
using method 1.
Here is the result after summarization in 4 methods of the previous news article.
Method 1: Result
Saturday. It is noteworthy that Bangladesh team left Dhaka on the eve of 26th April to prepare
for England camp. The first match of the tri nation series is on 12th May in Ireland.”
Method 2: Result
Method 3: Result
Method 4: Result
4 Method Evaluation
From this table, we can see that Method 1 gets the best precision score, it means the
ratio of fetching relevant data in compare to irrelevant data is lower in this method.
Method 3 gets the best recall score; this means it has better probability to retrieve
correct and relevant data more while summarizing Bangla News Articles and this
method also gets the highest F-measure score which also indicates if we consider both
recall and precision, this method performs better than other methods.
488 A. Paul et al.
5 Conclusion
In this paper, we have focused on avoiding incorrect stemming and mixing up of different
news. The system also compared the obtained results with human produced summary.
In Bangla language, rule based stemming based on prefix elimination or suffix
elimination does not work well since words in this language can have different types of
variations. On the other hand, finding largest common substring by analyzing the
dictionary for each word in the document has higher probability of selecting the right
stem words.
The aggressive threshold selection to avoid false positive and unverified news
makes the summary a valid one.
The performance evaluation against the human made summary shows us that
TF-IDF based single document summarization as mentioned in Method 3 performs
overall better than other methods. But if we want to generate a single summary for all
the news in a single cluster, Method 4 also performs well without producing redundant
summary.
References
1. Uddin, M.N., Khan, S.A.: A study on text summarization techniques and implement few of
them for Bangla language. In: 2007 10th International Conference on Computer and
Information Technology (2007). doi:10.1109/ICCITECHN.2007.4579374
2. Saharia, N., Sharma, U., Kalita, J.: Stemming resource-poor Indian languages. ACM Trans.
Asian Lang. Inf. Process. 13(3), 1–26 (2014)
3. Urmi, T.T., Jammy, J.J., Ismail, S.: A corpus based unsupervised Bangla word stemming
using N-gram language model. In: 2016 5th International Conference on Informatics,
Electronics and Vision (ICIEV) (2016). doi:10.1109/ICIEV.2016.7760117
4. Dave, H., Jaswal, S.: Multiple text document summarization system using hybrid
summarization technique. In: 2015 1st International Conference on Next Generation
Computing Technologies (NGCT) (2015). doi:10.1109/NGCT.2015.7375231
5. Baralis, E., Cagliero, L., Cerquitelli, T.: Supporting stock trading in multiple foreign
markets. In: Proceedings of 2nd International Workshop on Data Science for
Macro-Modeling – DSMM 2016 (2016). doi:10.1145/2951894.2951897
6. Dsouza, K.J., Ansari, Z.A.: A novel data mining approach for multi variant text
classification. In: 2015 IEEE International Conference on Cloud Computing in Emerging
Markets (CCEM) (2015)
7. Bangla Word List [PDF] West Bengal Bangla Academy, Kolkata (n.d.)
8. List of Regular Expressions - Libreoffice Help. Help.libreoffice.org. N.p. (2017). Web: 5
May 2017
9. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Dorling Kindersley,
Pearson, London (2015)
10. “আকস্মিক দেশের পথে মাশরাফি । খেলাধুলা । The Daily Ittefaq” Ittefaq.com.bd. N.p. (2017).
Web: 3 May 2017
11. Ferreira, R., et al.: Assessing sentence scoring techniques for extractive text summarization.
Expert Syst. Appl. 40(14), 5755–5764 (2013)
12. Ramos, J.: Using TF-IDF to determine word relevance in document queries. Technical
report, Department of Computer Science, Rutgers University (2003)