You are on page 1of 10

Bangla News Summarization

Anirudha Paul, Mir Tahsin Imtiaz, Asiful Haque Latif,


Muyeed Ahmed, Foysal Amin Adnan, Raiyan Khan, Ivan Kadery,
and Rashedur M. Rahman(&)

Department of Electrical and Computer Engineering, North South University,


Dhaka, Bangladesh
anirudhaprasun@gmail.com, asif.nobel@gmail.com,
akib100095@gmail.com, adnanbd769@gmail.com,
raiyan.khan.106@gmail.com, ikadery@gmail.com,
{tahsin.imtiaz,rashedur.rahman}@northsouth.edu

Abstract. Document similarity calculation and summarization is a challenging


task. Not many works have been done in this field for Bangla Language.
Similarity calculation and summarization is more challenging for Bangla Lan-
guage as Bangla grammar works differently than that of English. This paper
proposes a way to calculate similarity between Bangla news and apply sum-
marization on Bangla news documents taken from popular news portals by
applying various data mining techniques as accurately as possible.

Keywords: Text summarization  Text mining  Document clustering  Data


mining  Bangla News Summarization  DBSCAN

1 Introduction

Document clustering and summarization has a lot of applications. For example, docu-
ment clustering has applications for web document clustering for search engines.
Summarization is used to represent a document’s information in fewer words accurately
and completely so that a reader won’t have to go through the entire document in order to
get the main idea of the document. Although a lot of work has been done in English and
other major languages, the work done for Bangla is still trivial and can be improved.
Bangla is a language spoken as first language by almost 210 million people all over
the world. The amount of Bangla document in the internet is increasing exponentially
day by day. The objective of this paper is to find a way to efficiently perform stemming
to Bangla news documents collected from popular news portals in order to calculate the
similarity between the documents and to perform summarization to Bangla news in
such a way that the summarized document will represent the main document as
accurately and completely as possible.
The main challenge in the field of text similarity calculation and summarization is
the preprocessing of the documents. Since Bangla grammar is really complicated, the
techniques used to stem English words do not accurately apply for Bangla words. In
this paper, we have stemmed Bangla documents by firstly removing common stop
words from the document and finally finding the longest common substring we can find
from the dictionary for each word in the document.
© Springer International Publishing AG 2017
N.T. Nguyen et al. (Eds.): ICCCI 2017, Part II, LNAI 10449, pp. 479–488, 2017.
DOI: 10.1007/978-3-319-67077-5_46
480 A. Paul et al.

2 Related Work

Uddin and Khan [1] have performed a survey on different techniques that have been
applied in English or other languages for summarization and selected some of them in
order to apply those techniques for Bangla. The summarization model they designed
got the best result when they summarized the document down to 40%. Shaharia et al.
[2] worked on summarizing resource-poor languages that are spoken mostly in east of
India. These languages include Bangla, Assamese, Bodo etc. They introduced a rule
based approach and used a dictionary of frequent Bangla words for the task of stem-
ming. They succeeded to obtain 94% accuracy for Bangla and Assamese whereas 87%
and 82% accuracy was obtained for Bishnupriya and Bodo respectively using this
technique. Urmi et al. [3] designed a Bangla stemmer that is corpus based. It uses a
6-gram model and simple technique for threshold in order to identify the root. The next
paper [4] used WordNet ontology for abstractive summary generation. Although their
result was experimental, the summary that was generated was compressed well, easily
understandable for humans and did not have any grammatical errors.
Baralis et al. [5] proposed an approach that can take a financial document collection
and can summarize it though the documents are written in separate languages. The
mining system they proposed has the capability of handling news written in separate
languages and it can analyze the skill level of the users to cover specific parts in the
summary and can also rank the summaries which are generated. In the next paper [6]
weka-LibSVM classifier is used for multi-domain documents classification whereas
term document matrix is used to represent the data.

3 System Overview

In this paper, Bangla news documents are collected from popular websites. The system
is divided into two major parts. Part one consists of the process of finding similar
documents using DBSCAN [9] after preprocessing. In part two, the system summarizes
selected document depending on priority values assigned to the preprocessed sentences
of that document. Both parts have preprocessing as a common step. The preprocessing
part contains the task of deleting unnecessary words and stemming using a dictionary
and custom word lists. In the part of finding similarity among the documents, cosine
similarity [9] is used after the part of preprocessing. In part two, similar documents are
taken and the target document which would be summarized is selected. The term
frequency matrix is generated for each word occurring in the documents. Finally,
sentences are given rank points in order to choose the top sentences.

3.1 Data Collection


For newspaper articles, we have designed a crawler using python script, which crawls
data from popular newspaper websites in Bangladesh. We took all articles from last 7
days which was 993 articles in total. The crawled news articles were saved in an xls file
where the columns contained the date of the news, source, category, headline and the
news content respectively.
Bangla News Summarization 481

Table 1. Cleaning dictionary word list.

We got a word list of 98525 words from Kolkata: West Bengal Bangla Academy
[7]. Some words in the complete word list contained some additional redundant
character. Some examples are shown in Table 1. We got a list of 61790 words after
cleaning the number and character containing words using Regular Expression [8].
This word list was used while stemming to find the largest common substrings for
words in the news documents.

3.2 Preprocessing
In preprocessing, the unnecessary words (stop words) were deleted first and then the
words were stemmed using dictionary lookup [2].
Before clustering, deleting stop words is needed because they occur very frequently
and they are not the key points to determine the similarity of any two documents. The
stemming needs to be used as a way to identify different representation of the same words
as one. If two words have the same meaning but have different prefix or suffix depending
on the tense and context, normal string matching algorithm will consider them as different
words though they bear the same meaning. As a result, their Term Frequency and Inverse
Document Frequency will be calculated incorrectly. Therefore, stemming is used to
eliminate these additional prefix and suffix to find out the root word of all the words.
In Bangla documents, it is observed that most of the prepositions, conjunctions and
some verbs and their variations used are not necessary while prioritizing words or
sentences for finding similar documents or summarization process since these words
are common and widely used everywhere. We generated a list of 49 prepositions and
conjunctions and a list of 173 verbs and their variations. We got rid of those words
from the documents before further processing. Table 2 shows such few words that we
can get rid of before further processing.
In Bangla language, stemming words by cutting suffix or by considering prefix does
not always give the correct result. Table 3 represents the comparison between prefix
based result, suffix based result and result from word list look up.
The first example contains no additional suffix or prefix and is filtered perfectly by
all three methods. The second and last example show the drawbacks of suffix and prefix
stemming respectively. The letters “র” and “অ” work as common suffix and prefix
respectively in Bangla and get removed although these are part of the original words in
this case. Both the situations are handled perfectly by the Dictionary Lookup based
stemming. In fact, this process gives 81% accuracy [2].
482 A. Paul et al.

Table 2. Common removable verbs, prepositions and conjunctions examples

Table 3. Comparison of stemming from different techniques.

3.3 Clustering
After stemming was done, we observed the cosine similarities [9] between the news
documents. In DBSCAN [9], we set the minimum number of news articles to 3 in order
to consider a group of articles as a cluster. By this we have tried to avoid fake or
unreliable news. We also set the eps to 0.5. Which means news article whose similarity
is between 0.5–1 will be in the same cluster. Figure 1 shows the correlation matrix
which represents the similarity distance between the documents. Here, x-axis and
y-axis represents news articles.

Fig. 1. Correlation map.

Cosine similarity calculates in terms of inner product space, how much two vectors
are closed to each other. If two vectors are similar, the Cosine similarity value between
Bangla News Summarization 483

Table 4. Similarity calculation result on different thresholds.


Threshold True positive False positive True negative False negative
0.1 16 21 44 0
0.2 16 2 63 0
0.3 16 2 63 0
0.4 15 2 63 1
0.5 13 0 65 3
0.6 10 0 65 6
0.7 5 0 65 11
0.8 3 0 65 13
0.9 3 0 65 13

them will be close to +1 and if they are not similar, the value will be close to −1. But
cosine similarity always works on numeric values; it cannot straightforward calculate
the similarity between two strings or two documents. Therefore, each of the words in
every news article is replaced with their TF-IDF values. Here each of the words in a
news article is considered as a separate axis in the vector space. After this, if two news
articles have the cosine similarity close to +1 this can be considered as similar. We
tested various Cosine Similarity thresholds on several sample tests to find out the
optimum Cosine Similarity value. Documents having cosine similarity above threshold
are considered as similar match. Table 4 represents the accuracy for different threshold
where we have calculated similarity between documents about a particular topic. In this
table, 81 documents are considered where 16 documents are about the selected topic.
The goal is to avoid false positive completely as it will mix up various news as the
same. We decide to use the threshold value 0.5 for further analysis since this result
shows that the threshold 0.5 gives us the optimal result.

3.4 Term Frequency Matrix for Document Summarization Process


After the preprocessing and finding similar documents, we select a particular list of
similar news documents and generated term frequency matrix for all the words
appearing in that list. This matrix tells us which word appears how many times in the
documents in that particular list. The term frequency matrix looks like Table 5 after the
system runs on similar documents.

3.5 Giving Priority Values to Sentence


For summarizing the selected document, we have decided to give the sentences priority
values and select the top sentences having highest values among them. In order to do
this, we implement the formula in (1).
P
wi
Sik ¼ ð1Þ
tik
484 A. Paul et al.

Table 5. Term frequency matrix.

Here,
Sik = Score or priority value for k-th line in the i-th document
tik = Number of words in the k-th line of i-th document
wi = i-th word score.

Method 1
In this method, we have used term frequency [12] to calculate wi. Term frequency
(TF) is calculated using (3). Sik is then calculated using wi calculated in (2).

wi ¼ TF ð2Þ
x
TF ¼ ð3Þ
n

Here,
x = number of occurrences of a word in all document
n = number of words in current document.

Method 2
In this method, we have summarized the article which has got the highest total score
according to sentence scoring of method 1 and has shown it as the general summary for
all articles in the cluster.
Method 3
In this method, we have used multiplication of term frequency and inverse document
frequency. Inverse document frequency (IDF) [12] is calculated using (5). Sik is then
calculated using wi calculated in (4).

wi ¼ TF  IDF ð4Þ
p
IDF ¼ 1 þ loge ð5Þ
q
Bangla News Summarization 485

Here,
p = number of documents in the cluster
q = number of document in which this word exists.

Method 4
In this method, we have used the same process as method 2, but in this case, for
sentence scoring we have used method 3.

3.6 Summarization
After all the sentences get a priority value, a threshold can be defined for the number of
lines taken for the summarized document. We have decided to use the threshold value
of 40% according to Uddin and Khan [1].
The following news is taken from The Daily Ittefaq [10].

The English translation of this article is as follows:


“Mashrafe is returning to his homeland due to sudden sickness of his wife. Just after two days
of practice at Sussex, England, the captain is coming back. It has been informed that Sumona
Haque, the wife of Bangladesh’s one day captain has suddenly fallen sick on Saturday. She has
been admitted to a hospital. Mashrafe was supposed to land of Sunday afternoon, but due to
flight delay he has to reach Dhaka during night. It is noteworthy that Bangladesh team left
Dhaka on the eve of 26th April to prepare for England camp. The first practice match is on
Monday against Duke of Norfolk. After playing another practice match on 5th, the team will
leave for Ireland on 7th. The first match of the tri nation series is on 12th May in Ireland. But
due to a restriction, Mashrafe will not be able to play.”

For this particular document, Table 6 shows the score for each sentence calculated
using method 1.
Here is the result after summarization in 4 methods of the previous news article.
Method 1: Result

The English translation of this summarization is as follows:


“Mashrafe is returning to his homeland due to sudden sickness of his wife. It has been informed
that Sumona Haque, the wife of Bangladesh’s one day captain has suddenly fallen sick on
486 A. Paul et al.

Saturday. It is noteworthy that Bangladesh team left Dhaka on the eve of 26th April to prepare
for England camp. The first match of the tri nation series is on 12th May in Ireland.”

Table 6. Priority values of sentences.

Method 2: Result

The English translation of this summarization is as follows:


“Mashrafe is returning to his homeland due to sudden sickness of his wife. But not only his wife
Sumona, his son Sahil Mortaza is sick as well. He is returning to the country because of the
sickness of his wife, not because of his injury. It has been informed that wife of Mashrafe,
Sumona is sick.”

Method 3: Result

The English translation of this summarization is as follows:


“Mashrafe is returning to his homeland due to sudden sickness of his wife. But due to a
restriction, Mashrafe will not be able to play. It has been informed that Sumona Haque, the wife
of Bangladesh’s one day captain has suddenly fallen sick on Saturday. It is noteworthy that
Bangladesh team left Dhaka on the eve of 26th April to prepare for England camp.”
Bangla News Summarization 487

Method 4: Result

The English translation of this summarization is as follows:


“But not only his wife Sumona, his son Sahil mortaza is sick as well. Bangladesh team left
Dhaka for England late night of last Wednesday in order to take part in the Ireland tri series in
12–24 May and the Champions Trophy tournament in 1–18 Jun. He is returning to the country
because of the sickness of his wife, not because of his injury. This was not confirmed by BCB
yet although this news is from a reliable source.”

4 Method Evaluation

We have used ROUGE-N to evaluate the performance of the methods. ROUGE-N


measures the similarity of N-gram units between human made summary and algorithm
produced summary [11]. For the human made summary part, 10 journalists summa-
rized the news articles of our dataset. We have checked the similarity of our algorithm
produced summary against these human made summary using ROUGE-1, ROUGE-2,
ROUGE-3. Table 7 shows the average performance of this comparison.

Table 7. Performance evaluation.


ROUGE-N Metric Method 1 Method 2 Method 3 Method 4
ROUGE-1 Avg_Recall 0.4219 0.3132 0.5354 0.3851
Avg_Precision 0.4503 0.3145 0.4251 0.2976
Avg_F-Measure 0.4313 0.3089 0.4689 0.3292
ROUGE-2 Avg_Recall 0.3515 0.2071 0.4391 0.2321
Avg_Precision 0.3761 0.2077 0.3602 0.1897
Avg_F-Measure 0.3597 0.2047 0.3929 0.2063
ROUGE-3 Avg_Recall 0.3202 0.1609 0.4079 0.1761
Avg_Precision 0.3427 0.1638 0.3361 0.1491
Avg_F-Measure 0.3277 0.1603 0.3662 0.1601

From this table, we can see that Method 1 gets the best precision score, it means the
ratio of fetching relevant data in compare to irrelevant data is lower in this method.
Method 3 gets the best recall score; this means it has better probability to retrieve
correct and relevant data more while summarizing Bangla News Articles and this
method also gets the highest F-measure score which also indicates if we consider both
recall and precision, this method performs better than other methods.
488 A. Paul et al.

5 Conclusion

In this paper, we have focused on avoiding incorrect stemming and mixing up of different
news. The system also compared the obtained results with human produced summary.
In Bangla language, rule based stemming based on prefix elimination or suffix
elimination does not work well since words in this language can have different types of
variations. On the other hand, finding largest common substring by analyzing the
dictionary for each word in the document has higher probability of selecting the right
stem words.
The aggressive threshold selection to avoid false positive and unverified news
makes the summary a valid one.
The performance evaluation against the human made summary shows us that
TF-IDF based single document summarization as mentioned in Method 3 performs
overall better than other methods. But if we want to generate a single summary for all
the news in a single cluster, Method 4 also performs well without producing redundant
summary.

References
1. Uddin, M.N., Khan, S.A.: A study on text summarization techniques and implement few of
them for Bangla language. In: 2007 10th International Conference on Computer and
Information Technology (2007). doi:10.1109/ICCITECHN.2007.4579374
2. Saharia, N., Sharma, U., Kalita, J.: Stemming resource-poor Indian languages. ACM Trans.
Asian Lang. Inf. Process. 13(3), 1–26 (2014)
3. Urmi, T.T., Jammy, J.J., Ismail, S.: A corpus based unsupervised Bangla word stemming
using N-gram language model. In: 2016 5th International Conference on Informatics,
Electronics and Vision (ICIEV) (2016). doi:10.1109/ICIEV.2016.7760117
4. Dave, H., Jaswal, S.: Multiple text document summarization system using hybrid
summarization technique. In: 2015 1st International Conference on Next Generation
Computing Technologies (NGCT) (2015). doi:10.1109/NGCT.2015.7375231
5. Baralis, E., Cagliero, L., Cerquitelli, T.: Supporting stock trading in multiple foreign
markets. In: Proceedings of 2nd International Workshop on Data Science for
Macro-Modeling – DSMM 2016 (2016). doi:10.1145/2951894.2951897
6. Dsouza, K.J., Ansari, Z.A.: A novel data mining approach for multi variant text
classification. In: 2015 IEEE International Conference on Cloud Computing in Emerging
Markets (CCEM) (2015)
7. Bangla Word List [PDF] West Bengal Bangla Academy, Kolkata (n.d.)
8. List of Regular Expressions - Libreoffice Help. Help.libreoffice.org. N.p. (2017). Web: 5
May 2017
9. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Dorling Kindersley,
Pearson, London (2015)
10. “আকস্মিক দেশের পথে মাশরাফি । খেলাধুলা । The Daily Ittefaq” Ittefaq.com.bd. N.p. (2017).
Web: 3 May 2017
11. Ferreira, R., et al.: Assessing sentence scoring techniques for extractive text summarization.
Expert Syst. Appl. 40(14), 5755–5764 (2013)
12. Ramos, J.: Using TF-IDF to determine word relevance in document queries. Technical
report, Department of Computer Science, Rutgers University (2003)

You might also like