Bengali 3

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/315867538
An extractive text summarization technique for Bengali document(s) using K-

means clustering algorithm
Conference Paper · January 2017

DOI: 10.1109/ICIVPR.2017.7890883
CITATIONS READS
52 994
6 authors, including:
Sumya Akter Aysa Siddika Asa

Lecturer at Hajee Mohammad Danesh Science and Technology University Hajee Mohammad Danesh Science and Technology University, Dinajpur
3 PUBLICATIONS 62 CITATIONS 3 PUBLICATIONS 63 CITATIONS
SEE PROFILE SEE PROFILE
Md Palash Uddin Md. Delowar Hossain

Deakin University Kyung Hee University
79 PUBLICATIONS 1,003 CITATIONS 51 PUBLICATIONS 449 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Md Palash Uddin on 13 May 2018.
The user has requested enhancement of the downloaded file.

An Extractive Text Summarization Technique for
Bengali Document(s) using K-means Clustering
Algorithm
Sumya Akter1, Aysa Siddika Asa2, Md. Palash Uddin3, Md. Delowar Hossain4, Shikhor Kumer Roy5, and Masud Ibn Afjal6
Faculty of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University (HSTU),
Dinajpur-5200, Bangladesh
1
sumya.hstu@gmail.com, 2asha.cse12@gmail.com, 3palash_cse@hstu.ac.bd, 4delowar.cit@gmail.com,
5
shikhorroy.cse12@gmail.com and 6masud@hstu.ac.bd
Abstract— Text summarization, a field of data mining, is very popular research areas to extract main theme from large
important for developing various real-life applications. Many volume of data. It is generally used to denote any system that
techniques have been developed for summarizing English text(s). analyzes large quantities of natural language text and detects
But, a few attempts have been made for Bengali text because of lexical or linguistic usage patterns in an attempt to extract
its some multifaceted structure. This paper presents a method for probably useful information. Essentially, text summarization
text summarization which extracts important sentences from a techniques are classiﬁed as extractive and abstractive.
single or multiple Bengali documents. The input document(s) Extractive techniques perform text summarization by selecting
should be pre-processed by tokenization, stemming operation etc. sentences of documents according to some criteria.
Then, word score is calculated by Term- Frequency/Inverse
Abstractive techniques attempt to improve the coherence
Document Frequency (TF/IDF) and sentence score is determined
by summing up its constituent words’ scores with its position.
among sentences by eliminating redundancies and clarifying
Cue and skeleton words have also been considered to calculate the contest of sentences. Sentence scoring is the most used
the sentence score. For single or multiple documents, K-means technique for extractive text summarization. So, extractive
clustering algorithm has been applied to produce the final summarization involves assigning saliency measure to
summary. The experimental result shows satisfactory outputs in some units (e.g. sentences, paragraphs) of the documents
comparison to the existing approaches possessing linear run time and extracting those with highest scores to include in the
complexity. summary [6]. Moreover, people want to know any information
in a precise way. Thus, they don’t like to read big size of
Keywords— data mining; text summarization; extractive document with redundant data to gather information. Thus, the
summarization; bengali document(s) summarization; TF*IDF; K- technique of summarizing any text document helps to find
means clustering algorithm informative sentences in order to save precious time.
I. INTRODUCTION A. General Procedure of Text Summarization
Nowadays, the use of Internet has caused a rapid growth of A general procedure for extractive methods that are usually
electronic data which needed to process, store, and manage. performed in three steps is discussed below [7]:
Sometimes, it is difficult to find the exact information from Step 1: First step creates a representation of the document.
large amount of data or big data. Big data [1] has the potential Some preprocessing such as tokenization, stop word removal,
to be mined for information and data mining is essential to noise removal, stemming, sentence splitting, frequency
find out the proper information what we need. When data are computation etc. is applied here.
being accessed from such a huge repository of e-documents, Step 2: In this step, sentence scoring are performed. In
hundreds and thousands documents are retrieved through data general, three approaches are followed:
mining. It also finds the correlations or the patterns among • Word scoring–assigning scores to the most important
dozens of fields in large relational databases [2]-[3]. Data words
mining’s roots are traced back along three family lines:
classical statistics, artificial intelligence, and machine learning • Sentence scoring–verifying sentences features such as
[4]. Data mining is thus the process used to describe its position in the document, similarity to the title, etc.
knowledge in databases which is very much useful for and
extracting and identifying useful information and subsequent • Graph scoring–analyzing the relationship between
knowledge from databases. The extracted patterns from the sentences.
database are then used to build data mining models, and can The general methods for calculating the score of any word
be used to predict performance and behavior with high are word frequency, TF/IDF, upper case, proper noun, word
accuracy. It utilizes descriptive (e.g. summarization, co-occurrence, lexical similarity, etc.
clustering, sequence discovery etc.) and predictive (e.g. The common phenomena used for scoring any sentences
classification, regression, time series analysis etc.) data mining are Cue-phrases (‘‘in summary’’, ‘‘in conclusion’’, ‘‘our
approaches in order to discover hidden information [5]. As a investigation’’, ‘‘the paper describes’’ and emphasizes such
field of data mining, text summarization is one of the most
978-1-5090-6004-7/17/$31.00 ©2017 IEEE

as ‘‘the best’’, ‘‘the most important’’, ‘‘according to the Hu, T. S. He and H. Wang [12] proposed a multi-view
study’’, ‘‘signiﬁcantly’’, ‘‘important’’, ‘‘in particular’’, sentence ranking for query biased summarization. The
‘‘hardly’’, ‘‘impossible’), sentence inclusion of numerical proposed approach first constructs two base rankers to rank all
data, sentence length, sentence centrality, sentence the sentences in a document-set from two independent but
resemblance to the title, etc. complementary views (i.e. query-dependent view and query-
Also the popular graph scoring methods are text rank, independent view), and then aggregates them into a consensus
bushy path of the node, aggregate similarity etc. one. K. Sarkar [13] presented a summarization approach based
Step 3: In this step, high score sentences using a specific on sentence clustering for multi-documents text in which
sorting order for extracting the contents are select and then sentences are clustered using a similarity histogram based
sentence-clustering algorithm to identify multiple sub-topics
final summary is generated if it is a single document
from the input set of related documents and selects the
summarization. For multi document summarization, the representative sentences from the appropriate clusters to form
process needs to extend. Each document produces one the final summary. A. Kogilavani [14] presented multi-
summary and then any clustering algorithm is applied to document summarization using clustering and feature specific
cluster the relevant sentences of each summary to generate the sentence extraction. V. K. Gupta [15] proposed a query
final summary. focused extractive summarization approach for English text
B. General Approaches for Extractive Text Summarization where single document summaries are combined using
sentence clustering method to generate multi document
Extractive summarizers [8] find out the most relevant summary. For clustering, semantic and syntactic similarity
sentences in the document. These also remove the redundant between sentences are used. A. R. Deshpande [16] presented
data. Extractive summarization is easier than abstractive another multi-document English text summarization technique
summarization to bring out the summary. The common using clustering method where documents are clustered using
methods for extractive are TF/IDF method, cluster based cosine similarity. A. Agrawal and U. Gupta [17] proposed an
method, graph theoretic approach, machine learning approach, extractive clustering based technique for single document
LSA (Latent Semantic Analysis) method, text summarization summarization. This clustering approach summarizes English
with neural networks, automatic text summarization based on text by using K-means clustering algorithm. M. A. Uddin [18]
fuzzy logic, query based extractive text summarization, presented a multi-document text summarization for Bengali
concept-obtained text summarization, text summarization text where TF-based technology is used for extracting the
using regression for estimating feature weights, multilingual most significant contents from a set of Bengali documents.
extractive text summarization, topic-driven summarization Another work proposed by M. Ibrahim [19] for Bengali
MMR (Maximal Marginal Relevance) and centroid-based document summarization which is based on sentence scoring
summarization, etc. and ranking.
II. RELATED WORK III. PROPOSED TECHNIQUE

The previous works on single document or multi-document The Bengali document summarizer is a natural language
summarization are trying different directions to show the best processing (NLP) application of data mining which is
result. Till now, various generic multi-document extraction proposed to extract the most important information of the
based summarization techniques are already present. Most of document(s). We are using sentence clustering approach to
them are for English rather than other natural languages like generate summary from both single and multi-documents. The
Bengali. In this section, we discussed some previous works on proposed technique is used for summarizing Bengali
extractive text summarization. J. Zhang [9] presented an document(s). In this technique, the common preprocessing
approach for multi-document text summarization using Cue- steps including noise removal, tokenization, stop word
based hub-authority. It is a graph base summarization and removal, stemming [20] are used. TF*IDF [21] is used for
detecting sub-topics by sentence clustering using K-nearest calculating each word score. Then, score of each sentence is
neighbor (KNN). Y. Ouyang [10] presented an integrated calculated by the total sum of the words with its position [19].
multi-document summarization approach which is based on If the sentence contains any Cue word or skeleton word, then
hierarchical representation. In this paper, query relevancy and the score is increased by 1[19]. Next, the document is stored in
topic specificity are used for filtering process. Also it has a separate file with its corresponding sentences’ scores. Again
calculated point-wise mutual information (PMI) for for multi documents, for each document, one by one,
identifying the sub-sumption between words and high PMI is preprocessing, word scoring and sentence scoring operations
regarded as co-related. Then, hierarchical tree is constructed to
are repeated as mentioned above and also the documents are
produce the summarization. X. Li, J. Zhang and M. Xing [11]
stored in the same file that is all the documents are merged.
proposed an automatic summarization technique for Chinese
text which is based on sub topic partition and sentence After that, the sentences are sorted in descending order and it
features. In this process, the sentence weight is calculated by has been considered that the highest score as the centroid 1
LexRank algorithm combining with the score of its own and the lowest score as the centroid 2 to apply the K-means
features such as its length, position, cue words and structure. clustering algorithm [1], [22]. Then, top K sentences are
When applying LexRank algorithm some problems are found extracted from each cluster and the final summary is
so automatic summarization is proposed based on maximum generated. Here, K sentences can be measured as 30% of
spanning tree and sentence features to overcome the same. P. sentences of the original merged document.
A. Flow Chart of the Proposed Technique xii. [Check S for skeleton word]
The proposed extractive technique of summarizing Bengali If S contains skeleton word, then increase, SC:=SC+1.
xiii. CHECK: = CHECK+1, goto step (iv)
document(s) is illustrated with the following flow chart
xiv. Store the document in a file with scores.
representation in Fig. 1. xv. COUNT:=COUNT-1
xvi. [Check for another document to calculate sentence
score]
If COUNT! = 0, then goto step (ii)
Else, go to step (xvii).
xvii. Sort the stored sentence scores in decreasing order.
xviii. [Cluster the document using k-means algorithm]
Call procedure k-means_algorithm ()
xix. Extract top K sentences from each cluster to get the
final summary of the document(s).
xx. END.
Procedure: Stemming ()
i. Read token from the line.
ii. Load suffix lists from the stored file.
iii. [Check suffix list with the input token ]
If the token matches with any suffix, then discard the
suffix and mark token as a root word.
iv. Repeat step (i) until all token is processed.
Procedure: k-means_algorithm (m1 & m2, C1&C2, d1&d2, ,
av1,av2 )
i. [ Initialize centroid m1 & m2 ]
m1:= Highest score & m2:= Lowest score.
ii. [ Measure distance from the centroids to each sentence S ]
d1:= m1- , d2:= m2-
TF/IDF: Term Frequency/Inverse
iii. [Check the distance either negative or not]
Document Frequency If d1<0 then d1:= -d1
SC:
PV:
Sentence Score
Position Value
If d2<0 then d2:= -d2
vi. [Create cluster]
Figure 1. Extractive Bengali document(s) summarization technique If d1<d2 then C1:=
Else C2:=
B. Pseudo-code of the Proposed Technique v. [Calculating new mean]
TEXTSUM ( ) is the caller function that calls two procedures Find average value (av1, av2) of clusters C1 & C2.
Stemming ( ) & k-means_algorithm ( ) to generate the final vi. [Assign the average value to the mean]
summary. m1:= av1 & m2:= av2
Procedure: TEXTSUM (SC, COUNT, K, N, TotS, CHECK) vii. Repeat the steps (ii) to (vi) until the values of m1 and
i. [Start with COUNT ] m2 in two consecutive iterations remain unchanged.
For single document summarization, viii. Return.
set COUNT:=1.
and for multi-documents summarization, 1) Explanation of the Pseudo-code
set COUNT:=N, N is the no. of documents. The steps of the proposed method are discussed here in detail:
ii. [For the current document]
Count the total number of sentences, TotS. i. Preprocessing
iii. Set CHECK:=1. The actions performed in this step are:
iv. Repeat step (v) to (xvi) while CHECK<= TotS. • Noise removal is concerned with removing header, footer,
v. Remove noise from sentence, S. etc. from the document.
vi. Tokenize each sentence S. • Tokenization separates each word into lexical form. Words
vii. Optionally remove stop word from S. are separated by কমা, দাঁিড় etc.
viii. Call procedure Stemming ( ).
ix. [Calculate the score TF of each word using TF/IDF ] • The stop words are function words like eবং, aথবা, িকn,
TF= aনয্থায়, িকংবা,মাt etc. and they may be removed.
, ∗
• Stemming – A word in different forms in the same document
= log( + 1) need to be converted to their original form for simplicity like
x. [Calculate the score (SC) of each sentence, = Position বাংলােদেশ, বাংলােদেশর, বাংলােদশেক, বাংলােদেশo etc. should be converted
of the sentences.] to their original form বাংলােদশ . In the proposed technique, we
= ∑ + PV used the rules for stemming any word that are illustrated in [20].
Let’s consider an example: কিরম কাজিট করেছ. After stemming it will
=
√ be কিরম কাজ কর. Some examples of word stemming are shown in
xi. [Check S for cue words] Table I.
If S contains Cue words, then increase, SC:=SC+1.
TABLE I. SOME WORD STEMMING EXAMPLES রােখন, তারা িন য়i সামািজক মাধয্ম িকংবা নানা ধরেণর খবের েনেছন িডিজটাল oয়াlর্ 2016 eর। আগামী
Suffix Original words After Stemming 19-21 aেkাবর 2016 বসু nরা কনেভনশন িসিটেত aনু ি ত হেব েদেশর সবেচেয় বড় আiিসিট
i # eটাi, েসটাi # eটা, েসটা iেভn িডিজটাল oয়াlর্ 2016। eখােন থাকেব নতু ন নতু ন udাবন o pযু িkগত নানা আেলাচনাo
েতা # হয়েতা, করলেতা # হয়, করল থাকেছ ei আেয়াজেন। ei aনু ােন নতু ন udাবন কয্াটাগরীেত sু ল, কেলজ o িব িবদয্ালেয়র
েক # eটােক, আমােক # eটা, আমা িবjানমন িশkাথ রা তােদর pকl uপsাপেনর সু েযাগ পােব। তারা েযখােন ei pদশর্নীিট করেব
ে◌.ে◌ -> ◌া. # েহেস, েনেচ, # হাসা, নাচা, তার নাম হেc, “iেনােভশন েজান”। ei দািয়t o তttাবধােন থাকেছ গল েডেভলপার gপস
ে◌.ে◌িছেলন -> ◌া. # েহেসিছেলন, েনেচিছেলন # হাসা, নাচা বাংলা।
ii. Scoring Process The second document on the same topic contains 14
• Word scoring technique (TF/IDF): sentences and 219 words is [25]:
This approach is used for counting the words. If there are more ‘ননsপ বাংলােদশ’ েsাগানেক সামেন েরেখ বাংলােদেশ হল িতন িদনবয্াপী তথয্ o
unique words in a given sentence, then the sentence is relatively েযাগােযাগ pযু িk িবষয়ক েদেশর সবেচেয় বড় েমলা ‘িডিজটাল oয়াlর্ -2016’। বু ধবার রাজধানী
more important [23]. The TF/IDF score is calculated as follows:
ঢাকার inারনয্াশনাল কনেভনশন িসিট বসু nরায় (আiিসিসিব) e েমলার uেdাধন কেরন pধানমntী
= , ∗
েশখ হািসনা। িতন িদনবয্াপী ei েমলা আগামী kবার পযর্n pিত িদন সকাল 10টা েথেক রাত 8টা
= log( + 1) পযর্n সকেলর জনয্ েখালা থাকেব।uেdাধনী aনু ােন pধান aিতিথর ভাষেণ pধানমntী েশখ হািসনা
বেলেছন, “আiিসিট বয্বহাের ত ণ জনেগা ী িনেয় আমরা ‘লািনর্ং aয্াn আিনর্ং’ pকl চালু কেরিছ।
where,
pকেlর আoতায় 50 হাজার ত ণ-ত ণীর pিশkেণর বয্বsা করা হেয়েছ।”িডিজটাল িসিকuিরিট
TF=Term Frequency
=Number of occurrence of the word in the sentence S aয্াk-2016 করা হেc জািনেয় েশখ হািসনা বেলন, “আoয়ািম িলগ সরকার েদেশর sােথর্ aথর্ বয্য়
=Inverse document frequency কের সাবেমিরন েকবেলর মাধয্েম যু k হেয়েছ। আমরা িশkা বয্বsা unত করেত সারা েদেশ 30
N=Total number of the sentences in the text হাজার মািlিমিডয়া kাস চালু কেরিছ। 2018 সােলর মেধয্ আরo দশ হাজার েশখ রােসল িডিজটাল
=Number of sentences in which word occurs লয্াব চালু করা হেব।” iেতােমেধয্ েদেশর pায় সব uপ-েজলােতi ি -িজ েপৗঁেছ িগেয়েছ। আগামী
2017 সােলর মেধয্ েফার-িজ চালু হেয় যােব বেলo জানান pধানমntী।বাংলােদশ সংবাদ সংsা
• Sentence scoring (SC):
(বাসস) জািনেয়েছ, িডিজটাল িবষয়ক নয়া pযু িk o aিভনবt িবষেয় ধারণা o তথয্ আদানpদােনর
=∑ + PV জনয্ ei েমলা। 40িট মntণালয় িডিজটাল বাংলােদশ িহেসেব কী কী পিরেষবা িদেc তার খুঁ িটনািট
= তু েল ধরা হেব ei েমলায়। eেত শতািধক েবসরকাির pিত ান তােদর িডিজটাল কাযর্kম তু েল
√
Here, = Position of the sentences & PV =Position value. For ধরেব। িতন িদনবয্াপী েমলায় মাiেkাসফ্ট, েফসবু ক, eকেসন্চার, িব বয্া , েজডিটi, য়াoেয়-সহ
example =1, if it is the first sentence of a document. িব pিত ােনর 43 জন িবেদিশ বkা-সহ i শতািধক বkা 18িট েসশেন aংশ েনেবন।
• Cue Words - If we found any cue word (e.g.,
The score calculation of the words for the first sentence of the
েমাটকথা, aবেশেষ, iিতমেধয্, েযেহতু etc.) in any sentence, then the first document is shown in Table II.
score of the sentence is incremented by 1.
TABLE II. WORD SCORE OF THE FIRST SENTENCE
• Skeleton word – If we found any skeleton word (e.g., headline of
any document), then again the score of the sentence incremented Words Stemming Number of Number of Score of
occurrence sentence in each word,
by 1.
of words which TF= , *
• Store the document in a separate file with the corresponding in sentence(s)
sentences’ scores for further processing. For multi-documents, occurs ( ) log( +1)
( , )
all the previous processes are applied for each document and বাংলােদেশর বাংলােদশ 1 1 1.04
stored in the same file where they are then merged. So for ত ণ ত ণ 1 2 0.78
further processing, scores are sorted according to decreasing েpাgামার েpাgাম 1 1 1.04
order. o o 1 6 0.43
iii. Applying K-means clustering algorithm pযু িkগত pযু িk 1 3 0.64
After sorting the scores the lowest and the highest scores are িবষেয় িবষয় 1 1 1.04
assigned as two centroids for the K-means algorithms and the ঈষর্নীয় ঈষর্া 1 1 1.04
distance from each centroid to each sentence is measured. Nearest সাফলয্লাভকারীেদর সাফলয্ 1 1 1.04
distance from one centroid defines that cluster. Thus, two clusters িনেয় িনেয় 1 3 0.64
are created and for next iteration, centroid values are updated. For পৃ িথবী পৃ িথবী 1 1 1.04
this, the average value of each cluster is calculated and assigned জু েড়i জু েড় 1 1 1.04
them as new centroids respectively. This process is repeated until pশংসা pশংসা 1 1 1.04
two consecutive iterations produce same result. Al last, top K চলেছ চল 1 1 1.04
sentences are extracted from each cluster to produce the final
summary. Similarly, using TF/IDF all the word scores are calculated. Then,
the score of every sentence is calculated by summing up the
IV. EXPLANATION WITH AN EXAMPLE constituent words’ scores with their position using the mentioned
formula. Also, the score of a sentence is increased when it
Let’s consider there are two Bengali documents to be
contains any cue word or skeleton word or both. After getting the
summarized. The first document contains 10 sentences and 131
sentences’ score of all documents, they have been sorted in a
words is [24]:
merged file which is shown in Table III for the considered
বাংলােদেশর ত ণ েpাgামার o pযু িkগত িবষেয় ঈষর্নীয় সাফলয্লাভকারীেদর িনেয় পৃ িথবী জু েড়i example of two documents having 24 sentences in total.
pশংসা চলেছ। বাংলােদশ ধীের ধীের eিগেয় যােc pযু িkগত uৎকষর্তার িদেক। eখনকার িশkাথ
o ত ণ pজn িবjান o িবjান িনভর্ র পড়ােশানা িনেয় aেনক েবিশ সেচতন। সরকার তাi ei
েktিটেক আেরা বড় eকিট pয্াটফমর্ িহেসেব দাঁড়া করােত চান।িবjান o pযু িk িনেয় যারা খবর
TABLE III. MERGED FILE WITH SENTENCE SCORES TABLE IV. FINAL ITERATION
Sentence Centroid Cluster Score Sentence
score Score Sentence ” িডিজটাল িসিকuিরিট aয্াk-2016 করা হেc জািনেয় েশখ হািসনা বেলন,
notation 29.03 “আoয়ািম িলগ সরকার েদেশর sােথর্ aথর্ বয্য় কের সাবেমিরন েকবেলর মাধয্েম
” িডিজটাল িসিকuিরিট aয্াk-2016 করা হেc জািনেয় েশখ হািসনা যু k হেয়েছ।
বেলন, “আoয়ািম িলগ সরকার েদেশর sােথর্ aথর্ বয্য় কের সাবেমিরন িতন িদনবয্াপী েমলায় মাiেkাসফ্ট, েফসবু ক, eকেসন্চার, িব বয্া , েজডিটi,
26.78 য়াoেয়-সহ িব pিত ােনর 43 জন িবেদিশ বkা-সহ i শতািধক বkা 18িট
SC(1) 28.36 েকবেলর মাধয্েম যু k হেয়েছ।
েসশেন aংশ েনেবন।
িতন িদনবয্াপী েমলায় মাiেkাস , েফসবু ক, eকেসন্চার, িব বয্া , ‘ননsপ বাংলােদশ’ েsাগানেক সামেন েরেখ বাংলােদেশ হল িতন
েজডিটi, য়াoেয়-সহ িব pিত ােনর 43 জন িবেদিশ বkা-সহ i 25.35 িদনবয্াপী তথয্ o েযাগােযাগ pযু িk িবষয়ক েদেশর সবেচেয় বড় েমলা
SC(2) 26.14 শতািধক বkা 18িট েসশেন aংশ েনেবন। ‘িডিজটাল oয়াlর্ -2016’।
ননsপ বাংলােদশ’ েsাগানেক সামেন েরেখ বাংলােদেশ হল িতন িতন িদনবয্াপী ei েমলা আগামী kবার পযর্n pিত িদন সকাল 10টা েথেক
25.26 রাত 8টা পযর্n সকেলর জনয্ েখালা থাকেব।
িদনবয্াপী তথয্ o েযাগােযাগ pযু িk িবষয়ক েদেশর সবেচেয় বড় েমলা
Cluster-1
‘িডিজটাল oয়াlর্ -2016’। m1 = uেdাধনী aনু ােন pধান aিতিথর ভাষেণ pধানমntী েশখ হািসনা বেলেছন,
SC(3) 24.77
23.25 25.09 “আiিসিট বয্বহাের ত ণ জনেগা ী িনেয় আমরা ‘লািনর্ং aয্াn আিনর্ং’ pকl
িতন িদনবয্াপী ei েমলা আগামী kবার পযর্n pিত িদন সকাল 10টা
চালু কেরিছ।
SC(4) 24.68 েথেক রাত 8টা পযর্n সকেলর জনয্ েখালা থাকেব।
বাংলােদশ সংবাদ সংsা (বাসস) জািনেয়েছ, িডিজটাল িবষয়ক নয়া pযু িk o
uেdাধনী aনু ােন pধান aিতিথর ভাষেণ pধানমntী েশখ হািসনা 22.07
aিভনবt িবষেয় ধারণা o তথয্ আদানpদােনর জনয্ ei েমলা।
বেলেছন, “আiিসিট বয্বহাের ত ণ জনেগা ী িনেয় আমরা ‘লািনর্ং aয্াn 40িট মntণালয় িডিজটাল বাংলােদশ িহেসেব কী কী পিরেষবা িদেc তার খুঁ িটনািট
21.08
SC(5) 24.50 আিনর্ং’ pকl চালু কেরিছ। তু েল ধরা হেব ei েমলায়।
বাংলােদশ সংবাদ সংsা (বাসস) জািনেয়েছ, িডিজটাল িবষয়ক নয়া িবjান o pযু িk িনেয় যারা খবর রােখন, তারা িন য়i সামািজক মাধয্ম িকংবা
20.26
pযু িk o aিভনবt িবষেয় ধারণা o তথয্ আদানpদােনর জনয্ ei েমলা। নানা ধরেণর খবের েনেছন িডিজটাল oয়াlর্ 2016 eর।
SC(6) 21.52
40িট মntণালয় িডিজটাল বাংলােদশ িহেসেব কী কী পিরেষবা িদেc তার আগামী 19-21 aেkাবর 2016 বসু nরা কনেভনশন িসিটেত aনু ি ত হেব
19.39
েদেশর সবেচেয় বড় আiিসিট iেভn – িডিজটাল oয়াlর্ 2016।
SC(7) 20.59 খুঁ িটনািট তু েল ধরা হেব ei েমলায়।
বু ধবার রাজধানী ঢাকার inারনয্াশনাল কনেভনশন িসিট বসু nরায় (আiিসিসিব) e
িবjান o pযু িk িনেয় যারা খবর রােখন, তারা িন য়i সামািজক মাধয্ম 18.16
েমলার uেdাধন কেরন pধানমntী েশখ হািসনা।
SC(8) 20.26 িকংবা নানা ধরেণর খবের েনেছন িডিজটাল oয়াlর্ 2016 eর। আমরা িশkা বয্বsা unত করেত সারা েদেশ 30 হাজার মািlিমিডয়া kাস চালু
আগামী 19-21 aেkাবর 2016 বসু nরা কনেভনশন িসিটেত aনু ি ত হেব 16.03 কেরিছ।
SC(9) 19.39 েদেশর সবেচেয় বড় আiিসিট iেভn – িডিজটাল oয়াlর্ 2016। 2018 সােলর মেধয্ আরo দশ হাজার েশখ রােসল িডিজটাল লয্াব চালু করা
15.73 হেব।
বু ধবার রাজধানী ঢাকার inারনয্াশনাল কনেভনশন িসিট বসু nরায়
eখনকার িশkাথ o ত ণ pজn িবjান o িবjান িনভর্ র পড়ােশানা িনেয়
SC(10) 17.75 (আiিসিসিব) e েমলার uেdাধন কেরন pধানমntী েশখ হািসনা। 14.87 aেনক েবিশ সেচতন।
আমরা িশkা বয্বsা unত করেত সারা েদেশ 30 হাজার মািlিমিডয়া ei aনু ােন নতু ন udাবন কয্াটাগরীেত sু ল, কেলজ o িব িবদয্ালেয়র
SC(11) 15.67 kাস চালু কেরিছ। 14.55
িবjানমন িশkাথ রা তােদর pকl uপsাপেনর সু েযাগ পােব।
2018 সােলর মেধয্ আরo দশ হাজার েশখ রােসল িডিজটাল লয্াব চালু 13.56 আগামী 2017 সােলর মেধয্ েফার-িজ চালু হেয় যােব বেলo জানান pধানমntী।
SC(12) 15.37 করা হেব। বাংলােদেশর ত ণ েpাgামার o pযু িkগত িবষেয় ঈষর্নীয় সাফলয্লাভকারীেদর
Cluster-2
m2 = 12.85
eখনকার িশkাথ o ত ণ pজn িবjান o িবjান িনভর্ র পড়ােশানা িনেয় পৃ িথবী জু েড়i pশংসা চলেছ।
িনেয় aেনক েবিশ সেচতন। 11.36 সরকার তাi ei েktিটেক আেরা বড় eকিট pয্াটফমর্ িহেসেব দাঁড়া করােত
SC(13) 14.87 11.67 চান।
ei aনু ােন নতু ন udাবন কয্াটাগরীেত sু ল, কেলজ o িব িবদয্ালেয়র eখােন থাকেব নতু ন নতু ন udাবন o pযু িkগত নানা আেলাচনাo থাকেছ ei
11.53 আেয়াজেন।
িবjানমন িশkাথ রা তােদর pকl uপsাপেনর সু েযাগ পােব।
SC(14) 14.55 11.28 pকেlর আoতায় 50 হাজার ত ণ-ত ণীর pিশkেণর বয্বsা করা হেয়েছ।
আগামী 2017 সােলর মেধয্ েফার-িজ চালু হেয় যােব বেলo জানান 10.45 বাংলােদশ ধীের ধীের eিগেয় যােc pযু িkগত uৎকষর্তার িদেক।
SC(15) 13.25 pধানমntী। 9.97 ” iেতােমেধয্ েদেশর pায় সব uপ-েজলােতi ি -িজ েপৗঁেছ িগেয়েছ।
বাংলােদেশর ত ণ েpাgামার o pযু িkগত িবষেয় ঈষর্নীয় 9.71 eেত শতািধক েবসরকাির pিত ান তােদর িডিজটাল কাযর্kম তু েল ধরেব।
SC(16) 12.85 সাফলয্লাভকারীেদর িনেয় পৃ িথবী জু েড়i pশংসা চলেছ। 9.69 তারা েযখােন ei pদশর্নীিট করেব তার নাম হেc, “iেনােভশন েজান”।
সরকার তাi ei েktিটেক আেরা বড় eকিট pয্াটফমর্ িহেসেব দাঁড়া 8.25 ei দািয়t o তttাবধােন থাকেছ গল েডেভলপার gপস বাংলা।
SC(17) 11.67 করােত চান।

eখােন থাকেব নতু ন নতু ন udাবন o pযু িkগত নানা আেলাচনাo After extracting top 5 (K) sentences from each cluster, the finally
থাকেছ ei আেয়াজেন। produced summary, which contains 10 sentences and 173 words,
SC(18) 11.53
pকেlর আoতায় 50 হাজার ত ণ-ত ণীর pিশkেণর বয্বsা করা is shown below:
SC(19) 11.03 হেয়েছ। ” িডিজটাল িসিকuিরিট aয্াk-2016 করা হেc জািনেয় েশখ হািসনা বেলন, “আoয়ািম িলগ সরকার েদেশর
বাংলােদশ ধীের ধীের eিগেয় যােc pযু িkগত uৎকষর্তার িদেক। sােথর্ aথর্ বয্য় কের সাবেমিরন েকবেলর মাধয্েম যু k হেয়েছ। িতন িদনবয্াপী েমলায় মাiেkাসফ্ট, েফসবু ক,
SC(20) 10.45
eকেসন্চার, িব বয্া , েজডিটi, য়াoেয়-সহ িব pিত ােনর 43 জন িবেদিশ বkা-সহ i শতািধক বkা
” iেতােমেধয্ েদেশর pায় সব uপ-েজলােতi ি -িজ েপৗঁেছ িগেয়েছ।
SC(21) 9.74 18িট েসশেন aংশ েনেবন। ‘ননsপ বাংলােদশ’ েsাগানেক সামেন েরেখ বাংলােদেশ হল িতন িদনবয্াপী
তারা েযখােন ei pদশর্নীিট করেব তার নাম হেc, “iেনােভশন েজান”। তথয্ o েযাগােযাগ pযু িk িবষয়ক েদেশর সবেচেয় বড় েমলা ‘িডিজটাল oয়াlর্ -2016’। িতন িদনবয্াপী ei
SC(22) 9.68 েমলা আগামী kবার পযর্n pিত িদন সকাল 10টা েথেক রাত 8টা পযর্n সকেলর জনয্ েখালা থাকেব।uেdাধনী
eেত শতািধক েবসরকাির pিত ান তােদর িডিজটাল কাযর্kম তু েল ধরেব। aনু ােন pধান aিতিথর ভাষেণ pধানমntী েশখ হািসনা বেলেছন, “আiিসিট বয্বহাের ত ণ জনেগা ী িনেয়
SC(23) 9.47
ei দািয়t o তttাবধােন থাকেছ গল েডেভলপার gপস বাংলা। আমরা ‘লািনর্ং aয্াn আিনর্ং’ pকl চালু কেরিছ।আমরা িশkা বয্বsা unত করেত সারা েদেশ 30 হাজার
SC(24) 8.25 মািlিমিডয়া kাস চালু কেরিছ।2018 সােলর মেধয্ আরo দশ হাজার েশখ রােসল িডিজটাল লয্াব চালু করা হেব।
eখনকার িশkাথ o ত ণ pজn িবjান o িবjান িনভর্ র পড়ােশানা িনেয় aেনক েবিশ সেচতন।ei aনু ােন
Then, K-means clustering algorithm has been applied with নতু ন udাবন কয্াটাগরীেত sু ল, কেলজ o িব িবদয্ালেয়র িবjানমন িশkাথ রা তােদর pকl uপsাপেনর
m1=28.36 and m2=8.25 as the centroids. As discussed above, সু েযাগ পােব। আগামী 2017 সােলর মেধয্ েফার-িজ চালু হেয় যােব বেলo জানান pধানমntী।
the final iteration is shown in Table IV.
V. EXPERIMENTAL RESULT AND DISCUSSION
The proposed extractive Bengali document(s) summarization
technique is implemented using the IDE for Java application,
“Netbeans IDE 8.0”, The performance analysis is performed in a 2.50 [2] R. M. Chezian, Ahilandeeswari. G., “A Survey on Approaches for
GHz Intel® core™ i5 CPU with 4GB RAM running Windows 7 Frequent Item Set Mining on Apache Hadoop”, International Journal for
ultimate operating system. The comparison of the proposed technique trends in Engineering & Technology, Vol. 3 Issue 3, India, March 2015.
[3] A. Totewar, Data mining: Concepts and Techniques. [Online]. Available
with the existing approaches is shown in Table V. So, the proposed
http://www.slideshare.net/akannshat/data-mining-15329899, India, 2016.
technique summarizes both single and multiple Bengali documents [4] R. Pradheepa, K. Pavithra, “A Survey on Overview of Data Mining”,
though noise removing, tokenizing and stemming, scoring each word International Journal of Advanced Research in Computer and
by TF/IDF and then each sentence for applying the K-means Communication Engineering, Vol. 5, Issue 8, India, 2016.
clustering algorithm. [5] Unnamed, Data Mining-Applications & Trends. [Online]. Available:
https://www.tutorialspoint.com/data_mining/dm_applications_trends.htm
TABLE V. COMPARISON WITH EXISTING METHODS [6] F. El-Ghannam, and T. El-Shishtawy, “Multi-Topic Multi-Document
Summarizer”, International Journal of Computer Science & Information
Author name/ Langua Docume Major Operations
Technology (IJCSIT), Vol 5 No 6, India, 2013.
Technique ge type nt type
[7] A. Nenkova and K. McKeown, “Automatic Summarization”,
K. Sarkar [13] English Multiple Preprocessing, Clustering using cosine Foundations and Trends® in Information Retrieval, Vol. 5, Nos. 2–3, p.
similarity, Cluster ordering 103–233, Boston - Delft, 2011.
T.J. Siddiqui English Multiple Preprocessing, Sentence scoring [8] V. Gupta, G. S, Lehal, “A Survey of Text Summarization Extractive
[15] (Feature based), Clustering using Techniques”, Journal of Emerging technologies in web intelligence,
syntactic & semantic similarity Vol.2, No. 3, India, 2010.
A. R. Deshpand English Multiple Preprocessing (TF*IDF), Sentence [9] J. Zhang , L. Sun , Q. Zhou, “Cue-based Hub-Authority approach for
[21] (Query scoring (Feature based), Clustering Multi- document Text Summarization”, IEEE International Conference
based) using cosine similarity on Natural Language Processing and Knowledge Engineering, p. 642 –
A. Agrawal English Single Word score (TF*IDF), Sentence 645, China, 2005.
[17] scoring, K-means clustering [10] Y. Ouyang, W. Li, Q. Lu, “An Integrated Multi-document
M. A. Uddin Bengali Multiple Preprocessing, Sentence scoring (TTF), Summarization Approach based on Word Hierarchical Representation”,
[18] Cosine Similarity measure, A* Proceedings of the ACL-IJCNLP, p. 113-116, China, 2009.
algorithm [11] X. Li, J. Zhang, M. Xing, “Automatic Summarization for Chinese text
M. I. A. Efat Bengali Single Preprocessing, Sentence scoring based on Sub Topic Partition and Sentence Features”, IEEE 2nd
[19] (Feature based), Sentence ranking International Symposium on Intelligence Information Processing and
Trusted Computing (IPTC), China, 2011.
Proposed Bengali Single or Preprocessing (Noise removal, [12] P. Hu, T. He, H. Wang, “Multi-View Sentence Ranking for Query-
Technique Multiple Tokenization, Stemming), Word Biased Summarization”, IEEE International Conference on
scoring (TF/IDF), Sentence Scoring, Computational Intelligence and Software Engineering (CISE), Dec.
K-means clustering algorithm China, 2010.
[13] K. Sarkar, “Sentence Clustering-based Summarization of Multiple Text
Several experiments have been conducted to evaluate the Documents” TECHNIA – International Journal of Computing Science
proposed technique. Some experiments are evaluated on single and Communication Technologies, Vol. 2, No. 1, India, 2009.
[14] A. Kogilavani, P. Balasubramani,“Clustering and Feature specific
document and some are on multi-documents summarization. The sentence extraction based summarization of multi-documents ”,
proposed technique produced final summary from single or International journal of computer science & information Technology
multiple documents which has 30% sentences of the original (IJCSIT), Vol.2, No.4, India, August 2010.
merged document. The proposed technique simply gives expected [15] T. J. Siddiki ,V. K. Gupta, “Multi-document Summarization using
performance in comparison to the existing approaches. The time Sentence Clustering”, IEEE Proceedings of 4th International Conference
complexity of the proposed technique is ( ), which offers on Intelligent Human Computer Interaction, India, 2012.
[16] R. Kamal, Bangla-Stemmer. [Online]. Available:
linearity. The main pitfall is that sometimes, the sequence of https://github.com/rafikamal/Bangla-Stemmer
summarized sentences is not synchronized. However, the [17] A. Agrawal, and U. Gupta, “Extraction based approach for text
technique can be applied in various summarizing fields like summarization using k-means clustering”, IEEE International Journal
summarizing similar articles from different newspapers, blogs, of Scientific and Research Publications, Vol. 4, Issue 11, India, 2014.
books etc. [18] M. A. Uddin, K. Z. Sultana and M. A. Alom, “A Multi-Document Text
Summarization for Bengali Text”, IEEE International Forum on
Strategic Technology (IFOST), Bangladesh, 2014.
VI. CONCLUTION [19] M. I. A. Efat, M. Ibrahim , H. Kayesh, “Automated Bangla Text
In this paper, an extractive-based Bengali text summarization Summarization by Sentence Scoring and Ranking”, IEEE International
technique has been proposed both for single or multiple Conference on Informatics, Electronics & Vision (ICIEV), Bangladesh,
2013.
documents. In this summarization, some important sentences are [20] A. Mhatre, Implementation of k-means algorithm in C++. [Online].
extracted from the original document(s). We have compared the Available: http://ankurm.com/implementation-of-k-means-algorithm-in-c/
results with different extractive techniques and also measured the [21] A. R. Deshpande, Lobo L. M. R. J, “Text Summarization using
run-time complexity that shows the performance of the proposed Clustering Technique”, International Journal of Engineering Trends and
technique is improved. According to the result of the proposed Technology (IJETT) – Vol. 4 Issue 8, India, 2013.
[22] A. H. Witten, Text mining. [Online]. Available:
technique, we can conclude that it reduces the redundancy and http://www.cos.ufrj.br/~jano/LinkedDocuments/_papers/aula13/04-IHW
provides better summarization. How to measure similarities is Textmining.pdf
also a crucial issue in sentence clustering based summarization [23] R. Ferreira, L. de S. Cabral, R. D. Lins , G. P. e Silva, F. Freitas , G. D.
approach. The better similarity measure will improve the C. Cavalcanti, R. Lima, S. J. Simske, and L. Favaro, “Assessing sentence
clustering performance and this may improve the summarization scoring techniques for extractive text summarization”, ELSIVIER
International Journal of Expert systems with Applications, Netherlands,
performance. We can measure the relevancy of the sentences 2013.
using syntactic and semantic similarity in future. [24] আহনাফ রাতু ল (2016), হেয়েছ েদেশর সবচাiেত বড় আiিসিট iেভn, [Online].
Available: http://www.bigganprojukti.com/?p=76344
REFERENCES [25] Anondobazar (2016), বাংলােদেশ সবেচেয় বড় pযু িk েমলা‘ িডিজটাল oয়াlর্ -2016.
[1] Unnamed, Big Data. [Online]. Available: [Online]. Available: http://www.anandabazar.com/bangladesh-
http://searchcloudcomputing.techtarget.com/definition/big-data-Big-Data news/bangladesh-s-biggest-technology-fair-digital-world-2016-kick-off-bng-
dgtl-1.498224
View publication stats

Bengali 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bengali 3

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

An extractive text summarization technique for Bengali document(s) using K-

Conference Paper · January 2017

Sumya Akter Aysa Siddika Asa

SEE PROFILE SEE PROFILE

Md Palash Uddin Md. Delowar Hossain

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

978-1-5090-6004-7/17/$31.00 ©2017 IEEE

II. RELATED WORK III. PROPOSED TECHNIQUE

SC(17) 11.67 করােত চান।

View publication stats

You might also like