Professional Documents
Culture Documents
Abstract-Research about text summarization has been quite approach for Indonesian language-based document. The prob
an interesting topic over the years, proven by numerous number lems defined in this research are limited researches of text
of papers related with discussion of their studies such as
sUlmnarization development for Indonesian documents and
approaches, challenges and trends. This paper's goal is to d efine
a measurement for text summarization using Semantic Analysis achieving an acceptable quality for the generated summary.
Approach for Documents in Indonesian language. The applied Therefore, semantic analysis is chosen as the approach with
measurement requires Indonesian version of WordNet which had high expectations to achieve the best summary result quality.
b een implemented roughly. The main idea of semantic analysis This paper contains six more chapters as follows respec
is to obtain the similarity b etween sentences by calculating the
tively : Section 2 explains related works, significant features
vector values of each sentence with the title. The need of WordNet
is to d efine the depth of each word as being computed for word are presented in Section 3 , word similarity measurement is
similarity. Combining all required formulas and calculations, explained in Section 4, Section 5 presents sentence similarity
a compact and precise summarization is produced without and ranking, the results and measurements comparison will be
depriving the gist information of certain document. given in Section 6, Section 7 will gives the result and Section
Index Terms-Text summarization, semantic analysis, sentence 8 explains the conclussion and future works.
similarity, word similarity, Indonesian document
I
n-gram is from 1 referred as unigram, 2 referred as bigram
and 3 referred as trigram. Researchers named Brown and
deSouza [20] discussed a classification of certain n-gram from
the stream of text. They extract the valid n-gram based on
the frequency of words or phrases in the document plus their
Generating Feature relations with their adj acent words. However, in this paper's
Su mma ry Ranking
scope, a n-gram is considered valid if the respective n-gram
is found in the WordNet.
Figure I . The Proposed Measurement Architecture
C. Stemming
The purpose of Feature Computation phase is to extract and
Stenuning is a process of compressing affixed words using
calculate the sentence similarity value between each sentence
certain rules into its root form or so called stem. In this paper,
and the title. The basic idea of using that computation is
stelmning algorithm is used in word similarity measurement
to rank sentences based on the relevancy with the whole
process for normalizing affixed words which have similar stem
document which the main meaning is implied in the title. This
in order to calculate their base form similaritynot based on
phase consist of two sub-measurement; Word Similarity Mea
the actual form. Many approaches of Indonesian stemming
surement and Sentence Similarity Measurement. The details of
have been proposed [2 1 ] , [22] . A survey paper for Indone
those sub-measurement will be described in Section 5 and 6.
sian stemming composed the statistics of performance and
In the Feature Ranking phase, the sentence similarity values
stemming correction percentage comparison. Its result stated
will be ranked ascendingly ; from the highest to the lowest
that stemming algorithm proposed by Nazief and Adriani
value. Afterwards, sentences that are considerably irrelevant
( 1 996) were considered for being the most efficient since
should be eliminated since the general summary just extracts
its correction percentage is 9 3 % . Therefore, this paper used
the gist information in the document.
Nazief and Adriani [22] proposed algorithm for stemming.
For altering the summary result to become more readable,
another process should be conducted. In the Generating Sum
mary phase, the ranked sentences are sorted based on the D. Stop Words
position in the document. This phase is required to generate Stop words is list of over-commonly words in corpora
a readable sUlmnary that has a good flow of information. In which are filtered for being not relevant in natural language
conclusion, the result of the overall phases in this proposed data or text. The inclusion of stop words in text could bias
measurement generates the relevant sentences in the readable the tangible value of sentences or corpus. Results of both
summary format. ignoring and including stop words in pre-processing phases
proved significantly different which ignorance of stop words
give better subjectivity measurement result.
IV. TEXT S UMMARIZATION PRE- PROCE S S ING FEATURES
Regarding the scope, topic and proposed method of this V. WORD S IMILARITY MEAS UREMENT
research, the proposed text summarization pre-processing fea Semantic similarity is an idea of assigning metric based
tures are defined as follows: WordNet, N-Gram, Stemming and values of documents or terms based on meaning or semantic
Stop Words. similarity. To perceive the semantic value of documents or
terms, word similarity measurement process should be adopted Title : Prediksi Squad Liverpool Kootra Indone si a XI
first. There are two main features for word similarity such as
Seoteoce Similarity
depth of words and Wu & Palmer measurement. Sentence Value
Liverp o ol a.k.a.n melawan Indonesia XI dalam laga
0 . 7 3 64 3 8 940 8 9 02 1 3 8
persahabatan di Stadion Gelera Bung Kamo
Pelatih " The Reds" Brendan Rodgers kemungkinan
A. Depth of Words b e s ar akan melakukan banyak eksperimen pada 0 . 8 6 3 9 8274 6 5 0 72 1 2
pe:nandingan nanti
To obtain semantic similarity of two words using synthetical Bleacherreport memiliki prediksi skuad Liverpool 0 . 9 0 5 7 6 3 9 5 24245 1 9 6
yang akan diturunkan
WordNet, defining depth of those words is prioritized before Rod gers sep ertinya akan menggunakan formasi 4-2-2-
further calculation. Depth of words are acquired by traversing 1 - 1 untuk membun gkam Boaz Sol o55 a dan kawan- 0 . 8 0 2 2 5 4 3 1 7 76 94 1 3 9
kawan
the shortest distance paths from one word to another word
using synonym sets or synsets graph. Essentially, the more Figure 2. Example Result of Sentence Similarity Measurement using Cosine
paths from one word to another word are traversed, the more Similairity method
depth' s value are given. The calculation is defined in formula
(1):
A. Semantic Vector Derivatization
To illustrate the details of sentence similarity computation,
we provide two sentences as follow:
where w l and W2 are the words, respectively;
• S I = Spesies lemur tikus baru ditemukan
paths(wl , W2 ) is collection of paths value from w l and W 2 ;
• S2= Lemur adalah anggota keluarga primata seperti hal-
min(paths (wl , W2 ) ) i s the minimum value of paths(wl , W2).
nya monyet
For example, the paths from word makan and minum are four
(4), three (3) and six (6) . Since the depth of words extracts From those sets, tokenization and stop words disposal process
the shortest distance path between two words, the value of should be accomplished for both sets. To equate the dimension
depth( makan, minum) is three. of both semantic vectors for further calculation, their dimen
sion's lengths must follow the formula (3):
Table I
SEMANTIC VECTOR DERIVATlZATlON PROCES S
Spesies lemur tikus I ditemukan I Lemur I anggota I keluarga primata halnya monyet
Spesies I 0.13 0.09 0. 1 8 0.13 0.33 0.5 0.13 0.22 0.1
lemur 0.13 1 0.13 0.13 1 0.13 0.13 1 0.13 0.13
tikus 0.09 0.13 1 0.09 0.13 0.09 0.09 0.13 0.09 0.07
ditemukan 0. 1 8 0.13 0.09 1 0.13 0. 1 8 0.2 0.13 0. 1 8 0.09
+ + + + + + + + +
8V l , 2 1 1 1 1 1 0.33 0.5 1 0.22 0.13
i =l
Value
Bleacherreportmemiliki prediksi skuad Liverpool 0 . 9 0 5 7 6 3 9 5 2424 5 1 9 6
y a n g akan dirurunkan
Pelarib "Th e Reds" Brendan Rodgers kemungkinan
where the result is the distance between two points or b e s ar akan m elakukan b anyak eksperimen p ad a 0 . 8 6 3 9 8 2 74 6 5 0 72 1 2
vectors. The conversion formula from distance to similarity pel1andingan nanti
Rodgers s epertinya akan menggunakan formasi 4 -2-2-
is given as follows : I - I untuk membtmgkam B o az Solossa dan kawan- 0. 8 02 2 5 4 3 1 7 7 694 1 3 9
kawan
which the constant of one as denominator is used for Figure 3 . Example Result of Sentence Ranking
eliminating the possible values of zero. The similarity value
of S l and S2 is 0.372 1760683269597 using this euclidean
3) At the end of the paragraph (Inductive Paragraph)
similarity.
Therefore, considering that deductive paragraph is the most
2) Cosine Similarity: Unlike euclidean distance, cosine
c ornmon type among other two, inclusion of the first sentence
similarity is a similarity measurement between two vectors
in every paragraph or article into text summarization should
by considering the cosine angle between them. The maximum
be done. Thus, those main sentences are also used for helping
value of this similarity is 1 since cosine value of the lowest
the result to become more readable format.
angle which is 0° is 1 . The formula is given as follows:
.
stm( p, q) cos (B)
p.q (6) D. Sentence Ranking
p II . II q II
= =
II The basic idea of text sUlmnarization is extracting the
By using this formula, the similarity value of s l and S2 is relevant informations of a text or document. The extracted
0.726 1 97552605 1 l 7 8 . informations or sentences should be ranked based on their
3 ) Jaccard Similarity: Jaccard similarity is a similarity relevances with the essence of the text. To obtain the sentence's
measurement for evaluating the diversity of sets. In general, value of relevance, sentence similarity between each sentences
this similarity computes the distance between two sets by in the content and title is calculated. From that list of similar
counting both match and mismatch elements. This compu ities, we eliminate sentences that are considered irrelevant by
tation does not require WordNet as lexical word database calculating first the mean of them using this equation:
since this method can be classified in statistical approach; not
n
1
2:: ai
semantic analysis . The formula is given as follows:
X = (9)
n i =l
-
I pnq I
J (p, q) = I p U q I (7)
where n i s the size o f the sentence list and i s the value ai
in ith sentence. Afterwards, sentences with the values below
where can be derived into :
mean are considered irrelevant with the text and shouldn't
Mn be included in summarization. The relevant sentences are
J (p, q) =
Ml O + MO l + Mu
(8) then sorted based on the position in text. Theoretically, these
processes would produce text summarization with a human
The j accard similarity value between s l and S2 is 0.2.
friendly readable format. Figure 3 illustrates an example
of sentence ranking result based on the sentence similarity
C. Sentence Position value, whereas Figure 4 shows the list of sentences that are
The gist informations of a document is also determined considered as relevant with the document.
by the position of sentence in documents. A good writing
document should place the main idea in these particular VII . RES ULT
location: Tests were performed by implementing text summarization
1) At the beginning of the paragraph (Deductive Paragraph) based on the proposed measurements above and comparing
2) In the middle of the paragraph each of the results. The comparison of result will be based
MK: DPO Berwenang Bahas UlJ Terk ait Daerah
Title : Prediksi Squad Liverpool Kontra Indonesia XI
Mahkamah Kons t i tus i memutu s kan bahwa D ewan Pe rwak ilan Il ae rah ( IlPIl )
Sentence Similarity be rwen an g untu k i kut s e rt a meng aj ukan dan membah a s Ran cangan
Sentence Value Undang-Undang yang t e rkai t dae r ah .
Bleacherreport memiliki pre diksi skuad Liverpool 0 . 9 05 7 6 3 9 5 2424 5 1 9 6
yang akan dirurunkan Ilal am pu tus an yang d ib a c a k an Rabu (27 /3/201 3 ) , MK meng abul kan
Pelanh "The Reds" Brendan Rodgers kemungkinan s ebagian pe rmohon an uj i ma t e ri at as UU 27/2 009 d a n UU 1 2 / 20 1 1 .
b e s ar akan m elakukan b anyak eksperimen pada 0 . 8 6 3 98274 6 5 072 1 2
pertandingan nann Menurut MK, IlP D j uga memi l i ki h ak menyusun program l e g i s l as i
nas i onal ( Prol e gnas ) s ebab kedudukan DPIl s e t ara d e ngan Pre s i den da n
IlPR .
Figure 4. Example Result of Sentence Ranking after eliminating Irrelevant
Sentences UU 27/20 09 ada l ah t e ntang Maj e l i s Pe rmus y aw arat an Raky at , Ile wan
P e rwakil an Rak yat , Il ewan P e rwak i lan Il ae rah , dan Il ewan Pe rk'ak ilan
Rakyat Il ae rah .
MK: DPD Berwenang Bahas UlJ Terk ai t Daerah
Il PIl hanya memi li ki wewen a ng mengaj u kan RUU t e rka i t dae rah , yang be rwenan g untu k ikut s e rt a m.en g aj ukan dan membaha s Rancangan
men cakup ot onomi , pe ri mbang an keuangan an t ara pu s at dan dae rah , Undang-Undang yang t e rkai t dae r ah .
hubungan peme ri nt ah pu s at dan dae rah , pemb entukan dan p eme kar an
s e rt a pen ggabun gan dae rah , s e rt a pe nge l ol aan sumbe r da ya al am , n
Ilal am pu tus an yang d ibacak an Rabu ( 2 7 3 2 0 1 3 ) , MK mengabul kan
ucap Aki l . Men a nggapi put us an ini , Ke tua DPD l rman Gus man menga ku
s ebag i an pe rmohon an uj i ma t e ri at; as UU 2 7 2 0 0 9 dan UU 1 2 2 0 1 1 .
gembi ra de ngan put us an MK ya ng revo lus i on e r . " lni h ari be r s e j arah ,
Menurut MK, IlP Il j uga memi l i ki h ak men yusun program l e g i s l as i
s ehingga pe l ak s anaan tupo ks i D PD m.endapat t emp a t s eb agaimana
nas i onal ( Prol e gnas ) s ebab kedudukan DP£) s e t ara d engan Pre s i den dan
s eharus n ya , II kat a l rman .. Ku as a Hukum P emohon , TOOung Mu l ya Lubi s ,
IlPR .
mengat ak an bahwa pu tus an MK me luru3 kan kemba l i pas al 22 Il UUD
1 94 5 , 3 e t; e l ah membe r i kan hak IlPD i kut mengus u lkan , dan meranca ng S edangka n UU 1 22011 me rupa kan UU t ent ang Pe mbentu kan P e ratur an
UU . "MK memb e ri kan hak kepada IlP Il , be rs ama IlPR dan Pre s id en Pe rundan g-Unda ngan .
membahas P rol e g nas ( Pro gram Le gi s l a s i Nas io nal ) me s kipun IlPIl t id a k
ikut d a l am pe r s e tuj u an , " k at;any a . Men anggap i put us an i n i , Ke tua Il PIl I rman Gus man me ngaku gembi ra
den gan putus an MK ya ng rev olus i on e r .
Figure 5. Original document entitled "DPD Sambut Baik Putusan MK Soal I n i hari be rs e j arah , s ehin gga p e l aks a naan t upoks i DPIl mendap at
Kewenangan Legislasi" tempat s ebagaimana s eharus nya , kata l rman .