Professional Documents
Culture Documents
Abstract— Text Summary is a process of distilling Moreover, Indonesia got lower score than some
important information from various sources to counties in region such as Vietnam that is in 31st
produce a summary used by personal users or for position 'middle level' [1].
certain works. Usually the method of summary is Method of weighting log-tf.idf is used as it is the
studied in language lesson, unfortunately the foreign best method in information retrieval. Score weight of a
language proficiency of Indonesian particularly on term states about importance of the weight in
English is very low. Hence, to make summary of representing documents. In weighting of log-tf.idf, the
English texts will be difficult and automatic weight will be increased if frequency of occurrence of
summary tool is needed. The application is able to term is higher, but the weight will be decreased if the
summarize a text by input the file text in pdf format. frequency of occurrence of the term is higher in another
Furthermore, to make us sure how important documents[2].
sentences in a document Weighting every single Fuzzy c-means (FCM) clustering is a group model
sentence will be as the best way. Weighting is used by of fuzzy to make data as a member of all classes or
TF-IDF method. Fuzzy C Means Method is used in cluster shaped by different membership level between
Weighting resulting sentences is divided into two 0 and 1. The level of data in a class or cluster is
groups; first, important groups (High Sentence decided by its membership level[3]. Based on the
Weight) and second, unimportant groups ( Low background stated, researchers intended to apply the
Sentence Weight). algorithm to prove whether the algorithm can be
applied and combined with algorithm of TF-IDF in
Keyword—Automatic Text Summary Tool, Term text mining program which concentrates in a journal
Frequency– Inverse Document Frequency, Fuzzy C- summary issue so the point of the text can be extracted.
means. The point stated above is about the issue raised in
the text, methods or solutions to solve the issue and
result of solution application to the problem. It is
I. INTRODUCTION intended to make readers of the text are able to quickly
understand the gist of the documents.
Archives of file; text, pictures, sounds etc. each has
important information. Text document is usually often
used. Students in universities mostly use it to make a II. DOCUMENT CLUSTERING
final assignment as a theory base for their research.
Journals of technology are the most read by them and A. Automatic Text Summary
mostly it is written in English. Automatic text summary is a field of research
Data showed that English proficiency level of started in 1958. The main purpose of research is to
Indonesian is still low. Tribbunnews.com noted as summarize the text documents into shorter without
follows: EF announced result of global survey English exceeds given limit and maintain vital information and
Proficiency Index (EF EPI) in Jakarta. Indonesia is in overall meaning [4].
32nd position from 72 surveyed countries with the Automation summary can be applied for a
score 52.91. It showed that Singapore as asian country document (single document summarization) or some
that has the highest score in English proficiency level documents (multiple document summarization), a
followed by Malaysia and Philiphines in the top 15.
language (monolingual) or some languages several times in the document and matches with the
(translingual/multilingual). frequency of occurrence of the word in all documents.
The output of automatic text summarization is TF-IDF can be formulated as follows [7]
divided into two parts; first, extract is a process to
choose some significant sentences of a document. TF- IDF( tk,dj)= TF( tk,dj)*IDF ( tk) (2.1)
Second, abstract is a process to make a summary as a
Notes:
substitution of the original documents [5]. Fig. 1 is an
illustration from an automatic text summarization dj=the j-th document
machine.
tk =the k-th term
Term frequency (TF) is previously calculated
which is the frequency of occurrence of a term in each
document. IDF is then claculated which is the weight
value of a term calculated from the frequency of a term
appears in some documents. The more often a term
Fig. 1. Automatic Text Summarization Machine [5]
appears in several documents, the IDF value will be
small. The TF and IDF are calculated as follows [7]
The machine has several modules to summarize
completely. Fig. 2 shows the structure of existing TF( tk,dj)=f( tk,dj) (2.2)
modules on the automatic text summarization machine. Notes:
TF = term frequency
f = occurrence frequency
dj = the j-th document
tk = the k-term
The IDF value can be calculated by the following
Fig. 2. Summarization Modules [5] formulas [7]:
ଵ
The Soft Clustering algorithm uses membership IDF (tk) = (2.3)
ௗሺ௧ሻ
values (inspired by fuzzy set theory) to associate the
degree of membership of a document to a particular or
cluster. The membership values lie on the interval
ே
[0,1]. Fuzzy groupings generate clusters but not IDF (tk) = (2.4)
ௗሺ௧ሻ
partitions. The document on fuzzy clustering involves
more than one cluster. The membership degree of the Notes:
document on a particular cluster determines how close
the document is to the cluster. IDF =term weight
Hard groupings can also be obtained from fuzzy
N =number of documents
groupings by taking the threshold of the membership
function values, namely the document may be a part of df =document occurrence
the cluster. The membership value is greater than the
threshold value (say 0.7). Fuzzy C-means is a popular dj =the j-th document
flat clustering. The algorithm is relatively better than
the hard K-means. The sensitivity requirements of this tk =the k-th term
selected initial value tend to meet a local minima [6]. Equation (2.3) can solely be used if there is only one
document, while equation (2.4) is used on multiple
B. Term Frequency-Invers Document Frequency (TF- documents. TF-IDF is a core of preprocessing text and
IDF) an efficient and simple method to implement on several
kind of text [8][9][10]. There are two process in order to
TF-IDF weighting is a statistical measurement to refine the discover pattern in text document as follow:
measure how important a word is in a document. The pattern deploying and pattern evolving [11].
importance level increases when a word appears
2. Select one of the data as a centroid for each
C. Fuzzy C-Means (FCM) group.
The fuzzy c-means method, which was developed 3. Calculate the centroid values of each group.
by Dunn in 1973 and enhanced by Bezdek in 1981, is 4. Calculate the degree of membership value of
often used in pattern recognition [12]. The data points each data into each group.
are placed into different groups by considering the 5. Return to step 3 if the change in membership
similarity of the data values with the cluster center. The degree value is still above the specified
FCM's job is to minimize objective function. It is an threshold value, or if the change in centroid
iterative process that modifies by updating the matrix value is still above the specified threshold
of membership degrees and cluster center repeatedly value, or if the change in the value of the
until it gets the optimal local classification [13]. objective function is still above the specified
Fuzzy c-means (FCM) grouping is based on the threshold value.
fuzzy logic theory. In the fuzzy theory, the membership
of data is not assigned a firm value with a value of 1
(being a member) and 0 (not a member), but with a III. EXPERIMENTAL SETUP
membership degree value whose value range is 0 to 1.
The membership value of data in a set is 0 if the data is When fuzzy c-means is applied to summarize a
not a member and 1 when the data is a full member of document, the first thing to do is the extraction of the
the set. The higher the membership value, the higher text from the document. After the extraction is done
the degree of membership, and vice versa [14]. the next step is preprocessing stage. This stage
The number of membership degrees of each data xi includes:
is always 1 which can be calculated by: 1. Case folding is a process of converting all letters
σୀଵ uij = 1 (2.5) in the document (sentences) into lowercase. In
addition, characters other than letters are omitted.
Each cj group contains at least one data with a
2. Tokenizing is a process of cutting each word in
non-zero membership value, but does not contain one
each sentence or parsing by using a space as a
degree in all data. The formula is
delimiter that will generate a word token.
Ͳ ൏ σୀଵ uij <m (2.6)
3. Filtering is a process of filtering out the stopwords
Similar to the fuzzy set theory where data can be a
obtained from unimportant tokenizing stage, such
member in some sets expressed by the value of degree
as "what", "who", "the", "when", and "where".
of membership in each set, in FCM, each data can also
4. Stemming is a process of returning the words
be a member in each group with degrees of
obtained from the filtering into their basic forms
membership uij.
by eliminating prefix and suffix.
The value of data membership xi in group vj can be
5. Analyzing is a process of calculating the term
calculated by
మ frequency of the document to determine the
uij =
ୈሺ௫ǡሻೢషభ
(2.7) interconnection between words. This stage is well-
మ
σೖ
సభ ୈሺ௫ǡሻ
ೢషభ known as the weighting stage.
Parameter cj is the centroid of the j-th group and
D() is the distance between the data and the centroid. w In the fifth stage, the TF-IDF method is applied.
is the weighting exponent parameter introduced in The application begins with the process of calculating
FCM. There is no value of precision, where the value the appearance of term (word) in each document. When
of w is usually greater than 1 and generally rated 2. the TF calculation stage is complete, the next step is to
ே
To calculate the centroid in group cj in features j, calculate the IDF with the formula IDF (tk) = .
ௗሺ௧ሻ
use the following formula:
After the value of TF and IDF have been found, the
σಾ ೢ
సభሺ୳୧୪ሻ ௫
c ij= (2.8) value of TF-IDF is then calculated by using the TF-IDF
σಾసభሺ୳୧୪ሻ ೢ
Parameter M is the amount of data, w is the weight equation TF-IDF( tk,dj)=TF( tk,dj)*IDF( tk). The TF-IDF
of rank, and uil is the membership degree value of data is calculated by just multiplying the value of TF with
xi to group ci. Meanwhile, the objective function used is the value of IDF. If the value of TF-IDF of each
document is found, in the column of each document the
J = σெ ௪
ୀଵ σୀଵሺሻ ሺ݅ݔǡ ݆ܿሻ
ଶ
(2.9) value of TF-IDF is accumulated to obtain the value of
sentence weight and is sorted by descending.
Essentially the FCM algorithm has many This research aims to find and classify the
similarities with K-Means. The procedure of fuzzy c- sentences of a document into two groups, namely
means grouping is as follows [8]: important and unimportant sentence groups. Up to this
1. Determine the number of groups.
stage, the weight of each sentence has been obtained so ଶככ௦௦
f-measure = ͲͲͳ כΨ
that grouping process can be done. ା௦௦