You are on page 1of 5

Implementation of Fuzzy C-Means

Algorithm and TF-IDF on English


Journal Summary
Mohamad Irfan1, Jumadi2, Wildan Budiawan Zulfikar3, Erik4
1,2,3,4
Department of Informatics, UIN Sunan Gunung Djati Bandung
Jl. AH Nasution No. 105, Bandung, West Java, Indonesia
1
irfan.bahaf@uinsgd.ac.id, 2jumadi@uinsgd.ac.id, 3wildan.b@uinsgd.ac.id, 4erikfb@gmail.com

Abstract— Text Summary is a process of distilling Moreover, Indonesia got lower score than some
important information from various sources to counties in region such as Vietnam that is in 31st
produce a summary used by personal users or for position 'middle level' [1].
certain works. Usually the method of summary is Method of weighting log-tf.idf is used as it is the
studied in language lesson, unfortunately the foreign best method in information retrieval. Score weight of a
language proficiency of Indonesian particularly on term states about importance of the weight in
English is very low. Hence, to make summary of representing documents. In weighting of log-tf.idf, the
English texts will be difficult and automatic weight will be increased if frequency of occurrence of
summary tool is needed. The application is able to term is higher, but the weight will be decreased if the
summarize a text by input the file text in pdf format. frequency of occurrence of the term is higher in another
Furthermore, to make us sure how important documents[2].
sentences in a document Weighting every single Fuzzy c-means (FCM) clustering is a group model
sentence will be as the best way. Weighting is used by of fuzzy to make data as a member of all classes or
TF-IDF method. Fuzzy C Means Method is used in cluster shaped by different membership level between
Weighting resulting sentences is divided into two 0 and 1. The level of data in a class or cluster is
groups; first, important groups (High Sentence decided by its membership level[3]. Based on the
Weight) and second, unimportant groups ( Low background stated, researchers intended to apply the
Sentence Weight). algorithm to prove whether the algorithm can be
applied and combined with algorithm of TF-IDF in
Keyword—Automatic Text Summary Tool, Term text mining program which concentrates in a journal
Frequency– Inverse Document Frequency, Fuzzy C- summary issue so the point of the text can be extracted.
means. The point stated above is about the issue raised in
the text, methods or solutions to solve the issue and
result of solution application to the problem. It is
I. INTRODUCTION intended to make readers of the text are able to quickly
understand the gist of the documents.
Archives of file; text, pictures, sounds etc. each has
important information. Text document is usually often
used. Students in universities mostly use it to make a II. DOCUMENT CLUSTERING
final assignment as a theory base for their research.
Journals of technology are the most read by them and A. Automatic Text Summary
mostly it is written in English. Automatic text summary is a field of research
Data showed that English proficiency level of started in 1958. The main purpose of research is to
Indonesian is still low. Tribbunnews.com noted as summarize the text documents into shorter without
follows: EF announced result of global survey English exceeds given limit and maintain vital information and
Proficiency Index (EF EPI) in Jakarta. Indonesia is in overall meaning [4].
32nd position from 72 surveyed countries with the Automation summary can be applied for a
score 52.91. It showed that Singapore as asian country document (single document summarization) or some
that has the highest score in English proficiency level documents (multiple document summarization), a
followed by Malaysia and Philiphines in the top 15.
language (monolingual) or some languages several times in the document and matches with the
(translingual/multilingual). frequency of occurrence of the word in all documents.
The output of automatic text summarization is TF-IDF can be formulated as follows [7]
divided into two parts; first, extract is a process to
choose some significant sentences of a document. TF- IDF( tk,dj)= TF( tk,dj)*IDF ( tk) (2.1)
Second, abstract is a process to make a summary as a
Notes:
substitution of the original documents [5]. Fig. 1 is an
illustration from an automatic text summarization dj=the j-th document
machine.
tk =the k-th term
Term frequency (TF) is previously calculated
which is the frequency of occurrence of a term in each
document. IDF is then claculated which is the weight
value of a term calculated from the frequency of a term
appears in some documents. The more often a term
Fig. 1. Automatic Text Summarization Machine [5]
appears in several documents, the IDF value will be
small. The TF and IDF are calculated as follows [7]
The machine has several modules to summarize
completely. Fig. 2 shows the structure of existing TF( tk,dj)=f( tk,dj) (2.2)
modules on the automatic text summarization machine. Notes:
TF = term frequency

f = occurrence frequency
dj = the j-th document
tk = the k-term
The IDF value can be calculated by the following
Fig. 2. Summarization Modules [5] formulas [7]:

The Soft Clustering algorithm uses membership IDF (tk) = (2.3)
ௗ௙ሺ௧ሻ
values (inspired by fuzzy set theory) to associate the
degree of membership of a document to a particular or
cluster. The membership values lie on the interval

[0,1]. Fuzzy groupings generate clusters but not IDF (tk) = Ž‘‰ (2.4)
ௗ௙ሺ௧ሻ
partitions. The document on fuzzy clustering involves
more than one cluster. The membership degree of the Notes:
document on a particular cluster determines how close
the document is to the cluster. IDF =term weight
Hard groupings can also be obtained from fuzzy
N =number of documents
groupings by taking the threshold of the membership
function values, namely the document may be a part of df =document occurrence
the cluster. The membership value is greater than the
threshold value (say 0.7). Fuzzy C-means is a popular dj =the j-th document
flat clustering. The algorithm is relatively better than
the hard K-means. The sensitivity requirements of this tk =the k-th term
selected initial value tend to meet a local minima [6]. Equation (2.3) can solely be used if there is only one
document, while equation (2.4) is used on multiple
B. Term Frequency-Invers Document Frequency (TF- documents. TF-IDF is a core of preprocessing text and
IDF) an efficient and simple method to implement on several
kind of text [8][9][10]. There are two process in order to
TF-IDF weighting is a statistical measurement to refine the discover pattern in text document as follow:
measure how important a word is in a document. The pattern deploying and pattern evolving [11].
importance level increases when a word appears
2. Select one of the data as a centroid for each
C. Fuzzy C-Means (FCM) group.
The fuzzy c-means method, which was developed 3. Calculate the centroid values of each group.
by Dunn in 1973 and enhanced by Bezdek in 1981, is 4. Calculate the degree of membership value of
often used in pattern recognition [12]. The data points each data into each group.
are placed into different groups by considering the 5. Return to step 3 if the change in membership
similarity of the data values with the cluster center. The degree value is still above the specified
FCM's job is to minimize objective function. It is an threshold value, or if the change in centroid
iterative process that modifies by updating the matrix value is still above the specified threshold
of membership degrees and cluster center repeatedly value, or if the change in the value of the
until it gets the optimal local classification [13]. objective function is still above the specified
Fuzzy c-means (FCM) grouping is based on the threshold value.
fuzzy logic theory. In the fuzzy theory, the membership
of data is not assigned a firm value with a value of 1
(being a member) and 0 (not a member), but with a III. EXPERIMENTAL SETUP
membership degree value whose value range is 0 to 1.
The membership value of data in a set is 0 if the data is When fuzzy c-means is applied to summarize a
not a member and 1 when the data is a full member of document, the first thing to do is the extraction of the
the set. The higher the membership value, the higher text from the document. After the extraction is done
the degree of membership, and vice versa [14]. the next step is preprocessing stage. This stage
The number of membership degrees of each data xi includes:
is always 1 which can be calculated by: 1. Case folding is a process of converting all letters
σ௞௝ୀଵ uij = 1 (2.5) in the document (sentences) into lowercase. In
addition, characters other than letters are omitted.
Each cj group contains at least one data with a
2. Tokenizing is a process of cutting each word in
non-zero membership value, but does not contain one
each sentence or parsing by using a space as a
degree in all data. The formula is
delimiter that will generate a word token.
Ͳ ൏ σ௠௜ୀଵ uij <m (2.6)
3. Filtering is a process of filtering out the stopwords
Similar to the fuzzy set theory where data can be a
obtained from unimportant tokenizing stage, such
member in some sets expressed by the value of degree
as "what", "who", "the", "when", and "where".
of membership in each set, in FCM, each data can also
4. Stemming is a process of returning the words
be a member in each group with degrees of
obtained from the filtering into their basic forms
membership uij.
by eliminating prefix and suffix.
The value of data membership xi in group vj can be
5. Analyzing is a process of calculating the term
calculated by
మ frequency of the document to determine the
uij =
ୈሺ௫௜ǡ௖௝ሻೢషభ
(2.7) interconnection between words. This stage is well-

σೖ
೗సభ ୈሺ௫௜ǡ௖௟ሻ
ೢషభ known as the weighting stage.
Parameter cj is the centroid of the j-th group and
D() is the distance between the data and the centroid. w In the fifth stage, the TF-IDF method is applied.
is the weighting exponent parameter introduced in The application begins with the process of calculating
FCM. There is no value of precision, where the value the appearance of term (word) in each document. When
of w is usually greater than 1 and generally rated 2. the TF calculation stage is complete, the next step is to

To calculate the centroid in group cj in features j, calculate the IDF with the formula IDF (tk) = Ž‘‰ .
ௗ௙ሺ௧ሻ
use the following formula:
After the value of TF and IDF have been found, the
σಾ ೢ
೗సభሺ୳୧୪ሻ ௫௟௝
c ij= (2.8) value of TF-IDF is then calculated by using the TF-IDF
σಾ೗సభሺ୳୧୪ሻ ೢ

Parameter M is the amount of data, w is the weight equation TF-IDF( tk,dj)=TF( tk,dj)*IDF( tk). The TF-IDF
of rank, and uil is the membership degree value of data is calculated by just multiplying the value of TF with
xi to group ci. Meanwhile, the objective function used is the value of IDF. If the value of TF-IDF of each
document is found, in the column of each document the
J = σெ ௞ ௪
௜ୀଵ σ௝ୀଵሺ—‹Œሻ ሺ‫݅ݔ‬ǡ ݆ܿሻ

(2.9) value of TF-IDF is accumulated to obtain the value of
sentence weight and is sorted by descending.
Essentially the FCM algorithm has many This research aims to find and classify the
similarities with K-Means. The procedure of fuzzy c- sentences of a document into two groups, namely
means grouping is as follows [8]: important and unimportant sentence groups. Up to this
1. Determine the number of groups.
stage, the weight of each sentence has been obtained so ଶ‫כ‬௥௘௖௔௟௟‫כ‬௣௥௘௖௜௦௦௜௢௡
f-measure = ‫ͲͲͳ כ‬Ψ
that grouping process can be done. ௥௘௖௔௟௟ା௣௥௘௖௜௦௦௜௢௡

The process of this sentence grouping is a step


where the fuzzy c-means method is applied. The steps Table 3. Accuracy
in the process of sentence weight grouping are as N Recall Precision (%) f-measure
follows. o (%) =
1. Determine the number of groups. In this case we ଶ‫כ‬௥௘௖௔௟௟‫כ‬௣௥௘௖௜௦௦௜௢௡
‫כ‬
௥௘௖௔௟௟ା௣௥௘௖௜௦௦௜௢௡
determine two groups, namely important and
unimportant sentence groups. ͳͲͲΨ
2. Select one of the data as a centroid for each group. 1 27 % 21 % 23 %
At this stage, the data is selected randomly because 2 46 % 20 % 27,67 %
in each iteration center group will be fixed until the 3 62 % 45 % 52,14 %
center value of the group does not change again. 4 65 % 30 % 41,05 %
3. Calculate the centroid values of each group. 5 39 % 44 % 41,34 %
4. Calculate the degree of membership value of each Notes [16]:
data into each group. At this stage each data will
Recall is the calling document that is relevant to the
move to the center of the group based on the
statement (query) entered by the user into the
degree of membership that has been determined.
system
5. Return to step 3 if the change in membership
degree value is still above the specified threshold Precision is the number of relevant document groups of
value, or if the change in centroid value is still the total number of documents found by the
above the specified threshold value, or if the system
change in the value of the objective function is still
above the specified threshold value. The value of w f-measure : harmonic weight mean from recall and
is 2. precision

IV. RESULT V. CONCLUSION

After comparing the results of manual summaries


Table 2. and Table 3 show the results of the test with system summaries, the average value for recall,
and the accuracy of the summarization system when precision, and f-measure are 47.8%, 32%, and 37.04%
compared to the results of human manual summary. respectively. At the evaluation stage, the highest recall
The table of test results of the application is as follows. value is 65%, while for precision and f-measure are
30% and 41,05%, respectively. Meanwhile the lowest
Table 2. Summarization Result of System with Manual
Summarization Result
recall value is 27% with precision 21% and f-measure
No. Number of Number of Number of the 23%. The values indicate that the resulting summary is
sentences sentences of same not effective enough to produce a summary result that is
of manual sentences/Simil accurate or feasible to use. This can be caused by a
summarizat summarizati arity (A‫ޔ‬B) summary process that relies only on the weight of a
ion system on (B) sentence that is highly dependent on the appearance of
(A) the term in the document. In subsequent work, the
sentence extraction process needs to be optimized to
1 66 51 14
improve accuracy and performance. In addition, the
2 110 49 23
source document needs to be developed so that it can
3 140 101 63
process various types of file extensions as a whole in the
4 134 63 41 sense of not limited by the number of characters. It
5 72 82 32 needs special handling for article content in the form of
After the sentence is matched between the system source code, tables, or URLs.
summary results with man-made summary results to
know the accuracy of the summary. The function of REFERENCE
finding recall, precision and f-measure are calculated
by the following equations [15]: [1] “Kemampuan Bahasa Inggris Masyarakat
௖௢௥௥௘௖௧ Indonesia Masih Rendah,” 2016. .
recall =
௖௢௥௥௘௖௧ା௠௜௦௦௘ௗ
[2] D. Nugraha, P. Putra, and A. Suharsono,
௖௢௥௥௘௖௧
precission = “Peringkas Dokumen Tunggal Berbahasa
௖௢௥௥௘௖௧ା௪௥௢௡௚
Indonesia Menggunakan Metode Sentences
Clustering dan Frequent Term,” pp. 1–10, 2008. [15] A. Ridok, “Peringkasan Dokumen Bahasa
Indonesia Berbasis Non-Negative Matrix
[3] I. M. B. Adnyana, “Implementasi Algoritma Factorization ( Nmf ),” vol. 1, no. 1, pp. 39–44,
Fuzzy C Means Dan Statistical Region Merging 2014.
Pada Segmentasi Citra,” pp. 9–10, 2015.
[16] M. A. Setiawan, “PENGERTIAN RECALL,
[4] D. Cao and L. Xu, “Analysis of Complex PRECISION, F-MEASURE,” 2013. .
Network Methods for Extractive Automatic Text
Summarization,” pp. 2749–2756, 2016.
[5] Budi Susanto, “Text summarization,” pp. 1–12.
[6] V. K. Singh, “Document Clustering using K-
means , Heuristic K-means and Fuzzy C-
means,” 2011.
[7] J. Ilmiah, I. Asia, B. Dwijawisnu, A. Hetami,
and S. A. Malang, “PERANCANGAN
INFORMATION RETRIEVAL ( IR ) UNTUK
PENCARIAN IDE POKOK TEKS ARTIKEL
BERBAHASA INGGRIS DENGAN
PEMBOBOTAN VECTOR SPACE MODEL,”
vol. 9, no. 1, 2015.
[8] P. Bafna, D. Pramod, and A. Vaidya,
“Document clustering: TF-IDF approach,” in
2016 International Conference on Electrical,
Electronics, and Optimization Techniques
(ICEEOT), 2016, pp. 61–66.
[9] B. A. Kuncoro and B. H. Iswanto, “TF-IDF
method in ranking keywords of Instagram users’
image captions,” in 2015 International
Conference on Information Technology Systems
and Innovation (ICITSI), 2015, pp. 1–5.
[10] Li-Ping Jing, Hou-Kuan Huang, and Hong-Bo
Shi, “Improved feature selection approach
TFIDF in text mining,” in Proceedings.
International Conference on Machine Learning
and Cybernetics, 2002, vol. 2, pp. 944–946.
[11] Ning Zhong, Yuefeng Li, and Sheng-Tang Wu,
“Effective Pattern Discovery for Text Mining,”
IEEE Trans. Knowl. Data Eng., vol. 24, no. 1,
pp. 30–44, Jan. 2012.
[12] R. Bharathi, “A distributed , scalable
parallelization of Fuzzy c-Means Algorithm .,”
2016.
[13] D. Zhu, Y. Li, and C. Zhang, “Automatic Time
Picking for Microseismic Data Based on a
Fuzzy C-Means Clustering Algorithm,” no. 1,
pp. 1–5, 2016.
[14] E. Prasetyo, Data Mining Konsep dan Aplikasi
Menggunakan MATLAB. Yogyakarta: ANDI,
2012.

You might also like