Robin 3 PDF
Robin 3 PDF
Review Article
ABSTRACT
In research problems associated with text mining and classification, many factors have to be considered as on what basis the classification needs to be
done. These factor variables are termed as features. The hardness of the visualization of training data is directly based on the number of features. Most of
the times, the features are found to have high correlation and redundant. Dimensionality reduction helps to reduce the number of these features under
the task by accumulating a group of principle variables. In the previous work an automated feature extraction technique using the weighted TF-IDF was
proposed. Although the proposed method performed well, there was a drawback that some of the features generated are correlated to each other which
resulted in high dimensionality resulting in more time complexity and memory usage. This paper proposes an Automatic text summarization method
using the weighted TF-IDF model and K-means clustering for reducing the dimensionality of the extracted features. The various similarity measures are
utilized in order to identify the similarity between the sentences of the document and then they are grouped in cluster on the basis of their term
frequency and inverse document frequency (tf-idf) values of the words. The experiments were carried out on the student text data from the US
educational data hub and the results were compared with other dimensionality reduction methods in terms of co-selection, content based, weight based
and term significance parameters. The proposed method found to be efficient in terms of memory usage and time complexity.
Keywords: Text Mining, Classification, Dimension Reduction, Text Summarization, Weighted TF-IDF and K-Means Clustering .
© 2019 by Advance Scientific Research. This is an open-access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
DOI: http://dx.doi.org/10.22159/jcr.07.01.24
Weighted TF-IDF and k-means clustering which is based on centroids. for the same. The SVM classifier is made used for the purpose of
The rest of the paper is organized as follows: Section II highlights the finding the difference between the negative and other events.
important research works carried out in the domain and Section III
describes the problem of high dimension. Section IV depicts the Abstractive Text Summarization
proposed methodology and Section V gives the details of the In this method, the machine is intended to comprehend like human and
experimental environment. Section VI discusses on the results all the documents and the summary is delivered with key sentences
obtained in comparison with other methods and the paper ends with [24]. The linguistic based methods are used for the examination and to
conclusion in Section VII. interpret the text for finding fresh concepts and expressions that has
closest meaning by the generation of shorter version of the text which
RELATED WORKS conveys the main and important information from the source text [25].
Bran down et al [26] introduced the abstractive system where the
From the available literature, text summarization are broadly classified analysis is done through a statistical text data and extracted the key
based on the no of documents and on the techniques deployed. The signatures from the corpus. Then assigning of weight for all such
following section gives the highlights of the researches carried out in words are done. Based on the signatures that are extracted, weights
these categories. are then assigned for the sentences and the top weighted sentences are
selected for the summary. Daume et al [27] proposed an abstractive
Single Document Text Summarization summarization mechanism which maps every document into a kind of
In these cases, only one document is taken as input and summarization Data base representations. Also, classification is done based on four
is performed producing a single output [10][11]. Thomas et al [12] important categories like one-person, one-event, multiple-event and
proposed a system for the generation of auto keyword for the natural calamity. It also generates a small headline using the set of
summarization in single document of e-newspaper. Marcus et al [13] templates that are pre-defined. Finally, summaries are generated from
developed the discourse dependent summarization method which the databases.
predicts the adequacy for the summary text that are based on
discourse in the single e-newspaper. PROPOSED METHEDOLOGY
Multiple Document Text Summarization The proposed method is based on the unsupervised learning and it is
When multiple documents have to be summarized, the input is taken in accomplished as a part of process with three steps. (i). the
the form of many documents and delivers a single summary document preprocessing of the features generated,(ii). Calculation of score and
as a output [14][15]. Mirroshandal et al [16] presented couple of (iii) performing centroid based k-means clustering in the sentences
different algorithms for the temporal relation in-between the keyword and to extract only sentences of importance as a part of the summary.
extraction and for summarization of text in a multi-document input. The proposed architecture is depicted in the figure 3.1.
The initial algorithm was a weak and supervised ML approach for the
classification of the temporal relationships between the events and the
second algorithm was EM based technique which is un-supervised
DOCUMENTS
approach for the same temporal relationship extraction. Min.Z et al
[17] made use of the information that are common to a document set
that belong to a particular category for improving the quality of the
auto-summary generation in multi-documents.
shows the code snippet to remove the features that has a high 𝐼𝐷𝐹(𝑡) = log 𝑒 (𝑇𝑜𝑡𝑎𝑙𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑐𝑜𝑢𝑛𝑡)/(𝑁𝑜𝑜𝑓𝑑𝑜𝑐𝑡ℎ𝑎𝑡ℎ𝑎𝑠𝑡𝑒𝑟𝑚𝑇)
correlation. The threshold limit is fixed as 0.5 in our research. (2)
D. TF-IDF: Weighting: The definitions of the term frequency and the
IDF are combined to generate a composite weight for every term in all
the document. The TF-IDF weighting scheme assigned to a term in a
document is calculated as
𝑇𝐹 − 𝐼𝐷𝐹 = 𝑡𝑓 ∗ 𝑖𝑑𝑓 (3)
In contrast, the TF-IDF is assigned to a term weight of a document
which is
1. High when the term occurs multiple times within a less no of
documents
2. Lower when the particular term is present fewer times in a single
document but many times in a multi-document
3. Lowest when the term is occurring virtually in all the documents.
Sentence Length
Figure 3.2: Code Snippet for Removal of Features with High
This is calculated as the no of words that are present in the particular
Correlation
sentence. Longer sentences tend to contain more information on the
document. This is calculated using the cal_score(“ Corpus”) predefined
B. Recursive Feature Elimination
function .
Recursive feature selection works by eliminating the least important
features. It continues recursively until the specified number of features
k-Means Centroid Based Clustering
is reached. Recursive elimination can be used with any model that
The next and important phase is the clustering. K-means clustering is a
assigns weights to features, either through coef_ or
learning method which is unsupervised and provides many clustering
feature_importances_ Here we will use Random Forest to select the
solutions. The method classifies the given data over many clusters
100 best features. Fig 3.3 shows the code snippet for the removal of
based on a fixed priori. The intention of the proposed method is to
features through recursive method.
define k no of centroids for every cluster. The centroids are chosen for
placing them as far as possible from one another. The next is to bring
every point that belongs to a particular data and then to associate with
the nearest centroid. When all the points are classified, recalculation of
the k new centroid is done as the fresh centres of the clusters that
resulted from the prior step. After obtaining the k new centroid, the
new association is then generated in-between the similar data points
and the nearest centroid. The location of the k –centroid is changed in
each step till there are no more changes. Although the algorithm
terminates automatically, it do not find the most optimum
configuration that corresponds to that of the global objective function.
The algorithm can be run many times for a reduction of its effect. The
Figure 3.3: Code Snippet for Recursive Feature Elimination
issue is very hard in-terms of computational complexity but are
generally solved using the heuristic techniques for converging quick to
Sentence Score Calculation
a local optimum. The objective function is defined as
Every sentence is provided with an importance score and it reflects as 𝑝 (𝑚) (𝑚)
the measure of goodness for that sentence. These scores can be made 𝐾 = ∑𝑚 𝑘=1 ∑𝑛=1|𝑥𝑘 − 𝐶𝑘| | ∗ |𝑥𝑘 − 𝐶𝑘| | (4)
(𝑚)
used for the ordering of the sentences and to pick out the sentences Where |𝑥𝑘 − 𝐶𝑘| | is the distance measure that is chosen in-between
that has more importance. The probability that a sentence shall (𝑚)
the data point 𝑥𝑘 and the centre of the cluster and 𝐶𝑘| represents the
present in the result summary is directly proportional to the indicator of the distances of n-data points and the respective centres of
importance score. Every sentence is represented with a set of features the cluster.
where the score is calculated as a function of weighted sum of
independent features’ value.
Step 1: Start
The features we have used are:
A. TF-IDF Sum: The sentence’s goodness is normally represented using Step 2: Fix K points to the space that represents first group
the importance of its words. We use TF-IDF which is a powerful and
of centroids
heuristic in nature and best suits for the ranking of words based on
their importance. This is determined by the sum of all TF-IDF scores of Step 3: IF objects assignment = FULL calculate positions of
every individual words in the sentence. K centroid ELSE Step 2
B. TF: Term frequency measures as how frequent a term are occurring Step 4: DO Repeat Step 2, 3 till Centroid movement = 0
in a particular document. As all the document will not be of uniform
length, there are possibilities that a single term can appear many times
in case of a longer document than the shorter one. We perform
normalization by dividing the term frequency by the length of the The algorithm of the same is shown below.
document. This method is intended to produce a separation among the objects
𝑁𝑜𝑜𝑓𝑡𝑖𝑚𝑒𝑠𝑡ℎ𝑒𝑡𝑒𝑟𝑚𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑖𝑛𝑡ℎ𝑒𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
𝑇𝐹(𝑡) = (1) into small groups through which the metric is minimized and
𝑇𝑜𝑡𝑎𝑙𝑡𝑒𝑟𝑚𝑠
C. IDF: The Inverse Document frequency is the measure of importance calculated.
of a term. When computing the TF, all the terms are considered as The primary motive is to divide the complete documents into small
equally important. But most of the prepositions and conjunctions that sentences. Every sentence is then considered as the point in a
occurs many times will be of least important. Thus weighing down the Cartesian plane. All the sentences are there after broken into individual
frequent terms and scaling up the rare words sis done by calculating as tokens and the TF-IDF score is calculated for all the tokens in the
sentence.
𝐹(𝑡,𝑑)
𝑇𝑓𝑡 = (5)
𝐹(𝑑)
Where t represents a token, d is the document and F(t,d) depicts the
frequency of the term t in the document d . The inverse document
frequency is calculated as
𝑛
𝑖𝑑𝑓𝑡 = log10 (6)
𝑓(𝑡,𝑑)
Where N gives the total count of the sentences in the document and
𝑡𝑓: 𝑖𝑑𝑓𝑡 = 𝑡𝑓𝑡 ∗ 𝑖𝑑𝑓𝑡 score for all the sentences is calculated by the
summation of all the TF-IDF scores for each token in the sentence and
by normalizing it using the length of the sentence as discussed earlier.
The calculation is done as
1
𝑆𝑐𝑜𝑟𝑒(𝑋) = ∑𝑡(𝑡𝑓: 𝑖𝑑𝑓𝑡 ) (7)
|𝑥|
Figure 5.1: Proposed Methodology
Where X is the sentence in a document and t represents the term in X
and |X| is the length of the sentence X.
Table 5.1 shows the comparative analysis of the proposed
These scores of the sentences are used for representing the sentences
methodology in terms of co-selection. In the table, RU stands for
as a unique co-ordinates of the dingle dimensional Cartesian planes.
relative utility.
These are also used as the input for the clustering algorithm. The
simulative result produces K no of cluster centres. Now, the
Table 5.1: Comparison based on Co-Selection Methods
classification of each sentence into various clusters depending on the
scores that are computed for every sentence. At last, we picked the
cluster that has maximum count of sentences and the summary is Methods Evaluation Metrics – Co Selection
generated by reproducing the sentences in the same order as they Precision Recall F-Measure RU
appear in the source document. This kind of approach gave a concise SWING 0.84 0.82 0.81 0.80
summary because the densest cluster that is returned from the AUTO-SUMMARIZER 0.89 0.87 0.84 0.79
clustering algorithm has the sentences with highest score in the MULTINODAL-RNN 0.76 0.75 0.74 0.72
complete document. The scores of the sentences are then computes by WVE 0.91 0.90 0.89 0.91
summation of all the individual TF-IDF scores of the individual words GRAPH ENTROPY 0.94 0.91 0.89 0.87
and then by normalizing using the sentence length, Thus, the sentences TF-IDF: CBC 0.97 0.94 0.98 0.93
present in the most densed cluster are the one that are contextually
similar to the abstract of the document. The summary’s length is Fig 5.2 gives the graphical comparison of the co-selection based
subjected to change based on the length of the document/ When a metrics of different methods. It is seen form the graph that the
large value for K is chosen, then the clusters will be sparse resulting in proposed methodology has high precision of 0.97 , high Recall value of
in-coherency of summary. On the other hand, if a very low value of k is 0.94 and F-Measure value of 0.98. The relative utility (RU) is also very
chosen, the cluster will be very dense and hence the summary will not high of 0.93 in case of the proposed TF-IDF:CBC than other methods.
be concise. Thus the choice of the k value is to be done dynamically
based on the length of the document. After having a number of
experimental simulation with multiple values of k, the formulation was
arrived to choose the best value of K. If the number of sentences is less
than or equal to 20 , then choose k as N-4 and if not to choose K as N-
20.
Fig 5.3 gives the graphical comparison of the content based metrics of Table 5.4 shows the comparative analysis of the weight based
different methods. It is seen form the graph that the proposed dimension reduction evaluation parameter analysis. In the table F
methodology has cosine similarity of 0.98 , high Unit Overlap value of denotes frequency, B denotes binary and A denotes Augmented
0.94 and high Longest Common Subsequence value of 0.94 which
proves that the proposed TF-IDF:CBC out performs the other available Table 5.4: Comparison of Weight Based Methods
methods. Methods Evaluation Metrics-Weight Based
F-Weight B-Weight A-Weight Entropy
NB-SVM 0.78 0.80 0.87 0.34
Hybrid SVM 0.79 0.81 0.82 0.41
GA-IG 0.78 0.76 0.79 0.62
SWN-SG 0.61 0.63 0.68 0.54
IF-IDF:CBC 0.99 0.98 0.96 0.21
Fig 5.5 gives the graphical comparison of the weight based metrics of
different methods. It is seen form the graph that the proposed
methodology has Frequency weight of0.99, high Binary weight of 0.98,
high Augmented weight of 0.96. The proposed method also produces
the lowest entropy value of 0.21 which proves its better accuracy. The
overall comparison shows that the proposed TF-IDF: CBC out performs
the other available methods.
Table 5.5 shows the comparative term significance values obtained
from various methods.
Fig 5.4 gives the graphical comparison of the content based metrics of Fig 5.6 gives the graphical comparison of the Term Significance of
different methods. different methods. It is seen form the graph that the proposed
It is seen form the graph that the proposed methodology has N- methodology has high term significance of 98.08 which is the measure
Granularity of 0.98, high Pyramid value of 0.96 and high LSA measure of similarity that is directly proportional to dimension reduction and
value of 0.94 which proves that the proposed TF-IDF: CBC out proves that the proposed TF-IDF:CBC out performs the other available
performs the other available methods. methods.