0% found this document useful (0 votes)
79 views6 pages

Robin 3 PDF

Uploaded by

Saravanan V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views6 pages

Robin 3 PDF

Uploaded by

Saravanan V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Journal of Critical Reviews

ISSN- 2394-5125 Vol 7, Issue 1, 2020

Review Article

AN EXTRACTIVE BASED MULTI-DOCUMENT SUMMARIZATION USING WEIGHTED TF-IDF AND


CENTROID BASED K-MEANS CLUSTERING (TF-IDF: CBC) FOR LARGE TEXT DATA
JEBAMALAI ROBINSON1*, Dr. V. SARAVANAN2
1*Research Scholar, Bharathiar University, Coimbatore. jebamalai.robinson@gmail.com
2Dean, Computer Studies, Dr. SNS Rajalakshmi College of Arts and Science, Coimbatore.

Received: 05.11.2019 Revised: 12.12.2019 Accepted: 13.12.2019

ABSTRACT
In research problems associated with text mining and classification, many factors have to be considered as on what basis the classification needs to be
done. These factor variables are termed as features. The hardness of the visualization of training data is directly based on the number of features. Most of
the times, the features are found to have high correlation and redundant. Dimensionality reduction helps to reduce the number of these features under
the task by accumulating a group of principle variables. In the previous work an automated feature extraction technique using the weighted TF-IDF was
proposed. Although the proposed method performed well, there was a drawback that some of the features generated are correlated to each other which
resulted in high dimensionality resulting in more time complexity and memory usage. This paper proposes an Automatic text summarization method
using the weighted TF-IDF model and K-means clustering for reducing the dimensionality of the extracted features. The various similarity measures are
utilized in order to identify the similarity between the sentences of the document and then they are grouped in cluster on the basis of their term
frequency and inverse document frequency (tf-idf) values of the words. The experiments were carried out on the student text data from the US
educational data hub and the results were compared with other dimensionality reduction methods in terms of co-selection, content based, weight based
and term significance parameters. The proposed method found to be efficient in terms of memory usage and time complexity.
Keywords: Text Mining, Classification, Dimension Reduction, Text Summarization, Weighted TF-IDF and K-Means Clustering .

© 2019 by Advance Scientific Research. This is an open-access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
DOI: http://dx.doi.org/10.22159/jcr.07.01.24

INTRODUCTION important points to be inserted in summary. When the user searches


for an information by entering query, the internet will be providing a
In the digital era, the data is exploded and being circulated at an huge number of files that matches the score of related data in the query
unimaginable speed. These data are mostly of the unstructured format posted, and thus user time is wasted in searching for the pertinent
particularly text data. There is an urgent need for the development of data. But it is not possible for the user to conclude on the required file.
text summarization tools for reducing the dimensions and allow people This issue grows rapidly as information flow into web increases in
to get the necessary information with ease [1]. In the present, World Wide Web [3]. Text summarization is also found to be a
information access to large sources is made easier and quicker. technique where summary or abstract is automatically generated using
However, much of the information retrieved are redundant and also computer techniques from one or more text [4]. According to Babar [5,
inappropriate and do not convey the intended information. For 6], a summary is a text and that can be assembled from one or more
instance, we need to dig into the documents of a university or a school texts, that is capable of delivering a significant information in the
to get intended information of the student which would be more time actual text and it will be of condensed form. The main aim of automatic
consuming. text summarization is to introduce the source text into a condensed
Usage of Auto text summarization solves the problem of extracting version with the semantics. The most significant advantage of utilizing
only the useful information leaving behind the redundant and a summary is due to that reduces the reading time technologies which
irrelevant information from large texts. An effective summarization is able to make a coherent summary to take into the account variables
method can serve the purpose of ease of access saving time and also like writing style, length and syntax [7].
more information can be made to fit in a given area which saves The content of a text summary purely depend on the need of the user.
memory. There are two kinds of summarizing that can be done in text Summaries that are Topic-specific addresses the needs of the users of
summarization [2]. Extractive summarization where the subset of particular interest. On the other end, generative summaries tries to
words which represent the major points are pulled out from a given cover as many information from the source but the original topic
piece of text and are then combined to build a summary. Abstractive organization of the source text is maintained [8]. The summaries that
summarization is more advanced where in the deep learning methods are produced from human are non-extractive but many of the research
are deployed for making the machine to perform summarization as conducted on summarization is on the extractive methods. In practical
human does. The grammatical in-accuracy of the extractive methods scenarios, the extractive summaries meet the need of the users more
can be overcome by abstractive methods as the latter is capable of effectively than the abstractive summaries. This is basically due to the
generating new sentences which signifies only important information reason that in case of abstractive methods, such as in semantic
from the source. Although the performance of abstractive techniques is representation and NLP problems are harder when compared to data
better than the extractive methods, complicated deep learning and driven approaches such as sentence extraction. The abstractive
complex models are required to implement. Hence, extractive summarization is still in brain child stage of research [9]. In the
summarization are more popular and widely adapted. previous research, automated feature extraction was done using the
Text Summarization has become very important since from many weighted TF-IDF model which suffered from high dimensional features
years. To generate a summarized document, a reader and an identifier which resulted in performance issues such as memory usage and time
is required to generate summary. Summary is a data assembled by complexity. In this paper, extractive summarization is used for auto
collecting an equivalent information files and extracting only the generation of summaries from multi-document with the use of

Journal of critical reviews 135


An Extractive Based Multi-Document Summarization Using Weighted TF-IDF And Centroid Based K-Means Clustering (TF-IDF: CBC) for Large Text Data

Weighted TF-IDF and k-means clustering which is based on centroids. for the same. The SVM classifier is made used for the purpose of
The rest of the paper is organized as follows: Section II highlights the finding the difference between the negative and other events.
important research works carried out in the domain and Section III
describes the problem of high dimension. Section IV depicts the Abstractive Text Summarization
proposed methodology and Section V gives the details of the In this method, the machine is intended to comprehend like human and
experimental environment. Section VI discusses on the results all the documents and the summary is delivered with key sentences
obtained in comparison with other methods and the paper ends with [24]. The linguistic based methods are used for the examination and to
conclusion in Section VII. interpret the text for finding fresh concepts and expressions that has
closest meaning by the generation of shorter version of the text which
RELATED WORKS conveys the main and important information from the source text [25].
Bran down et al [26] introduced the abstractive system where the
From the available literature, text summarization are broadly classified analysis is done through a statistical text data and extracted the key
based on the no of documents and on the techniques deployed. The signatures from the corpus. Then assigning of weight for all such
following section gives the highlights of the researches carried out in words are done. Based on the signatures that are extracted, weights
these categories. are then assigned for the sentences and the top weighted sentences are
selected for the summary. Daume et al [27] proposed an abstractive
Single Document Text Summarization summarization mechanism which maps every document into a kind of
In these cases, only one document is taken as input and summarization Data base representations. Also, classification is done based on four
is performed producing a single output [10][11]. Thomas et al [12] important categories like one-person, one-event, multiple-event and
proposed a system for the generation of auto keyword for the natural calamity. It also generates a small headline using the set of
summarization in single document of e-newspaper. Marcus et al [13] templates that are pre-defined. Finally, summaries are generated from
developed the discourse dependent summarization method which the databases.
predicts the adequacy for the summary text that are based on
discourse in the single e-newspaper. PROPOSED METHEDOLOGY
Multiple Document Text Summarization The proposed method is based on the unsupervised learning and it is
When multiple documents have to be summarized, the input is taken in accomplished as a part of process with three steps. (i). the
the form of many documents and delivers a single summary document preprocessing of the features generated,(ii). Calculation of score and
as a output [14][15]. Mirroshandal et al [16] presented couple of (iii) performing centroid based k-means clustering in the sentences
different algorithms for the temporal relation in-between the keyword and to extract only sentences of importance as a part of the summary.
extraction and for summarization of text in a multi-document input. The proposed architecture is depicted in the figure 3.1.
The initial algorithm was a weak and supervised ML approach for the
classification of the temporal relationships between the events and the
second algorithm was EM based technique which is un-supervised
DOCUMENTS
approach for the same temporal relationship extraction. Min.Z et al
[17] made use of the information that are common to a document set
that belong to a particular category for improving the quality of the
auto-summary generation in multi-documents.

Extractive Text Summarization Pre-Processing of features


Here, the summarizing mechanism discovers more crucial information
from the document presented as input for making the summary of that
specific document[18][19]. In this methods, the statistical aswel as the
linguistic features in the sentences contributes to the decision making
of the most relevant topic words in the input document considered. Sentence Score Calculation
Thomas et al [12] developed a hybrid model for the extractive
summarization of the important features using ML algorithm from e- TF-IDF SUM
new paper. Minz et al [20] made use of an open source system for the Term Frequency
extractive summary called the SWING for summarization of multi-
document. The information that are common among the input
W-IDF
documents from a specific category is considered as a feature and are
encapsulated from the concept of CSI (“Category-Specific Importance”).
It is proved that the CSI is an important metric for selecting the
sentences in all the extractive tasks of summarization. Marcus et al
[13] proposed a proposed a discourse dependent extractive K Means-CBC
summarizing method which uses the RPA (“ Rhetorical Parsing
Algorithm”) for determining the structure of the discourse of the input
text and calculates the partial ordering on the elementary as well as
the units of parenthesis of the text. Erkan et al [21] proposed an
extractive summarizer which has three steps, the feature extraction, SUMMARY
the feature vector creation and the re-ranker. The main features
considered are the Centroids, length Cut-off and the page rank. Algulev
et al [22] developed an extractive summarizer based on unsupervised Figure 3.1: Proposed Methodology
learning which optimizes the three important properties namely the
relevance, length and the redundancy. The documents were split into Pre-Processing of Features
pieces of sentences and only the important sentences are selected from A. Remove Highly Correlated Features
the document. Aramaki et al [23] proposed a supervised learning – Features that are highly correlated or co-linear can cause over fitting.
extractive summarization method which detects the negative type of When a pair of variables are highly correlated we can remove one to
event and also investigated the type of information which were useful reduce dimensionality without much loss of information. Figure 3.2

Journal of critical reviews 136


An Extractive Based Multi-Document Summarization Using Weighted TF-IDF And Centroid Based K-Means Clustering (TF-IDF: CBC) for Large Text Data

shows the code snippet to remove the features that has a high 𝐼𝐷𝐹(𝑡) = log 𝑒 (𝑇𝑜𝑡𝑎𝑙𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑐𝑜𝑢𝑛𝑡)/(𝑁𝑜𝑜𝑓𝑑𝑜𝑐𝑡ℎ𝑎𝑡ℎ𝑎𝑠𝑡𝑒𝑟𝑚𝑇)
correlation. The threshold limit is fixed as 0.5 in our research. (2)
D. TF-IDF: Weighting: The definitions of the term frequency and the
IDF are combined to generate a composite weight for every term in all
the document. The TF-IDF weighting scheme assigned to a term in a
document is calculated as
𝑇𝐹 − 𝐼𝐷𝐹 = 𝑡𝑓 ∗ 𝑖𝑑𝑓 (3)
In contrast, the TF-IDF is assigned to a term weight of a document
which is
1. High when the term occurs multiple times within a less no of
documents
2. Lower when the particular term is present fewer times in a single
document but many times in a multi-document
3. Lowest when the term is occurring virtually in all the documents.

Sentence Length
Figure 3.2: Code Snippet for Removal of Features with High
This is calculated as the no of words that are present in the particular
Correlation
sentence. Longer sentences tend to contain more information on the
document. This is calculated using the cal_score(“ Corpus”) predefined
B. Recursive Feature Elimination
function .
Recursive feature selection works by eliminating the least important
features. It continues recursively until the specified number of features
k-Means Centroid Based Clustering
is reached. Recursive elimination can be used with any model that
The next and important phase is the clustering. K-means clustering is a
assigns weights to features, either through coef_ or
learning method which is unsupervised and provides many clustering
feature_importances_ Here we will use Random Forest to select the
solutions. The method classifies the given data over many clusters
100 best features. Fig 3.3 shows the code snippet for the removal of
based on a fixed priori. The intention of the proposed method is to
features through recursive method.
define k no of centroids for every cluster. The centroids are chosen for
placing them as far as possible from one another. The next is to bring
every point that belongs to a particular data and then to associate with
the nearest centroid. When all the points are classified, recalculation of
the k new centroid is done as the fresh centres of the clusters that
resulted from the prior step. After obtaining the k new centroid, the
new association is then generated in-between the similar data points
and the nearest centroid. The location of the k –centroid is changed in
each step till there are no more changes. Although the algorithm
terminates automatically, it do not find the most optimum
configuration that corresponds to that of the global objective function.
The algorithm can be run many times for a reduction of its effect. The
Figure 3.3: Code Snippet for Recursive Feature Elimination
issue is very hard in-terms of computational complexity but are
generally solved using the heuristic techniques for converging quick to
Sentence Score Calculation
a local optimum. The objective function is defined as
Every sentence is provided with an importance score and it reflects as 𝑝 (𝑚) (𝑚)
the measure of goodness for that sentence. These scores can be made 𝐾 = ∑𝑚 𝑘=1 ∑𝑛=1|𝑥𝑘 − 𝐶𝑘| | ∗ |𝑥𝑘 − 𝐶𝑘| | (4)
(𝑚)
used for the ordering of the sentences and to pick out the sentences Where |𝑥𝑘 − 𝐶𝑘| | is the distance measure that is chosen in-between
that has more importance. The probability that a sentence shall (𝑚)
the data point 𝑥𝑘 and the centre of the cluster and 𝐶𝑘| represents the
present in the result summary is directly proportional to the indicator of the distances of n-data points and the respective centres of
importance score. Every sentence is represented with a set of features the cluster.
where the score is calculated as a function of weighted sum of
independent features’ value.
Step 1: Start
The features we have used are:
A. TF-IDF Sum: The sentence’s goodness is normally represented using Step 2: Fix K points to the space that represents first group
the importance of its words. We use TF-IDF which is a powerful and
of centroids
heuristic in nature and best suits for the ranking of words based on
their importance. This is determined by the sum of all TF-IDF scores of Step 3: IF objects assignment = FULL calculate positions of
every individual words in the sentence. K centroid ELSE Step 2
B. TF: Term frequency measures as how frequent a term are occurring Step 4: DO Repeat Step 2, 3 till Centroid movement = 0
in a particular document. As all the document will not be of uniform
length, there are possibilities that a single term can appear many times
in case of a longer document than the shorter one. We perform
normalization by dividing the term frequency by the length of the The algorithm of the same is shown below.
document. This method is intended to produce a separation among the objects
𝑁𝑜𝑜𝑓𝑡𝑖𝑚𝑒𝑠𝑡ℎ𝑒𝑡𝑒𝑟𝑚𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑖𝑛𝑡ℎ𝑒𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
𝑇𝐹(𝑡) = (1) into small groups through which the metric is minimized and
𝑇𝑜𝑡𝑎𝑙𝑡𝑒𝑟𝑚𝑠
C. IDF: The Inverse Document frequency is the measure of importance calculated.
of a term. When computing the TF, all the terms are considered as The primary motive is to divide the complete documents into small
equally important. But most of the prepositions and conjunctions that sentences. Every sentence is then considered as the point in a
occurs many times will be of least important. Thus weighing down the Cartesian plane. All the sentences are there after broken into individual
frequent terms and scaling up the rare words sis done by calculating as tokens and the TF-IDF score is calculated for all the tokens in the
sentence.

Journal of critical reviews 137


An Extractive Based Multi-Document Summarization Using Weighted TF-IDF And Centroid Based K-Means Clustering (TF-IDF: CBC) for Large Text Data

𝐹(𝑡,𝑑)
𝑇𝑓𝑡 = (5)
𝐹(𝑑)
Where t represents a token, d is the document and F(t,d) depicts the
frequency of the term t in the document d . The inverse document
frequency is calculated as
𝑛
𝑖𝑑𝑓𝑡 = log10 (6)
𝑓(𝑡,𝑑)
Where N gives the total count of the sentences in the document and
𝑡𝑓: 𝑖𝑑𝑓𝑡 = 𝑡𝑓𝑡 ∗ 𝑖𝑑𝑓𝑡 score for all the sentences is calculated by the
summation of all the TF-IDF scores for each token in the sentence and
by normalizing it using the length of the sentence as discussed earlier.
The calculation is done as
1
𝑆𝑐𝑜𝑟𝑒(𝑋) = ∑𝑡(𝑡𝑓: 𝑖𝑑𝑓𝑡 ) (7)
|𝑥|
Figure 5.1: Proposed Methodology
Where X is the sentence in a document and t represents the term in X
and |X| is the length of the sentence X.
Table 5.1 shows the comparative analysis of the proposed
These scores of the sentences are used for representing the sentences
methodology in terms of co-selection. In the table, RU stands for
as a unique co-ordinates of the dingle dimensional Cartesian planes.
relative utility.
These are also used as the input for the clustering algorithm. The
simulative result produces K no of cluster centres. Now, the
Table 5.1: Comparison based on Co-Selection Methods
classification of each sentence into various clusters depending on the
scores that are computed for every sentence. At last, we picked the
cluster that has maximum count of sentences and the summary is Methods Evaluation Metrics – Co Selection
generated by reproducing the sentences in the same order as they Precision Recall F-Measure RU
appear in the source document. This kind of approach gave a concise SWING 0.84 0.82 0.81 0.80
summary because the densest cluster that is returned from the AUTO-SUMMARIZER 0.89 0.87 0.84 0.79
clustering algorithm has the sentences with highest score in the MULTINODAL-RNN 0.76 0.75 0.74 0.72
complete document. The scores of the sentences are then computes by WVE 0.91 0.90 0.89 0.91
summation of all the individual TF-IDF scores of the individual words GRAPH ENTROPY 0.94 0.91 0.89 0.87
and then by normalizing using the sentence length, Thus, the sentences TF-IDF: CBC 0.97 0.94 0.98 0.93
present in the most densed cluster are the one that are contextually
similar to the abstract of the document. The summary’s length is Fig 5.2 gives the graphical comparison of the co-selection based
subjected to change based on the length of the document/ When a metrics of different methods. It is seen form the graph that the
large value for K is chosen, then the clusters will be sparse resulting in proposed methodology has high precision of 0.97 , high Recall value of
in-coherency of summary. On the other hand, if a very low value of k is 0.94 and F-Measure value of 0.98. The relative utility (RU) is also very
chosen, the cluster will be very dense and hence the summary will not high of 0.93 in case of the proposed TF-IDF:CBC than other methods.
be concise. Thus the choice of the k value is to be done dynamically
based on the length of the document. After having a number of
experimental simulation with multiple values of k, the formulation was
arrived to choose the best value of K. If the number of sentences is less
than or equal to 20 , then choose k as N-4 and if not to choose K as N-
20.

EXPERIMENTAL SETUP AND EVALUATION


The experiment is carried out in a minimum system configuration.
NVivo tool is used as the programming language and the pre-
processing of features are done using the appropriate inbuilt functions.
The data considered for the experiment is taken from the US
educational data hub that is of the size of 256Mb with 4560 no of
documents. Figure 4.1 shows the sample summary generated from the
proposed method and the original corpus. As the method used in this
research is based on extraction, it is important that the size of the
result summary to be 35 % to 50% of the original text as if it is smaller
than the specified range , the summary will not be concise and Figure 5.2: Comparative Analysis based on Co-Selection
coherent when these are compared with human written summary. As
an evaluation, the simulation is run in different sample texts and is Table 5.2 shows the comparative analysis of the content based
compared with other extractive summarizers. dimension reduction evaluation parameter analysis. In the table, LCS
stands for Longest Common Subsequence
RESULTS AND DISCUSSION
The proposed method produces good results when compared with Table 5.2: Comparison of Content based Methods
human written abstractive summaries. . The metrics used for the Methods Evaluation Metrics – Content based
evaluation are calculated using the built-in functions of the NVivo tool. Cosine Similarity Unit overlap LCS
Fig 5.1 shows the sample extracted summary using the proposed HLO 0.82 0.84 0.86
methodology. FIS 0.89 0.91 0.90
Spectral Clustering 0.91 0.81 0.87
Lex Rank 0.84 0.82 0.83
TF-IDF: CBC 0.98 0.96 0.94

Journal of critical reviews 138


An Extractive Based Multi-Document Summarization Using Weighted TF-IDF And Centroid Based K-Means Clustering (TF-IDF: CBC) for Large Text Data

Fig 5.3 gives the graphical comparison of the content based metrics of Table 5.4 shows the comparative analysis of the weight based
different methods. It is seen form the graph that the proposed dimension reduction evaluation parameter analysis. In the table F
methodology has cosine similarity of 0.98 , high Unit Overlap value of denotes frequency, B denotes binary and A denotes Augmented
0.94 and high Longest Common Subsequence value of 0.94 which
proves that the proposed TF-IDF:CBC out performs the other available Table 5.4: Comparison of Weight Based Methods
methods. Methods Evaluation Metrics-Weight Based
F-Weight B-Weight A-Weight Entropy
NB-SVM 0.78 0.80 0.87 0.34
Hybrid SVM 0.79 0.81 0.82 0.41
GA-IG 0.78 0.76 0.79 0.62
SWN-SG 0.61 0.63 0.68 0.54
IF-IDF:CBC 0.99 0.98 0.96 0.21

Figure 5.3: Comparison of Content based Metrics

Table 5.3 shows the comparative analysis of the content based


dimension reduction evaluation parameter analysis. In the table, LSA
stands for Latent Symantec Analysis

Table 5.3: Comparison Based on Content Based Methods


Methods Evaluation Metrics – Content based
N-Granularity Pyramid LSA Measure
GFLES 0.89 0.91 0.90
GGSDS 0.87 0.86 0.81
TF-IDF: CBC 0.98 0.96 0.94 Figure 5.5: Comparison based on Weight Based Measures

Fig 5.5 gives the graphical comparison of the weight based metrics of
different methods. It is seen form the graph that the proposed
methodology has Frequency weight of0.99, high Binary weight of 0.98,
high Augmented weight of 0.96. The proposed method also produces
the lowest entropy value of 0.21 which proves its better accuracy. The
overall comparison shows that the proposed TF-IDF: CBC out performs
the other available methods.
Table 5.5 shows the comparative term significance values obtained
from various methods.

Table 5.5: Comparison of Term Significance

Method Term Significance


Sentiword Net 91.82
Auto-Summarizer 93.15
Dictionary-tagger 90.18
Text-STAT 96.18
IF-IDF:CBC 98.08
Figure 5.4: Comparison of Content Based Measures (2)

Fig 5.4 gives the graphical comparison of the content based metrics of Fig 5.6 gives the graphical comparison of the Term Significance of
different methods. different methods. It is seen form the graph that the proposed
It is seen form the graph that the proposed methodology has N- methodology has high term significance of 98.08 which is the measure
Granularity of 0.98, high Pyramid value of 0.96 and high LSA measure of similarity that is directly proportional to dimension reduction and
value of 0.94 which proves that the proposed TF-IDF: CBC out proves that the proposed TF-IDF:CBC out performs the other available
performs the other available methods. methods.

Journal of critical reviews 139


An Extractive Based Multi-Document Summarization Using Weighted TF-IDF And Centroid Based K-Means Clustering (TF-IDF: CBC) for Large Text Data

Utterance Compression," in IEEE Transactions on Audio, Speech,


and Language Processing, vol. 21, no. 7, pp. 1469-1480, July 2013
10. Gambhir, Mahak& Gupta, Vishal. (2016). Recent automatic text
summarization techniques: a survey. Artificial Intelligence
Review. 47. 10.1007/s10462-016-9475-9.
11. Alzuhair and M. Al-Dhelaan, "An Approach for Combining
Multiple Weighting Schemes and Ranking Methods in Graph-
Based Multi-Document Summarization," in IEEE Access, vol. 7, pp.
120375-120386, 2019
12. J. R. Thomas, S. K. Bharti, K. S. Babu, “Automatic keyword
extraction for text summarization in e-newspapers,” in:
Proceedings of the International Conference on Informatics and
Analytics, ACM, 2016, pp.86-93
13. M. P. Marcus, M. A. Marcinkiewicz, B. Santorini, “Building a large
annotated corpus of English: The Penn treebank,” Computational
linguistics, vol. 19 (2), 1993, pp. 313-330
14. L. Shi, F. Wei, S. Liu, L. Tan, X. Lian, M. X. Zhou, “Understanding
Figure 5.6: Comparison of Term Significance text corpora with multiple facets, in: Visual Analytics Science and
Technology (VAST), 2010 IEEE Symposium on, IEEE, 2010, pp.
CONCLUSION 99-106.
In this paper, we proposed an automatic text summarization approach 15. R. M. Alguliev, R. M. Aliguliyev, M. S. Hajirahimova, C. A.
using weighted TF-IDF combined with centroid based clustering (TF- Mehdiyev,Mcmr: Maximum coverage and minimum redundant
IDF: CBC) for document summarization to reduce the dimensionality in text summarization model, Expert Systems with Applications 38
the extracted features. The k-means method is used for the creation of (12) (2011) 14514-14522
groups of similar sentences. The most important sentence was then 16. S. A. Mirroshandel, G. Ghassem-Sani, “Towards unsupervised
selected for the summary generation. The proposed method is contrast learning of temporal relations between events,” Journal of
to the other supervised method which needs a large samples for the Artificial Intelligence Research
training process and makes TF-IDF: CBC much independent from the 17. Z. L. Min, Y. K. Chew, L. Tan, “Exploiting category-specific
language as well as domain. The experimental results demonstrate that information for multi-document summarization,” in Proceedings
the proposed method produces better and favorable results when of COLING, ACL, 2012, pp. 2093–2108
compared to that of other sate of the art approaches The proposed 18. Dong, Y. Chang, Z. Zheng, G. Mishne, J. Bai, R. Zhang, K. Buchner,.
method can further be enhanced by applying suitable optimization Liao, F. Diaz, “Towards recency ranking in web search,” in:
techniques to obtain optimized and dimension reduced features. Proceedings of the third ACM international conference on Web
search and data mining, ACM, 2017, pp. 11-20.
REFERENCES 19. R. Radev, K. R. McKeown, “Generating natural language
summaries from multiple on-line sources,” Computational
1. J. N. Madhuri and R. Ganesh Kumar, "Extractive Text Linguistics, vol. 24 (3),2017, pp. 470-500
Summarization Using Sentence Ranking," 2019 International 20. L. N. Minh, A. Shimazu, H. P. Xuan, B. H. Tu, S. Horiguchi, “Sentence
Conference on Data Science and Communication (IconDSC), extraction with support vector machine ensemble,” In
Bangalore, India, 2019,pp. 1-3 Proceedings of the First World Congress of the International
2. S. R. Rahimi, A. T. Mozhdehi and M. Abdolahi, "An overview on Federation for Systems Research, 2015, pp. 14-17
extractive text summarization," 2017 IEEE 4th International 21. Erkan, D. R. Radev, “The university of Michigan at duc 2004,” in:
Conference on Knowledge-Based Engineering and Innovation Proceedings of the Document Understanding Conferences Boston,
(KBEI), Tehran, 2017, pp. 0054-0062 MA,2013
3. Alzuhair and M. Al-Dhelaan, "An Approach for Combining 22. R. M. Alguliev, R. M. Aliguliyev, M. S. Hajirahimova, C. A. Mehdiyev,
Multiple Weighting Schemes and Ranking Methods in Graph- Mcmr: Maximum coverage and minimum redundant text
Based Multi-Document Summarization," in IEEE Access, vol. 7, pp. summarization model, Expert Systems with Applications 38 (12)
120375-120386, 2019 (2011) 14514-14522
4. Chen and M. C. Chen, "TSCAN: A Content Anatomy Approach to 23. Aramaki, Y. Miura, M. Tonoike, T. Ohkuma, H. Mashuichi, K.
Temporal Topic Summarization," in IEEE Transactions on Ohe,“Text2table: Medical text summarization system based on
Knowledge and Data Engineering, vol. 24, no. 1, pp. 170-183, Jan. named entity recognition and modality identification,” in:
2012 Proceedings of the Workshop on Current Trends in Biomedical
5. S.A. Babar and P.D. Patil, Improving performance of text Natural Language Processing, ACL, 2009, pp. 185-192
summarization, International Conference on Information and 24. Baralis and L. Cagliero, "Learning From Summaries: Supporting e-
Communication Technologies(ICICT 2014), Procedia Computer Learning Activities by Means of Document Summarization," in
Science 46 (2015), 354 – 363 IEEE Transactions on Emerging Topics in Computing, vol. 4, no. 3,
6. Babar S A ,Liu “Automatic text summarization using fuzzy pp. 416-428, July-Sept. 2016
inference,” in 22nd International Conference on Automation and 25. Y. Ma, P. Zhang and J. Ma, "An Ontology Driven Knowledge Block
Computing, Sep 2016. Summarization Approach for Chinese Judgment Document
7. Ren J, Li G, Ross K, et al. iTextMine: integrated text-mining system Classification," in IEEE Access, vol. 6, pp. 71327-71338, 2018
for large-scale knowledge extraction from the literature. Database 26. R. Brandow, K. Mitze, L. F. Rau, “Automatic condensation of
(Oxford). 2018 electronic publications by sentence selection,” Information
8. Kundan Krishna, Balaji Vasan Srinivasan, "Generating Topic- Processing & Management, vol. 31 (5), 1995, pp. 675-685
Oriented Summaries Using Neural Attention" Proceedings of the 27. Saggion, K. Bontcheva, H. Cunningham, “Robust generic and query
2018 Conference of the North American Chapter of the based summarization,” in: Proceedings of the tenth conference on
Association for Computational Linguistics: Human Language European chapter of the Association for Computational
Technologies, Volume 1 Linguistics, vol. 2, ACL, 2013, pp. 235-238
9. Liu and Y. Liu, "Towards Abstractive Speech Summarization:
Exploring Unsupervised and Supervised Approaches for Spoken

Journal of critical reviews 140

You might also like