Professional Documents
Culture Documents
ABSTRACT
Text summarization is necessary to get the most precise and useful information from a large
document and eliminate the irrelevant or less important ones. Text summarization can be
carried out mainly in two ways. They are abstractive text summarization and extractive text
summarization. The automatic text summarization can be used on single document or multi
document. Also, the web page summarization can be done using web scraping and bringing
the content and summarizing it. This decreases the redundancy of files and saves time in
understanding large information. The text summarization task can be challenging due to its
vast usage capability, if not done properly it cannot be used. Thus, NLP comes in help to
understanding the language and extract the useful sentences or information that are critical in
understanding of the topic. Here the Text ranking algorithm and cosine similarity is used to
summarize the text. The data is given as text or a website page URL in which summary is
necessary.
KEYWORDS
Deep Learning, Natural Language Processing (NLP), Text Rank Algorithm, Text
Summarization, Extractive Text Summarization.
INTRODUCTION
LITERATURE SURVEY
1. Song, S., Huang, H. & Ruan, T. Abstractive text summarization using LSTM-CNN
based deep learning. Multimedia Tools Applications 78, 857–875 (2019) Springer.
In this paper, Abstractive text summarization is done by using LSTM-CNN model. The
dataset are taken from the daily news coverages as CNN, DailyMail websites. The CNN
dataset has more than 92000 texts and corresponding summaries. The DailyMail dataset has
219000 texts and corresponding summaries. The preprocessing is done in three steps: word
segmentation, morphological reduction, coreference resolution. The below figure shows the
design.
2. Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and Chandan K. Reddy. 2020.
Neural Abstractive Text Summarization with Sequence-to-Sequence Models. ACM/IMS
Trans. Data Sci. 2, 1, Article 1 (December 2020).
In this paper, Abstractive text summarization is done by RNN based Seq2Seq model. The
datasets are taken from CNN/DailyMail Dataset(300k news articles), Newsroom Dataset(1.3
million news articles), Bytecup Dataset(1.3 million news articles). Also discussed about
evaluation parameters such as ROUGE, BERTScore, fluency, factual correctness, relevance.
Seq2Seq model
3. J. N. Madhuri and R. Ganesh Kumar, "Extractive Text Summarization Using
Sentence Ranking," 2019 International Conference on Data Science and
Communication 2019, pp. 1-3, IEEE.
In this paper, Extractive text summarization is done using sentence ranking. The work is done
based on single document text summarization. The input data is given by a word document
with text. Then main task is to identifying the important paragraphs and giving weights to
sentences. After the summarization, it is compared to the human generated summary. The
evaluation metric used is ROUGE.
4. Joshi, A., Fidalgo, E., Alegre, E., & Fernández-Robles, L. (2019). SummCoder: An
Unsupervised Framework for Extractive Text Summarization Based on Deep Auto-
encoders. Expert Systems with Applications, Elsevier.
In this paper, Extractive text summarization is done by sentence content relevance, sentence
novelty, sentence position relevance. In this content relevance is done using deep auto
encoders. By combining these three metrics the authors have performed extractive text
summarization. The datasets used are CNN and DailyMail dataset, DUC dataset, Tor Illegal
Documents summarization dataset, Blog summarization dataset. This approach performed
better in some of the ROUGE evaluation metrics compared to traditional ML models.
5. Alguliyev, R. M., Aliguliyev, R. M., Isazade, N. R., Abdi, A., & Idris, N. (2018).
COSUM: Text summarization based on clustering and optimization. Expert Systems,
e12340, Elsevier.
In this paper, text summarization is done using clustering and optimization techniques called
COSUM. The first step is clustering of sentences by k-means. The second step is selecting
important sentences from the clusters based on different features. The datasets used are
DUC2001(309 articles) and DUC2002(567 documents). The pre-processing steps are
splitting into sentences, removing stop words and noisy words, upper case removal,
stemming. The evaluation metric used is ROUGE. This model performs better for ROUGE-1
and ROUGE-2.
8. El-Kassas, W. S., Salama, C. R., Rafea, A. A., & Mohamed, H. K. (2020). EdgeSumm:
Graph-based framework for automatic text summarization. Information Processing &
Management, 57(6), 102264, Elsevier.
In this paper, Automatic Text Summarization (ATS) is done by graph-based framework
“EdgeSumm”. The datasets used are DUC2001(308 English news documents and 616
model summaries) and DUC2002(567 news reports documents). The performance metric is
ROUGE. The average ROUGE score is better than other standard and state-of-the-art
systems.
9. Tsai, C.-F., Chen, K., Hu, Y.-H., & Chen, W.-K. (2020). Improving text
summarization of online hotel reviews with review helpfulness and sentiment. Tourism
Management, 80, 104122, Elsevier.
In this paper authors focussed on text summarization to get useful sentences that depict
sentiments of customers on the services provided by the hotels. The dataset is taken from
online hotel booking platform TripAdviser. The pre-processing includes spell check, word
segmentation, stemming, parts-of-speech tagging. After summarization, sentiments are found
on which services can be improved by the hotels.
Proposed approach
11. Al-Maleh, M., Desouki, S. Arabic text summarization using deep learning approach.
J Big Data 7, 109 (2020), Springer.
In this paper, text summarization is done by Sequence-to-Sequence model in deep learning
approach. The baseline models are also applied, and ROUGE score is used as performance
metric and comparison is done. The dataset comprises of 300,000 entries of articles and their
headlines. The proposed methodology flowchart is as below.
12. R. S. Shini and V. D. A. Kumar, "Recurrent Neural Network based Text
Summarization Techniques by Word Sequence Generation," 2021 6th International
Conference on Inventive Computation Technologies (ICICT), 2021, pp. 1224-1229,
IEEE.
In this paper, the author describes the various types of text summarization techniques based
on deep learning. The datasets used are CNN and DailyMail Dataset, New York Times
dataset, DUC2004 dataset, Amazon review dataset. Also, this paper focuses on the pre-
processing steps to be followed. The proposed architecture is as below.
13. K. Merchant and Y. Pande, "NLP Based Latent Semantic Analysis for Legal Text
Summarization," 2018 International Conference on Advances in Computing,
Communications, and Informatics (ICACCI), 2018, pp. 1803-1807, IEEE.
In this paper, text summarization is done using latent semantic analysis (LSA). The authors
used single document and multi document approach. The dataset is taken from legal
judgements issued by Indian judiciary system. The ROUGE-1 is 0.58. The proposed
approach is shown.
14. Candidate sentence selection for extractive text summarization
Begum Mutlu, Ebru A. Sezer, M. Ali Akcayol, Information Processing and
Management, Elsevier, 2020.
In this paper, text summarization is done using LSTM (Long Short-Term Memory). The
dataset is DUC2001, SIGIR2018. The ROUGE-1 is 0.607, ROUGE-2 is 0.501, ROUGE-L is
0.569.
15. B. Mutlu, E.A. Sezer and M.A. Akcayol, Multi-document extractive text
summarization: A comparative assessment on features, Knowledge-Based Systems
(2019), Science Direct
In this paper, Extractive Text summarization is done by the fuzzy inference systems. The
dataset used is DUC2002. The ROUGE-1-2-L achieved 0.66, 0.59, 0.66 respectively. This
method achieved better performance than neural networks for ROUGE-1. Also discussed
detailed pre-processing steps for extractive text summarizations
PROPOSED WORK
METHODOLOGY
The main motto of this project is to summarize the text that takes the text as input and
displays the summary through an user interface.
It involves 3 steps:
1. Preprocessing
Preprocessing
The text is taken from the textbox from the user interface provided. The user can also provide
a URL in which text has to be extracted. The paragraphs present in web page provided by
user is scraped and taken as input. This input text must be preprocessed before applying text
rank algorithm.
The input text after preprocessing is taken as input for this step. The process of extractive
summarization requires the most important sentences among the whole input. Thus,
identifying the sentences that are to be displayed in the summary is done in this step. For this,
importance of sentence is identified by the Cosine Similarity method. The similarity
measurement is a measure of the cosine of the angle between the two non-zero vectors. The
libraries in python in which cosine similarity is available are scikit-learn, TensorFlow. The
similarity increases when the distance between two vectors decreases and vice-versa.
Of course, the initial step is to extract all the sentences from the text. This might be as simple
as separating the text at "." or newlines, or it can be more complicated if we want to fine-tune
the definition of a sentence. Parsers are never removed from the system; they are simply
abandoned. Once you have all the sentences in the text, we must create a graph in which each
sentence is a node and linkages between them or to the k-most similar sentences weighted by
similarity are established.
This method allows us to program Text Rank without having to do any arithmetic or use
matrices, all we need is your graph and a function to compute sentence similarity.
It determines how similar each sentence is to the rest of the text. The similarity function
should be directed to the meaning of the sentence, and cosine similarity approach can work
well.
If we extract words instead of sentences and follow the same algorithm, using a similarity
function between words then we can use Text Rank to extract keywords from the text, the
idea is that the word that is most like all the other words is the most important one. Filtering
stop-words is very important here.