You are on page 1of 13

TEXT SUMMARIZATION USING NLP

ABSTRACT

Text summarization is necessary to get the most precise and useful information from a large
document and eliminate the irrelevant or less important ones. Text summarization can be
carried out mainly in two ways. They are abstractive text summarization and extractive text
summarization. The automatic text summarization can be used on single document or multi
document. Also, the web page summarization can be done using web scraping and bringing
the content and summarizing it. This decreases the redundancy of files and saves time in
understanding large information. The text summarization task can be challenging due to its
vast usage capability, if not done properly it cannot be used. Thus, NLP comes in help to
understanding the language and extract the useful sentences or information that are critical in
understanding of the topic. Here the Text ranking algorithm and cosine similarity is used to
summarize the text. The data is given as text or a website page URL in which summary is
necessary.

KEYWORDS

Deep Learning, Natural Language Processing (NLP), Text Rank Algorithm, Text
Summarization, Extractive Text Summarization.
INTRODUCTION

Automatic Text Summarization is useful in many fields such as Education, Research,


News Articles summary, … The extractive text summarization that can be used to
gain insights about the document or long text. Thus, performing extractive text
summarization using NLP algorithms as Text Rank Algorithm and cosine similarity.
The existing approaches are using Recurrent Neural Networks, Long Short-Term
Memory, Graph based frameworks, sentence ranking, supervised approach … Due to
neural network models like these can significantly increase in execution time. The
neural networks and supervised models to understand the language it requires
knowledge of corpora of that language. The proposed approach is using Text Rank
Algorithm which is a natural language algorithm. Text Rank works well because
it does not only rely on the local context of a text unit (vertex), but rather it considers
information recursively drawn from the entire text (graph). Text Rank performs better
than most of the supervised learning approaches. Text summarization application is
designed for easy use by providing either URL or text directly. It can be done in two
different ways. They are extractive and abstractive text summarization. Here the
extractive text summarization method is used. This can also be challenging, so
finding cosine similarity of sentences can be useful in this situation. The text from a
website is scraped using BeautifulSoup module available in python. The number of
paragraphs summary needed can be given but it is optional and set to 5 if not
provided. The NLP algorithm used is Text Rank algorithm. This is used for ranking
sentences and words according to their importance and usage in the given text or
input. The result is displayed on the left side with box heading text summary. The
front end technology used is Django application which is similar to MVT (Model,
view, template) pattern. All the requests go to urls.py. From there views are selected
based on url and models are used accordingly. This Text summarizer application is
used for summarizing the given text.
RELATED WORK

LITERATURE SURVEY

1. Song, S., Huang, H. & Ruan, T. Abstractive text summarization using LSTM-CNN
based deep learning. Multimedia Tools Applications 78, 857–875 (2019) Springer.
In this paper, Abstractive text summarization is done by using LSTM-CNN model. The
dataset are taken from the daily news coverages as CNN, DailyMail websites. The CNN
dataset has more than 92000 texts and corresponding summaries. The DailyMail dataset has
219000 texts and corresponding summaries. The preprocessing is done in three steps: word
segmentation, morphological reduction, coreference resolution. The below figure shows the
design.

2. Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and Chandan K. Reddy. 2020.
Neural Abstractive Text Summarization with Sequence-to-Sequence Models. ACM/IMS
Trans. Data Sci. 2, 1, Article 1 (December 2020).
In this paper, Abstractive text summarization is done by RNN based Seq2Seq model. The
datasets are taken from CNN/DailyMail Dataset(300k news articles), Newsroom Dataset(1.3
million news articles), Bytecup Dataset(1.3 million news articles). Also discussed about
evaluation parameters such as ROUGE, BERTScore, fluency, factual correctness, relevance.
Seq2Seq model
3. J. N. Madhuri and R. Ganesh Kumar, "Extractive Text Summarization Using
Sentence Ranking," 2019 International Conference on Data Science and
Communication 2019, pp. 1-3, IEEE.
In this paper, Extractive text summarization is done using sentence ranking. The work is done
based on single document text summarization. The input data is given by a word document
with text. Then main task is to identifying the important paragraphs and giving weights to
sentences. After the summarization, it is compared to the human generated summary. The
evaluation metric used is ROUGE.

4. Joshi, A., Fidalgo, E., Alegre, E., & Fernández-Robles, L. (2019). SummCoder: An
Unsupervised Framework for Extractive Text Summarization Based on Deep Auto-
encoders. Expert Systems with Applications, Elsevier.
In this paper, Extractive text summarization is done by sentence content relevance, sentence
novelty, sentence position relevance. In this content relevance is done using deep auto
encoders. By combining these three metrics the authors have performed extractive text
summarization. The datasets used are CNN and DailyMail dataset, DUC dataset, Tor Illegal
Documents summarization dataset, Blog summarization dataset. This approach performed
better in some of the ROUGE evaluation metrics compared to traditional ML models.
5. Alguliyev, R. M., Aliguliyev, R. M., Isazade, N. R., Abdi, A., & Idris, N. (2018).
COSUM: Text summarization based on clustering and optimization. Expert Systems,
e12340, Elsevier.
In this paper, text summarization is done using clustering and optimization techniques called
COSUM. The first step is clustering of sentences by k-means. The second step is selecting
important sentences from the clusters based on different features. The datasets used are
DUC2001(309 articles) and DUC2002(567 documents). The pre-processing steps are
splitting into sentences, removing stop words and noisy words, upper case removal,
stemming. The evaluation metric used is ROUGE. This model performs better for ROUGE-1
and ROUGE-2.

6. Mohamed, M., & Oussalah, M. (2019). SRL-ESA-Text Sum: A text summarization


approach based on semantic role labelling and explicit semantic analysis. Information
Processing & Management, 56(4), 1356–1372, Elsevier.
This paper follows graph-based text summarization techniques for single and multiple
documents. The dataset used is DUC2002 which is available publicly. The following shows
the architecture of graph-based text summarization system.
7. Goularte, F. B., Nassar, S. M., Fileto, R., & Saggion, H. (2019). A text summarization
method based on fuzzy rules and applicable to automated assessment. Expert Systems
with Applications, 115, 264–275, Elsevier.
In this paper, Automatic text summarization is done using fuzzy rules of different features.
The important text can be extracted using these fuzzy rules. The dataset used is Brazilian
Portuguese dataset which is given by students in virtual learning environment. The metric
used is ROUGE for comparison.

8. El-Kassas, W. S., Salama, C. R., Rafea, A. A., & Mohamed, H. K. (2020). EdgeSumm:
Graph-based framework for automatic text summarization. Information Processing &
Management, 57(6), 102264, Elsevier.
In this paper, Automatic Text Summarization (ATS) is done by graph-based framework
“EdgeSumm”. The datasets used are DUC2001(308 English news documents and 616
model summaries) and DUC2002(567 news reports documents). The performance metric is
ROUGE. The average ROUGE score is better than other standard and state-of-the-art
systems.
9. Tsai, C.-F., Chen, K., Hu, Y.-H., & Chen, W.-K. (2020). Improving text
summarization of online hotel reviews with review helpfulness and sentiment. Tourism
Management, 80, 104122, Elsevier.
In this paper authors focussed on text summarization to get useful sentences that depict
sentiments of customers on the services provided by the hotels. The dataset is taken from
online hotel booking platform TripAdviser. The pre-processing includes spell check, word
segmentation, stemming, parts-of-speech tagging. After summarization, sentiments are found
on which services can be improved by the hotels.

10. Hernandez-Castaneda, A., Garcia-Hernandez, R. A., Ledeneva, Y., & Millan-


Hernandez, C. E. (2020). Extractive Automatic Text Summarization based on Lexical-
semantic Keywords. IEEE Access, 1–1.
In this paper, Extractive Text Summarization is done by two steps. First feature generation
using LDA, One hot encoding, TF-IDF, Doc2Vec. Second clustering similar sentences and
finding proximity using cosine similarity, silhouette index. Then selecting important
sentences from the clusters and generating summary. The performance metrics as ROUGE-1,
ROUGE-2, ROUGE-SU performed better than previous methods as using only LDA or TF-
IDF. The datasets used are DUC2002, TAC2011 datasets.

Proposed approach

11. Al-Maleh, M., Desouki, S. Arabic text summarization using deep learning approach.
J Big Data 7, 109 (2020), Springer.
In this paper, text summarization is done by Sequence-to-Sequence model in deep learning
approach. The baseline models are also applied, and ROUGE score is used as performance
metric and comparison is done. The dataset comprises of 300,000 entries of articles and their
headlines. The proposed methodology flowchart is as below.
12. R. S. Shini and V. D. A. Kumar, "Recurrent Neural Network based Text
Summarization Techniques by Word Sequence Generation," 2021 6th International
Conference on Inventive Computation Technologies (ICICT), 2021, pp. 1224-1229,
IEEE.
In this paper, the author describes the various types of text summarization techniques based
on deep learning. The datasets used are CNN and DailyMail Dataset, New York Times
dataset, DUC2004 dataset, Amazon review dataset. Also, this paper focuses on the pre-
processing steps to be followed. The proposed architecture is as below.
13. K. Merchant and Y. Pande, "NLP Based Latent Semantic Analysis for Legal Text
Summarization," 2018 International Conference on Advances in Computing,
Communications, and Informatics (ICACCI), 2018, pp. 1803-1807, IEEE.
In this paper, text summarization is done using latent semantic analysis (LSA). The authors
used single document and multi document approach. The dataset is taken from legal
judgements issued by Indian judiciary system. The ROUGE-1 is 0.58. The proposed
approach is shown.
14. Candidate sentence selection for extractive text summarization
Begum Mutlu, Ebru A. Sezer, M. Ali Akcayol, Information Processing and
Management, Elsevier, 2020.
In this paper, text summarization is done using LSTM (Long Short-Term Memory). The
dataset is DUC2001, SIGIR2018. The ROUGE-1 is 0.607, ROUGE-2 is 0.501, ROUGE-L is
0.569.

15. B. Mutlu, E.A. Sezer and M.A. Akcayol, Multi-document extractive text
summarization: A comparative assessment on features, Knowledge-Based Systems
(2019), Science Direct
In this paper, Extractive Text summarization is done by the fuzzy inference systems. The
dataset used is DUC2002. The ROUGE-1-2-L achieved 0.66, 0.59, 0.66 respectively. This
method achieved better performance than neural networks for ROUGE-1. Also discussed
detailed pre-processing steps for extractive text summarizations

PROPOSED WORK

METHODOLOGY

The main motto of this project is to summarize the text that takes the text as input and
displays the summary through an user interface.

It involves 3 steps:

1. Preprocessing

2. Implementing Text rank algorithm

3. Displaying the result using Django framework

Preprocessing

The text is taken from the textbox from the user interface provided. The user can also provide
a URL in which text has to be extracted. The paragraphs present in web page provided by
user is scraped and taken as input. This input text must be preprocessed before applying text
rank algorithm.

Tokenizing the text: The text is tokenized by using NLP library.

e.g., This is a sample. This is a sentence.

After word tokenizing the above sentences.

This is a sample This is a sentence

After sentence tokenizing the above sentence.

This is a sample This is a sentence

Implementing the Text Rank algorithm:

The input text after preprocessing is taken as input for this step. The process of extractive
summarization requires the most important sentences among the whole input. Thus,
identifying the sentences that are to be displayed in the summary is done in this step. For this,
importance of sentence is identified by the Cosine Similarity method. The similarity
measurement is a measure of the cosine of the angle between the two non-zero vectors. The
libraries in python in which cosine similarity is available are scikit-learn, TensorFlow. The
similarity increases when the distance between two vectors decreases and vice-versa.

Of course, the initial step is to extract all the sentences from the text. This might be as simple
as separating the text at "." or newlines, or it can be more complicated if we want to fine-tune
the definition of a sentence. Parsers are never removed from the system; they are simply
abandoned. Once you have all the sentences in the text, we must create a graph in which each
sentence is a node and linkages between them or to the k-most similar sentences weighted by
similarity are established.

This method allows us to program Text Rank without having to do any arithmetic or use
matrices, all we need is your graph and a function to compute sentence similarity.
It determines how similar each sentence is to the rest of the text. The similarity function
should be directed to the meaning of the sentence, and cosine similarity approach can work
well.

If we extract words instead of sentences and follow the same algorithm, using a similarity
function between words then we can use Text Rank to extract keywords from the text, the
idea is that the word that is most like all the other words is the most important one. Filtering
stop-words is very important here.

You might also like