CSE4022 - NLP Review 2

CSE4022 - Natural Language Processing
Review 2 : Literature Review
Date: Sat, 4th Feb 2023
Team Members:
Harsh Chauhan: 20BCE1886
Nayan Khemka: 20BCE1882
Dolly Agarwala: 20BCE1863
Title: Comparative Study of Text Summarization Methods
Published on: International Journal of Computer Applications
Problem Statement: Text summarising is a natural language processing tool that is growing in popularity
for information condensing. Text summarising involves taking the original document's important
content, condensing it, and providing a summary of it. This essay compares various text summarising
techniques based on distinct application kinds. The extractive and abstractive categories of text
summarising approaches, which make up the bulk of the paper's discussion, are covered in detail. Along
with a taxonomy of summary systems, the paper also discusses linguistic and statistical approaches to
summarization
Implementation Details: Summarization systems can be divided into many categories based on the
numerous types of summaries that might be beneficial in various applications. There are several kinds of
summaries in addition to abstract and extract. The types of reasoning needed to produce each of the
major dimensions of variation and their full understanding are still up for debate. The study of automatic
text summarization is therefore a fascinating field to explore. Depending on the type of summary and
application, different summarization techniques can be contrasted. The following categories can be used
to group summarization systems:
1. Based on Methods
There are two methods of summarization: extraction and abstraction. In extraction, the source
sentences are taken out as-is and added to the summary, whereas in abstraction, new phrases
are created just for the summary. When opinions are diverse, abstraction is extremely
important.
2. Based on the type of details

The informative or indicative nature of the summary depends on the type of detail[1]. An
indicative summary just gives the main concept of the original text and is used to give a short
overview of a lengthy material. The user is encouraged to read the source document because
they are often brief. For instance, a consumer reads the synopsis on the back of any novel
before making a purchase.
3. Based on the Content
This classification is based on the original document's content type[1]. Any sort of user can
utilise the generic summation system, and the summary is independent of the document's
subject. Every piece of information is equally important and is not user-specific.
4. Based on limitation
Summary can be categorised based on the text input restrictions[1]. Only unique types of input,
such as newspaper articles, tales, manuals, etc., are accepted by genre-specific systems.
restricted to the kind of input that they will accept.
5. Based on number of input documents
Based on whether a system takes one or more documents as input, summarization may be
categorized[1]. Summarization of a single document can only take one document as input. They
are typically simpler to prepare because they are summaries of a single document.
6. Based on language
The output of a monolingual system is limited to that language and only accepts documents in
that language.
Systems that support multiple languages can receive documents written in several languages
and provide summaries of those contents.
Result: Information overload is a result of the Internet's popularity and the rapid advancement of
technology. If there were powerful text summarizers that produced a user-friendly summary of
documents, this issue might be resolved. Therefore, a system that allows a user to quickly find and
obtain a summary document is required. Summarizing a document using extractive or abstractive
techniques is one option.
Extractive text summarization is simpler to construct. However, text summarising using abstractive
techniques is more effective since they result in a summary that is difficult to generate but semantically
linked. The pros and cons of each approach of summarising a text were covered in this essay, along with
many sorts of summary techniques.
References:
[1] Saeedeh Gholamrezazadeh, Mohsen Amini Salehi, Bahareh Gholamzadeh.A Comprehensive Survey
on Text Summarization Systems. 2009 In proceeding of: Computer Science and its Applications, 2nd
International Conference.
[2] Goldstein, J., Kantrowitz, M., Mittal, V., Carbonell, J., 1999.Summarizing text documents: Sentence
selection and evaluation metrics. In: Proc. ACM-SIGIR’99, pp. 121–128.
[3] Horacio, L. Guy.Generating indicative-informative summaries with SumUM : Summarization.
Computational linguistics -Association for Computational Linguistics, 2002, vol. 28, pp. 497-526.
Title: Neural Extractive Text Summarization with Syntactic Compression

Published on: Department of Computer Science
The University of Texas at Austin
Problem Statement: Recent neural network summarising methods mostly use generation-based
abstraction or selection-based extraction. In this paper, they develop a neural model for syntactic
compression and joint extraction for single-document summarization. In order to create the final
summary, our algorithm selects phrases from the document, finds potential compressions using
constituency parses, and then rates those compressions using a neural model. We create oracle
extractive-compressive summaries for learning, then use this supervision to jointly train both of our
components. According on experimental findings using the CNN/Daily Mail and New York Times
datasets, our model performs well (on par with cutting-edge systems), as determined by ROUGE.
Additionally, our method beats a commercial compression module, and manual and human inspection
demonstrates that our model's output typically maintains grammaticality.
Implementation Details: In this paper, they provide a paradigm that combines the great performance of
neural extractive systems with additional flexibility from compression and interpretability provided by
having discrete compression choices. Our model first encrypts the original text's phrases, after which it
progressively chooses a subset of them to further compress.
These possibilities are obtained from syntactic constituency parses and reflect an increased set of
discrete options from earlier work. They are chosen for each phrase in order to preserve meaning and
grammaticality (Berg-Kirkpatrick et al., 2011; Wang et al., 2013).
In addition, the neural model evaluates the document context, the phrase, and the recurrent state of
the decoder model before selecting which compressions to use.
Building the oracle summary for supervision during the training of an extractive and compressive model
is a major task. We use beam search to extract a group of excellent sentences from the document, and
then we further refine each phrase to extract oracle compression labels. These extractive and
compressive components are combined and learned together as part of their model's training aim.
Models:
1. Extractive Sentence Selection:
A single document consists of n sentences D={s1, s2, · · · , sn}. The i-th sentence
is denoted as si = {wi1, wi2, · · · , wim} where wij is the j-th word in si. The content selection module
learns to pick up a subset of D denoted as Dˆ = {sˆ1, sˆ2, · · · , sˆk, |sˆi ∈ D} where k sentences are
selected.
2. Text Compression:
The text compression module assesses our discrete compression possibilities

after the sentences have been chosen and determines whether to keep particular phrases or words in
the sentences. Figuring out whether or not to remove a PP from this statement is illustrated in Figure 4.
Based on the guidelines outlined in Section 2, this PP was designated as deleterable. After encoding this
phrase and compressing it, our network integrates this data with the document context (VDOC) and
decoding context (HDEC) to determine whether or not to eliminate the span.
Result: In this paper, we introduced an extractive and compressive summarization neural network
architecture.
In our model, each phrase's syntax-derived compression choices are evaluated by a compression
classifier that is connected to a sentence extraction model. We use a beam search approach and
heuristics to obtain an oracle set of high-scoring extraction and compression decisions for the purpose
of training the model. Our approach makes significant improvements over the extractive model,
surpasses earlier work on the CNN/Daily Mail corpus in terms of ROUGE, and, based on human
evaluations, seems to have acceptable grammaticality.
References:
[1] Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein. 2011. Jointly Learning to Extract and Compress. In
Proceedings of the 49th Annual Meeting of the Associationfor ComputationalLinguistics: Human
Language Technologies, pages 481–490. Association for Computational Linguistics.
[2] Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018. Faithful to the Original: Fact Aware Neural
Abstractive Summarization. In AAAI Conference on Artificial Intelligence
[3] Jaime Carbonell and Jade Goldstein. 1998. The Use of MMR, Diversity-based Reranking for
Reordering Documents and Producing Summaries. In Proceedings of the 21st Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, pages 335–336,
NewYork, NY, USA. ACM.
Title: Single Document Automatic Text Summarization using term frequency inverse document
frequency (TF-IDF)
Published on: 12Computer Science Department, School of Computer Science, Bina Nusantara University.
JIn. K.H. Syahdan No 9, Jakarta Barat, DKI Jakarta, 11480, Indonesia
Problem Statement: The abundance of information online has led to an increase in study on automatic
text summarization within the field of Natural Language Processing (NLP). By eliminating the less
important information from the text, text summary makes the text shorter and makes it easier for the
reader to discover the information they need.
The material can be summarised using a wide variety of techniques. The TF-IDF is one of them. The goal
of this study was to develop an automatic text summarizer using the TF-IDF algorithm and to compare it
to other automatic text summarizers available online. The F-Measure was utilised as the benchmark
comparison value to assess the summary output from each summarizer. With three data samples, the
research's findings had an accuracy rate of 67%.
Implementation Details: There are several types of algorithms that may be utilised to produce an
automated summary.
An extractive text summary using a Term Frequency-Inverse Document is the most often utilised (TF-
IDF). This experiment intends to aid users in reading the document(s) quickly by providing summaries
generated by this software. There are other tools available that do automated summary similarly to this
application, however they only assist in summarising a single document. This application may summarise
several papers.
However, the researchers in this experiment only pay attention to how well the software does when
summarising a single document. In this experiment, the accuracy of the TF-IDF-generated summary in
comparison to the professional summary is also calculated.
The preprocessing function uses NLTK methods including tokenization, stemming, part-of-speech (POS)
tagger, and stopwords to handle the document. After the document is entered into the software, the
preprocessing function uses tokenization methods to separate the text into a list of words.
Both sentence tokenization and word tokenization are subsets of these tokenization operations. The
function to break the text into sentences is called sentence tokenization. While word tokenization is a
tool for breaking down a string of written text into individual words and punctuation.
The content is first converted to lowercase using a lower function to normalise the page, erasing the
difference between News and news. The sentences are then created by tokenizing the paragraphs. A list
of words is then created by tokenizing the phrases. Every word in the list is categorised using the POS
tagger function to ensure that no extraneous words are included.
This POS
tag classifies the words into VERB (verbs), NOUN (nouns), PRON (pronouns), ADJ (adjectives),
ADV (adverbs), ADP (adpositions), CONJ (conjunctions), DET (determiners), NUM (cardinal
numbers), PRT (particles or other function words), X (other: foreign words, typos, abbreviations)
(punctuation) (Petrov, Das, & McDonald, 2012). Only VERB and NOUN are calculated in this
experiment, because these types of the word are biased to make a summary (Yohei, 2002). All
stopwords and clitics are also removed to prevent ambiguities. Then, the list of words is processed
using stemming function to normalize the words by removing affixes to make sure that the result is the
known word in the dictionary (Bird, Klein, & Loper, 2009).
Result: This study demonstrates how a software for automated text summarization uses the TF-IDF
method. The TF-IDF algorithm may be utilised as the most efficient way to create an extracted summary,
as demonstrated by this experiment. It produces the summary with a 67% accuracy rate, which is a
better outcome than what other online summarizers do.
The software provides a better summary, according to the comparison results between the programme
and the two online summarizers using a statistical technique. The extraction approach demonstrates the
effectiveness of TF-IDF as a way for generating the value that indicates how significant a word is inside
the document.
References:
[1]Al-Hashemi, R. (2010), Text Summarization Extraction System (TSES) Using Extracted Keywords,
International Arab Journal of E-Technology, 1(4), 164- 168.
[2] Bird, S., Klein, E., & Loper, E. (2009) Natural language processing with Python. United States:O'Reilly
Media.
[3]Das, D., & Martins, A. F. (2007). A survey on automatic text summarization. Literature Survey for the
Language and Statistics, 3(3), 1-12.

CSE4022 - NLP Review 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSE4022 - NLP Review 2

Uploaded by

Copyright:

Available Formats

CSE4022 - Natural Language Processing

Review 2 : Literature Review

Date: Sat, 4th Feb 2023

Harsh Chauhan: 20BCE1886

Nayan Khemka: 20BCE1882

Dolly Agarwala: 20BCE1863

Title: Comparative Study of Text Summarization Methods

Published on: International Journal of Computer Applications

2. Based on the type of details

Title: Neural Extractive Text Summarization with Syntactic Compression

The University of Texas at Austin

1. Extractive Sentence Selection:

The text compression module assesses our discrete compression possibilities

JIn. K.H. Syahdan No 9, Jakarta Barat, DKI Jakarta, 11480, Indonesia

known word in the dictionary (Bird, Klein, & Loper, 2009).

You might also like