You are on page 1of 44

Text Summarization using NLP

A Full Semester Internship Report Submitted in Partial Fulfilment for the Award of the Degree

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

Submitted
by

Kesari Venkatesh Reddy (18341A0558)

Under the Esteemed Guidance of


Dr. K. Lakshmana Rao
Associate Professor
June 2022

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


Text Summarization using NLP
Internship carried out at
COGNIZANT TECHNOLOGY SOLUTIONS., BANGALORE

A Full Semester Internship Report Submitted in Partial Fulfilment for the Award of the Degree

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

Submitted
by

Kesari Venkatesh Reddy (18341A0558)

June 2022

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


GMR INSTITUTE OF TECHNOLOGY
(An Autonomous institute, affiliated to J.N.T.University Kakinada)
NAAC “A” Graded, NBA Accredited, ISO 9001:2008 Certified Institution G.M.R.
Nagar, Rajam-532127, A.P

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

BONAFIDE CERTIFICATE

Signature of Faculty Supervisor Signature of Industry Supervisor

Dr. K. Lakshmana Rao Ms. D. Anushree


Associate Professor Human Resource - Genc
Department of CSE Cognizant Technology Solution
GMRIT, Rajam.

This is to certify that report entitled “Text Summarization using NLP” submitted by Kesari
Venkatesh Reddy (18341A0558) who completed the internship program under our guidance
and supervision at COGNIZANT TECHNOLOGY SOLUTIONS has been carried out in
partial fulfilment for the award of B.Tech degree in the discipline of Computer Science and
Engineering to JNTUK is a record of bonafide work . The results embodied in this report have
not been submitted to any other university or institution for the award of any degree or diploma.

Signature of the H.O.D

Dr. A. V. Ramana
Professor and HOD
Department of CSE
GMRIT, Rajam
Internship Certificate
ACKNOWLEDGEMENT

I would like to take this opportunity to thank Ms. D. Anushree, Human Resources - Genc,
CTS Bangalore for providing all the necessary facilities that led to the successful completion
of our internship.

I would like to sincerely thank internal supervisor, Dr. K. Lakshmana Rao, Associate
Professor, Department of Computer Science and Engineering for wholehearted and valuable
guidance throughout the program.

I would like to thank our beloved Principal Dr. C.L.V.R.S.V. Prasad, Head of the Department
Dr. A. Venkata Ramana, Professor, Computer Science and Engineering for providing great
support in completing the full semester Internship.

It gives me an immense pleasure to express deep sense of gratitude to the central Internship
team Dr. Surya Narayan Dash, Internship Head, I would sincerely thank our department
coordinators Dr. K. Sri Vidya, Associate Professor, Department of Computer Science and
Engineering and Dr. V. Prasad, Associate Professor, Department of Computer Science and
Engineering for their great support.

Kesari Venkatesh Reddy (18341A0558)


ABSTRACT
Text summarization is necessary to get the most precise and useful information from a large
document and eliminate the irrelevant or less important ones. Text summarization can be carried
out mainly in two ways. They are abstractive text summarization and extractive text
summarization. The automatic text summarization can be used on single document or multi
document. Also, the web page summarization can be done using web scraping and bringing the
content and summarizing it. This decreases the redundancy of files and saves time in
understanding large information. The text summarization task can be challenging due to its vast
usage capability, if not done properly it cannot be used. Thus, NLP comes in help to
understanding the language and extract the useful sentences or information that are critical in
understanding of the topic. Here the Text ranking algorithm and cosine similarity is used to
summarize the text. The data is given as text or a website page URL in which summary is
necessary.
Keywords:
Text Rank Algorithm, Natural Language Processing, Text Summarization, Extractive
Text summarization

ii
Text Summarization using NLP

TABLE OF CONTENTS

CHAPTER TITLE PAGE NO.

Certificate
Acknowledgement i
Abstract ii
Index iii
List of Figures iv
List of Tables v
1 Introduction
1.1 Introduction 1
1.2 Benefits of Internship 1
1.2.1 Benefits to the Students 2
1.2.2 Benefits to the Industry 2
1.2.3 Benefits to the Institution 2
1.3 Ethics 3
1.4 Values 3
2 Profile of the Company
2.1 About the Company 4
2.2 Services 5
3 Tasks Taken Up and Problem Definition
3.1 Introduction 8
3.2 Problem Statement 9
3.3 Existing System and its disadvantages 9
3.4 Proposed System and its advantages 9
3.5 Technologies used 10
3.6 Literature Survey 10
4 Methodology and Learning
4.1 Data Pre-Processing 16
4.0.0 Data Cleaning 17
4.0.1 Tokenization 17
4.2 Implementing Text Rank Algorithm 18
4.3 Displaying Output through Django 19
4.4 Design 20
4.5 Requirements 20
5 Results
5.1 Results 21
6 Conclusion and Suggestions
6.1 Conclusions 24
6.2 Future Scope 24

Appendix 25

Bibliography 32
List of Figures

Fig. No Name of the Figure Page No.


3.1 Flow chart of Base paper 10

3.2 Flow chart of Reference Paper 11

3.3 Flow chart of Reference Paper 12

3.4 Flow chart of Reference Paper 13

3.5 Flow chart of Reference Paper 14

3.6 Flow chart of Reference Paper 15

4.1 Django Framework Flow chart 18

4.2 Flow chart of Proposed System 19

5.1 The main page of our design 21

5.2 Input is given as URL of Sachin Tendulkar

Wikipedia page as example 21

5.3 Output is shown like this with time taken 22

5.4 Input is given as text as example 22

5.5 Output 23

iv
List of Tables

Tab. No Name of the Table Page No.


4.1 Word Tokenization of Sentence 16

4.2 Sentence Tokenization 16

5.1 Comparison of Results 23

v
Text Summarization using NLP 2022

1. INTRODUCTION

1.1 Introduction

An internship is a trained and supervised experience in a professional setting in which a


student is learning and gaining essential experience and expertise. Internship is meant for
introducing candidates either full-time or part time to a real-world experience related to
their career goals and interests. Internship is an excellent way to build those all-important
connections that are invaluable in developing and maintaining a strong professional
network for the future. Internships provide real world experience to those looking to
explore or gain the relevant knowledge and skill required to enter a career field. Internship
is relatively short term in nature with the primary focus on getting some on the job training
and taking what’s learning in the classroom and applying it to the real world.

1.2 Benefits of Internship

Students learn how their course of study applies to the real world and build valuable
experience that makes them stronger candidates for jobs after graduation.
● Internship at a start-up will benefit in improving team spirit, adapting to flexible
working times and client services.
● You can get serious work experience, build a portfolio and establish a network of
professional contacts which can help you after you graduate.
● The main advantage is to have practical knowledge. In our college we can have
theoretical knowledge which doesn’t help much. Working on a project gives
practical experience.
● Confidence can be increased when we were involved in solving problems and were
succeeded in solving it.

1
Text Summarization using NLP 2022

● Working on a project also improves communication skills and interpersonal skills. As


we need to talk to higher authorities regarding the project our skills can be better when
compared.
● Having several internships while in college can be very impressive to potential
employers.
● Working in a team for a project teaches us how to interact with our colleagues and how
to deal with them without hurting the feelings of both sides.

1.1.1 Benefits to the Students


● Learning by doing.

● All round development.

● Aid in career planning.

● Experience of professional working conditions.

● Smooth transition from campus to company.

1.1.2 Benefits to the Industry


● Steady stream of skilled manpower provides value addition and increased
productivity.

● Human Resource Development benefits.

● Conduit for Industrial Partnership.

● Employer Branding.

1.1.3 Benefits to the Institution


● Inputs to quickly adapt curriculum to match the needs of industry.
● Opportunities for research and consultancy.
● Access to industrial expertise and infrastructure.

1.3 Ethics
● Help develop an organizational environment favourable to acting ethically
● Improve their understanding of the software and related documents on which they
work and of the environment in which they will be used.

2
Text Summarization using NLP 2022

● Accept full responsibility for their own work.


● Improve their ability to produce accurate, informative, and well-written
documentation.
● Not promote their own interest at the expense of the profession, client or employer.
● Assist colleagues in professional development.
● Strive to fully understand the specifications for software on which they work.
● Improve their knowledge of the Code, its interpretation, and its application to their
work.

1.4 Values

● Professional communications.

● Be proactive, and when invited to work functions introduce oneself to people.

● Taking constructive criticism well.


● Being able to work independently with little guidance is very important in the
working world.
● Always work hard even if the task is small and seems unimportant.

● Make an effort during the course of the internship to build relationships with people
around the office.

3
Text Summarization using NLP 2022

2. PROFILE OF THE COMPANY

About Company:

Cognizant Technology Solutions Corporation is a global leader and a multinational

company. It is one of the important businesses outsourcing company in the world. The

headquarters of the company is in Teaneck, New Jersey. It was founded originally as a

technology unit of Dun & Bradstreet, and the headquarters are in Chennai, India. In 1996,

Cognizant started exceeding performance with its international clients. The next year, the

company had its headquarters moved from Teaneck to Chennai in India. Cognizant was

the first company to be listed on NASDAQ 100. After accepting some of the work of

application maintenance, it went into application development. During the 2000s, time

looked like a golden era for the company. It became one of the Future 500 companies in

2011. It is also known as the World’s Most Admired Companies. The company is split

into two new major services, Nelson Media Research, and IMS Health. After some time, it

became the public subsidiary of the IMS Health. But in 2003, Cognizant sold all its shares

in the subsidiary and the CEO also resigned from his post. The company expanded its

work from IT services to Outsourcing and business consulting as well. There was a fast

growth in the success of Cognizant. The services provided include application

development, business intelligence, supply chain management, CRM, etc. The company

has 318,400 employees globally, of which over 150,000 are in India across 10 locations

with a plurality in Chennai. On 20 Jan, 1994 Cognizant registered its branch in Chennai,

Tamil Nadu, India with the legal name Cognizant Technology Solutions India Private

4
Text Summarization using NLP 2022

Limited. The company has local, regional and global delivery centres in the UK, Australia,

Hungary, Netherlands, Spain, China, Philippines, Canada, Brazil, Argentina, Mexico etc

Business Units:

Cognizant is organized into several vertical and horizontal units. The vertical units focus

on specific industries such as Banking & Financial Services, Insurance, Healthcare,

Manufacturing and Retail. The horizontals focus on specific technologies or process areas

such as Analytics, mobile computing, BPO and Testing. Both horizontal and vertical units

have business consultants, who together form the organization-wide Cognizant Consulting

team. Cognizant is among the largest recruiters of MBAs in the industry; they are involved

in business analysis for IT services projects.

2.1 Services offered by Cognizant:

It provides digital solutions that can make advancements for Business

• Application Services & Modernization

• Artificial Intelligence

• Business Process Services

• Cloud Enablement

• Core Modernization

• Digital Experience

• Digital strategy

• Enterprise Services
5
Text Summarization using NLP 2022

• Industry & platform Solutions

• Infrastructure Services

• Quality Engineering & Assurance

• Security

• Software Product Engineering

• Sustainability Services

Quality engineering at speed and scale is main principle of Cognizant. Cognizant

developers and architects employ agile practices to combine full-stack software

development with user-driven design. At Cognizant main focus on software development

platform-as-a-service (PaaS) environments to ensure quality, cloud portability and security

for every product. End goal is to build the right software that meets ones needs

straightaway, with products that work smarter and faster. Products:

• Cognizant Big Decisions

• Cognizant Data Insights

• Cognizant Insurance Intake Automation

• Cognizant Document Accelerator

• Risk Profile Gateway

• Digital Retirement Operations

• Shared Investigator Platform

6
Text Summarization using NLP 2022

Segments Cognizant Serve:

High-tech: They help high-tech companies rethink their business models and plan and
implement transformational processes across the product lifecycle. Partner with cognizant
to get ahead of the demand curve as well as become nimbler and drive operational
efficiency for more profitable growth.

Platform: The race to the next billion users, monetizing content and supporting aggressive
growth in new channels has upended the industry. Cognizant partner with companies to
accelerate digital at scale and to operate more efficiently, and power growth.

Software: In the subscription economy, adapting to customer needs is top priority.


Cognizant partner with software clients and keep the emphasis on speed to market, helping
them to launch new revenue models, and accelerate product development and release
cycles.

7
Text Summarization using NLP 2022

3. TASKS TAKEN UP AND PROBLEM DEFINITION

3.1 Introduction

Automatic Text Summarization is useful in many fields such as Education, Research,


News Articles summary, … The extractive text summarization that can be used to gain
insights about the document or long text. Thus, performing extractive text summarization
using NLP algorithms as Text Rank Algorithm and cosine similarity. The existing
approaches are using Recurrent Neural Networks, Long Short-Term Memory, Graph
based frameworks, sentence ranking, supervised approach … Due to neural network
models like these can significantly increase in execution time. The neural networks and
supervised models to understand the language it requires knowledge of corpora of that
language. The proposed approach is using Text Rank Algorithm which is a natural
language algorithm. Text Rank works well because it does not only rely on the local
context of a text unit (vertex), but rather it considers information recursively drawn from
the entire text (graph). Text Rank performs better than most of the supervised learning
approaches. Text summarization application is designed for easy use by providing either
URL or text directly. It can be done in two different ways. They are extractive and
abstractive text summarization. Here the extractive text summarization method is used.
This can also be challenging, so finding cosine similarity of sentences can be useful in this
situation. The text from a website is scraped using BeautifulSoup module available in
python. The number of paragraphs summary needed can be given but it is optional and set
to 5 if not provided. The NLP algorithm used is Text Rank algorithm. This is used for
ranking sentences and words according to their importance and usage in the given text or
input. The result is displayed on the left side with box heading text summary. The front-
end technology used is Django application which is similar to MVT (Model, view,
template) pattern. All the requests go to urls.py. From there views are selected based on
8
Text Summarization using NLP 2022

url and models are used accordingly. This Text summarizer application is used for
summarizing the given text.

3.2 Problem Statement

Text summarization can be carried out mainly in two ways. They are abstractive text

summarization and extractive text summarization. The automatic text summarization can be used

on single document or multi document. Also, the web page summarization can be done using web

scraping and bringing the content and summarizing it. This decreases the redundancy of files and

saves time in understanding large information. The text summarization task can be challenging

due to its vast usage capability, if not done properly it cannot be used. Thus, NLP comes in help

to understanding the language and extract the useful sentences or information that are critical in

understanding of the topic. Here the Text ranking algorithm and cosine similarity is used to

summarize the text. The data is given as text or a website page URL in which summary is

necessary

3.3 Existing System and its Disadvantages:

The existing approaches are using Recurrent Neural Networks, Long Short-Term Memory,
Graph based frameworks, sentence ranking, supervised approach … Due to neural
network models like these can significantly increase in execution time. The neural
networks and supervised models in order to understand the language it requires knowledge
of corpora of that language.

3.4 Proposed System and its Advantages:

9
Text Summarization using NLP 2022

The proposed approach is using Text Rank Algorithm which is a natural language
algorithm. Text Rank works well because it does not only rely on the local context of a
text unit (vertex), but rather it considers information recursively drawn from the entire text
(graph). Text Rank performs better than most of the supervised learning approaches. The
user interface is designed using Django and takes input either a URL or text which is to be
summarized.

3.5 Technologies Used


Google Collab, TensorFlow, NLTK, Django

3.6 Literature Survey

The authors in [1], proposed the abstractive text summarization is done by using LSTM-CNN

model. The dataset are taken from the daily news coverages as CNN, DailyMail websites. The

CNN dataset has more than 92000 texts and corresponding summaries. The DailyMail dataset has

219000 texts and corresponding summaries. The preprocessing is done in three steps: word

segmentation, morphological reduction, coreference resolution. The below figure shows the

design.

3.1 Flowchart of Base Paper


10
Text Summarization using NLP 2022

Tian Shi et al, [2] proposed a Abstractive text summarization is done by RNN based Seq2Seq

model. The datasets are taken from CNN/DailyMail Dataset(300k news articles), Newsroom

Dataset(1.3 million news articles), Bytecup Dataset(1.3 million news articles). Also discussed

about evaluation parameters such as ROUGE, BERTScore, fluency, factual correctness,

relevance.

Seq2Seq model
3.2 Flow chart of the reference paper

R. Ganesh Kumar et al,[3] proposed the extractive text summarization is done using sentence

ranking. The work is done based on single document text summarization. The input data is given

by a word document with text. Then main task is to identifying the important paragraphs and

giving weights to sentences. After the summarization, it is compared to the human generated

summary. The evaluation metric used is ROUGE.

The authors in [4], proposed the extractive text summarization is done by sentence content

relevance, sentence novelty, sentence position relevance. In this content relevance is done using

deep auto encoders. By combining these three metrics the authors have performed extractive text

summarization. The datasets used are CNN and DailyMail dataset, DUC dataset, Tor Illegal

11
Text Summarization using NLP 2022

Documents summarization dataset, Blog summarization dataset. This approach performed better

in some of the ROUGE evaluation metrics compared to traditional ML models.

3.3 Flow chart of the reference paper

The authors in [5], proposed text summarization is done using clustering and optimization

techniques called COSUM. The first step is clustering of sentences by k-means. The second step

is selecting important sentences from the clusters based on different features. The datasets used

are DUC2001(309 articles) and DUC2002(567 documents). The pre-processing steps are splitting

into sentences, removing stop words and noisy words, upper case removal, stemming. The

evaluation metric used is ROUGE. This model performs better for ROUGE-1 and ROUGE-2.

The authors in [6], follows graph-based text summarization techniques for single and multiple

documents. The dataset used is DUC2002 which is available publicly. The preprocessing is done

in three steps: word segmentation, morphological reduction, coreference resolution. The

performance metrics as ROUGE-1, ROUGE-2, ROUGE-SU performed better than previous

methods. The following shows the architecture of graph-based text summarization system.

12
Text Summarization using NLP 2022

3.4 Flow chart of the reference paper

The authors in [7], proposed Automatic text summarization is done using fuzzy rules of different

features. The important text can be extracted using these fuzzy rules. The dataset used is Brazilian

Portuguese dataset which is given by students in virtual learning environment. The metric used is

ROUGE for comparison.

The authors in [8], proposed Automatic Text Summarization(ATS) is done by graph-based

framework “EdgeSumm”. The datasets used are DUC2001(308 English news documents and 616

model summaries) and DUC2002(567 news reports documents). The performance metric is

ROUGE. The average ROUGE score is better than other standard and state-of-the-art systems.

The authors in [9], focussed on text summarization to get useful sentences that depict sentiments

of customers on the services provided by the hotels. The dataset is taken from online hotel

booking platform TripAdviser. The pre-processing includes spell check, word segmentation,

stemming, parts-of-speech tagging. After summarization, sentiments are found on which services

can be improved by the hotels.


13
Text Summarization using NLP 2022

The authors in [10], proposed Extractive Text Summarization is done by two steps. First feature

generation using LDA, One hot encoding, TF-IDF, Doc2Vec. Second clustering similar sentences

and finding proximity using cosine similarity, silhouette index. Then selecting important

sentences from the clusters and generating summary. The performance metrics as ROUGE-1,

ROUGE-2, ROUGE-SU performed better than previous methods as using only LDA or TF-IDF.

The datasets used are DUC2002, TAC2011 datasets.

Proposed approach
3.5 Flow chart of the reference paper

The authors in [11], proposed text summarization is done by Sequence-to-Sequence model in deep

learning approach. The baseline models are also applied, and ROUGE score is used as

performance metric and comparison is done. The dataset comprises of 300,000 entries of articles

and their headlines. The proposed methodology flowchart is as below.

14
Text Summarization using NLP 2022

The authors in [12], describes the various types of text summarization techniques based on deep

learning. The datasets used are CNN and DailyMail Dataset, New York Times dataset, DUC2004

dataset, Amazon review dataset. Also, this paper focuses on the pre-processing steps to be

followed. The proposed architecture is as below.

3.6 Flow chart of the reference paper

The authors in [13], proposed text summarization is done using latent semantic analysis (LSA).

The authors used single document and multi document approach. The dataset is taken from legal

judgements issued by Indian judiciary system. The ROUGE-1 is 0.58. The proposed approach is

shown.

The authors in [14], text summarization is done using LSTM (Long Short-Term Memory). The

dataset is DUC2001, SIGIR2018. The ROUGE-1 is 0.607, ROUGE-2 is 0.501, ROUGE-L is

0.569.

The authors in [15], Extractive Text summarization is done by the fuzzy inference systems. The

dataset used is DUC2002. The ROUGE-1-2-L achieved 0.66, 0.59, 0.66 respectively. This method

15
Text Summarization using NLP 2022

achieved better performance than neural networks for ROUGE-1. Also discussed detailed pre-

processing steps for extractive text summarizations.

4. METHODOLOGY AND LEARNING

The main motto of this project is to summarize the text that takes the text as input and displays the
summary through a user interface.

It involves 3 steps:

1. Preprocessing

2. Implementing Text rank algorithm

3. Displaying the result using Django framework

4.1 Preprocessing

The text is taken from the textbox from the user interface provided. The user can also provide
a URL in which text has to be extracted. The paragraphs present in web page provided by
user is scraped and taken as input. This input text must be preprocessed before applying text
rank algorithm.

Tokenizing the text: The text is tokenized by using NLP library.

e.g., This is a sample. This is a sentence.

After word tokenizing the above sentences.

This is a sample This is a sentence


Table 4.1 Word Tokenization of sentence

After sentence tokenizing the above sentence.


This is a sample This is a sentence
Table 4.2 Sentence Tokenization of sentence

16
Text Summarization using NLP 2022

4.2 Implementing the Text Rank algorithm:

The input text after preprocessing is taken as input for this step. The process of extractive

summarization requires the most important sentences among the whole input. Thus,

identifying the sentences that are to be displayed in the summary is done in this step. For this,

importance of sentence is identified by the Cosine Similarity method. The similarity

measurement is a measure of the cosine of the angle between the two non-zero vectors. The

libraries in python in which cosine similarity is available are scikit-learn, TensorFlow. The

similarity increases when the distance between two vectors decreases and vice-versa.

Of course, the initial step is to extract all the sentences from the text. This might be as simple

as separating the text at "." or newlines, or it can be more complicated if we want to fine-tune

the definition of a sentence. Parsers are never removed from the system; they are simply

abandoned. Once you have all the sentences in the text, we must create a graph in which each

sentence is a node and linkages between them or to the k-most similar sentences weighted by

similarity are established.

This method allows us to program Text Rank without having to do any arithmetic or use

matrices, all we need is your graph and a function to compute sentence similarity.

It determines how similar each sentence is to the rest of the text. The similarity function

should be directed to the meaning of the sentence, and cosine similarity approach can work

well.

If we extract words instead of sentences and follow the same algorithm, using a similarity

function between words then we can use Text Rank to extract keywords from the text, the
17
Text Summarization using NLP 2022

idea is that the word that is most like all the other words is the most important one. Filtering

stop-words is very important here.

4.3 Displaying result using Django Framework:

• Django is a high-level python web framework that helps in writing software that is

complete, versatile, secure, scalable, maintainable and also portable.

• The main important feature of Django is it follows the MVT(Model view template).

4.1 Django Framework Flowchart

18
Text Summarization using NLP 2022

4.4 Design

4.2 Flowchart of Proposed System

4.5 Requirements
asgiref==3.2.10
beautifulsoup4==4.9.1
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
Django==3.1
gunicorn==20.0.4
19
Text Summarization using NLP 2022

idna==2.10
joblib==0.16.0
lxml==4.5.2
nltk==3.5
numpy==1.19.1
pandas==1.1.0
python-dateutil==2.8.1
pytz==2020.1
regex==2020.7.14
requests==2.24.0
six==1.15.0
soupsieve==2.0.1
sqlparse==0.3.1
tqdm==4.48.2
urllib3==1.25.10

20
Text Summarization using NLP 2022

5. RESULTS
GUI:

The figures 5.1 -5.5 represent the screenshots of the built GUI.

5.1 The main page of design

21
Text Summarization using NLP 2022

5.2 Input is given as URL of Sachin Tendulkar Wikipedia page as example

5.3 Output is shown like this with time taken

22
Text Summarization using NLP 2022

5.4 Input is given as text(3 paragraphs on global warming) as example

5.5 Output is shown as below

23
Text Summarization using NLP 2022

ROUGE-1 ROUGE-2 ROUGE-L ROUGE-W


[1] 0.349 0.178 - -
[2] 0.3936 0.2786 0.3635 -
[3] 0.519 0.366 0.47 0.377
[10] 0.498 - - -
[13] 0.583 0.15 0.35 -
Proposed 0.524 0.19 - -
method
Table 5.1 Comparison of Results

6. CONCLUSION AND SUGGESTIONS

6.0 Conclusion

With the help of this application, the text is summarized which makes use of Natural Language

Processing technique called Text Rank algorithm. A short summary was generated keeping intact

the important ideas from the original text. The similarity between the sentences is calculated using

cosine similarity. Thus, text summarization is done using extractive approach. This needs to use

deep learning models.

6.1 Future Scope


The abstractive text summarization is useful as it generates summary based on new words

formed during training. This implementation is done using deep learning techniques. Based on the

24
Text Summarization using NLP 2022

work done in this paper, future scope is to develop deep learning models that can generate

summary in its own words.

APPENDIX
Summarizer.py
from bs4 import BeautifulSoup
import requests
import nltk

nltk.download('punkt')
nltk.download("stopwords")

from nltk.corpus import stopwords


import re
import numpy as np
import pandas as pd

def cosine_similarity(X, Y_set):


25
Text Summarization using NLP 2022

X_list = nltk.word_tokenize(X)

sw = stopwords.words('english')
l1 =[];l2 =[]

X_set = {w for w in X_list if not w in sw}

rvector = X_set.union(Y_set)

for w in rvector:
if w in X_set: l1.append(1)
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c=0

for i in range(len(rvector)):
c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)

return cosine

def summarizer(content, num_lines = 5):


content = re.sub(r"\[[^()]*\]",' ',content)

words = nltk.word_tokenize(content)
sentences = nltk.sent_tokenize(content)

Y_set = {w for w in words if not w in stopwords.words('english')}

26
Text Summarization using NLP 2022

word_count = {}

for word in words:


if word not in stopwords.words('english'):
if word not in word_count:
word_count[word] = 1
else:
word_count[word]+=1

scores = []

for sentence in sentences:


words = nltk.word_tokenize(sentence)
score = 0
for word in words:
if word in word_count:
score += word_count[word]
scores.append(score)

scores = np.asarray(scores) / max(scores)

df = pd.DataFrame({'Sentences': sentences, 'Scores':scores})

sorted_df = df.sort_values(by = "Scores", ascending=False).reset_index()

paras = []
similarity = []
for i in range(len(sorted_df)):
paras.append(' '.join(list(sorted_df.iloc[i:i + num_lines,1])))
similarity.append(cosine_similarity(' '.join(list(sorted_df.iloc[i:i + num_lines,1])), Y_set))
27
Text Summarization using NLP 2022

return paras[similarity.index(max(similarity))].split('. ')

def url_summarizer(link, num_lines=5):


source = requests.get(link).text
soup = BeautifulSoup(source, 'lxml')
paras = soup.find_all("p")

content = []
for para in paras:
content.append(para.text)

content = ' '.join(content)

return content, summarizer(content, num_lines)

urls.py

from django.urls import path


from . import views

urlpatterns = [
path('', views.index, name = 'index'),
]

Views.py

from django.shortcuts import render, redirect


from django.contrib import messages

28
Text Summarization using NLP 2022

from .Summarizer import summarizer, url_summarizer


import re
import time
# Create your views here.

def index(request):
context = {'flag':False, 'url_error':False, 'summarize_div':True}
if request.method == 'POST':
if len(request.POST['textarea']) > 0 and len(request.POST['url_link']) > 0:
messages.error(request, "Enter either URL or Text, not Both.")
return redirect('index')

if len(request.POST['textarea']) >0:
if request.POST['num_lines']:
num_lines = request.POST['num_lines']
else:
num_lines = 5
content = request.POST['textarea']
start = time.time()
summary = summarizer(content, int(num_lines))
end = time.time()
context['time_taken'] = round(end-start, 2)
context['flag'] = True
context["content"] = content
context["summary"] = summary
context['summarize_div'] = False
return render(request, 'text_summarizer/index.html', context)

elif len(request.POST['url_link']) >0:

29
Text Summarization using NLP 2022

if re.search("(ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&
%@!\-\/]))?", request.POST['url_link']) == None:
context['url_error'] = True
return render(request, 'text_summarizer/index.html', context)

else:
if request.POST['num_lines']:
num_lines = request.POST['num_lines']
else:
num_lines = 5
try:
start = time.time()
content, summary = url_summarizer(request.POST['url_link'], int(num_lines))
end = time.time()
context['time_taken'] = round(end-start, 2)
context['flag'] = True
context["content"] = content
context["summary"] = summary
context['summarize_div'] = False
return render(request, 'text_summarizer/index.html', context)
except:
messages.error(request, "Entered URL doesn’t contain any Data.")
return redirect('index')

else:
messages.error(request, "Enter URL or Text to summarize the content.")
return redirect('index')
return render(request, 'text_summarizer/index.html', context)

urls.py(TextSummarizer)

30
Text Summarization using NLP 2022

"""TextSummarizer URL Configuration

The `urlpatterns` list routes URLs to views. For more information please see:
https://docs.djangoproject.com/en/3.0/topics/http/urls/
Examples:
Function views
1. Add an import: from my_app import views
2. Add a URL to urlpatterns: path('', views.home, name='home')
Class-based views
1. Add an import: from other_app.views import Home
2. Add a URL to urlpatterns: path('', Home.as_view(), name='home')
Including another URLconf
1. Import the include() function: from django.urls import include, path
2. Add a URL to urlpatterns: path('blog/', include('blog.urls'))
"""
from django.contrib import admin
from django.urls import path, include

urlpatterns = [
path('admin/', admin.site.urls),
path('', include('text_summarizer.urls'))
]

31
Text Summarization using NLP 2022

BIBLIOGRAPHY
[1] Song, S., Huang, H. & Ruan, T. Abstractive text summarization using LSTM-CNN based

deep learning. Multimedia Tools Applications 78, 857–875 (2019) Springer.

[2] Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and Chandan K. Reddy. 2020. Neural

Abstractive Text Summarization with Sequence-to-Sequence Models. ACM/IMS Trans. Data Sci.

2, 1, Article 1 (December 2020).

[3] J. N. Madhuri and R. Ganesh Kumar, "Extractive Text Summarization Using Sentence

Ranking," 2019 International Conference on Data Science and Communication 2019, pp. 1-3,

IEEE.

[4] Joshi, A., Fidalgo, E., Alegre, E., & Fernández-Robles, L. (2019). SummCoder: An

Unsupervised Framework for Extractive Text Summarization Based on Deep Auto-encoders.

Expert Systems with Applications, Elsevier.

32
Text Summarization using NLP 2022

[5] Tsai, C.-F., Chen, K., Hu, Y.-H., & Chen, W.-K. (2020). Improving text summarization of

online hotel reviews with review helpfulness and sentiment, Elsevier.

[6] Mohamed, M., & Oussalah, M. (2019). SRL-ESA-Text Sum: A text summarization approach

based on semantic role labelling and explicit semantic analysis. Information Processing &

Management, 56(4), 1356–1372, Elsevier.

[7] Goularte, F. B., Nassar, S. M., Fileto, R., & Saggion, H. (2019). A text summarization method

based on fuzzy rules and applicable to automated assessment. Expert Systems with Applications,

115, 264–275, Elsevier

[8] El-Kassas, W. S., Salama, C. R., Rafea, A. A., & Mohamed, H. K. (2020). EdgeSumm:

Graph-based framework for automatic text summarization. Information Processing &

Management, 57(6), 102264, Elsevier

[9] Tsai, C.-F., Chen, K., Hu, Y.-H., & Chen, W.-K. (2020). Improving text summarization of

online hotel reviews with review helpfulness and sentiment. Tourism Management, 80, 104122,

Elsevier

[10] Hernandez-Castaneda, A., Garcia-Hernandez, R. A., Ledeneva, Y., & Millan-Hernandez, C.

E. (2020). Extractive Automatic Text Summarization based on Lexical-semantic Keywords. IEEE

Access, 1–1

[11] Al-Maleh, M., Desouki, S. Arabic text summarization using deep learning approach. J Big

Data 7, 109 (2020), Springer

[12] R. S. Shini and V. D. A. Kumar, "Recurrent Neural Network based Text Summarization

Techniques by Word Sequence Generation," 2021 6th International Conference on Inventive

Computation Technologies (ICICT), 2021, pp. 1224-1229, IEEE

[13] K. Merchant and Y. Pande, "NLP Based Latent Semantic Analysis for Legal Text

Summarization," 2018 International Conference on Advances in Computing, Communications,

and Informatics (ICACCI), 2018, pp. 1803-1807, IEEE


33
Text Summarization using NLP 2022

[14] Candidate sentence selection for extractive text summarization Begum Mutlu, Ebru A. Sezer,

M. Ali Akcayol, Information Processing and Management, Elsevier, 2020

[15] B. Mutlu, E.A. Sezer and M.A. Akcayol, Multi-document extractive text summarization: A

comparative assessment on features, Knowledge-Based Systems (2019), Science Direct

[16]https://www.analyticsvidhya.com/blog/2022/02/text-summarisation/

[17]https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-
using-deep-learning-python/

[18]https://medium.com/luisfredgs/automatic-text-summarization-with-machine-learning-an-
overview-68ded5717a25

34

You might also like