Professional Documents
Culture Documents
A Full Semester Internship Report Submitted in Partial Fulfilment for the Award of the Degree
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted
by
A Full Semester Internship Report Submitted in Partial Fulfilment for the Award of the Degree
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted
by
June 2022
BONAFIDE CERTIFICATE
This is to certify that report entitled “Text Summarization using NLP” submitted by Kesari
Venkatesh Reddy (18341A0558) who completed the internship program under our guidance
and supervision at COGNIZANT TECHNOLOGY SOLUTIONS has been carried out in
partial fulfilment for the award of B.Tech degree in the discipline of Computer Science and
Engineering to JNTUK is a record of bonafide work . The results embodied in this report have
not been submitted to any other university or institution for the award of any degree or diploma.
Dr. A. V. Ramana
Professor and HOD
Department of CSE
GMRIT, Rajam
Internship Certificate
ACKNOWLEDGEMENT
I would like to take this opportunity to thank Ms. D. Anushree, Human Resources - Genc,
CTS Bangalore for providing all the necessary facilities that led to the successful completion
of our internship.
I would like to sincerely thank internal supervisor, Dr. K. Lakshmana Rao, Associate
Professor, Department of Computer Science and Engineering for wholehearted and valuable
guidance throughout the program.
I would like to thank our beloved Principal Dr. C.L.V.R.S.V. Prasad, Head of the Department
Dr. A. Venkata Ramana, Professor, Computer Science and Engineering for providing great
support in completing the full semester Internship.
It gives me an immense pleasure to express deep sense of gratitude to the central Internship
team Dr. Surya Narayan Dash, Internship Head, I would sincerely thank our department
coordinators Dr. K. Sri Vidya, Associate Professor, Department of Computer Science and
Engineering and Dr. V. Prasad, Associate Professor, Department of Computer Science and
Engineering for their great support.
ii
Text Summarization using NLP
TABLE OF CONTENTS
Certificate
Acknowledgement i
Abstract ii
Index iii
List of Figures iv
List of Tables v
1 Introduction
1.1 Introduction 1
1.2 Benefits of Internship 1
1.2.1 Benefits to the Students 2
1.2.2 Benefits to the Industry 2
1.2.3 Benefits to the Institution 2
1.3 Ethics 3
1.4 Values 3
2 Profile of the Company
2.1 About the Company 4
2.2 Services 5
3 Tasks Taken Up and Problem Definition
3.1 Introduction 8
3.2 Problem Statement 9
3.3 Existing System and its disadvantages 9
3.4 Proposed System and its advantages 9
3.5 Technologies used 10
3.6 Literature Survey 10
4 Methodology and Learning
4.1 Data Pre-Processing 16
4.0.0 Data Cleaning 17
4.0.1 Tokenization 17
4.2 Implementing Text Rank Algorithm 18
4.3 Displaying Output through Django 19
4.4 Design 20
4.5 Requirements 20
5 Results
5.1 Results 21
6 Conclusion and Suggestions
6.1 Conclusions 24
6.2 Future Scope 24
Appendix 25
Bibliography 32
List of Figures
5.5 Output 23
iv
List of Tables
v
Text Summarization using NLP 2022
1. INTRODUCTION
1.1 Introduction
Students learn how their course of study applies to the real world and build valuable
experience that makes them stronger candidates for jobs after graduation.
● Internship at a start-up will benefit in improving team spirit, adapting to flexible
working times and client services.
● You can get serious work experience, build a portfolio and establish a network of
professional contacts which can help you after you graduate.
● The main advantage is to have practical knowledge. In our college we can have
theoretical knowledge which doesn’t help much. Working on a project gives
practical experience.
● Confidence can be increased when we were involved in solving problems and were
succeeded in solving it.
1
Text Summarization using NLP 2022
● Employer Branding.
1.3 Ethics
● Help develop an organizational environment favourable to acting ethically
● Improve their understanding of the software and related documents on which they
work and of the environment in which they will be used.
2
Text Summarization using NLP 2022
1.4 Values
● Professional communications.
● Make an effort during the course of the internship to build relationships with people
around the office.
3
Text Summarization using NLP 2022
About Company:
company. It is one of the important businesses outsourcing company in the world. The
technology unit of Dun & Bradstreet, and the headquarters are in Chennai, India. In 1996,
Cognizant started exceeding performance with its international clients. The next year, the
company had its headquarters moved from Teaneck to Chennai in India. Cognizant was
the first company to be listed on NASDAQ 100. After accepting some of the work of
application maintenance, it went into application development. During the 2000s, time
looked like a golden era for the company. It became one of the Future 500 companies in
2011. It is also known as the World’s Most Admired Companies. The company is split
into two new major services, Nelson Media Research, and IMS Health. After some time, it
became the public subsidiary of the IMS Health. But in 2003, Cognizant sold all its shares
in the subsidiary and the CEO also resigned from his post. The company expanded its
work from IT services to Outsourcing and business consulting as well. There was a fast
development, business intelligence, supply chain management, CRM, etc. The company
has 318,400 employees globally, of which over 150,000 are in India across 10 locations
with a plurality in Chennai. On 20 Jan, 1994 Cognizant registered its branch in Chennai,
Tamil Nadu, India with the legal name Cognizant Technology Solutions India Private
4
Text Summarization using NLP 2022
Limited. The company has local, regional and global delivery centres in the UK, Australia,
Hungary, Netherlands, Spain, China, Philippines, Canada, Brazil, Argentina, Mexico etc
Business Units:
Cognizant is organized into several vertical and horizontal units. The vertical units focus
Manufacturing and Retail. The horizontals focus on specific technologies or process areas
such as Analytics, mobile computing, BPO and Testing. Both horizontal and vertical units
have business consultants, who together form the organization-wide Cognizant Consulting
team. Cognizant is among the largest recruiters of MBAs in the industry; they are involved
• Artificial Intelligence
• Cloud Enablement
• Core Modernization
• Digital Experience
• Digital strategy
• Enterprise Services
5
Text Summarization using NLP 2022
• Infrastructure Services
• Security
• Sustainability Services
for every product. End goal is to build the right software that meets ones needs
6
Text Summarization using NLP 2022
High-tech: They help high-tech companies rethink their business models and plan and
implement transformational processes across the product lifecycle. Partner with cognizant
to get ahead of the demand curve as well as become nimbler and drive operational
efficiency for more profitable growth.
Platform: The race to the next billion users, monetizing content and supporting aggressive
growth in new channels has upended the industry. Cognizant partner with companies to
accelerate digital at scale and to operate more efficiently, and power growth.
7
Text Summarization using NLP 2022
3.1 Introduction
url and models are used accordingly. This Text summarizer application is used for
summarizing the given text.
Text summarization can be carried out mainly in two ways. They are abstractive text
summarization and extractive text summarization. The automatic text summarization can be used
on single document or multi document. Also, the web page summarization can be done using web
scraping and bringing the content and summarizing it. This decreases the redundancy of files and
saves time in understanding large information. The text summarization task can be challenging
due to its vast usage capability, if not done properly it cannot be used. Thus, NLP comes in help
to understanding the language and extract the useful sentences or information that are critical in
understanding of the topic. Here the Text ranking algorithm and cosine similarity is used to
summarize the text. The data is given as text or a website page URL in which summary is
necessary
The existing approaches are using Recurrent Neural Networks, Long Short-Term Memory,
Graph based frameworks, sentence ranking, supervised approach … Due to neural
network models like these can significantly increase in execution time. The neural
networks and supervised models in order to understand the language it requires knowledge
of corpora of that language.
9
Text Summarization using NLP 2022
The proposed approach is using Text Rank Algorithm which is a natural language
algorithm. Text Rank works well because it does not only rely on the local context of a
text unit (vertex), but rather it considers information recursively drawn from the entire text
(graph). Text Rank performs better than most of the supervised learning approaches. The
user interface is designed using Django and takes input either a URL or text which is to be
summarized.
The authors in [1], proposed the abstractive text summarization is done by using LSTM-CNN
model. The dataset are taken from the daily news coverages as CNN, DailyMail websites. The
CNN dataset has more than 92000 texts and corresponding summaries. The DailyMail dataset has
219000 texts and corresponding summaries. The preprocessing is done in three steps: word
segmentation, morphological reduction, coreference resolution. The below figure shows the
design.
Tian Shi et al, [2] proposed a Abstractive text summarization is done by RNN based Seq2Seq
model. The datasets are taken from CNN/DailyMail Dataset(300k news articles), Newsroom
Dataset(1.3 million news articles), Bytecup Dataset(1.3 million news articles). Also discussed
relevance.
Seq2Seq model
3.2 Flow chart of the reference paper
R. Ganesh Kumar et al,[3] proposed the extractive text summarization is done using sentence
ranking. The work is done based on single document text summarization. The input data is given
by a word document with text. Then main task is to identifying the important paragraphs and
giving weights to sentences. After the summarization, it is compared to the human generated
The authors in [4], proposed the extractive text summarization is done by sentence content
relevance, sentence novelty, sentence position relevance. In this content relevance is done using
deep auto encoders. By combining these three metrics the authors have performed extractive text
summarization. The datasets used are CNN and DailyMail dataset, DUC dataset, Tor Illegal
11
Text Summarization using NLP 2022
Documents summarization dataset, Blog summarization dataset. This approach performed better
The authors in [5], proposed text summarization is done using clustering and optimization
techniques called COSUM. The first step is clustering of sentences by k-means. The second step
is selecting important sentences from the clusters based on different features. The datasets used
are DUC2001(309 articles) and DUC2002(567 documents). The pre-processing steps are splitting
into sentences, removing stop words and noisy words, upper case removal, stemming. The
evaluation metric used is ROUGE. This model performs better for ROUGE-1 and ROUGE-2.
The authors in [6], follows graph-based text summarization techniques for single and multiple
documents. The dataset used is DUC2002 which is available publicly. The preprocessing is done
methods. The following shows the architecture of graph-based text summarization system.
12
Text Summarization using NLP 2022
The authors in [7], proposed Automatic text summarization is done using fuzzy rules of different
features. The important text can be extracted using these fuzzy rules. The dataset used is Brazilian
Portuguese dataset which is given by students in virtual learning environment. The metric used is
framework “EdgeSumm”. The datasets used are DUC2001(308 English news documents and 616
model summaries) and DUC2002(567 news reports documents). The performance metric is
ROUGE. The average ROUGE score is better than other standard and state-of-the-art systems.
The authors in [9], focussed on text summarization to get useful sentences that depict sentiments
of customers on the services provided by the hotels. The dataset is taken from online hotel
booking platform TripAdviser. The pre-processing includes spell check, word segmentation,
stemming, parts-of-speech tagging. After summarization, sentiments are found on which services
The authors in [10], proposed Extractive Text Summarization is done by two steps. First feature
generation using LDA, One hot encoding, TF-IDF, Doc2Vec. Second clustering similar sentences
and finding proximity using cosine similarity, silhouette index. Then selecting important
sentences from the clusters and generating summary. The performance metrics as ROUGE-1,
ROUGE-2, ROUGE-SU performed better than previous methods as using only LDA or TF-IDF.
Proposed approach
3.5 Flow chart of the reference paper
The authors in [11], proposed text summarization is done by Sequence-to-Sequence model in deep
learning approach. The baseline models are also applied, and ROUGE score is used as
performance metric and comparison is done. The dataset comprises of 300,000 entries of articles
14
Text Summarization using NLP 2022
The authors in [12], describes the various types of text summarization techniques based on deep
learning. The datasets used are CNN and DailyMail Dataset, New York Times dataset, DUC2004
dataset, Amazon review dataset. Also, this paper focuses on the pre-processing steps to be
The authors in [13], proposed text summarization is done using latent semantic analysis (LSA).
The authors used single document and multi document approach. The dataset is taken from legal
judgements issued by Indian judiciary system. The ROUGE-1 is 0.58. The proposed approach is
shown.
The authors in [14], text summarization is done using LSTM (Long Short-Term Memory). The
0.569.
The authors in [15], Extractive Text summarization is done by the fuzzy inference systems. The
dataset used is DUC2002. The ROUGE-1-2-L achieved 0.66, 0.59, 0.66 respectively. This method
15
Text Summarization using NLP 2022
achieved better performance than neural networks for ROUGE-1. Also discussed detailed pre-
The main motto of this project is to summarize the text that takes the text as input and displays the
summary through a user interface.
It involves 3 steps:
1. Preprocessing
4.1 Preprocessing
The text is taken from the textbox from the user interface provided. The user can also provide
a URL in which text has to be extracted. The paragraphs present in web page provided by
user is scraped and taken as input. This input text must be preprocessed before applying text
rank algorithm.
16
Text Summarization using NLP 2022
The input text after preprocessing is taken as input for this step. The process of extractive
summarization requires the most important sentences among the whole input. Thus,
identifying the sentences that are to be displayed in the summary is done in this step. For this,
measurement is a measure of the cosine of the angle between the two non-zero vectors. The
libraries in python in which cosine similarity is available are scikit-learn, TensorFlow. The
similarity increases when the distance between two vectors decreases and vice-versa.
Of course, the initial step is to extract all the sentences from the text. This might be as simple
as separating the text at "." or newlines, or it can be more complicated if we want to fine-tune
the definition of a sentence. Parsers are never removed from the system; they are simply
abandoned. Once you have all the sentences in the text, we must create a graph in which each
sentence is a node and linkages between them or to the k-most similar sentences weighted by
This method allows us to program Text Rank without having to do any arithmetic or use
matrices, all we need is your graph and a function to compute sentence similarity.
It determines how similar each sentence is to the rest of the text. The similarity function
should be directed to the meaning of the sentence, and cosine similarity approach can work
well.
If we extract words instead of sentences and follow the same algorithm, using a similarity
function between words then we can use Text Rank to extract keywords from the text, the
17
Text Summarization using NLP 2022
idea is that the word that is most like all the other words is the most important one. Filtering
• Django is a high-level python web framework that helps in writing software that is
• The main important feature of Django is it follows the MVT(Model view template).
18
Text Summarization using NLP 2022
4.4 Design
4.5 Requirements
asgiref==3.2.10
beautifulsoup4==4.9.1
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
Django==3.1
gunicorn==20.0.4
19
Text Summarization using NLP 2022
idna==2.10
joblib==0.16.0
lxml==4.5.2
nltk==3.5
numpy==1.19.1
pandas==1.1.0
python-dateutil==2.8.1
pytz==2020.1
regex==2020.7.14
requests==2.24.0
six==1.15.0
soupsieve==2.0.1
sqlparse==0.3.1
tqdm==4.48.2
urllib3==1.25.10
20
Text Summarization using NLP 2022
5. RESULTS
GUI:
The figures 5.1 -5.5 represent the screenshots of the built GUI.
21
Text Summarization using NLP 2022
22
Text Summarization using NLP 2022
23
Text Summarization using NLP 2022
6.0 Conclusion
With the help of this application, the text is summarized which makes use of Natural Language
Processing technique called Text Rank algorithm. A short summary was generated keeping intact
the important ideas from the original text. The similarity between the sentences is calculated using
cosine similarity. Thus, text summarization is done using extractive approach. This needs to use
formed during training. This implementation is done using deep learning techniques. Based on the
24
Text Summarization using NLP 2022
work done in this paper, future scope is to develop deep learning models that can generate
APPENDIX
Summarizer.py
from bs4 import BeautifulSoup
import requests
import nltk
nltk.download('punkt')
nltk.download("stopwords")
X_list = nltk.word_tokenize(X)
sw = stopwords.words('english')
l1 =[];l2 =[]
rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append(1)
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c=0
for i in range(len(rvector)):
c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
return cosine
words = nltk.word_tokenize(content)
sentences = nltk.sent_tokenize(content)
26
Text Summarization using NLP 2022
word_count = {}
scores = []
paras = []
similarity = []
for i in range(len(sorted_df)):
paras.append(' '.join(list(sorted_df.iloc[i:i + num_lines,1])))
similarity.append(cosine_similarity(' '.join(list(sorted_df.iloc[i:i + num_lines,1])), Y_set))
27
Text Summarization using NLP 2022
content = []
for para in paras:
content.append(para.text)
urls.py
urlpatterns = [
path('', views.index, name = 'index'),
]
Views.py
28
Text Summarization using NLP 2022
def index(request):
context = {'flag':False, 'url_error':False, 'summarize_div':True}
if request.method == 'POST':
if len(request.POST['textarea']) > 0 and len(request.POST['url_link']) > 0:
messages.error(request, "Enter either URL or Text, not Both.")
return redirect('index')
if len(request.POST['textarea']) >0:
if request.POST['num_lines']:
num_lines = request.POST['num_lines']
else:
num_lines = 5
content = request.POST['textarea']
start = time.time()
summary = summarizer(content, int(num_lines))
end = time.time()
context['time_taken'] = round(end-start, 2)
context['flag'] = True
context["content"] = content
context["summary"] = summary
context['summarize_div'] = False
return render(request, 'text_summarizer/index.html', context)
29
Text Summarization using NLP 2022
if re.search("(ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&
%@!\-\/]))?", request.POST['url_link']) == None:
context['url_error'] = True
return render(request, 'text_summarizer/index.html', context)
else:
if request.POST['num_lines']:
num_lines = request.POST['num_lines']
else:
num_lines = 5
try:
start = time.time()
content, summary = url_summarizer(request.POST['url_link'], int(num_lines))
end = time.time()
context['time_taken'] = round(end-start, 2)
context['flag'] = True
context["content"] = content
context["summary"] = summary
context['summarize_div'] = False
return render(request, 'text_summarizer/index.html', context)
except:
messages.error(request, "Entered URL doesn’t contain any Data.")
return redirect('index')
else:
messages.error(request, "Enter URL or Text to summarize the content.")
return redirect('index')
return render(request, 'text_summarizer/index.html', context)
urls.py(TextSummarizer)
30
Text Summarization using NLP 2022
The `urlpatterns` list routes URLs to views. For more information please see:
https://docs.djangoproject.com/en/3.0/topics/http/urls/
Examples:
Function views
1. Add an import: from my_app import views
2. Add a URL to urlpatterns: path('', views.home, name='home')
Class-based views
1. Add an import: from other_app.views import Home
2. Add a URL to urlpatterns: path('', Home.as_view(), name='home')
Including another URLconf
1. Import the include() function: from django.urls import include, path
2. Add a URL to urlpatterns: path('blog/', include('blog.urls'))
"""
from django.contrib import admin
from django.urls import path, include
urlpatterns = [
path('admin/', admin.site.urls),
path('', include('text_summarizer.urls'))
]
31
Text Summarization using NLP 2022
BIBLIOGRAPHY
[1] Song, S., Huang, H. & Ruan, T. Abstractive text summarization using LSTM-CNN based
[2] Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and Chandan K. Reddy. 2020. Neural
Abstractive Text Summarization with Sequence-to-Sequence Models. ACM/IMS Trans. Data Sci.
[3] J. N. Madhuri and R. Ganesh Kumar, "Extractive Text Summarization Using Sentence
Ranking," 2019 International Conference on Data Science and Communication 2019, pp. 1-3,
IEEE.
[4] Joshi, A., Fidalgo, E., Alegre, E., & Fernández-Robles, L. (2019). SummCoder: An
32
Text Summarization using NLP 2022
[5] Tsai, C.-F., Chen, K., Hu, Y.-H., & Chen, W.-K. (2020). Improving text summarization of
[6] Mohamed, M., & Oussalah, M. (2019). SRL-ESA-Text Sum: A text summarization approach
based on semantic role labelling and explicit semantic analysis. Information Processing &
[7] Goularte, F. B., Nassar, S. M., Fileto, R., & Saggion, H. (2019). A text summarization method
based on fuzzy rules and applicable to automated assessment. Expert Systems with Applications,
[8] El-Kassas, W. S., Salama, C. R., Rafea, A. A., & Mohamed, H. K. (2020). EdgeSumm:
[9] Tsai, C.-F., Chen, K., Hu, Y.-H., & Chen, W.-K. (2020). Improving text summarization of
online hotel reviews with review helpfulness and sentiment. Tourism Management, 80, 104122,
Elsevier
Access, 1–1
[11] Al-Maleh, M., Desouki, S. Arabic text summarization using deep learning approach. J Big
[12] R. S. Shini and V. D. A. Kumar, "Recurrent Neural Network based Text Summarization
[13] K. Merchant and Y. Pande, "NLP Based Latent Semantic Analysis for Legal Text
[14] Candidate sentence selection for extractive text summarization Begum Mutlu, Ebru A. Sezer,
[15] B. Mutlu, E.A. Sezer and M.A. Akcayol, Multi-document extractive text summarization: A
[16]https://www.analyticsvidhya.com/blog/2022/02/text-summarisation/
[17]https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-
using-deep-learning-python/
[18]https://medium.com/luisfredgs/automatic-text-summarization-with-machine-learning-an-
overview-68ded5717a25
34