You are on page 1of 20

VIVEKANAND EDUCATION SOCIETY’S

INSTITUTE OF TECHNOLOGY
Department of Computer Engineering

NLP Mini Project Report on

NEWS ARTICLE SCRAPING AND


SUMMARISATION SYSTEM

Submitted by
Sonal Misal, Roll No. 34
Bhavika Motiramani, Roll No. 35
Priyanka Patil, Roll No. 41
(2020-21)

Under the guidance of


Mrs. Priya R.L
Mrs. Sharmila Sengupta
Mrs. Ashvini Abhijit Gaikwad
INDEX
Chapter Title
No. Page
No.

1 Introduction 1
1.1 Introduction of the project
1.2 Aims and Objectives
1.3 Scope of the Project

2. Description 3
2.1. Description of the project
2.2. Block Diagram
2.3. Use case Diagram
2.4. Sequence Diagram
2.5. Activity Diagram
3. Implementation & Methodology 7
3.1 Details of Hardware and Software
3.1.1. Hardware details
3.2.1. Software details
3.2 Web Scraping
3.3. Text Pre-processing
3.4. Text Summarization
3.5. Pseudocode
3.6. Code and Results
3.6.1. Code
3.6.2. Results and Discussion
3.7. Applications
4. Conclusion and Future Scope 18
4.1 Future Scope
4.2. Conclusion
4.3 References
Chapter - 1: INTRODUCTION

1.1 Introduction of the project


In today’s digital world, people are bombarded with endless information. When we talk about
news, any reader can obtain information which may be delivered from newspapers, magazines,
television news channel or widely accessed internet provides news from portals, blogs and other
social media. Therefore, there’s plenty of news to stay aware of and it needs to be digested by
people quickly. There is also a need to develop automatic text summarization tools that allow
people to get insights from them easily. Implementing summarization can enhance the readability
of documents, reduce the time spent in researching for information, and allow for more information
to be fitted in a particular area.
In this project request and newspaper libraries are used in order to scrape news from websites, and
then summarization is carried out on the extracted text. This project is also implemented in English
and regional language Hindi. The data is collected from news website jagran.com for Hindi articles
with pre-processing using nltk library and the data for English is collected from any website such
as cnn and ndtv and the nltk library is used to summarise the text.

1.2 Aims and Objectives


The main objective of our project is to develop a system that will provide a user the summarised
form of an article by extracting the data from the website first via various scraping techniques and
then displaying it briefly. The procedure can be explained as follows:
i. The data from the website article is scraped into a variable.
ii. This data is then filtered to remove the stop words which then finds the most important
content and displays it to the user.

1.3 Scope of the project


Irrespective of certain advantages of this system certain amendments can be made to enhance the
efficiency of system. They are as follows:

1.3.1 Various other regional languages


1
This can be made available to a larger audience by using various kinds of regional languages
articles and allowing the summarisation on those articles.eg Marathi, Gujrati etc.

1.3.2 Not only articles


This can also be further be used to extract the data from website and blogs which have information
and used to summarise the data into brief paragraphs so that people can easily consume the data.

1.3.3 Mobile application


A mobile application for the same can be developed so that any mobile user can view and, on the
go, read summarised articles, therefore gaining more information in lesser period of time.

1.3.4 Personalised interface


The project can be personalised to a user’s need so that any user who is only interested in viewing
one genre of news such as political, sports, Bollywood buzz receives the information timely.

2
Chapter - 2: DESCRIPTION

2.1 Description of the project


The News Article Extraction and Summarization System performs the main functionality of
providing news on a go by summarizing news articles from website into small briefs that can be
easily digested. In this project we are making use of various text pre-processing techniques,
newspaper module to extract or scrape data from websites and also summarization techniques.
Extraction is done in two major languages-Hindi and English, and the system is built in such a way
that it ignores all other web contents such as images and advertisement and only extracts the main
content from the webpage. The project is done in python using various in-built libraries such as
newspaper, nltk for various purposes mentioned below. The stop words are removed and each
sentence is given a score based on its value of contribution to the news. The most important
statements are then shortlisted and displayed to the user. Tkinter is used for the front-end
development.

2.2 Block Diagram

Fig 2.2.1 Block diagram

3
2.3 Use Case Diagram

Fig 2.3.1 Use case diagram

4
2.4 Sequence Diagram

Fig 2.4.1 Sequence diagram

5
2.5 Activity Diagram

Fig 2.5.1 Activity diagram

6
Chapter - 3: Implementation Methodology

3.1 Details of Hardware & Software


3.1.1 Hardware Details

• Minimum 4-GB RAM


• Windows 7 and above (64- bit)

3.1.2 Software Details

• Python- 3.8
• Anaconda3
• Jupyter

3.2 Web scraping

In order to scrape data from websites we have made use of a python library called newspaper3k.
Newspaper3k is an excellent Python module used for extracting and parsing newspaper articles.
Newspaper use advance algorithms with web scrapping to extract all the useful text from a website.
It works amazingly well on online newspapers websites. All the text data extracted from the Hindi
and English news website is then retrieved into a file for further processing. Urllib module is the
URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses
the url open function and is able to fetch URLs using a variety of different protocols. Urllib is a
package that collects several modules for working with URLs, such as urllib.request for opening
and reading. Hence data from the website is scraped in such a way that all other web content are
ignored and only text content is used. The procedure is as follows:

1. Document Load
It uses the Request/response handlers which are managers that make http requests to a group of
urls, and fetch the response objects as html contents and pass this data to the next module. Python
uses opening process libraries for performing request/response URL.

7
2. Data parsing/data cleansing
This is the module where the data which is fetched is processed and cleaned. Here the unstructured
data is transformed into structured during this processing.

3. Data Extraction
In this module the idea behind is web scraping which is to retrieve data that is already exists
on a website and convert it into a format that is suitable for analysis. Web-Pages are rendered
by the browser. A set of wrapper functions that make it simple to select common HTML/XML
elements.

3.3 Text Pre-processing

Before we can apply different summarization approaches on a dataset, we need to do certain pre-
processing to make the data “usage-ready” for our summarizer. The importance of pre-processing
procedure is evident because of its use in almost every developed system related to text processing.
Following are some of the widely used pre-processing steps for text processing:

1.Sentence Tokenization
It also known as Sentence Boundary Disambiguation, is the process of identifying where the
sentences start and end. It is used to treat individual sentences as separate entities and makes
processing of the text relatively easy. It is carried out using the nltk library in English and for Hindi
we have made use of CLTK library to perform the same function.

2.Cleaning
Cleaning is done to remove special characters from the text and to replace them with spaces. As a
result, it simplifies the text for analysis purposes.

3.Word Tokenisation
It tokenizes words into separate entities within a sentence. This step is especially important if you
want to calculate the feature scores of individual words of a sentence for deducing important
sentences in a document.

8
4. Stop word removal
It is the process of removing stop words. (words which do not convey any information, such as
“and”, “the”, “it” etc. which are insignificant in feature score calculation). Since they are deemed
unnecessary and have no significance on their own, they must be removed to simplify the task of
the summarizer. The Hindi stop words are not available in the library and hence were given to the
system in a set of words in file. The English stop words are already present in the nltk library and
hence these predefined stop words are compared with the text and then removed from the necessary
formatted content.

3.4 Text Summarization

Automatic text summarization is a technique concerning the creation of a compressed form for
single document. Stemming is the process of producing morphological variants of a root/base
word. Stemming programs are commonly referred to as stemming algorithms or stemmers. Nltk
library is used in regional language summarisation. The stop words commonly present in Hindi
language are gained into a text file. Words from the text file that are not present in this stop words
category are then segregated. Frequency of these words are found out and then by dividing these
frequencies with the most frequently occurring word’s frequency. Sentence scores are found out
and the most frequently occurring sentences are then displayed to the user as summary. This way
of summary extraction is called as extractive summarisation.

3.5 Pseudocode

Algorithm:
Steps:
1. Import necessary libraries that is natural language toolkit.
1.1. Import sent tokenize to tokenize paragraph into sentences and import word tokenize
to tokenize sentences into words from NLTK.
1.2. Import stop words from NLTK corpus to remove the stop words like the, he, have etc
for English sentences and अत, अपना, अभी etc for Hindi sentences.
1.3. Import match to calculate the number of sentences from the sentence list which is
obtained after applying tokenize the text into sentences.

9
2. Import Article from newspaper and import subprocess.
2.1. Take input of article URL.
2.2. Download the text content and parse the website.
2.3. Put it into a file and display it to the user
3. Import CLTK
3.1. Tokenize text into sentences.
3.2. Print the number of sentences by using predefine function len()
4. Tokenize words into sentences using word tokenize
5. Open stop words text file
5.1. Create stop words dictionary and append all the stop words from the text file into the
dictionary
5.2. if word from text in stop words dictionary then, pop(word)
5.3. Create a dictionary of words and frequencies of remaining words
6. Find maximum frequency of word and normalise all frequencies by dividing with max
frequency
7. Similarly find sentence scores.
8. Display the title of the news, the top image and create summary using hpeaq (which forms a
minimum heap) from the calculated sentence scores.

3.6 Code and Results


3.6.1. Code
I. News_Scrapper_ipynb

10
II. English_News_ipynb

11
III. Hindi_News_ipynb

12
13
3.6.2. Results and Discussion

1. Web Data

Fig 3.6.2.1 Screenshot of news website

14
Initially the data is extracted from the news website as shown in figure 3.6.2.1. This data is then
using text summarization method.

2. Initial Window

Fig 3.6.2.2 Screenshot of Initial GUI window

The above figure 3.6.2.2 is the initial GUI window of our system. We have used Tkinter in python
for GUI of our system. Tkinter is a Python binding to the Tk GUI toolkit. It is the standard Python
interface to the Tk GUI toolkit, and is Python's de facto standard GUI. The name Tkinter comes
from Tk interface. Tkinter is free software released under a Python license.

3. Extracted and Summarized Data

15
Fig 3.6.2.3a Screenshot of extracted and summarized data of English News

Fig 3.6.2.3b. Screenshot of extracted and summarized data of Hindi News

From the above figure 3.6.2.3a, the extracted content and the summarization of that content is of
English News and from the above figure 3.6.2.3b, the extracted content and the summarized data
is of Hindi news.

3.7 Applications

1. Media Monitoring
The problem of information overload and “content shock” has been widely discussed.
Automatic summarization presents an opportunity to condense the continuous torrent of
information into smaller pieces of information.

2. Internal document workflow


Large companies are constantly producing internal knowledge, which frequently gets
stored and under-used in databases as unstructured data. These companies should embrace
tools that let them re-use already existing knowledge. Summarization can enable analysts
to quickly understand everything the company has already done in a given subject, and
quickly assemble reports that incorporate different points of view.

16
3. E-learning and class assignments
Many teachers utilize case studies and news to frame their lectures. Summarization can
help teachers more quickly update their content by producing summarized reports on their
subject of interest.

4. Patent research
Researching patents can be a tedious process. Whether you are doing market intelligence
research or looking to file a new patent, a summarizer to extract the most salient claims
across patents could be a time saver.

5. Science and R&D


Academic papers typically include a human-made abstract that acts as a summary.
However, when you are tasked with monitoring trends and innovation in a given sector, it
can become overwhelming to read every abstract. Systems that can group papers and
further compress abstracts can become useful for this task.

17
Chapter - 4: CONCLUSION AND FUTURE SCOPE

4.1 Conclusion
We have used Natural language tool kit and newspaper python module for this system. The system
is able to fetch only the required contents among the huge data. The study of the text processing
techniques is the major part of our project. Hence, we design this system so that it will help the
user to just read the summary of the news from the website and not wasting time in reading the
whole news. The algorithm works perfectly for most of the English news article websites like cnn
and ndtv. The regional language Hindi is mostly implemented on Jagran website articles and is
able to summarise the text efficiently. Each news websites contain various different perspective of
news and it will be lengthy which is updated frequently. In today’s life we do not have enough
time to read each and every content of the newspaper from various sources. Hence our system
provides a convenient way to keep up with the fast-paced lives and provide the summarised news
content.

4.2 Future Scope

1. As this system is only applicable for Hindi and English language in future, we will
implement other regional and foreign languages also.
2. The system can also be modified to scrap the content based on the type of the news. The
system can also can be updated to display the news based on the user’s location.
3. Trending news can be highlighted according to the number of views.

4.3 References

[1] “A Novel Approach for News Extraction using Web Scrapping”, Shreesha M, Srikara SB,
Manjesh R, ISSN: 2278-0181, IJERT

[2] “NEWSONE- AN AGGREGATION SYSTEM FOR NEWS USING WEB SCRAPING


METHOD”, K.Sundaramoorthy, R.Durga, S.Nagadarshini, 978-1-5090-4797-0/17 $31.00 © 2017
IEEE DOI 10.1109/ICTACC.2017.43

[3] https://stackabuse.com/text-summarization-with-nltk-in-python/
[4] https://www.geeksforgeeks.org/newspaper-article-scraping-curation-python/

18

You might also like