Professional Documents
Culture Documents
INSTITUTE OF TECHNOLOGY
Department of Computer Engineering
Submitted by
Sonal Misal, Roll No. 34
Bhavika Motiramani, Roll No. 35
Priyanka Patil, Roll No. 41
(2020-21)
1 Introduction 1
1.1 Introduction of the project
1.2 Aims and Objectives
1.3 Scope of the Project
2. Description 3
2.1. Description of the project
2.2. Block Diagram
2.3. Use case Diagram
2.4. Sequence Diagram
2.5. Activity Diagram
3. Implementation & Methodology 7
3.1 Details of Hardware and Software
3.1.1. Hardware details
3.2.1. Software details
3.2 Web Scraping
3.3. Text Pre-processing
3.4. Text Summarization
3.5. Pseudocode
3.6. Code and Results
3.6.1. Code
3.6.2. Results and Discussion
3.7. Applications
4. Conclusion and Future Scope 18
4.1 Future Scope
4.2. Conclusion
4.3 References
Chapter - 1: INTRODUCTION
2
Chapter - 2: DESCRIPTION
3
2.3 Use Case Diagram
4
2.4 Sequence Diagram
5
2.5 Activity Diagram
6
Chapter - 3: Implementation Methodology
• Python- 3.8
• Anaconda3
• Jupyter
In order to scrape data from websites we have made use of a python library called newspaper3k.
Newspaper3k is an excellent Python module used for extracting and parsing newspaper articles.
Newspaper use advance algorithms with web scrapping to extract all the useful text from a website.
It works amazingly well on online newspapers websites. All the text data extracted from the Hindi
and English news website is then retrieved into a file for further processing. Urllib module is the
URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses
the url open function and is able to fetch URLs using a variety of different protocols. Urllib is a
package that collects several modules for working with URLs, such as urllib.request for opening
and reading. Hence data from the website is scraped in such a way that all other web content are
ignored and only text content is used. The procedure is as follows:
1. Document Load
It uses the Request/response handlers which are managers that make http requests to a group of
urls, and fetch the response objects as html contents and pass this data to the next module. Python
uses opening process libraries for performing request/response URL.
7
2. Data parsing/data cleansing
This is the module where the data which is fetched is processed and cleaned. Here the unstructured
data is transformed into structured during this processing.
3. Data Extraction
In this module the idea behind is web scraping which is to retrieve data that is already exists
on a website and convert it into a format that is suitable for analysis. Web-Pages are rendered
by the browser. A set of wrapper functions that make it simple to select common HTML/XML
elements.
Before we can apply different summarization approaches on a dataset, we need to do certain pre-
processing to make the data “usage-ready” for our summarizer. The importance of pre-processing
procedure is evident because of its use in almost every developed system related to text processing.
Following are some of the widely used pre-processing steps for text processing:
1.Sentence Tokenization
It also known as Sentence Boundary Disambiguation, is the process of identifying where the
sentences start and end. It is used to treat individual sentences as separate entities and makes
processing of the text relatively easy. It is carried out using the nltk library in English and for Hindi
we have made use of CLTK library to perform the same function.
2.Cleaning
Cleaning is done to remove special characters from the text and to replace them with spaces. As a
result, it simplifies the text for analysis purposes.
3.Word Tokenisation
It tokenizes words into separate entities within a sentence. This step is especially important if you
want to calculate the feature scores of individual words of a sentence for deducing important
sentences in a document.
8
4. Stop word removal
It is the process of removing stop words. (words which do not convey any information, such as
“and”, “the”, “it” etc. which are insignificant in feature score calculation). Since they are deemed
unnecessary and have no significance on their own, they must be removed to simplify the task of
the summarizer. The Hindi stop words are not available in the library and hence were given to the
system in a set of words in file. The English stop words are already present in the nltk library and
hence these predefined stop words are compared with the text and then removed from the necessary
formatted content.
Automatic text summarization is a technique concerning the creation of a compressed form for
single document. Stemming is the process of producing morphological variants of a root/base
word. Stemming programs are commonly referred to as stemming algorithms or stemmers. Nltk
library is used in regional language summarisation. The stop words commonly present in Hindi
language are gained into a text file. Words from the text file that are not present in this stop words
category are then segregated. Frequency of these words are found out and then by dividing these
frequencies with the most frequently occurring word’s frequency. Sentence scores are found out
and the most frequently occurring sentences are then displayed to the user as summary. This way
of summary extraction is called as extractive summarisation.
3.5 Pseudocode
Algorithm:
Steps:
1. Import necessary libraries that is natural language toolkit.
1.1. Import sent tokenize to tokenize paragraph into sentences and import word tokenize
to tokenize sentences into words from NLTK.
1.2. Import stop words from NLTK corpus to remove the stop words like the, he, have etc
for English sentences and अत, अपना, अभी etc for Hindi sentences.
1.3. Import match to calculate the number of sentences from the sentence list which is
obtained after applying tokenize the text into sentences.
9
2. Import Article from newspaper and import subprocess.
2.1. Take input of article URL.
2.2. Download the text content and parse the website.
2.3. Put it into a file and display it to the user
3. Import CLTK
3.1. Tokenize text into sentences.
3.2. Print the number of sentences by using predefine function len()
4. Tokenize words into sentences using word tokenize
5. Open stop words text file
5.1. Create stop words dictionary and append all the stop words from the text file into the
dictionary
5.2. if word from text in stop words dictionary then, pop(word)
5.3. Create a dictionary of words and frequencies of remaining words
6. Find maximum frequency of word and normalise all frequencies by dividing with max
frequency
7. Similarly find sentence scores.
8. Display the title of the news, the top image and create summary using hpeaq (which forms a
minimum heap) from the calculated sentence scores.
10
II. English_News_ipynb
11
III. Hindi_News_ipynb
12
13
3.6.2. Results and Discussion
1. Web Data
14
Initially the data is extracted from the news website as shown in figure 3.6.2.1. This data is then
using text summarization method.
2. Initial Window
The above figure 3.6.2.2 is the initial GUI window of our system. We have used Tkinter in python
for GUI of our system. Tkinter is a Python binding to the Tk GUI toolkit. It is the standard Python
interface to the Tk GUI toolkit, and is Python's de facto standard GUI. The name Tkinter comes
from Tk interface. Tkinter is free software released under a Python license.
15
Fig 3.6.2.3a Screenshot of extracted and summarized data of English News
From the above figure 3.6.2.3a, the extracted content and the summarization of that content is of
English News and from the above figure 3.6.2.3b, the extracted content and the summarized data
is of Hindi news.
3.7 Applications
1. Media Monitoring
The problem of information overload and “content shock” has been widely discussed.
Automatic summarization presents an opportunity to condense the continuous torrent of
information into smaller pieces of information.
16
3. E-learning and class assignments
Many teachers utilize case studies and news to frame their lectures. Summarization can
help teachers more quickly update their content by producing summarized reports on their
subject of interest.
4. Patent research
Researching patents can be a tedious process. Whether you are doing market intelligence
research or looking to file a new patent, a summarizer to extract the most salient claims
across patents could be a time saver.
17
Chapter - 4: CONCLUSION AND FUTURE SCOPE
4.1 Conclusion
We have used Natural language tool kit and newspaper python module for this system. The system
is able to fetch only the required contents among the huge data. The study of the text processing
techniques is the major part of our project. Hence, we design this system so that it will help the
user to just read the summary of the news from the website and not wasting time in reading the
whole news. The algorithm works perfectly for most of the English news article websites like cnn
and ndtv. The regional language Hindi is mostly implemented on Jagran website articles and is
able to summarise the text efficiently. Each news websites contain various different perspective of
news and it will be lengthy which is updated frequently. In today’s life we do not have enough
time to read each and every content of the newspaper from various sources. Hence our system
provides a convenient way to keep up with the fast-paced lives and provide the summarised news
content.
1. As this system is only applicable for Hindi and English language in future, we will
implement other regional and foreign languages also.
2. The system can also be modified to scrap the content based on the type of the news. The
system can also can be updated to display the news based on the user’s location.
3. Trending news can be highlighted according to the number of views.
4.3 References
[1] “A Novel Approach for News Extraction using Web Scrapping”, Shreesha M, Srikara SB,
Manjesh R, ISSN: 2278-0181, IJERT
[3] https://stackabuse.com/text-summarization-with-nltk-in-python/
[4] https://www.geeksforgeeks.org/newspaper-article-scraping-curation-python/
18