You are on page 1of 14

Synopsis on

(Development of Automatic Text Summarization Algorithm)

Submitted by

(Mansi Bhardwaj)

For the award of the degree of

M.Tech

Under the supervision of

External Guide
(Mrs. Pooja Gupta, Assistant Professor, Banasthali Vidyapith University)

Apaji Institute of Mathematics and Applied Computer


Technology
Banasthali Vidyapith
Banasthali - 304022
Session: 2019-2021

1
Table of Contents (For Research Project)
S.No. Title Page No.

1. Introduction 1-2
Organization
Problem Definition

2. Literature Review 3-8

3. Proposed Study 9
Aims and Objectives

4 Research Methodology

Methodology 10-12

Work Plan 13
Proposed Contents of the Thesis 14
Tools and Techniques 15

5 References 16

2
CHAPTER 1
INTRODUCTION
1.1 About The Organization
Banasthali Vidyapith University is India’s largest women accommodation university. The
source of inspiration of the Banasthali University is “Shanta Bai” She is the daughter of
our founder and freedom fighter and educationist “Shri. Hiralal Shastri” To complete the
unfinished task of his daughter, the Shri Shantabai Shiksha Kutir was started in 1935. The
name “Banasthali Vidyapith” was adopted only in 1943. This also happened to be the
year when undergraduate courses were first introduced. The UGC committee which
recommended the conferment of University status on the institution kept the following
points in mind:

(i) Vidyapith’s definite and viable programme for restructuring courses at the
undergraduate level and its eagerness to carry out various measures to make
education more meaningful and practical.

(ii) Availability of opportunities to the students to develop their personalities

(iii) Vidyapith’s initiative to inculcate spiritual and moral values in the students
through various activities, emphasizing character-building and simplicity. 

 Banasthali University is a largest fully residential women’s university in the


world, located in the Tonk district of the Indian state of Rajasthan. It is a
university that offering programs at the secondary, senior secondary, under
graduate and post graduate degree levels. In 2020, NAAC has accredited this
university with A++ grading.

 The campus is a sprawling 850 acres, located about 80 kilometres from the capital
city of Jaipur, in the Tonk district of Rajasthan, India. The campus has been
broadly divided into the school division, the University division and the
residential blocks. The residential blocks feature 29 hostels each with the capacity
of housing up to 438 students.
 In Banasthali there are five-fold-activities so that the girls can grow in other
fields also like dancing, singing, sports, self defense and fitness. So that a girl who
is well developed and not depend to others for her education, safety and lifestyle.
In sports they have basket ball, cricket, football, tennis and many other sports. In
Dancing they teach the cultural dance like kathak, Manipuri and may dance

3
forms. In Fitness they have Yoga classes, Aerobics classes, Zumba classes,
swimming classes, Gym.

1.2 Problem Statement


Automatic Text Summarization is important because of the huge amount of data that
grow faster on the internet. The web sources on the internet are huge sources of textual
data. Web sources means, websites, user reviews, news, blogs, social media networks.
The textual content on the internet will increase day by day so, it is hard to summarize all
data manually. Manually Text Summarization is expensive and hard to summarize that
huge amount of data and also take so much time. For that purpose we can use automatic
text summarization.

Automatic Text Summarization is a process of shortening of set data computationally and


to create a summary that represent most important or relevant information within the
original text. ATS systems can be classified into two types of documents: single
document or multi-document summarization system, summary which is generated by
summarize single document and when the summary generated from the cluster of
documents. For summarization of textual content the ATS system should be designed by
ATS approaches which are: extractive approach, abstractive approach and hybrid
approach. In extractive approach, selects the most important sentences from the input text
and uses them to generate the summary. The abstractive approach represents the input
text in an intermediate form then generates the summary with words and sentences that
differ from the original text sentences. The hybrid approach is a combination of both
abstractive and extractive approach.

Manual text summarization is a time consuming and costly task that includes many steps.
For example, the following steps are done to manually summarize a single document
(Takeuchi, 2002):

1) Trying to understand what the document is about.

2) Trying to extract the "most important" parts from it

3) Trying to compose a summary that satisfies the following requirements (Lloret et al.,
2017):

 The summary readability and linguistic quality.

 The summary consistency and content coverage.

 The non-redundancy of the produced summary.

4
Due to the difficulty of manual text summarization of the huge amount of the textual
content on the Internet or various archives, ATS systems have appeared as the main
technology to solve this urgent and pressing issue.

CHAPTER 2
Literature Review
There is a lot of effort in the field of achieving effective text summarization. Nagwani et
al. [1] proposed a frequent term based text summarization algorithm that first processes
the document to be summarized by eliminating stop words and by applying stemmers.
Next, term-frequent data is calculated from the document and frequent terms are selected,
and for these selected words the semantic equivalent terms are also generated. Finally, all
sentences in the document that contain the frequent terms identified and their semantic
equivalents are filtered for summarization.

Guangbing et al. [2] introduced a personalized text-based content summarizer to help


mobile users to retrieve and process information more quickly, as per their interests and
preferences. It is based on probabilistic language modeling techniques adapted to build a
user model and an extractive text summarization system to generate a personalized and
automatic summary for mobile learning.

Aksoy et al. [3] proposed an idea of using Semantic Role Labeling (SRL) on generic
Multi-Document Summarization (MDS). Sentences are scored according to frequent
semantic phrases and the summary is formed using the top-scored sentences. This method
used a term-based sentence scoring approach to investigate the effects of using semantic
units instead of single words for sentence scoring. Then scoring metric is integrated as an
auxiliary feature with the intention of examining its effects on the performance.

Rushdi et al [4] put forth a novel technique for summarization of domain-specific text
from a single web document that uses statistical and linguistic analysis on the text in a
reference corpus and the web document is presented. The proposed summarizer used the
combinational function of Sentence Weight and Subject Weight to determine the rank of
a sentence. It used the number of terms and number of words in a sentence, and term
frequency in the corpus for summarization and about 30% of the ranked sentences were
considered to be the summary of the web document. Three web document summaries
using the proposed technique were generated and compared with the summaries
developed manually from 16 different human subjects.

Foong et al. [5] developed a hybrid Harmony Particle Swarm Optimization (PSO)
framework for an Extractive Text Summarizer to overcome high processing load. Their
objective was to find out if the proposed PSO model was capable of condensing original

5
electronic documents into shorter summarized texts more efficiently and accurately than
the alternative models. Their empirical results showed that the proposed hybrid PSO
model improved the efficiency and accuracy of composing summarized text.

Already Implemented System. These all are already implemented systems and which can
use different algorithms,

MEAD: MEAD [11] is the most elaborate publicly available platform for multi-lingual
summarization and evaluation. The platform implements multiple summarization
algorithms such as position-based, centroid-based, largest common subsequence, and
keywords. The methods for evaluating the quality of the summaries are both intrinsic and
extrinsic. MEAD implements a battery of summarization algorithms, including baselines
(lead-based and random) as well as centroid-based and query-based methods.

• Neural Network is used by S.P yong [6]. He used keywords extraction and summary
production system to generate summary.

• RST is used by Li Chengcheng [7] to analyze sentence and discover rhetoric relations to
generate a Summary.

• In 2000 Hongyan Jing [8] takes closely related sentences for this he used human
abstraction concept.

• In 2011 Nitin Agarwal [9] used unsupervised query-oriented approach with the help of
clustering based method.

• In 2004 Jun’ichi Fukumoto [10] using TF/IDF for single and multiple documents
abstract generation.

In (Mehdi Allahyari, Elizabeth D. Trippe, Krys Kochut [12]) authors give a survey on
text summarization survey which is very helpful for gaining the information about text
summarization.

6
CHAPTER 3
PROPOSED STUDY
3.1 Aims and Objectives:
 Objective of the research is that we can make an algorithm which is different from
others and also easy to understand by other readers and also gives the accurate
results.

 The main objective of an ATS system is to generate a summary that includes the
important ideas or sentences of input document in less space and the level of
repetition is minimum.

 The ATS system helps the users to get the main ideas of the input document
without read the entire document which can save a lot of time and effort.

 Aim of the AST is that the summary is short in length as compare to the input text
document so that user can easily understand the concept and aim of the whole
document without reading it and it can save time of the users.

 Aim of ATS system that it can work on the web sources like, social media, news,
blogs or research papers and summarize the contents by headings or paragraphs
according to the classification of the ATS.

These are the aims and objectives of the Automatic Text Summarization System.
ATS will work for these objectives so that it can fulfill the requirements of the
users easily, because users generally think these things when first they listen
about the ATS systems.

In this research we can develop an algorithm so that the readers are easily
understand and may be the complexity of the developed algorithm is high but the
necessary thing is that it can give the accurate results.

7
CHAPTER 4
RESEARCH METHODOLOGY
4.1 Methodology
In this section, methodology will explain on which the processes of research project will
go on and explain those steps briefly, the steps are:

 Data Collection

 Pre Processing

 Processing

 Post Processing
Data Collection: In this section, collect the data from the websites or from the social
media networks, reviews of user on a particular topic so that we are able to summarize
that what users want. And we can also use a Corpus (collection of data). The data is
collected with the help of some data collection tools. This is first and basic step for
developing an algorithm for Automatic Text Summarization.

Pre-Processing: In the previous section, the data which is collected is in unstructured


format or that data is not ready for use for ATS algorithm so we have to filter that data by
using linguistic techniques like we can use word tokenization, segmentation, stemming,
removal of stop-words, POS Tagging etc. By applying these techniques on the corpus we
can filter that data and use that data for the next step.

Processing: In this section we can work on the actual filtered textual data. Using one of
text summarization approaches by applying a technique or more to convert the input text
document into summary. Different text summarization approaches are: Extractive Text
Summarization Approach, Abstractive Text Summarization Approach and Hybrid Text
Summarization Approach. And different techniques are: Text Summarization operations
and Statistical and linguistics features. Different Building Blocks used are: Text
representation models, Linguistics Analysis and processing techniques, Soft computing
techniques. All of these techniques and approaches are used for developing an algorithm
for ATS.

8
Post-Processing: In this section, the problems are resolved which are generated in the
previous step. Solving some problems in the generated summary sentences like anaphora
resolution and reordering the selected sentences before generating the final summary.

Workflow: It can define the flow of work in which the whole algorithmic development
of ATS is proceed:

Corpus Text Sentences Vectors

Summary Sentence Graphs Similarity


Ranking Matrix

This is the work flow for developing an algorithm for ATS.

4.2 Work Plan: In this section, I can discuss the work plan means that approx how
much time I will take for each phase in the research project.

 For first phase (Data collection): Time require for this phase is one and half
month.

In this phase, collect data from the web sources then use panda (tool for python
language) and convert those data into dataframes.

 For Second Phase (Pre-Processing): Time require for this phase is two months.

In this phase, the data is filtered by applying some linguistics techniques like,
removal-of-stop-words, stemming, etc.

 For Third Phase (Processing): Time require for this phase is two and half
months.

In this phase, the main task is we can select the approach and use one or more
techniques with it.

 For Fourth Phase (Post Processing): Time require for this phase is one and half
months.

In this phase, ranking of the sentences is done on the summarized text and solve
some other problems.
9
These are four phases of the project and in these phases we have a lot of small
tasks to do. And in each phase we have to use python language to achieve the
desired output in each phase.

4.3 Proposed Contents of the Thesis: In this section, I am explaining the


algorithm and also the expected outcome or result. So, I am developing an algorithm for
automatic text summarization. It can summarize an input document by extractive,
abstractive or hybrid approach. As I can develop the algorithm so I want that the
algorithm should take less time and give accurate results. It can compatible for every type
of textual content either news, user reviews on social media or blogs. After the
summarization process I should be concern about the ranking of the sentences like if a
sentence is at last in the input document and in the summarize document it can come on
first then for this we have to rank the sentences properly according to their need or
importance.

4.4 Tools and Techniques: In my research project we can use some tools and some
techniques like, Pandas for python, Machine Learning, Naïve Bayes Classifier, N-Gram
Algorithm etc.

 Pandas: Pandas is a high-level data manipulation tool developed by Wes


McKinney. It is built on the Numpy package and its key data structure is called
the Data Frame. Data Frames allow you to store and manipulate tabular data in
rows of observations and columns of variables. Panda has many libraries with the
help of these libraries we can able to perform all techniques that we are used in
this research project.

 Machine Learning: Machine Learning is an application of Artificial Intelligence


(AI) that provides system the ability to automatically learn and improve from
experience without being explicitly programmed. Machine Learning focuses on
the development of computer programs that can access data and use it learn for
themselves. The process of learning begins with observation or data, such as
examples, direct experience or instruction to look for patterns in data and make
better decisions in the future based on the examples that we provide. The primary
aim is to allow computers to learn automatically without human intervention or
assistance and adjust actions accordingly. But, using the classic algorithms of
machine learning, text is considered as sequence of keywords; instead, an
approach based on semantic analysis mimics the human ability to understand the
meaning of text.

 Naïve Bayes Classifier: Naive Bayes classifiers are a collection of classification


algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of
algorithms where all of them share a common principle, i.e. every pair of features
being classified is independent of each other. To start with, let us consider a
dataset. Consider a fictional dataset that describes the weather conditions for
10
playing a game of golf. Given the weather conditions, each tuple classifies the
conditions as fit(“Yes”) or unfit(“No”) for playing golf.
 Bayes’ Theorem finds the probability of an event occurring given the
probability of another event that has already occurred. Bayes’ theorem is stated
mathematically as the following equation, where A and B are events and P(B) ? 0.
 Basically, we are trying to find probability of event A, given the event B is true.
Event B is also termed as evidence.
 P(A) is the priori of A (the prior probability, i.e. Probability of event before
evidence is seen). The evidence is an attribute value of an unknown instance (here,
it is event B).
 P(A|B) is a posterior probability of B, i.e. probability of event after evidence is
seen.

 N-Gram Models: Models that assign probabilities to sequences of words are


called language models or LMs. In this post is showing you the simplest model
that assigns probabilities to sequences of words, the n-gram. An n-gram is a
sequence of N words: a 2-gram (or bi-gram) is a two-word sequence of words like
“This is”, “is a”, “a great”, or “great song” and a 3-gram (or tri-gram) is a three-
word sequence of words like “is a great”, or “a great song”. We’ll see how to use
n-gram models to predict the last word of an n-gram given the previous words and
thus to create new sequences of words. In a bit of terminological ambiguity, we
usually drop the word “model”, and thus the term n-gram is used to mean either
the word sequence itself or the predictive model that assigns it a probability.

11
CHAPTER 5
REFERENCES
1. Kumar Nagwani, Naresh, and Shrish Verma. "A Frequent Term and Semantic
Similarity based Single Document Text Summarization Algorithm." International
Journal of Computer Applications (2011)

2. Yang, Guangbing, Wen, Nian-Shing, and Sutinen. "Personalized Text Content


Summarizer for Mobile Learning: An Automatic Text Summarization System
with Relevance Based Language Model." InTechnology for Education (T4E),
2012 IEEE Fourth International Conference on, pp. 90-97. IEEE, 2012.

3. Aksoy, Bugdayci, Gur, Uysal, and Can. "Semantic argument frequency-based


multi-document summarization." In Computer and Information Sciences, 2009.
ISCIS 2009. 24th International Symposium on, pp. 460-464. IEEE, 2009.

4. Shams, Rushdi, M. M. A. Hashem, Suraiya Rumana Akter, and Monika Gope.


"Corpus-based web document summarization using statistical and linguistic
approach." In Computer and Communication Engineering (ICCCE), 2010
International Conference on, IEEE, 2010.

5. Foong, Oi-Mean, and Alan Oxley. "A hybrid PSO model in Extractive Text
Summarizer." In Computers & Informatics (ISCI), 2011 IEEE Symposium on, pp.
130-134. IEEE, 2011.

6. S. P. Yong, A. I. Z. Abidin and Y. Y. Chen, “A Neural Based Text


Summarization System”, 6th International Conference of Data Mining, pp. 45-50,
2005.

7. LiChengcheng, “Automatic Text Summarization Based On Rhetorical Structure


Theory”, International Conference on Computer Application and System
Modeling (ICCASM), vol. 13, pp. 595-598, October 2010.

8. Hongyan Jing, “Sentence Reduction for Automatic Text Summarization”, In


Proceedings of the 6th Applied Natural Language Processing Conference, Seattle,
USA, pp. 310-315, 2000.

9. Nitin Agarwal, Gvr Kiran, Ravi Shankar Reddy and Carolyn Penstein Ros´e,
“Towards Multi-Document Summarization of Scientific Articles: Making
Interesting Comparisons with SciSumm”, Proceedings of the Workshop on
Automatic Summarization for Different Genres, Media, and Languages, Portland,
Oregon, pp. 8–15.

12
10. Jun'ichi Fukumoto, “Multi-Document Summarization Using Document Set Type
Classification”, Proceedings of NTCIR- 4, Tokyo, pp. 412-416, 2004.

11. http://www.summarization.com/mead,(Accessed:20 November,2015).

12. Mehdi Allahyari, Saeid Safaei, Krys Kochut, Seyedamin Pouriyeh, Mehdi Assefi,
Elizabeth D.Trippe, Juan B.Gutierrez, “Text Summarization Techniques: A Brief
Survey”, arXiv:1707.02268v3 [cs.CL] 28 july 2017.

13. Wafaa S. El-Kassas, Cherif R. Salama, Ahmed A. Rafea, Hoda K. Mohamed,


“Automatic Text Summarization: A Comprehensive Survey”, Expert Systems
with Applications (2020).

13
14

You might also like