You are on page 1of 8

Automatic Text Summarization using Python

Steven Ace B. Galedo Philippines


College of Information Technology Manila, Philippines
Education Department stevengaledo@gmail.com
Technological Institute of the
Abstract—Manual Text Summarization requires a lot time understanding, even those words did not appear in the
when done. That’s why in this paper, a simple automatic text source documents. It aims at producing important material
summarization tool using Cosine Similarity Measure was in a new way. They interpret and examine the text using
created. Cosine Similarity is just one way of summarizing text advanced natural language techniques in order to generate a
by making sentence vectors out of a text and then by using a
formula ranking the top ranking sentence and then include it
new shorter text that conveys the most critical information
in the final summary. from the original text. However, in Extractive
Summarization, Extractive methods attempt to summarize
Keywords—Text Summarization, Python, Extractive articles by selecting a subset of words that retain the most
Summarization, Cosine Similarity, Sentence Vectors important points. This approach weights the important part
of sentences and uses the same to form the summary.
I. INTRODUCTION Different algorithm and techniques are used to define
weights for the sentences and further rank them based on
Scanning large piles of text and interpreting its meaning
is a hard task to do. That’s why summarization has been a importance and similarity among each other[3].
crucial part for many industries already. Reading a summary III. ABSTRACTIVE AND EXTRACTIVE TEXT SUMMARIZATION
gives a glimpse for a reader the meaning or gist of a text
without having to read the whole text especially if the A. Abstractive Text Summarization
document is lengthy like a court case or scientific papers.
Summarization is the process of reducing a text to its main In order to achieve abstractive text summarization,
idea and necessary information. Summarizing differs from certain techniques are applied. The techniques are discussed
paraphrasing in that summary leaves out details and terms below and are divided into two categories which is the
while a paraphrase  is a restatement of the meaning of a text Structured Based Approach and Semantic Based Approach.
or passage using other words. Summarizing helps you Structured based approach encodes most important
understand and learn important information by reducing information from the document through cognitive schemes
information to its key ideas. Summaries can be used for such as templates, extraction rules and other structures such
annotation and study notes as well as to expand the depth of as tree, ontology, lead and body phrase structure. In
your writing[1]. It is very difficult for human beings to Semantic based approach, semantic representation of
manually extract the summary of a large documents of text. document is used to feed into natural language generation
There are plenty of text material available on the Internet. So (NLG) system. This method focuses on identifying noun
there is a problem of searching for relevant documents from phrase and verb phrase by processing linguistic data. The
the number of documents available, and absorbing relevant different techniques for each approach are explained better
information from it. below in Table I and Table II:

II. AUTOMATIC TEXT SUMMARIZATION TABLE I. ABSTRACTIVE TEXT SUMMARIZATIONMETHODS:


USING STRUCTURED BASED APPROACH[4]
Automatic text summarization deals with employing
machines or computers to perform the summarization of a
Methods Description
document or documents using some form of heuristics or
-It uses a dependency tree to represent the text of a
statistical methods. There are different approaches to document.
automatic text summarization. Each which will be explained -It uses either a language generator or an algorithm
below: for generation of summary.
Tree
A. Automatic Text Summarization Based on Input Type Based
Method
Single Document Summarization focuses on
summarizing single document only. Multi document
Summarization focuses on summarizing multiple
documents[2].
-It uses a template to represent a whole document.
-Linguistic patterns or extraction rules are matched to
B. Automatic Text Summarization Based on Purpose identify text snippets that will be mapped
Automatic Text Summarization can also be done based into template slots.
Template
purpose. There are 3 approaches of Purpose Based Based
Automatic Text Summarization these include Generic, Method
Domain-Specific and Query-Based. Generic Based
Summarization focuses on obtaining a generic summary or
abstract of the collection (whether documents, or sets of
images, or videos, news stories etc.). Domain-Specific -Use ontology
summarizes only a particular document and Query Based (knowledge base) to improve the process of
summarizes objects specific to a query[2]. summarization. -It exploits fuzzy ontology to handle
uncertain data that simple domain ontology cannot.
Ontology
C. Automatic Text Summarization Based on Output Type Based
In terms of output type, there are also approaches to text Method
summarization these include Abstractive and Extractive text
summarization technique. In Abstractive Summarization,
Abstractive methods select words based on semantic
- This method is based on the operations of phrases TABLE III. EXTRACTIVE TEXT SUMMARIZATION
(insertion and substitution) that have same syntactic TECHNIQUES[4]
head chunk in the lead and body sentences in order
Lead and
to rewrite the lead sentence. Methods Description
Body
Phrase Term -Sentence frequency is defined as the
Method Frequency- number of sentences in the document
Inverse that contain that term.
Document -Then this sentence vectors are scored
Frequency by similarity to the query and the
-Documents to be summarized are represented in terms highest scoring sentences are picked to
Method
of categories and a list of aspects. be part of the summary.
Rule -It is intuitive to think that summaries
Based should address different “themes”
Method appearing in the documents.
-If the document collection for which
summary is being produced is of totally
different topics, document clustering
becomes almost essential to generate a
meaningful summary.
TABLE II. ABSTRACTIVE TEXT SUMMARIZATION - Sentence selection is based
METHODS: USING SEMANTIC BASED APPROACH[4] Cluster Based on similarity of the sentences to the
Method theme of the cluster (Ci).
The next factor that is location of the
Methods Description sentence in the document (Li). The last
factor is its similarity to the first
-A semantic model, which captures concepts
sentence in the document to which it
and relationship among concepts, is built to belongs
represent the contents of multimodal (Fi).
documents. Si =W1 * Ci + W2 * Fi+ W3 *Li
Multimodal
semantic model Where, W1, W2, W3 are weight age for
inclusion in summary.
- The clustering k-means
algorithm is applied.
-Graph theoretic representation of
-The contents of summary are generated from passages provides a method of
abstract representation of source documents, rather identification of themes.
than from sentences of source documents. Graph -After the common pre-processing
-The abstract Representation is Information Item, Theoretic steps, namely, stemming and stop word
which is the smallest element of coherent Approach removal; sentences in the documents
information in a text. are represented as nodes in an
undirected graph.
Information -The summarization process is
Item Based Method modelled as a classification problem:
sentences are classified as summary
sentences and non-summary sentences
Machine based on the features that they possess.
Learning -The Classification probabilities are
Approach studied statistically using Navie Bayes
Classifier rule:
P (s є <S | F1, F2, ..., FN) = P (F1,
-This method is used to summarize a document F2, ..., FN | s є S) * P (s є S) / P (F1,
by creating a semantic graph called Rich F2,..., FN)
Semantic Graph (RSG) for the original - It gets this name LSA because SVD
document, reducing the generated semantic applied to document word matrices,
graph. group documents that are semantically
Semantic LSA Method related to each other, even when they do
Graph Based not share common words.
Method

Text -This method involves training the


summarization neural networks to learn the types of
With Neural sentences that should be included in the
Networks summary.
-It uses three- layered Feed Forward
neural network.
B. Extractive Text Summarization
With respect to Extractive Text Summarization, the
different techniques in order to achieve it are listed below in
Table III.
-This method considers each
characteristic of a text such as similarity
to title, sentence length and similarity to
Automatic TS key word etc. as the input of the fuzzy
based on fuzzy system.
logic

The idea of this approach is to obtain Fig. 1 Cosine Similarity Formula


concepts of words based on HowNet,
and use concept as feature, instead of The Python Program using Cosine Similarity
An approach to word. This approach uses conceptual
concept-obtained vector space model to form a rough By applying the formula in a python program, a simple
text summarization, and then calculate text summarization was created. The code for the program is
summarization degree of semantic similarity of reflected below:
sentence for reducing its redundancy[5].
#!/usr/bin/env python
Mathematical regression is a good
# coding: utf-8
model to estimate the text feature
weights. In this model, a mathematical import nltk
function can relate output to input. The
Text feature parameters of many manually from nltk.corpus import stopwords
summarization summarized English documents are
using regression
used as independent input variables and from nltk.cluster.util import cosine_distance
corresponding dependent outputs are
for estimating specified in training phase. A relation import numpy as np
feature weights between inputs and outputs is
established. Then testing data are import networkx as nx
introduced to the system model for
evaluation of its efficiency[5].
def read_article(file_name):
Multi document extractive
summarization deals with extraction of file = open(file_name, "r")
summarized information from multiple
texts written about the same topic. filedata = file.readlines()
Resulting summary report allows
individual users, so as professional article = filedata[0].split(". ")
information consumers, to quickly
familiarize themselves with information sentences = []
Multi-document
contained in a large cluster of
extractive
documents. Multi-document
summarization summarization creates information
reports that are both concise and
for sentence in article:
comprehensive. With different opinions #print(sentence)
being put together & outlined, every
topic is described from multiple sentences.append(sentence.replace("[^a-zA-Z]", "
perspectives within a single
document[5].
").split(" "))
sentences.pop()

IV. COSINE SIMILARITY return sentences


While the approaches presented in Table I, II and III
produced a good summary as been proved. It is still better to
produce a simple text summarization tool using Cosine def sentence_similarity(sent1, sent2, stopwords=None):
Similarity Measure. Cosine similarity is a measure of if stopwords is None:
similarity between two non-zero vectors of an inner product
space that measures the cosine of the angle between them. stopwords = []
Since we will be representing our sentences as the bunch of
vectors, we can use it to find the similarity among sentences.
Its measures cosine of the angle between vectors. Angle will sent1 = [w.lower() for w in sent1]
be 0 if sentences are similar[3]. Cosine similarity is
sent2 = [w.lower() for w in sent2]
computed to have a grasp of which sentences are related with
each other and can be included in the final summary.
Sentence Vectors are created by listing all the words in 2
sentences being compared and counting the occurrence of all_words = list(set(sent1 + sent2))
each word in each sentence. Then cosine similarity of 2
sentence vectors are calculated. The formula for cosine
similarity is shown below in Fig. 1. vector1 = [0] * len(all_words)
vector2 = [0] * len(all_words)
# build the vector for the first sentence # Step 2 - Generate Similary Martix across sentences
for w in sent1: sentence_similarity_martix =
build_similarity_matrix(sentences, stop_words)
if w in stopwords:
continue
# Step 3 - Rank sentences in similarity martix
vector1[all_words.index(w)] += 1
sentence_similarity_graph =
nx.from_numpy_array(sentence_similarity_martix)
# build the vector for the second sentence scores = nx.pagerank(sentence_similarity_graph)
for w in sent2: #print("This is the scores: ",
if w in stopwords: sentence_similarity_martix)
continue
vector2[all_words.index(w)] += 1 # Step 4 - Sort the rank and pick top sentences
ranked_sentence = sorted(((scores[i],s) for i,s in
enumerate(sentences)), reverse=True)
return 1 - cosine_distance(vector1, vector2)
#print("Indexes of top ranked_sentence order are ",
ranked_sentence)
def build_similarity_matrix(sentences, stop_words):
# Create an empty similarity matrix for i in range(top_n):
similarity_matrix = np.zeros((len(sentences), summarize_text.append(" ".join(ranked_sentence[i]
len(sentences))) [1]))

for idx1 in range(len(sentences)): # Step 5 - Offcourse, output the summarize texr


for idx2 in range(len(sentences)): print("Summarized Text: \n", ".
if idx1 == idx2: #ignore if both are same ".join(summarize_text))
sentences
continue # let's begin
similarity_matrix[idx1][idx2] = generate_summary( "sample1.txt", 1)
sentence_similarity(sentences[idx1], sentences[idx2],
stop_words) So, there are 4 functions from the code. The read_article,
sentence_similarity, build_similarity_matrix and
generate_summary. The read_article function is where we
return similarity_matrix open the text file defined in the code, the sentence_similarity
is where we compute the sentence similarity of sentences
using cosine similarity and the build_similarity_matrix is the
matrix similarity for all sentences. Lastly, the
generate_summary function is what we call to generate the
def generate_summary(file_name, top_n=5): summary. Preferred top number of sentences and name of
nltk.download("stopwords") text file is provided.

stop_words = stopwords.words('english') V. CONCLUSION


summarize_text = [] From the review of text summarization techniques,
#Flow:Input article → split into sentences → remove we would able to create an automatic text summarization
stop words → using python. The method for summarization used is Cosine
Similarity. In the end, it has been proved that cosine
#build a similarity matrix → generate rank based on similarity measure can be used to create automatic text
matrix → summarization tool.
#pick top N sentences for summary.
ACKNOWLEDGMENT
First of all, I would like to thank the Almighty God for
# Step 1 - Read text and split it giving me overflowing knowledge and wisdom to make this
review and making the project a successful one. I would like
sentences = read_article(file_name) also to thank Him for providing me everything I need from
the start until the end of this project. I would like also to [3] “Understand Text Summarization and create your own
thank my parents for their unwavering support for us as I summarizer in python.” [Online]. Available:
did this research. https://towardsdatascience.com/understand-text-summarization-
and-create-your-own-summarizer-in-python-b26a9f09fc70.
REFERENCES
[Accessed: 18-Sep-2019].
[1] D. Dean, “Original Article with Highlighting and Annotations [4] D. K. Gaikwad and C. N. Mahender, “A Review Paper on Text
Bats.” Summarization,” Int. J. Adv. Res. Comput. Commun. Eng., vol. 5,
[2] “Unsupervised Text Summarization using Sentence no. 3, pp. 154–160, 2016.
Embeddings.” [Online]. Available: [5] V. Gupta and G. S. Lehal, “A Survey of Text Summarization
https://medium.com/jatana/unsupervised-text-summarization- Extractive techniques,” J. Emerg. Technol. Web Intell., vol. 2, no.
using-sentence-embeddings-adb15ce83db1. [Accessed: 18-Sep- 3, pp. 258–268, 2010.
2019].

You might also like