You are on page 1of 29

Publication House

Identification

01-06-20
Department of Electronics & Communication Engineering
MANIT Bhopal

Major Project Report

Group No. : 24

Project Title:-Identification of Publication House of any Article using ML

S.No. Scholar No. Name Signature


1 161114094 Shaily Jaloree

2 161114084 Sulabh Jain

3 161114075 Akshay Shrivastava

4 161114106 Diksha Sharma

5 161114115 Deepali Rawat

Name & Signature of faculty mentor : Dr. R.K. Baghel

2
Title & Area of Work
Publication House Identification

We wished to build an engine that would identify certain patterns in editorial pieces of leading
newspapers and given a new text can identify the newspaper it was published on.

Identifying an author of a text article is a real world problem which is encountered in many
different areas of work. In the field of academics, quite often we require a mechanism to
detect plagiarism.

“ When you have wit of your own, it's a pleasure to


credit other people for theirs ”

We think this engine is possible because unlike normal reporting, when a news agency decides
to write an article there are certain standards which the agencies tend to follow.
Also, we believe that every newspaper targets a particular set of readers. This makes the
richness of words used by different newspapers a factor for segregation .
Therefore our underlying assumption here is :
"Every newspaper Editorial Team has a TRUE , statistically representable style that there
work are slight variation on" .

For example ->


Although both the following excerpts are on demonetization, but the writing style is different
when taken from different newspaper.

3
4
DECLARATION :-
We the undersigned solemnly declare that the project report Publication House
Identification is based our my own work carried out during our study under the supervision of
Dr. R.K Baghel.

We assert the statements made and conclusions drawn are an outcome of our research work. We
further certify that

I. The work contained in the synopsis is original and has been done by us under the general
supervision of our supervisor.
II. The work has not been submitted to any other institution for any other
degree/diploma/certificate in this university or any other University of India or abroad.
III. We have followed the guidelines provided by the university in writing the report.
IV. Whenever we have used materials (data, theoretical analysis, and text) from other sources,
we have given due credit to them in the text of the report and giving their details in the
references.

Name - Sulabh Jain


Scholar No- 161114084

Name - Akshay Shrivastava


Scholar No- 161114075

Name - Shaily Jaloree


Scholar No- 161114094

Name - Diksha Sharma


Scholar No- 161114106

Name - Deepali Rawat


Scholar No- 161114115

5
CERTIFICATE
MAULANA AZAD NATIONAL INSTITUTE OF
TECHNOLOGY
Bhopal-462003

DEPARTMENT OF ELECTRONICS AND


COMMUNICATION ENGINEERING

This is to certify that Akshay Shrivastava, Deepali Rawat, Diksha Sharma,


Shaily Jaloree, Sulabh Jain students of fourth year, Bachelor Of
Technology , ELECTRONICS AND COMMUNICATION
ENGINEERING have successfully completed their major project on
Publication House Identification in the partial fulfillment of the
requirement of their Bachelor Of Technology in Electronics and
Communication Technology .

Guided by: Dr. R.K Baghel Project Coordinator: Dr. Dheeraj Agarwal

6
ACKNOWLEDGMENT

We have taken efforts in this project. However, it would not have been possible
without the kind support and help of many individuals and organizations. We would
like to extend our sincere thanks to all of them.

We are highly indebted to Dr. R. K. Baghel sir for his guidance and constant
supervision as well as for providing necessary information regarding the project &
also for their support in completing the project. We can not thank him enough for
his tremendous support and help. We feel motivated and encouraged every time we
attend his meeting. Without his encouragement and guidance this project would not
have materialized.

We would also like to thank the developers of Scrapy, NLTK and scikit-learn
libraries in python which greatly simplified our tasks of scrapping, feature
extraction and classification respectively.

It was a great learning exercise for us and we sincerely thank you for giving us this
opportunity.

7
Table of Contents
S.no. Contents Page no.

1. Abstract 9
2. Introduction 10
3. Literature Review 11
4. Methodology & Work Description 12
5. Tools & Technology Used 17
6. Implementation & Coding 18
7. Result & Analysis 25
8. Conclusion & Future Scope 27
9. References 28

8
Abstract:-
Author identification of a text is a real issue that is encountered across various
walks of life. Grading of an individual‟s work is a key attribute in distance
education. It is difficult for institutions to verify whether the individual participated
is the person enrolled. Academics are continuously tapping for plagiarism, while
intellectuals often recover unattributed texts whose authors need to be identified. It
is with this reason that makes this field in machine learning particularly appealing.

We in our project aim to contribute to this area of text analytics. Building on


the existing structure of work we train learning algorithms to identify patterns in the
editorial pieces published by a news publication house every day and then attribute
any given editorial to its publishing house. Just like different authors have different
writing styles which makes author identification an achievable task, the same can be
applied for associating a given text article to a publication house.

HOW DO WE DIFFERENTIATE?

Every Newspaper Editorial is written and supervised by a team of editors. The


characteristic traits of that editorial is decided by the supervisor (author), thus,
identifying these traits of editorial is like Author Identification.
Each publication aims at a specific demographic viz. “The Hindu” is generally
aimed at groups of avid and intellectuals thereby making them use better
vocabulary, whereas “TOI” is generally for orthodox people and they tend to use
general vocabulary.

Therefore, we try to exploit the above two points to extract features for our
analysis.

9
Introduction:-

In the era of information revolution, electronic documents are becoming


principle media of business and academic information. Enormous amount of data is
produced and released on the internet every day. For recognizing the real author of
each of these articles or documents, we need a model that could do that on its own
in a self-learning way.

Text mining is a new field attempting to extract necessary information from


natural language text. It may be established as the process of analyzing text to
extract information that is useful for particular purposes. Unlike the data stored in
databases, text is unstructured and ambiguous. Therefore, it is algorithmically
complex to deal with.

A standard approach for plagiarism detection initially defines a set of style


markers and then either counts manually these markers in the text under study or
tries to find tools that can provide these counts accurately.

NLTK is a well-known platform for writing Python programs to work with


human language data. It provides user-friendly interfaces to over 50 lexical
resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic reasoning,
wrappers for industrial-strength NLP libraries, and an active discussion forum.

Applications of Text Categorization are numerous. Similar techniques are also


being used at sentence level rather than document level for word sense
disambiguation.
10
Literature Review:-
It is difficult for the institutions to detect plagiarism in assessment of answers
submitted by students. Thus, there is need of Author Identification. It also plays an
important role in E-mails, short text messages and in many other domains for
avoiding duplication of author identity.

The initial approach was based on Bayesian Statistical analysis of the occurrences
of prepositions and conjunctions like „or‟, „to‟ , „and‟ etc. Thereby, giving a way to
differentiate the identity of each author.
Following approaches included the "Type-Token Ratio". This was orthodoxly used
for small dataset. Analysts have calculated the average repetition of words to be
around 40% of the original word count. The term 'types' represent the number of
unique words in the text and the word 'token' represent the total number of words.

The more recent techniques and the base of our attempt is the "CUSUM" technique,
which assumes that every author tend to use a set of words in his/her writing style.
The proper nouns used in the text are treated as a single symbol, the average
sentence length is calculated and each document sentence is marked with respect to
that as '+' or '-' to show the comparison.

"Readability Measures" - is based on complexity of documents, which depends on


the number of technical terms in the text. The readability of the document is
calculated by counting the technical terms, average number of syllables per word
(readability index formulae are language based , it does not reject understandability
and does not consider actual content). This method is said to be one of the accurate
methods of Author Identification.
Instead of counting how many words are used a certain number of times an
alternative approach could examine how many times individual words are used in
the text under study.
This seems to be the most promising method since it requires minimal
computational cost and achieves remarkable results for a wide variety of authors.
11
Methodology:-
With the underlying assumption that "Every newspaper has its own way of
writing style" we researched existing literature and similar projects for leads on
potential methodologies. While reading one of the Research Paper by Stanford
("Whose Book Is It Anyway?") we got to know that every author has its own way
of writing style and the same thing can be applied for newspapers also because
every newspaper is written by a small editorial team and the editorial team consists
of authors only. And they have to follow some way of standard and guideline while
writing the newspaper. So we took author identification as a base for making our
project.

We realized the most orthodox methodology used to solve popular text analytics
problems such as sentiment analysis and document classification is the semantically
based deep text analysis approach. But for our purpose semantics wouldn‟t be
identifiably different from one newspaper to another.

Let‟s take for example a sentiment analysis problem where we are classifying
tweets into positive and negative categories. In this case we know that a positive
tweet will have words like „good,‟ „impressed,‟ „excellent,' etc., whereas a negative
tweet will have words like „disappointed,‟ „bad,', 'enraged‟ etc. But for different
newspaper articles on the same topic the keywords would all be similar. They
would vary from each other more at a level of writing.

It is with this logic that we chose to go with the shallow text analysis approach as it
is goes hand in hand with our assumption of a statistically based author style and,
more importantly, is significantly less expensive to compute.

12
Further research into shallow text analysis suggested three main approaches:

Token Based: Word-level features such as word length and n-grams.

Vocabulary Based: Vocabulary richness indicators, such as the number of unique


words.

Syntactic Features: Sentence-based features, such as parts of speech and


punctuation.
(Reference - Whose Book is it Anyway , Stanford University , CA94305)

Each of these feature approaches have their pros and cons ; token level features are
easy but, especially in the case of n-grams
(contiguous sequences of n words), slow to compute.
Vocabulary features are both fast and easy to compute, but can have a high
variance.
Syntactic features share many of their attributes with vocabulary features, but can
be quite complex to compute.

The construction of the models can be divided into following stages:-

1)Data Collection -
The process of collecting data from varied sources for training and testing our
model. Basically we need data from different sites of publication like Times Of
India ,The Hindu etc. We collected data from different sites of publication and for
this purpose we used "Scrapy" library. Our input files are JSON files which contain
the editorial articles scrapped off the website of 5 different news agencies using
scrappy.
For this project we have used data from The Guardian, Times of India, Economic
Times, Indian Express and Deccan Chronicle.
We were able to scrape off 11,000+ news articles in total.

13
2)Data Cleaning and Pre-processing -

So far, the data we have collected from various resources is in the raw form hence
we cannot use it directly for our algorithm.
After Collecting the data next task is to pre-process it.
For Publication Data the preprocessing contains following things like -
a)After scrapping the data we got to know that the same publication name is making
different classes. Like some publication name was like ['Deccan Chronicle''Deccan
Chronicle'] though the name should be ['Deccan Chronicle'] but it is printed twice.
b)The data we scrape consist of all the special characters, newline. So we replace all
the special characters and newline character with space. And stored one article in
one line.

3)Feature Extraction -

Given the huge size of the datasets we have used for classification, it was
unappealing to go for computationally expensive features. For e.g. one of the
widely used features in the field of text analytics is n-grams. It is defined as the
occurrence of „n‟ words in a sequence. Although a powerful feature, it‟s
computational complexity restricts it‟s usage in our project.

The vocabulary indicators we have used are standard and widely used. We
also decided to look at different parts of speech such as pronouns, conjunctions and
prepositions as they both have small word pools, but are still powerful in
determining writing style, representing subject and clause shifts respectively. Our
final feature vector structure treats each article as an individual training example
and consists of 108 features divided across the seven categories. Note that, aside
from a check against a small pool of conjunctions and pronouns, we do not consider
the semantic meaning of any individual word; we are only concerned with a word‟s
length and uniqueness.

14
(Reference - Whose Line is it Anyway , Stanford University , CA 94305 , Table-1)

The NLTK library was very helpful in feature extraction. Most of our actions like
removal of stop words, stemming, parts of speech tagging, etc., were greatly
simplified because of the various NLTK libraries.

15
4)Classification Methodology -

We used multiple popular classifiers to train and predict, namely - Perceptron,


Random Forests Classifier, AdaBoost Classifier and Bernoulli Naive Bayes
Classifier. The scikit-learn library in python made it easy for us to implement these
classifiers.

We carried out a stratified shuffle split, reserving 30% of the data for testing
purpose. The stratified shuffle split helps preserve the class distribution of our
original data in the train-test split. We also carried out some feature selection using
a k-Best features metric with k = 2 * num_features / 3 in order to remove redundant
features (there were some of these because the text isn't varied enough to provide
non-zero values for every data point of every distribution listed in the feature table).

Using matplotlib library, we generated confusion matrices for different classifiers.


After observing them we saw the percentage of text articles being classified
correctly for different newspapers.

16
Tools & Technology Used:-
1. Pandas - Pandas is a python library used for data handling and analysis in terms
of DataFrame. It also facilitates reading CSV files and preprocessing some data
attributes.

2. NLTK- NLTK is a well-known platform for building Python programs to work


with human language data. We used it majorly for Data extraction. Most of our
work was removal of stop word, stemming, part of speech tagging etc were greatly
simplified because of various nltk library.for example for removing stemming we
have used porter.stemmer.

3.Sklearn- With the help of sklearn library in python, we have implemented


various classifiers like RandomForest, BernoulliNB etc.

4. Matplotlib- With the help of matplotlib we have generated confusion matrix and
other plots for data analysis.

5.Scrapy- With the help of scrapy library, we were able to fetch data from various
websites like-
http://blogs.economictimes.indiatimes.com/et-editorials
http://www.financialexpress.com/print/edits-columns

6.Jupyter Notebook-We used this open-source platform for data cleaning, and
transfer of code to each other. Sharing of graphs and plots were made very easy by
using Jupyter.

17
Implementation &Coding:-

Data Extraction-

Scrapping of data using scrapy for different newspapers.

18
Data Cleaning-

Opening the stored data which was in a json file and storing all the articles in lists in
python.

Replacing the special characters(such as '*', '\n' ,'\0'etc) with blank space and then
storing it in a CSV file along with its editorial name.

19
Feature Extraction-

Creating the placeholders for the 108 features to be extracted.

20
Extracting all the features and then calculating the length of word and storing it in a
data frame.

Using NLTK library POS (part of speech) tagging is done .


Distribution of pronoun normalized by the total number of sentence in article .
Distribution of conjunction normalized by the total number of sentence in article.

21
Calculating the number of unique words, words that occurs exactly once, words that
occurs exactly twice, number of stop words.

22
23
Classification -

This is the last stage. Here we read our feature file and then split the data with the
help of Stratified Shuffle split and select 2/3 best features.

Using different classifiers for prediction and creating a confusion matrix for them .

24
25
Result Analysis:-
The table below summarizes the run time and accuracy of the various
classifiers. The highest accuracy that we were able to achieve was close to 77%
using the Random Forest Classifier.

Results of our project are consistent and are slightly more accurate than that
achieved by the project to predict author of a novel done by students at Stanford.

Classifier Accuracy
RandomForestClassifier 75.4
AdaBoostClassifier 74.2
Perceptron 65.1
BernoulliNB 65.4

Different confusion matrices for different classifiers-

26
By observing these matrices, we infer that Perceptron model misclassify a lot of
instances. It will produce better results if we increase the number of training inputs.
We also observe that the newspaper with the maximal data pool produced good
results in comparison to other newspapers. Accuracy is highest and rate of
misclassification is lowest for newspaper The Guardian.

27
Conclusion and Future Scope:-
We believe a lot of further work can be done in this field to greatly increase
the accuracy of the model and also the type of problems. More than the
classification algorithm we think work has to be put in to extract better and more
revealing features about the text.

In this project we used the most intuitive measures we could come up with. A more
in depth research into the features of text might reveal better features. Also, more
data will also improve the accuracy of the model.

We also think more work could be put into identifying the topic on which the article
is written. We were able to extract the tags associated with each article and they
should help us in topic classification.

Another application of this project can be to identify how viral the text editorial will
be given the date, day, sale etc of the paper when a particular editorial was
published.
We believe these will all be interesting and worthwhile problems to work on in this
day and age of so many opinions and news sources where we want
to choose the news content more carefully and have a healthy balance of
perspectives.

We can also break the dataset into smaller binary classification problems taking
editorials from only two newspapers at a time.

28
References:-

1. Stanko, S., Lu, D., Hsu, I. Whose Book is it Anyway? Using Machine Learning
to Identify the Author of Unknown Texts(Stanford University).

2. Luyckx, K., Daelemans, W. Shallow Text Analysis and Machine Learning for
Authorship Attribution

3. https://ieeexplore.ieee.org/document/7506654

4. https://cloud.google.com/blog/products/gcp/problem-solving-with-ml-automatic-
document-classification

5. Study of Different Methods for AuthorIdentification Vishal Chandani,


NinadDeshmane, KshitijBuva, SuvratApte, Dr. R.S. Prasad (Head of
Department)NBN Sinhagad, School of Engineering,SavitribaiPhule, Pune
University: Computer Engineering NBNSSOE Pune, India

29

You might also like