Professional Documents
Culture Documents
Identification
01-06-20
Department of Electronics & Communication Engineering
MANIT Bhopal
Group No. : 24
2
Title & Area of Work
Publication House Identification
We wished to build an engine that would identify certain patterns in editorial pieces of leading
newspapers and given a new text can identify the newspaper it was published on.
Identifying an author of a text article is a real world problem which is encountered in many
different areas of work. In the field of academics, quite often we require a mechanism to
detect plagiarism.
We think this engine is possible because unlike normal reporting, when a news agency decides
to write an article there are certain standards which the agencies tend to follow.
Also, we believe that every newspaper targets a particular set of readers. This makes the
richness of words used by different newspapers a factor for segregation .
Therefore our underlying assumption here is :
"Every newspaper Editorial Team has a TRUE , statistically representable style that there
work are slight variation on" .
3
4
DECLARATION :-
We the undersigned solemnly declare that the project report Publication House
Identification is based our my own work carried out during our study under the supervision of
Dr. R.K Baghel.
We assert the statements made and conclusions drawn are an outcome of our research work. We
further certify that
I. The work contained in the synopsis is original and has been done by us under the general
supervision of our supervisor.
II. The work has not been submitted to any other institution for any other
degree/diploma/certificate in this university or any other University of India or abroad.
III. We have followed the guidelines provided by the university in writing the report.
IV. Whenever we have used materials (data, theoretical analysis, and text) from other sources,
we have given due credit to them in the text of the report and giving their details in the
references.
5
CERTIFICATE
MAULANA AZAD NATIONAL INSTITUTE OF
TECHNOLOGY
Bhopal-462003
Guided by: Dr. R.K Baghel Project Coordinator: Dr. Dheeraj Agarwal
6
ACKNOWLEDGMENT
We have taken efforts in this project. However, it would not have been possible
without the kind support and help of many individuals and organizations. We would
like to extend our sincere thanks to all of them.
We are highly indebted to Dr. R. K. Baghel sir for his guidance and constant
supervision as well as for providing necessary information regarding the project &
also for their support in completing the project. We can not thank him enough for
his tremendous support and help. We feel motivated and encouraged every time we
attend his meeting. Without his encouragement and guidance this project would not
have materialized.
We would also like to thank the developers of Scrapy, NLTK and scikit-learn
libraries in python which greatly simplified our tasks of scrapping, feature
extraction and classification respectively.
It was a great learning exercise for us and we sincerely thank you for giving us this
opportunity.
7
Table of Contents
S.no. Contents Page no.
1. Abstract 9
2. Introduction 10
3. Literature Review 11
4. Methodology & Work Description 12
5. Tools & Technology Used 17
6. Implementation & Coding 18
7. Result & Analysis 25
8. Conclusion & Future Scope 27
9. References 28
8
Abstract:-
Author identification of a text is a real issue that is encountered across various
walks of life. Grading of an individual‟s work is a key attribute in distance
education. It is difficult for institutions to verify whether the individual participated
is the person enrolled. Academics are continuously tapping for plagiarism, while
intellectuals often recover unattributed texts whose authors need to be identified. It
is with this reason that makes this field in machine learning particularly appealing.
HOW DO WE DIFFERENTIATE?
Therefore, we try to exploit the above two points to extract features for our
analysis.
9
Introduction:-
The initial approach was based on Bayesian Statistical analysis of the occurrences
of prepositions and conjunctions like „or‟, „to‟ , „and‟ etc. Thereby, giving a way to
differentiate the identity of each author.
Following approaches included the "Type-Token Ratio". This was orthodoxly used
for small dataset. Analysts have calculated the average repetition of words to be
around 40% of the original word count. The term 'types' represent the number of
unique words in the text and the word 'token' represent the total number of words.
The more recent techniques and the base of our attempt is the "CUSUM" technique,
which assumes that every author tend to use a set of words in his/her writing style.
The proper nouns used in the text are treated as a single symbol, the average
sentence length is calculated and each document sentence is marked with respect to
that as '+' or '-' to show the comparison.
We realized the most orthodox methodology used to solve popular text analytics
problems such as sentiment analysis and document classification is the semantically
based deep text analysis approach. But for our purpose semantics wouldn‟t be
identifiably different from one newspaper to another.
Let‟s take for example a sentiment analysis problem where we are classifying
tweets into positive and negative categories. In this case we know that a positive
tweet will have words like „good,‟ „impressed,‟ „excellent,' etc., whereas a negative
tweet will have words like „disappointed,‟ „bad,', 'enraged‟ etc. But for different
newspaper articles on the same topic the keywords would all be similar. They
would vary from each other more at a level of writing.
It is with this logic that we chose to go with the shallow text analysis approach as it
is goes hand in hand with our assumption of a statistically based author style and,
more importantly, is significantly less expensive to compute.
12
Further research into shallow text analysis suggested three main approaches:
Each of these feature approaches have their pros and cons ; token level features are
easy but, especially in the case of n-grams
(contiguous sequences of n words), slow to compute.
Vocabulary features are both fast and easy to compute, but can have a high
variance.
Syntactic features share many of their attributes with vocabulary features, but can
be quite complex to compute.
1)Data Collection -
The process of collecting data from varied sources for training and testing our
model. Basically we need data from different sites of publication like Times Of
India ,The Hindu etc. We collected data from different sites of publication and for
this purpose we used "Scrapy" library. Our input files are JSON files which contain
the editorial articles scrapped off the website of 5 different news agencies using
scrappy.
For this project we have used data from The Guardian, Times of India, Economic
Times, Indian Express and Deccan Chronicle.
We were able to scrape off 11,000+ news articles in total.
13
2)Data Cleaning and Pre-processing -
So far, the data we have collected from various resources is in the raw form hence
we cannot use it directly for our algorithm.
After Collecting the data next task is to pre-process it.
For Publication Data the preprocessing contains following things like -
a)After scrapping the data we got to know that the same publication name is making
different classes. Like some publication name was like ['Deccan Chronicle''Deccan
Chronicle'] though the name should be ['Deccan Chronicle'] but it is printed twice.
b)The data we scrape consist of all the special characters, newline. So we replace all
the special characters and newline character with space. And stored one article in
one line.
3)Feature Extraction -
Given the huge size of the datasets we have used for classification, it was
unappealing to go for computationally expensive features. For e.g. one of the
widely used features in the field of text analytics is n-grams. It is defined as the
occurrence of „n‟ words in a sequence. Although a powerful feature, it‟s
computational complexity restricts it‟s usage in our project.
The vocabulary indicators we have used are standard and widely used. We
also decided to look at different parts of speech such as pronouns, conjunctions and
prepositions as they both have small word pools, but are still powerful in
determining writing style, representing subject and clause shifts respectively. Our
final feature vector structure treats each article as an individual training example
and consists of 108 features divided across the seven categories. Note that, aside
from a check against a small pool of conjunctions and pronouns, we do not consider
the semantic meaning of any individual word; we are only concerned with a word‟s
length and uniqueness.
14
(Reference - Whose Line is it Anyway , Stanford University , CA 94305 , Table-1)
The NLTK library was very helpful in feature extraction. Most of our actions like
removal of stop words, stemming, parts of speech tagging, etc., were greatly
simplified because of the various NLTK libraries.
15
4)Classification Methodology -
We carried out a stratified shuffle split, reserving 30% of the data for testing
purpose. The stratified shuffle split helps preserve the class distribution of our
original data in the train-test split. We also carried out some feature selection using
a k-Best features metric with k = 2 * num_features / 3 in order to remove redundant
features (there were some of these because the text isn't varied enough to provide
non-zero values for every data point of every distribution listed in the feature table).
16
Tools & Technology Used:-
1. Pandas - Pandas is a python library used for data handling and analysis in terms
of DataFrame. It also facilitates reading CSV files and preprocessing some data
attributes.
4. Matplotlib- With the help of matplotlib we have generated confusion matrix and
other plots for data analysis.
5.Scrapy- With the help of scrapy library, we were able to fetch data from various
websites like-
http://blogs.economictimes.indiatimes.com/et-editorials
http://www.financialexpress.com/print/edits-columns
6.Jupyter Notebook-We used this open-source platform for data cleaning, and
transfer of code to each other. Sharing of graphs and plots were made very easy by
using Jupyter.
17
Implementation &Coding:-
Data Extraction-
18
Data Cleaning-
Opening the stored data which was in a json file and storing all the articles in lists in
python.
Replacing the special characters(such as '*', '\n' ,'\0'etc) with blank space and then
storing it in a CSV file along with its editorial name.
19
Feature Extraction-
20
Extracting all the features and then calculating the length of word and storing it in a
data frame.
21
Calculating the number of unique words, words that occurs exactly once, words that
occurs exactly twice, number of stop words.
22
23
Classification -
This is the last stage. Here we read our feature file and then split the data with the
help of Stratified Shuffle split and select 2/3 best features.
Using different classifiers for prediction and creating a confusion matrix for them .
24
25
Result Analysis:-
The table below summarizes the run time and accuracy of the various
classifiers. The highest accuracy that we were able to achieve was close to 77%
using the Random Forest Classifier.
Results of our project are consistent and are slightly more accurate than that
achieved by the project to predict author of a novel done by students at Stanford.
Classifier Accuracy
RandomForestClassifier 75.4
AdaBoostClassifier 74.2
Perceptron 65.1
BernoulliNB 65.4
26
By observing these matrices, we infer that Perceptron model misclassify a lot of
instances. It will produce better results if we increase the number of training inputs.
We also observe that the newspaper with the maximal data pool produced good
results in comparison to other newspapers. Accuracy is highest and rate of
misclassification is lowest for newspaper The Guardian.
27
Conclusion and Future Scope:-
We believe a lot of further work can be done in this field to greatly increase
the accuracy of the model and also the type of problems. More than the
classification algorithm we think work has to be put in to extract better and more
revealing features about the text.
In this project we used the most intuitive measures we could come up with. A more
in depth research into the features of text might reveal better features. Also, more
data will also improve the accuracy of the model.
We also think more work could be put into identifying the topic on which the article
is written. We were able to extract the tags associated with each article and they
should help us in topic classification.
Another application of this project can be to identify how viral the text editorial will
be given the date, day, sale etc of the paper when a particular editorial was
published.
We believe these will all be interesting and worthwhile problems to work on in this
day and age of so many opinions and news sources where we want
to choose the news content more carefully and have a healthy balance of
perspectives.
We can also break the dataset into smaller binary classification problems taking
editorials from only two newspapers at a time.
28
References:-
1. Stanko, S., Lu, D., Hsu, I. Whose Book is it Anyway? Using Machine Learning
to Identify the Author of Unknown Texts(Stanford University).
2. Luyckx, K., Daelemans, W. Shallow Text Analysis and Machine Learning for
Authorship Attribution
3. https://ieeexplore.ieee.org/document/7506654
4. https://cloud.google.com/blog/products/gcp/problem-solving-with-ml-automatic-
document-classification
29