You are on page 1of 24

“Data Redundancy using Ma-LSTM”

A report submitted in partial fulfillment of the requirements

Of

MINI-PROJECT

In

Sixth Semester

By

1MS16IS074 Rishu Verma


1MS16IS080 Sachin
1MS16IS054 Piyush Kumar

Under the guidance of

Dr. Sumana Maradithaya


Associate Professor
Dept. of ISE, RIT

RAMAIAH
I n s t i t u t e o f Te c h n o l o g y

DEPARTMENT OF INFORMATION SCIENCE & ENGINEERING


RAMAIAH INSTITUTE OF TECHNOLOGY
(AUTONOMOUS INSTITUTE AFFILIATED TO VTU)
VIDYA SOUDHA
M. S. RAMAIAH NAGAR, M. S. R. I. T. POST, BANGALORE – 560054

2018-2019
RAMAIAH INSTITUTE OF TECHNOLOGY
(Autonomous Institute Affiliated to VTU)
VIDYA SOUDHA
M. S. Ramaiah Nagar, M. S. R. I. T. Post, Bangalore – 560054

DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING

RAMAIAH
I n s t i t u t e o f Te c h n o l o g y

CERTIFICATE

This is to certify that the project work entitled “Data Redundancy using Ma-
LSTM” is a bonafide work carried out by Rishu Verma,Sachin, Piyush
Kumar bearing USN: 1MS16IS074,1MS16IS080, 1MS16IS054 in partial
fulfillment of requirements of Mini-Project course of Sixth Semester B.E. It is
certified that all correction/suggestions indicated for internal assessment has
been incorporated in the report. The project has been approved as it satisfies
the academic requirements in respect of project work prescribed by the above
said course.

_________________________ __________________________

Signature of the Guide Signature of the HOD


Dr. Sumana Maradithaya Dr. Vijaya Kumar B P
Asst. Professor Professor and Head,
Dept. of ISE, RIT, Dept. of ISE, RIT
Bangalore-54 Bangalore-54

Other Examiners
Name of the Examiners:
Signature
1.

2.
Acknowledgements

All sentences or passages quoted in this report from other people's work
have been specifically acknowledged by clear cross-referencing to author,
work and page(s). Any illustrations which are not the work of the author
of this report have been used (where possible) with the explicit
permission of the originator and are specifically acknowledged. I
understand that failure to do this amount to plagiarism and will be
considered grounds for failure in this project and the degree examination
as a whole.
Abstract

There is redudant data everywhere. Searching on internet about anything


gives lots of repetetive data about the topic. One finds it difficult to find
the concise, non-repetetive information.
The LSTM model has been proposed which is trained to detect similarity
between two sentences. The model takes word embeddings as input and
gives similarity score between 0 and 1. Each word is first transformed
into a vector with the help of Google-news-vector corpus.
The model can’t be directly trained on raw sentences each word has to
be represented in the form of vectors. This representation of word is
known as word embeddings.
The proposed model can thus predict similarity in the sentences, and can
output concise, non-redundant information to the user.
Contents
1. Introduction 1
1.1 Motivation 4
1.2 Scope 4
1.3 Objectives 5
1.4 Proposed Model 5

2. Literature Review 9

3. System analysis and Design 10

4. Modelling and Implementation 12


4.1 Use Case Diagram 12
4.2 Sequence Diagram 13

5. Testing, Results and Discussion 14

6. Conclusion and Future Work 18

References 19
Chapter-1

INTRODUCTION

The problem of data redundancy across related documents is tackled by


Manhattan LSTM (MaLSTM) — a Siamese deep network and WordNet lexi-
cal database.

Siamese network is an artificial neural network that use the same


weights while working in tandem on two different input vectors to com-
pute comparable output vectors.

The Long Short Term Memory (LSTM) is a second-order recurrent neu-


ral network architecture that excels at storing sequential short-term mem-
ories and retrieving them many time-steps later.

LSTMs have an edge over conventional feed-forward neural networks and


RNN in many ways. This is because of their property of selectively re-
membering patterns for long durations of time. The purpose of this arti-
cle is to explain LSTM and enable you to use it in real life problems.

LSTMs make small modifications to the information by multiplications


and additions. With LSTMs, the information flows through a mechanism
known as cell states. This way, LSTMs can selectively remember or forget
things. The information at a particular cell state has three different
dependencies.
These dependencies can be generalized to any problem as:
1.The previous cell state (i.e. the information that was present in
the memory after the previous time step)

2.The previous hidden state (i.e. this is the same as the output of
the previous cell)

3.The input at the current time step (i.e. the new information that
is being fed in at that moment) 1
fig1: Expanded RNN
fig1 is Expansion of RNN is for understanding .Recurrent Neural Network
can be represented as time sequenced simple neural network where
output of RNN at any time serves as input at next time sequence.

WordNet is a lexical database for the English language. It groups English


words into sets of synonyms called synsets, provides short definitions and
usage examples, and records a number of relations among these synonym
sets or their members. WordNet can thus be seen as a combination of
dictionary and thesaurus.

Synsets are interlinked by means of conceptual-semantic and lexical


relations. The resulting network of meaningfully related words. WordNet
is also freely and publicly available to use. WordNet's structure makes it
a useful tool for computational linguistics and natural language
processing.
WordNet superficially resembles a thesaurus, in that it groups words
together based on their meanings. However, there are some important
distinctions. First, WordNet interlinks not just word forms—strings of
letters—but specific senses of words. As a result, words that are found in
close proximity to one another in the network are semantically
disambiguated. Second, WordNet labels the semantic relations among
words, whereas the groupings of words in a thesaurus does not follow
any explicit pattern other than meaning similarity.

2
Structure of wordnet:

The main relation among words in WordNet is synonymy, as between the


words shut and close or car and automobile. Synonyms--words that
denote the same concept and are interchangeable in many contexts--are
grouped into unordered sets (synsets). Each of WordNet’s 117 000 synsets
is linked to other synsets by means of a small number of “conceptual
relations.” Additionally, a synset contains a brief definition (“gloss”)
and, in most cases, one or more short sentences illustrating the use of
the synset members. Word forms with several distinct meanings are
represented in as many distinct synsets. Thus, each form-meaning pair in
WordNet is unique.

3
1.1 Motivation
Most of the things are available in internet. One can study about any
topic using MOOC(massive open online courses), but most of the time
infomation about a topic is repeated. This makes it difficult for a person
to search more about that topic. We wanted to make a model which can
detect similarity between sentences and remove the redundant data from
a document thus giving reader concise information.

1.2 Scope

The model has scopes:


• Extract useful information from documents and ignoring redundant
data.
• Detect plagiarism.
• Checking if two research papers are similiar.
• Biomedical Informatics: To developed the biomedical ontologies
namely the Gene Ontology we used the semantic similarity.
Similarity methods are mainly used to compare the genes and they
can also used in other bio-entities
• Geo-Informatics: Similarity measure also used to find the similarities
between geographical feature type ontologies. Several tools are
available to do this task such as (i) The OSM Semantic Network
used to compute the semantic similarity of tags in OpenStreetMap.
(ii) Similarity Calculator is used to find the similarity between two
geographical concepts in the Geo-Net-PT ontology and (iii) SIM-DL
similarity server computes the similarity between geographical
feature type ontologies.

4
1.3 Objectives
• Reduce related multiple documents to single non-trivial document .
• Improve accuracy to reduce data loss.
• Perform accurately over different genre of documents.

1.4 Proposed Model

Siamese Manhattan LSTM

fig 2: Siamese LSTM

The proposed Manhattan LSTM (MaLSTM) model is outlined in Figure 2.


There are two networks LSTM(a) and LSTM(b) which each process one of
the sentences in a givenpair, but we solely focus on siamese architectures
with tied weights such that LSTM(a) = LSTM(b)in this work. Each
sentence (represented as a sequence of word vectors) is passed to the
LSTM, which updates its hidden state at each sequence-index.

5
fig 3: Embedding matrix

Fig3 demonstrate conversion of sentences into their word embedding


matrix.

6
Word Embeddings:

Word embedding is one of the most popular representation of document


vocabulary. It is capable of capturing context of a word in a document,
semantic and syntactic similarity, relation with other words, etc.

Loosely speaking, they are vector representations of a particular word.


Having said this, what follows is how do we generate them? More
importantly, how do they capture the context?

Word2Vec is one of the most popular technique to learn word


embeddings using shallow neural network.

Consider the following similar sentences: Have a good day and Have a
great day. They hardly have different meaning. If we construct an
exhaustive vocabulary (let’s call it V), it would have V = {Have, a,
good, great, day}.

Now, let us create a one-hot encoded vector for each of these words in
V. Length of our one-hot encoded vector would be equal to the size of V
(=5). We would have a vector of zeros except for the element at the
index representing the corresponding word in the vocabulary. That
particular element would be one. The encodings below would explain
this better.Have = [1,0,0,0,0]`; a=[0,1,0,0,0]` ; good=[0,0,1,0,0]` ;
great=[0,0,0,1,0]` ; day=[0,0,0,0,1]` (` represents transpose)If we try to
visualize these encodings, we can think of a 5 dimensional space, where
each word occupies one of the dimensions and has nothing to do with
the rest (no projection along the other dimensions). This means ‘good’
and ‘great’ are as different as ‘day’ and ‘have’, which is not true.

7
Our objective is to have words with similar context occupy close spatial
positions. Mathematically, the cosine of the angle between such vectors
should be close to 1, i.e. angle close to 0.

fig 4: Word2Vec example

fig4 , shows a famous example of word2vec. Here, when vector of king


is subtracted from vector of man and added with vector of queen the
resultant shows a vector of woman.

8
Chapter- 2

Literature Survey
[1] It disscusses about unified MOOC model to provide information. It
facilitate the exploitation of the experiences produced by the interactions
of the pedagogical actors. The aim is to make a unified analysis of the
massive data generated by learning actors.

[2] Language sentences is critical to the performance of several


applications such as text mining, question answering, and text
summarization. Given two sentences, an effective similarity measure
should be able to determine whether the sentences are semantically
equivalent or not, taking into account the variability of natural language
expression. That is, the correct similarity judgment should be made even
if the sentences do not share similar surface form.

[3] Ordering information is a difficult but a important task for


natural language generation applications. A wrong order of information
not only makes it difficult to understand, but also conveys an entirely
different idea to the reader.

[4] A way for encoding sentences into embedding vectors that specifically
target transfer learning to other NLP tasks. The models are efficient and
result in accurate performance on diverse transfer tasks. Two variants of
the encoding models allow for trade-offs between accuracy and compute
resources.

[5] Long Short Term Memory(LSTM) is a Machine learning model which


is improved version of RNN. RNN has very short memory and can’t
predict about things that happened in initial phases of training. LSTM
takes care of these things and is a prefered for NLP tasks.

[6] Keras is an open-source neural-network library written in Python. It


is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit,
Theano, or PlaidML. Designed to enable fast experimentation with deep
neural networks, it focuses on being user-friendly, modular, and
extensible.
9
Chapter-3

System analysis and design


Determination:

• The model should provide results as accurate as possible.


• The model should be robust and must not crash during processing
of inputs.
• The interface for the project should be easy to understand and use.

Vanishing Gradient:

fig 5a fig 5b

An error gradient is the direction and magnitude calculated during the


training of a neural network that is used to update the network weights
in the right direction and by the right amount. (Fig 5a)In deep networks
or recurrent neural networks, error gradients can accumulate during an
update and result in very large gradients.

10
These in turn result in large updates to the network weights, and in
turn, an unstable network. At an extreme, the values of weights can
become so large as to overflow and result in NaN values. The explosion
occurs through exponential growth by repeatedly multiplying gradients
through the network layers that have values larger than 1.0.

Moving backward in the Network and calculating gradients of loss(Error)


with respect to the weights , this tends to get smaller and smaller as we
keep on moving backward in the Network. This means that the neurons
in the Earlier layers learn very slowly as compared to the neurons in the
later layers in the Hierarchy. (fig5b) The Earlier layers in the network
are slowest to train. The Training process takes too long and the
Prediction Accuracy of the Model will decrease.

Secifications:

OS: Ubuntu
Language: Python 2.76
Dataset : Quora duplicate question dataset
coding interface: Jupyter Notebook
libraries used:
• keras
• pandas
• numpy
• nltk
• matplotlib
• seaborn
• scikit learn
• gensim
11
Chapter-4

Modelling and Implementation

4.1 Use case diagram

fig 6: Use Case Diagram

Release1- Input were Sentences and output was Similarity score between
them.

Release2- A front Layer is built on top of release 1 input , here multiple


documents are taken as input and output was merged documents. (fig6)

12
4.2 Sequence Diagram

fig 7: Sequence Diagram

Stage1 - User Interface

Job - Documents input and retrieval , visible to end users.


Stage2 - Pre-Processing

Job - Sentence to their vector representation and text cleaning.


Stage3 - Computation

Job - Model feeding and computation of similarity score. (fig7)

13
Chapter-5

Testing Result and Disscussion

Training and Testing:

The model was tested on a sample from the quora dataset. The optimizer
used was Adadelta optimizer. Gradient clipping was also used to avoid
exploding gradient problem.

Sample image of our dataset:

our model predict the values between 0 and 1. We can choose threshold
near to 0.5 to predict if sentence is similar or not.

14
fig 8: training phase

fig 8 shows hoe the model was trained on the dataset. It also shows

increase in accuracy with increase in epochs.

15
Result and Discussions:
After training our model on the training set. And validating it on
validation set we got accuracy greater than 80%. Convergence of our
model can be shown by the graph below.

Fig 8: Observations

With increase in epochs validation curve behaves similar to training


curve. (fig8a)
With increase in epochs validation curve behaves similar to training
curve. (fig8b)

The model was trained using LSTM but there are various other sentence
similarity measuring algorithms eg GRU. We can use it in future to
compare accuracy.
A bigger dataset can help in increasing the accuracy of the model. 16
Better optimizer can also help in giving better results for the data.

fig 9: Output

Fig9 Shows a sample output of the trained model. The interface takes in
a question number from the testing set as input and gives the predicted
and actual score as output.
Predicted score is given between 0 and 1, based on that, a threshold can
be decided to predict if sentence is similar or not.

17
Chapter-6

Conclusion and future work


The model was successfully trained and was able to produce accuracy of
greater than 80% on validation set. The model was also able to produce
the same accuracy with the test set.
The model was trained only on Quora duplicate dataset, in future the
model will be trained on various other dataset for bigger domain.
In Future, Model will scrap data from web give precise and non-
redundant information from the scrapped data. We want to apply other
algorithms for sentence similarity like GRU to compare the efficiency and
training time among algorithms.
Choose a different optimizer. Adadelta doesn’t perform as well as other
methods when finely tuned.

18
References:
[1] Machine Learning Based On Big Data Extraction of Massive
Educational Knowledge Abdelladim Hadioui!!", Nour-eddine El Faddouli,
Yassine Benjelloun Touimi, and Samir Bennani
[2]The Evaluation of Sentence Similarity Measures
Palakorn Achananuparp, Xiaohua Hu, and Shen Xiajiong
The ability to accurately judge the similarity between natural

[3]A Machine Learning Approach to Sentence Ordering for Multidocument


Summarization and its Evaluation
Danushka Bollegala, Naoaki Okazaki, Mitsuru Ishizuka

[4]Universal Sentence Encoder


Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco

[5] https://en.wikipedia.org/wiki/Long_short-term_memory
[6] https://keras.io/

19

You might also like