You are on page 1of 4

60003180058

VARUN VORA

EXPERIMENT 9

AIM: Understanding and implementing text similarity.

THEORY:
Consider the following 2 sentences:
 My house is empty
 There is nobody at mine
A human could easily determine that these 2 sentences convey a very similar meaning despite being
written in 2 completely different formats; The intersection of the 2 sentences only has one word in
common, “is”, and it doesn’t provide any insight into how similar the sentences. Nonetheless, we’d
still expect a similarity algorithm to return a score that informs us that the sentences are very similar.
This phenomenon describes what we’d refer to as semantic text similarity, where we aim to identify
how similar documents are based on the context of each document. This is quite a difficult problem
because of the complexities that come with natural language.
On the other hand, we have another phenomenon called lexical text similarity. Lexical text similarity
aims to identify how similar documents are on a word level. Many of the traditional techniques tend
to focus on lexical text similarity and they are often much faster to implement than the new deep
learning techniques that have slowly risen to stardom.
Essentially, we may define text similarity as attempting to determine how “close” 2 documents are in
lexical similarity and semantic similarity.
This is a common, yet tricky, problem within the Natural Language Processing (NLP) domain. Some
example use cases of text similarity include modeling the relevance of a document to a query in a
search engine and understanding similar queries in various AI systems in order to provide uniform
responses to users.

Popular Evaluation Metrics for Text Similarity

Whenever we are performing some sort of Natural Language Processing task, we need a way to
interpret the quality of the work we are doing. “The documents are pretty similar” is subject and not
very informative in comparison to the model has a 90% accuracy score. Metrics provide us with
objective and informative feedback to evaluate a task.
Popular metrics include:
 Euclidean Distance
 Cosine Similarity
 Jaccard Similarity

In order to perform text similarity using NLP techniques, these are the standard steps to be followed:
Text Pre-Processing:
In day to day practice, information is being gathered from multiple sources be it web, document or
transcription from audio, this information may contain various types of garbage values, noisy text,
encoding. This needs to be cleaned in order to perform further tasks of NLP. In this preprocessing
phase, it should include removing non-ASCII values, special characters, HTML tags, stop words, raw
format conversion and so on.

Feature Extraction:
To convert the text data into a numeric format, text data needs to be encoded. Various encoding
techniques are widely being used to extract the word-embeddings from the text data such techniques
are bag-of-words, TF-IDF, word2vec.
60003180058
VARUN VORA
Vector Similarity:
Once we will have vectors of the given text chunk, to compute the similarity between generated
vectors, statistical methods for the vector similarity can be used. Such techniques are cosine
similarity, Euclidean distance, Jaccard distance, word mover’s distance. Cosine similarity is the
technique that is being widely used for text similarity.

Decision Function:
From the similarity score, a custom function needs to be defined to decide whether the score classifies
the pair of chunks as similar or not. Cosine similarity returns the score between 0 and 1 which refers 1
as the exact similar and 0 as the nothing similar from the pair of chunks. In regular practice, if the
similarity score is more than 0.5 than it is likely to similar at a somewhat level. But, this can vary
based on the application and use-case.

BERT
BERT- Bidirectional Encoder Representation from Transformers (BERT) is a state of the art
technique for natural language processing pre-training developed by Google. BERT is trained on
unlabelled text including Wikipedia and Book corpus. BERT uses transformer architecture, an
attention model to learn embeddings for words.
BERT consists of two pre training steps Masked Language Modelling (MLM) and Next Sentence
Prediction (NSP). In BERT training text is represented using three embeddings, Token Embeddings +
Segment Embeddings + Position Embeddings.
60003180058
VARUN VORA

CODE (Using BERT):


60003180058
VARUN VORA

TESTING (OUTPUT):

CONCLUSION: Hence, text similarity was understood and implemented.

You might also like