You are on page 1of 5

C5: Word Embeddings and Distance Measurements for Text

The relationship between the word and its neighbourhood tends to define the semantics of a
word and its overall positioning and presence in a sentence.
Word embedding is a learned representation of a word wherein each word is represented
using a vector in n-dimensional space.
Word2vec captures relationships in text.
vector (Man) + vector (Queen) = vector (King) + vector (Woman)
The thought process here is that the relationship of Man:King is the same as Woman:Queen.
The Word2vec algorithm is able to capture these semantic relationships.
The values or vectors obtained from the simple mathematics discussed in previous chapters
are not exactly equal to the actual vector representation of the words.
Word2Vec is a model that enables the building of word vectors using contextual information
from the neighbourhood of a word. For every word whose embedding is developed, it's based
on the words around it.
Word2vec is an unsupervised methodology for building word embeddings. In the Word2vec
architecture, an attempt is made to do either of the following:

 Predict the target word based on the context word


 Predict the context word based on the target word
The Word2vec algorithm tries to capture relationships between words in the text corpus.
The output of the Word2vec algorithm is a |V| * D matrix, where |V| is the size of the
vocabulary we want vector representations for and D is the number of dimensions used to
represent each word vector.
There is a pretrained, globally available Word2vec model that Google trained on Google
News dataset. It has a vocabulary size of 3 million words and phrases and each vector has
300 dimensions.
Word2vec models can be trained by two approaches,

 Predicting the context word using the target word as input, which is referred to as the
Skip-gram method
 Predicting the target word using the context words as input, which is referred to as the
Continuous Bag-of-Words (CBOW) method
Skip Gram Method

Say we have the sentence “Let us make an effort to understand natural language processing”
Every row in the preceding graph has one word shaded in brown. This word represents the
target word. Each row also has some words shaded in gray. These words represent the
context words for the corresponding target word. As you will have guessed, the
window_size value used here is 5.

CBOW method

The CBOW method works similarly to the Skip-gram method. However, the change here is
that the vector corresponding to the context word is sent in as input and the model tries to
predict the target word.

The methods we discussed previously are computationally expensive since all the weights or
entries in the embedding and context matrix are updated for each target word, context word
or context word, target word pair. Mikolov et al. addressed this problem by employing two
strategies—subsampling and negative sampling.

The size of the vocabulary is equal to the number of unique words in the sentences we have
defined.
Higher-dimensional vectors capture more information across dimensions, especially when
the corpus and vocabulary are big and the data is highly varied.

“The model is as good as the data it was trained on.”

Word Movers Distance


WMD computes the pairwise Euclidean distance between words across the sentences and it
defines the distance between two documents as the minimum cumulative cost in terms of
the Euclidean distance required to move all the words from the first sentence to the second
sentence.

C6: Exploring Sentence-, Document-, and Character-Level Embeddings

Similar to Word2Vec, the idea here is to predict certain words as well. However, in addition
to using word representations for predicting words, as we did in the Word2Vec model,
here, document representations are used as well.

These documents are represented using dense vectors, similar to how we represent words.
The vectors are called document or paragraph vectors and are trained to predict words in
the document.
Similar to Word2Vec, Doc2Vec also falls under the class of unsupervised algorithms since
the data that's used here is unlabeled.
The paper described two ways of building paragraph vectors, as follows:

Distributed Memory Model of Paragraph Vectors (PV-DM): This is similar to


the continuous bag-of-words approach we discussed regarding Word2Vec.
Paragraph vectors are concatenated with the word vectors to predict the target
word.

Distributed Bag-of-Words Model of Paragraph Vectors (PV-DBOW): In this


approach, word vectors aren't taken into account. Instead, the paragraph vector
is used to predict randomly sampled words from the paragraph.

The PV-DBOW model is simpler and more memory-efficient as word vectors don't need to
be stored in this approach

The learned representations that are obtained from both the distributed memory model and
the distributed bag-of-words model can be combined to form the paragraph vector.

Building word representations using character n-grams from the words themselves, a
technique referred to as fastText. The fastText model helped us capture morphological
information from sub-word representations. fastText is also flexible as it can provide
embeddings for out-of-vocabulary words since embeddings are a result of sub-word
representations.
The original fastText research paper extended on the Skip-gram approach for Word2Vec,
but today, both the Skip-gram and continuous bag-of-words approach can be used.
fastText can be applied to solve a plethora of problems such as spelling correction, auto
suggestions, and so on since it is based on sub-word representation

fastText is a very convenient technique for building word representations using character
level features. It outperformed Word2Vec since it incorporated internal word structure
information and associated it with morphological features, which are very important in
certain languages.

It also allows us to represent words not present in the original vocabulary.

Sent2Vec combines the continuous bag-of-words approach we discussed regarding


Word2Vec, along with the fastText thought process of using constituent n-gram, to build
sentence embeddings.

Research has shown that Sent2Vec outperforms Doc2Vec in the majority of the tasks it
undertakes and that it is a better representation method for sentences or documents.

Universal Sentence Encoders, which is a very recent technique that has been open sourced by
Google to build sentence or document-level embeddings.

The Universal Sentence Encoder (USE) is a model for fetching embeddings at the sentence
level. These models are trained using Wikipedia, web news, web question-answer pages,
and discussion forums.

Several models that have been built using USE-based transfer learning have outperformed
state-of the- art results in the recent past. USE can be used similar to TF-IDF, Word2Vec,
Doc2Vec, fastText, and so on for fetching sentence-level embeddings.

C7: Identifying Patterns in Text Using Machine Learning

Naive Bayes is a popular ML algorithm based on the Bayes' theorem. The Bayes'
theorem can be represented as follows:

Here, A, B are events:


P(A|B) is the probability of A given B, while P(B|A) is the probability of B given A.
P(A) is the independent probability of A, while P(B) is the independent probability of B.

Naive Bayes, which assumes that all the features are independent of each other, so the joint
probability is simply the product of independent probabilities. This assumption is naive
because it is almost always wrong. Even in our example, an applicant having a high SAT
score is more likely to have a high GPA, so these two events are not independent. However,
the Naive Bayes assumption has been proved to work well for classification problems.

Sentiment analysis, sometimes called opinion mining or polarity detection, refers to the set
of algorithms and techniques that are used to extract the polarity of a given document; that
is, it determines whether the sentiment of a document is positive, negative, or neutral.
SVM is a supervised ML algorithm that attempts to classify data within a dataset by
finding the optimal hyperplane that best segregates the classes

Each data point in the dataset can be considered a vector in an N-dimensional plane, with
each dimension representing a feature of the data. SVM identifies the frontier data points (or
points closest to the opposing class), also known as support vectors, and then attempts to find
the
boundary (also known as the hyperplane in the N-dimensional space) that is the farthest
from the support vector of each class.

Say we have a fruit basket with two types of fruits in it and we want to create an algorithm
that segregates them. We only have information about two features of the fruits; that is,
their weight and radius. Therefore, we can abstract this problem as a linear algebra
problem, with each fruit representing a vector in a two-dimensional space, as shown in the
following diagram. In order to segregate the two types of fruit, we will have to identify a
hyperplane (in two dimensions, the hyperplane would be a line) whose equation can be
represented as follows:

Here, w1 and w2 are coefficients and c is a constant. The equation of the hyperplane in the n
dimension can be generalized as follows:

The algorithm creates a number of hyperplanes and repeats this calculation to identify the
hyperplane that's the most equidistant from both support vectors.

Pickling in Python refers to serializing and deserializing Python object structures. In other
words, by using the module, we can Pickle save the Python objects that are created as part of
model training for reuse.

C8: From Human Neurons to Artificial Neurons for Understanding Text

The evolution of high-end processors in the form of graphical processing units (GPUs) and
tensor processing units (TPUs) has supplemented the rise of neural network-based
applications by making it possible to perform heavy calculations that are very commonly
encountered in any neural network.

Activation functions play a key role in transforming the system of linear equations to a
nonlinear construct (complex nonlinear decision boundaries).

Activation functions introduce nonlinearity in the network. Without nonlinearity, the network
would be performing linear mappings between the input, which would be nothing but a
multivariate linear equation

There are techniques for initializing weight matrices, such as Xavier initialization, that result
in better results than randomly initialized weight matrices.

You might also like