Word Embeddings Classification

Word Vectors & Text Classification
Bardia Rafieian
Bardia.rafieian@upc.edu
PhD. Candidate
Universitat Politecnica de Catalunya
Technical University of Catalonia
Outline
● Last session recap

● Word Vectors
● Word2Vec
● Glove
● Cross-lingual word Embeddings
● Contextual Embeddings
● Text Classification
2
Last session recap
3
Last session recap: Bag of words
Image from: https://victorzhou.com/blog/bag-of-words/
4
Last session recap: Bag of words
● Sparse representation: Most vector values are 0. Distance

between vectors is not informative.
● Missing order information: Vectors do not represent the

distance between the different tokens in the sentence.
5
Word Vectors
6
One-hot
● If you have 10,000 words in your vocabulary, then you can
represent each word as a 1x10,000 vector
● Example: 4 words in our vocabulary →
Mango [1, 0, 0, 0]
Strawberry [0, 1, 0, 0]
City [0, 0, 1, 0]
Barcelona [0, 0, 0, 1]
7
Some useful definitions
( Term Frequency(TF) — Inverse Dense Frequency(IDF) )
IDF =Log[(# Number of documents) / (Number of documents containing the word)] and
TF = (Number of repetitions of word in a document) / (# of words in a document)
TF-IDF = TF*IDF
8
Word Vectors
● Representations learned from a corpus

of text.
● Models the probability of words given
the context of a fixed size window.
● Similarity between vectors can be
measured and can provide information
between words relationship.
● One trained, these vectors can be
employed as features for other models.
9
Jointly Trained Embeddings
● The embedding table works as input
layer of the network
● Initialized from a random distribution

as any other parameter of the
network.
● Parameters are optimized through

back-propagation according to the
loss function of the task.
Figure from: (Collobert et al, 2011)

10
11
● As we are using now the number of

tokens of the sentence as input we
need a way to match their lengths el gato come . <pad> <pad>
and form batches:
● Padding: A special token in our

vocabulary that indicates that fills the
missing tokens until the appropriate
sequence length.
[23, 2333, 34, 2, 1 , 1]
12
● Our Embedding table is the first layer of
our model.
● For each token represented as its index in

the vocabulary it retrieves its embedding
vector.
● This table is optimized as any other layer

of our network by applying
backpropagation.
13
● Embeddings can be combined to

form sentence representations. e.g,
Average, LSTM, etc.
● They can also be applied to

constant size windows of tokens.
14
● Once Embeddings are fully trained they

can be employed as features for other
models, Neural or not Neural.
● These is useful when unlabelled data is

available.
15
Word Vectors
16
Word2Vec
17
Skip-gram and Continuous Bag of Words
Model (CBOW)
● The distributed representation of
the input word is used to predict the
context. (Skip-gram)
● Using the distributed representations

of context (or surrounding words)
to predict the word in the middle.
(CBOW)
Figurefrom:(Mikolov2013ExploitingSA)
18
Word2Vec: Skip-Gram
19
Word2Vec: Skip-Gram
20
Example
Having the sentence: The future king is the prince
● We choose window size of the context (w), as 2

● We eliminate stop words: (The, is)
● After tokenization, we achieve data points
● We create dictionary of one-hot encodings for each X and Y:
21
Example
● The final sizes of these matrices will be n x m, where
n - number of created data points
m - number of unique words
● We now have X and Y matrices ,

we choose embedding size as hyperparameter
● The hidden layer dim = word embedding dim,

The output layers is softmax.
NN is Linear
● After the training of the network, we can obtain the

Weights which are our embedding values!
22
More information
Similar context?
I think you could expect that synonyms like “intelligent” and “smart” would have very
similar contexts. Or that words that are related, like “engine” and “transmission”,
would probably have similar contexts as well.
23
Word2Vec:CBOW
● Predicts the current word
according to the other words in
the windows.
● Prediction based on the average
representation of window’s
words.
● As Skip-Gram no notion of order
in the window is used
● Better for common words in the
corpus.
24
Example
Sentence:
‘It is a pleasant day’
● We need (context word) and (target word)

● If the window size is 2 for the context: then the word pairs would be:
([it, a], is), ([is, pleasant], a),([a, day], pleasant)
25
● Skip-gram: works well with small amount of the training data, represents
well even rare words or phrases.
● CBOW: several times faster to train than the skip-gram, slightly better
accuracy for the frequent words
26
Glove
27
GloVe
● Co-ocurrence Matrix: For each pair of words

i,j it depicts the number of times the word i
appears in the context of j.
● Captures global relations between words

during the whole corpus.
28
GloVe
Co-occurrence Matrix:
The cat sat on mat
Figurefrom:(towardsdatascience)
29
GloVe
Let’s fill a co-occurrence matrix:
?
? ?
? ? ?
? ?
?
Image source: Data Analytics with Hadoop: An Introduction for Data Scientists
30
GloVe
● Co-occurrence probabilities for ice and steam with probe words
● The ratio is better able to distinguish relevant words (solid and gas) from
irrelevant words (water and fashion)
● It is also better able to discriminate between the two relevant words.
31
GloVe
wi: Vector of current word

wj: Vector of context word
X: Co-ocurrence matrix
b: bias vector
Embedding: wi + wj
32
GloVe
Advantages over Word2Vec: Drawbacks over Word2Vec:
● Fast Training. ● Higher memory consumption.

● Scales better to bigger corpora. ● More sensible to learning rate initialization
● considers the co-occurrence statistics of ● focus on the local context of words within a
words across all sentences to learn word limited window
embeddings ● learns word embeddings based on the specific
● captures the relationships between context words surrounding each word in the
words like "soccer" and "sport" based corpus
on their overall co-occurrence patterns
33
Cross-lingual Words
Embeddings
34
Cross-lingual Word Embeddings
35
Cross-lingual Word Embeddings
36
Cross-lingual Words Embeddings
37
Contextual Embeddings
38
Contextual Embeddings
39
ELMo (Embeddings from Language Models)
● Language Modelling: Task of
predicting the next tokens of a
sequence or the probability of a
given sequence to happen in the
language.
● Contextual Embeddings: Vector

representations that take into
account the context of the
sentence.
40
41
42
Document Embeddings
43
Doc2vec
● Representing documents as a vector and is a generalizing of the Word2vec method.
● The inspiration : word vectors are asked to contribute to predict the next word.
● The paragraph vectors are also asked to contribute to the prediction task of the next word
given many contexts sampled from the paragraph
● every paragraph is mapped to a unique vector, represented by a column in a matrix
44
Doc2vec
● Algorithm has two key stages:

● 1) training to get word vectors, softmax weights and paragraph vectors on already seen
paragraphs.
● 2) “the inference stage” to get paragraph vectors D for new paragraphs (unseen)
45
Text Classification
46
Classification Tasks
Text classifiers can be used to organize, structure,

and categorize pretty much any kind of text
● Feature: the string representing the input text

● Target: the text’s polarity
● Additional data could be added, as Part-of-Speech tags.
47
NLP Tasks
image source: Dan Jurafsky & Chris Manning http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf 48

Convolutional Neural Network (CNN)
● CNN: learns a set of filters

that are applied over the
input window by window to
find local relations.
● Pooling: Reducing
dimensionality by applying
an assumption over
windows of the data, e.g
maximum, average
Source: http://cs231n.github.io/convolutional-networks/
49
Recurrent Neural Networks (RNN)
RNNS keep a context of the
previous steps on the sequence
that conditions the output of
each steps.
This feature is useful for

predicting new steps of the
Source:
sequence or if only the last step http://colah.github.io/posts/2015-08-Underst
is passed to the next layer to anding-LSTMs/
compute a sequence
representation.
50
Further readings
Text Books:
● Goldberg, Y. & Hirst, G. (2017). “Neural Network Methods in Natural Language Processing”. Morgan & Claypool Publishers.
● Manning, C.D., & Schütze, H. (1999). “Foundations of Statistical Natural Language Processing”.
Articles:
● Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P.P. (2011). “Natural Language Processing (almost) from Scratch”. Journal of Machine
Learning Research, 12, 2493-2537.
● Goldberg, Y. (2016). “A Primer on Neural Network Models for Natural Language Processing”. J. Artif. Intell. Res., 57, 345-420.
● Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
● Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP) (pp. 1532-1543).
● Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International
Conference on Machine Learning - Volume 32 (ICML'14). JMLR.org, II–1188–II–1196.
51
Questions?
52

Word Embeddings Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Word Embeddings Classification

Uploaded by

Copyright:

Available Formats

Word Vectors & Text Classiﬁcation

● Last session recap

Image from: https://victorzhou.com/blog/bag-of-words/

● Sparse representation: Most vector values are 0. Distance

● Missing order information: Vectors do not represent the

TF = (Number of repetitions of word in a document) / (# of words in a document)

● Representations learned from a corpus

● Initialized from a random distribution

● Parameters are optimized through

Figure from: (Collobert et al, 2011)

● As we are using now the number of

● Padding: A special token in our

● For each token represented as its index in

● This table is optimized as any other layer

● Embeddings can be combined to

● They can also be applied to

● Once Embeddings are fully trained they

● These is useful when unlabelled data is

● Using the distributed representations

● We choose window size of the context (w), as 2

● We now have X and Y matrices ,

● The hidden layer dim = word embedding dim,

● After the training of the network, we can obtain the

● We need (context word) and (target word)

● Co-ocurrence Matrix: For each pair of words

● Captures global relations between words

wi: Vector of current word

Advantages over Word2Vec: Drawbacks over Word2Vec:

● Fast Training. ● Higher memory consumption.

● Contextual Embeddings: Vector

● every paragraph is mapped to a unique vector, represented by a column in a matrix

● Algorithm has two key stages:

Text classifiers can be used to organize, structure,

● Feature: the string representing the input text

image source: Dan Jurafsky & Chris Manning http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf 48

● CNN: learns a set of filters

This feature is useful for

You might also like