You are on page 1of 52

Word Vectors & Text Classification

Bardia Rafieian
Bardia.rafieian@upc.edu

PhD. Candidate
Universitat Politecnica de Catalunya
Technical University of Catalonia
Outline

● Last session recap


● Word Vectors
● Word2Vec
● Glove
● Cross-lingual word Embeddings
● Contextual Embeddings
● Text Classification

2
Last session recap

3
Last session recap: Bag of words

Image from: https://victorzhou.com/blog/bag-of-words/

4
Last session recap: Bag of words

● Sparse representation: Most vector values are 0. Distance


between vectors is not informative.

● Missing order information: Vectors do not represent the


distance between the different tokens in the sentence.

5
Word Vectors

6
One-hot
● If you have 10,000 words in your vocabulary, then you can
represent each word as a 1x10,000 vector
● Example: 4 words in our vocabulary →
Mango [1, 0, 0, 0]
Strawberry [0, 1, 0, 0]
City [0, 0, 1, 0]
Barcelona [0, 0, 0, 1]

7
Some useful definitions
( Term Frequency(TF) — Inverse Dense Frequency(IDF) )

IDF =Log[(# Number of documents) / (Number of documents containing the word)] and

TF = (Number of repetitions of word in a document) / (# of words in a document)

TF-IDF = TF*IDF

8
Word Vectors

● Representations learned from a corpus


of text.
● Models the probability of words given
the context of a fixed size window.
● Similarity between vectors can be
measured and can provide information
between words relationship.
● One trained, these vectors can be
employed as features for other models.

9
Jointly Trained Embeddings
● The embedding table works as input
layer of the network

● Initialized from a random distribution


as any other parameter of the
network.

● Parameters are optimized through


back-propagation according to the
loss function of the task.

Figure from: (Collobert et al, 2011)


10
Jointly Trained Embeddings

11
Jointly Trained Embeddings

● As we are using now the number of


tokens of the sentence as input we
need a way to match their lengths el gato come . <pad> <pad>
and form batches:

● Padding: A special token in our


vocabulary that indicates that fills the
missing tokens until the appropriate
sequence length.
[23, 2333, 34, 2, 1 , 1]

12
Jointly Trained Embeddings
● Our Embedding table is the first layer of
our model.

● For each token represented as its index in


the vocabulary it retrieves its embedding
vector.

● This table is optimized as any other layer


of our network by applying
backpropagation.

13
Jointly Trained Embeddings

● Embeddings can be combined to


form sentence representations. e.g,
Average, LSTM, etc.

● They can also be applied to


constant size windows of tokens.

14
Jointly Trained Embeddings

● Once Embeddings are fully trained they


can be employed as features for other
models, Neural or not Neural.

● These is useful when unlabelled data is


available.

15
Word Vectors

16
Word2Vec

17
Skip-gram and Continuous Bag of Words
Model (CBOW)
● The distributed representation of
the input word is used to predict the
context. (Skip-gram)

● Using the distributed representations


of context (or surrounding words)
to predict the word in the middle.
(CBOW)

Figurefrom:(Mikolov2013ExploitingSA)

18
Word2Vec: Skip-Gram

19
Word2Vec: Skip-Gram

20
Example
Having the sentence: The future king is the prince

● We choose window size of the context (w), as 2


● We eliminate stop words: (The, is)
● After tokenization, we achieve data points
● We create dictionary of one-hot encodings for each X and Y:

21
Example
● The final sizes of these matrices will be n x m, where
n - number of created data points
m - number of unique words

● We now have X and Y matrices ,


we choose embedding size as hyperparameter

● The hidden layer dim = word embedding dim,


The output layers is softmax.
NN is Linear

● After the training of the network, we can obtain the


Weights which are our embedding values!
22
More information
Similar context?

I think you could expect that synonyms like “intelligent” and “smart” would have very
similar contexts. Or that words that are related, like “engine” and “transmission”,
would probably have similar contexts as well.

23
Word2Vec:CBOW
● Predicts the current word
according to the other words in
the windows.
● Prediction based on the average
representation of window’s
words.
● As Skip-Gram no notion of order
in the window is used
● Better for common words in the
corpus.

24
Example
Sentence:
‘It is a pleasant day’

● We need (context word) and (target word)


● If the window size is 2 for the context: then the word pairs would be:
([it, a], is), ([is, pleasant], a),([a, day], pleasant)

25
● Skip-gram: works well with small amount of the training data, represents
well even rare words or phrases.

● CBOW: several times faster to train than the skip-gram, slightly better
accuracy for the frequent words

26
Glove

27
GloVe

● Co-ocurrence Matrix: For each pair of words


i,j it depicts the number of times the word i
appears in the context of j.

● Captures global relations between words


during the whole corpus.

28
GloVe

Co-occurrence Matrix:
The cat sat on mat

Figurefrom:(towardsdatascience)

29
GloVe
Let’s fill a co-occurrence matrix:

?
? ?
? ? ?

? ?
?

Image source: Data Analytics with Hadoop: An Introduction for Data Scientists

30
GloVe
● Co-occurrence probabilities for ice and steam with probe words
● The ratio is better able to distinguish relevant words (solid and gas) from
irrelevant words (water and fashion)
● It is also better able to discriminate between the two relevant words.

31
GloVe

wi: Vector of current word


wj: Vector of context word
X: Co-ocurrence matrix
b: bias vector

Embedding: wi + wj

32
GloVe

Advantages over Word2Vec: Drawbacks over Word2Vec:

● Fast Training. ● Higher memory consumption.


● Scales better to bigger corpora. ● More sensible to learning rate initialization
● considers the co-occurrence statistics of ● focus on the local context of words within a
words across all sentences to learn word limited window
embeddings ● learns word embeddings based on the specific
● captures the relationships between context words surrounding each word in the
words like "soccer" and "sport" based corpus
on their overall co-occurrence patterns

33
Cross-lingual Words
Embeddings

34
Cross-lingual Word Embeddings

35
Cross-lingual Word Embeddings

36
Cross-lingual Words Embeddings

37
Contextual Embeddings

38
Contextual Embeddings

39
ELMo (Embeddings from Language Models)
● Language Modelling: Task of
predicting the next tokens of a
sequence or the probability of a
given sequence to happen in the
language.

● Contextual Embeddings: Vector


representations that take into
account the context of the
sentence.

40
ELMo (Embeddings from Language Models)

41
ELMo (Embeddings from Language Models)

42
Document Embeddings

43
Doc2vec
● Representing documents as a vector and is a generalizing of the Word2vec method.

● The inspiration : word vectors are asked to contribute to predict the next word.

● The paragraph vectors are also asked to contribute to the prediction task of the next word
given many contexts sampled from the paragraph

● every paragraph is mapped to a unique vector, represented by a column in a matrix

44
Doc2vec

● Algorithm has two key stages:


● 1) training to get word vectors, softmax weights and paragraph vectors on already seen
paragraphs.
● 2) “the inference stage” to get paragraph vectors D for new paragraphs (unseen)

45
Text Classification

46
Classification Tasks

Text classifiers can be used to organize, structure,


and categorize pretty much any kind of text

● Feature: the string representing the input text


● Target: the text’s polarity
● Additional data could be added, as Part-of-Speech tags.

47
NLP Tasks

image source: Dan Jurafsky & Chris Manning http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf 48


Convolutional Neural Network (CNN)

● CNN: learns a set of filters


that are applied over the
input window by window to
find local relations.
● Pooling: Reducing
dimensionality by applying
an assumption over
windows of the data, e.g
maximum, average
Source: http://cs231n.github.io/convolutional-networks/

49
Recurrent Neural Networks (RNN)
RNNS keep a context of the
previous steps on the sequence
that conditions the output of
each steps.

This feature is useful for


predicting new steps of the
Source:
sequence or if only the last step http://colah.github.io/posts/2015-08-Underst
is passed to the next layer to anding-LSTMs/
compute a sequence
representation.
50
Further readings
Text Books:
● Goldberg, Y. & Hirst, G. (2017). “Neural Network Methods in Natural Language Processing”. Morgan & Claypool Publishers.
● Manning, C.D., & Schütze, H. (1999). “Foundations of Statistical Natural Language Processing”.

Articles:
● Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P.P. (2011). “Natural Language Processing (almost) from Scratch”. Journal of Machine
Learning Research, 12, 2493-2537.

● Goldberg, Y. (2016). “A Primer on Neural Network Models for Natural Language Processing”. J. Artif. Intell. Res., 57, 345-420.

● Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

● Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP) (pp. 1532-1543).

● Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International
Conference on Machine Learning - Volume 32 (ICML'14). JMLR.org, II–1188–II–1196.

51
Questions?

52

You might also like