You are on page 1of 91

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Overview
deeplearning.ai
Some basic applications of word embeddings

village
city town
gas country
oil happy
petroleum
sad joyful

Semantic analogies Sentiment analysis Classification of


and similarity customer feedback
Advanced applications of word embeddings

Machine translation Information extraction Question answering


Learning objectives Prerequisite: neural networks

● Identify the key concepts of word representations

● Generate word embeddings

● Prepare text for machine learning

● Implement the continuous bag-of-words model


Basic Word
Representations
deeplearning.ai
Outline
● Integers

● One-hot vectors

● Word embeddings
Integers
Word Number
a 1
able 2
about 3
... ...
hand 615
… …
happy 621
... ...
zebra 1000
Integers
+ Simple

hand happy zebra


- Ordering: little semantic sense
615 < 621 < 1000
?! ?!
One-hot vectors
“a” “happy” “zebra”
1 a 0 a 0 a
0 able 0 able 0 able
0 about 0 about 0 about

0
...
hand
... 1000
rows

0
...
hand
... ⋮
0
...
hand
⋮ … ⋮ … ⋮ …
0 happy 1 happy 0 happy
⋮ ... ⋮ ... ⋮ ...
0 zebra 0 zebra 1 zebra
One-hot vectors
Word Number “happy”
a 1 1 0 a
able 2 2 0 able
about 3 3 0 about
... ... ... ⋮ ...
hand 615 615 0 hand
… … … ⋮ …
happy 621 621 1 happy
... ... ... ⋮ ...
zebra 1000 1000 0 zebra
One-hot vectors 0 a
⋮ …
~10k—1M+
+ Simple rows
1 happy
⋮ ...
0 Zyzzyva
+ No implied ordering

- Huge vectors happy d(paper, excited)


= d(paper, happy)
paper
- No embedded meaning = d(excited, happy)

excited
Word
Embeddings
deeplearning.ai
Meaning as vectors

rage anger spider boring paper kitten happy excited


(-2.52) (-2.08) (-1.53) (-0.91) (0.03) (1.09) (1.75) (2.31)
negative -2 -1 0 1 2 positive
Meaning as vectors
concrete
1
paper (0.03, 0.79)

spider (-1.53, 0.41) puppy kitten


(0.98, 0.57) (1.09, 0.57)
snake (-1.53, 0.41)

negative -2 -1 0 1 2 positive

rage (-2.52, -0.54) excited


anger (-2.08, -0.71) (2.31, -0.54)
boring happy
thought (0.03, -0.93) (1.75, -0.81)
-1
abstract
Word embedding vectors “happy”
0.123

+ Low dimension ~100—~1000
-4.059
rows

+ Embed meaning 1.891
○ e.g. semantic distance

forest ≈ tree forest ≉ ticket


○ e.g. analogies
Paris:France :: Rome:?
Terminology

word vectors

integers one-hot vectors word embedding vectors


“word vectors”
word embeddings
Summary
● Words as integers

● Words as vectors
○ One-hot vectors
○ Word embedding vectors

● Benefits of word embeddings for NLP


How to Create
Word
Embeddings
deeplearning.ai
Word embedding process Hyperparameters
Word embedding size ...

Corpus Embedding method


Transformation Machine learning model
General- Specialized words Learning task Self-supervised
purpose e.g. contracts,
“I think [???] I am” = unsupervised
e.g. Wikipedia law books integers, vectors + supervised
Words in context Meaning

Word embeddings
Word
Embedding
Methods
deeplearning.ai
Basic word embedding methods
● word2vec (Google, 2013)
○ Continuous bag-of-words (CBOW)
○ Continuous skip-gram / Skip-gram with negative sampling (SGNS)

● Global Vectors (GloVe) (Stanford, 2014)

● fastText (Facebook, 2016)


○ Supports out-of-vocabulary (OOV) words
Advanced word embedding methods
Deep learning, contextual embeddings

● BERT (Google, 2018)

Tunable pre-trained
● ELMo (Allen Institute for AI, 2018)
models available

● GPT-2 (OpenAI, 2018)


Continuous
Bag-of-Words
Model
deeplearning.ai
Continuous bag-of-words word embedding process

Embedding method
“I think
Corpus Transformation CBOW “therefore”
[???] I am”

Word embeddings
Corpus Transformation CBOW

Center word prediction: rationale

The little ? is barking

dog
puppy
hound
terrier
...
Corpus Transformation CBOW

Creating a training example

center word

I am happy because I am learning

context words window


C=2 window size = 5
context half-size
Corpus Transformation CBOW

From corpus to training


Embedding method
Transformation Context words
Corpus
Center
Context words
word CBOW
I am happy because I am learning I am because I happy
am happy I am because
happy because am learning I
Predicted center word

Word embeddings
Corpus Transformation CBOW

From corpus to training


Embedding method
Transformation Context words
Corpus
Center
Context words
word CBOW
I am happy because I am learning I am because I happy
am happy I am because
happy because am learning I
Predicted center word

Word embeddings
Corpus Transformation CBOW

From corpus to training


Embedding method
Transformation Context words
Corpus
Center
Context words
word CBOW
I am happy because I am learning I am because I happy
am happy I am because
happy because am learning I
Predicted center word

Word embeddings
CBOW in a nutshell

Source: Mikolov, T., Chen, K.,


Corrado, G.S., & Dean, J. (2013).
Efficient Estimation of Word
Representations in Vector Space
Cleaning and
Tokenization
deeplearning.ai
Cleaning and tokenization matters
● Letter case “The” == “the” == “THE” → lowercase / upper case
Cleaning and tokenization matters
● Letter case “The” == “the” == “THE” → lowercase / upper case

● Punctuation , ! . ? →. “ ‘ « » ’ ” →∅ … !! ??? → .
Cleaning and tokenization matters
● Letter case “The” == “the” == “THE” → lowercase / upper case

● Punctuation , ! . ? →. “ ‘ « » ’ ” →∅ … !! ??? → .

● Numbers 1 2 3 5 8 →∅ 3.14159 90210 → as is/<NUMBER>


Cleaning and tokenization matters
● Letter case “The” == “the” == “THE” → lowercase / upper case

● Punctuation , ! . ? →. “ ‘ « » ’ ” →∅ … !! ??? → .

● Numbers 1 2 3 5 8 →∅ 3.14159 90210 → as is/<NUMBER>

● Special characters ∇ $ € § ¶ ** → ∅
Cleaning and tokenization matters
● Letter case “The” == “the” == “THE” → lowercase / upper case

● Punctuation , ! . ? →. “ ‘ « » ’ ” →∅ … !! ??? → .

● Numbers 1 2 3 5 8 →∅ 3.14159 90210 → as is/<NUMBER>

● Special characters ∇ $ € § ¶ ** → ∅

● Special words 😊 #nlp → :happy: #nlp


Example in Python: corpus

Who ❤ "word embeddings" in 2020? I do!!!

emoji punctuation number


Example in Python: libraries

# pip install nltk


# pip install emoji

import nltk
from nltk.tokenize import word_tokenize
import emoji

nltk.download('punkt') # download pre-trained Punkt tokenizer for English


Example in Python: code
corpus = 'Who ❤ "word embeddings" in 2020? I do!!!'

data = re.sub(r'[,!?;-]+', '.', corpus)

→ Who ❤ "word embeddings" in 2020. I do.


Example in Python: code
corpus = 'Who ❤ "word embeddings" in 2020? I do!!!'

data = re.sub(r'[,!?;-]+', '.', corpus)


data = nltk.word_tokenize(data) # tokenize string to words

→ ['Who', '❤', '``', 'word', 'embeddings', "''", 'in', '2020', '.', 'I',
'do', '.']
Example in Python: code
corpus = 'Who ❤ "word embeddings" in 2020? I do!!!'

data = re.sub(r'[,!?;-]+', '.', corpus)


data = nltk.word_tokenize(data) # tokenize string to words
data = [ ch.lower() for ch in data
if ch.isalpha()
or ch == '.'
or emoji.get_emoji_regexp().search(ch)
]

→ ['who', '❤', 'word', 'embeddings', 'in', '.', 'i', 'do', '.']


Sliding Window
of Words in
Python
deeplearning.ai
Sliding window of words in Python
def get_windows(words, C):
i = C
while i < len(words) - C:
center_word = words[i]
context_words = words[(i - C):i] + words[(i+1):(i+C+1)]
yield context_words, center_word
i += 1

I am happy because I am learning


0 1 2 3 4 5 6
Sliding window of words in Python
def get_windows(words, C):
...
yield context_words, center_word

for x, y in get_windows(
['i', 'am', 'happy', 'because', 'i', 'am', 'learning'],
2
):
print(f'{x}\t{y}')
Sliding window of words in Python
for x, y in get_windows(
['i', 'am', 'happy', 'because', 'i', 'am', 'learning'],
2
):
print(f'{x}\t{y}')

→ ['I', 'am', 'because', 'I'] happy


['am', 'happy', 'I', 'am'] because
['happy', 'because', 'am', 'learning'] I
Transforming
Words into
Vectors
deeplearning.ai
Transforming center words into vectors
Corpus I am happy because I am learning

Vocabulary am, because, happy, I, learning

One-hot am because happy I learning


vector am 1 0 0 0 0
because 0 1 0 0 0
happy 0 0 1 0 0
I 0 0 0 1 0
learning 0 0 0 0 1
Transforming context words into vectors
Average of individual one-hot vectors

I am because I I am because I
am 0 1 0 0 0.25
because 0 0 1 0 0.25
happy 0 + 0 + 0 + 0 /4 = 0
I 1 0 0 1 0.5
learning 0 0 0 0 0
Final prepared training set

Context words Context words vector Center word Center word vector

I am because I [0.25; 0.25; 0; 0.5; 0] happy [0; 0; 1; 0; 0]

am happy I am [0.5; 0; 0.25; 0.25; 0] because [0; 1; 0; 0; 0]

happy because am learning [0.25; 0.25; 0.25; 0; 0.25] I [0; 0; 0; 1; 0]


Architecture
of the CBOW
Model
deeplearning.ai
Hyperparameters
Architecture of the CBOW model N: Word embedding size ...

Input layer Hidden layer Output layer

Context words Center word


vector W1 W2 vector
weights weights
⋮ ⋮ ⋮
b1 b2
x= ŷ=
biases biases
V ⋮ ⋮ ⋮ V
“I am happy ReLU softmax
because I am V N V
learning”
→V=5 x h ŷ
Architecture
of the CBOW
Model:
deeplearning.ai
Dimensions
Dimensions (single input)
Input layer Hidden layer Output layer

W1 NxV W2 VxN
⋮ ⋮ ⋮
b1 Nx1 b2 Vx1

⋮ ReLU
⋮ softmax

z1 = W1x + b1 N x 1 z2 = W2h + b2 Vx1


x h = ReLU(z1) Nx1
h ŷ = softmax(z2) V x 1
ŷ
Vx1 Nx1 Vx1
Dimensions (single input)
Column vectors

z1 = W1x + b1 z1 = W1 = NxV x= b1 =

Nx1 Nx1
Vx1

Row vectors

b1 = 1xN W1 =
z1 = xW1T + b1 NxV b1 = 1xN

x= 1xV
Architecture of
the CBOW
Model:
deeplearning.ai
Dimensions 2
b1 → B1 = b1 … b1 N broadcasting
Dimensions (batch input) m

Input layer Hidden layer Output layer


Context words
vectors
→ matrix
W1 NxV W2 VxN
⋮ ⋮ ⋮
x(1) … x(m)
B1 Nxm B2 Vxm
X= V
⋮ ReLU
⋮ softmax

m
Z 1 = W1 X + B 1 N x m Z 2 = W2 H + B 2 V x m
X H = ReLU(Z1) N x m
H Ŷ = softmax(Z2) V x m
Ŷ
Vxm Nxm Vxm
Dimensions (batch input)
Input layer Hidden layer Output layer
Predicted
Context words
center word
matrix
matrix
W1 W2
⋮ ⋮ ⋮
X= x(1) … x(m) V B1 B2 Ŷ= ŷ(1) … ŷ(m) V
⋮ ⋮ ⋮
ReLU softmax
m m

X H Ŷ
Vxm Nxm Vxm
Architecture
of the CBOW
Model
deeplearning.ai
Activation Functions
Rectified Linear Unit (ReLU)
ReLU(x) = max(0, x)
Input layer Hidden layer
z1 = W1x + b1

W1
h = ReLU(z1)
⋮ ⋮
b1 z1 h
5.1 5.1
⋮ ⋮
ReLU -0.3 0
⋮ ReLU ⋮
-4.6 0
x h 0.2 0.2
Softmax ∈
ℝK
softmax ∈ [0, 1]K
~probabilities
Hidden layer Output layer
𝝨=1
z = W2h + b2

W2 ŷ = softmax(z)
⋮ ⋮ z ŷ
b2 z1 ŷ1 a
⋮ ⋮ ⋮
⋮ ⋮ zi softmax ŷi happy
softmax
⋮ ⋮ ⋮
zV ŷV zebra
V V
h ŷ Probabilities of
being center word
Softmax: example

z exp(z) ŷ = softmax(z)

9 8103 0.083 am
8 2981 0.03 because
11 exp 59874 /𝝨 0.612 happy ← Predicted center word
10 22026 0.225 I
8.5 4915 0.05 learning

𝝨=97899 𝝨=1
Training a
CBOW Model
deeplearning.ai
Cost Function
Loss
x ŷ y
Context Predicted Actual
words CBOW center word center word
vector vector vector

Input Machine learning Prediction Truth


model
minimize
Parameters
Loss error
W1, b1, W2, b2
adjust
cross-entropy loss
Actual Predicted
Cross-entropy loss y=
y1
⋮ ŷ=
ŷ1

yV ŷV

I am happy because I am learning

y ŷ log(ŷ) y ⊙ log(ŷ)
0 am 0.083 -2.49 0
0 because 0.03 -3.49 0
1 happy 0.611 log -0.49 ⊙y -0.49 -𝝨 J = 0.49
0 I 0.225 -1.49 0
0 learning 0.05 -2.49 0
Cross-entropy loss

y ŷ log(ŷ) y ⊙ log(ŷ)
0 am 0.96 -0.04 0
0 because 0.01 -4.61 0
1 happy 0.01 log -4.61 ⊙y -4.61 -𝝨 J = 4.61
0
0
I
learning
0.01
0.01
-4.61
-4.61
0
0 >
J(correct) = 0.49
Cross-entropy loss
J = -log ŷactual
word
incorrect
y ŷ predictions:
penalty
0 am 0.96
0 because 0.01
1 happy 0.01 → J = 4.61 correct
0 I 0.01 predictions:
0 learning 0.01 reward
Training a
CBOW Model
deeplearning.ai
Forward Propagation
Training process
● Forward propagation

● Cost

● Backpropagation and gradient descent


Z 1 = W1 X + B 1 Z 2 = W2 H + B 2
Forward propagation H = ReLU(Z1) Ŷ = softmax(Z2)

Input layer Hidden layer Output layer


Predicted
Context words
center word
matrix
matrix
W1 W2
⋮ ⋮ ⋮
X= x(1) … x(m) B1 B2 Ŷ= ŷ(1) … ŷ(m)
⋮ ⋮ ⋮
ReLU softmax

X H Ŷ
Cost
Cost: mean of losses

Predicted
Actual center
center word
word matrix
matrix

Ŷ= ŷ(1) … ŷ(m) Y= y(1) … y(m)


Training a
CBOW Model
deeplearning.ai
Backpropagation and
Gradient Descent
Minimizing the cost
● Backpropagation: calculate partial derivatives of cost with respect to
weights and biases
Minimizing the cost
● Backpropagation: calculate partial derivatives of cost with respect to
weights and biases

● Gradient descent: update weights and biases


Gradient descent
Hyperparameter: learning rate ⍺
Extracting Word
Embedding
Vectors
deeplearning.ai
Extracting word embedding vectors: option 1
Input layer Hidden layer Output layer

⋮ W1 b1 ReLU ⋮ W2 b2 softmax ⋮

am
am
… because
W1 = w(1) w(V) N x= happy V
I
learning
V
Extracting word embedding vectors: option 2
Input layer Hidden layer Output layer

⋮ W1 b1 ReLU ⋮ W2 b2 softmax ⋮

w(1) am am
because
W2 = … V x= happy V
I
w(V) learning
N
Extracting word embedding vectors: option 3

w2(1)

W1 = w1(1) … w1(V) W2 = …
w2(V)

am
because
W3= 0.5 (W1+W2T) = w3(1) … w3(V) N x= happy V
I
learning
V
Evaluating
Word
Embeddings
deeplearning.ai
Intrinsic Evaluation
Intrinsic evaluation
Test relationships between words
● Analogies
Semantic analogies
“France” is to “Paris” as “Italy” is to <?>

Syntactic analogies
“seen” is to “saw” as “been” is to <?>

⚡ Ambiguity
“wolf” is to “pack” as “bee” is to <?> → swarm? colony?
Intrinsic evaluation
Test relationships between words
● Analogies
Intrinsic evaluation
Test relationships between words

● Analogies

● Clustering

Source: Michael Zhai, Johnny Tan, and Jinho


D. Choi. 2016. Intrinsic and extrinsic
evaluations of word embeddings
Intrinsic evaluation
Test relationships between words

● Analogies
village
city town
● Clustering gas country
oil happy
petroleum
● Visualization sad joyful
Evaluating
Word
Embeddings
deeplearning.ai
Extrinsic Evaluation
Named entity

Extrinsic evaluation Andrew works at deeplearning.ai


person organization

Test word embeddings on external task


e.g. named entity recognition, parts-of-speech tagging
Extrinsic evaluation
Test word embeddings on external task
e.g. named entity recognition, parts-of-speech tagging

+ Evaluates actual usefulness of embeddings

- Time-consuming

- More difficult to troubleshoot


Conclusion
deeplearning.ai
Recap and assignment
● Data preparation

● Word representations

● Continuous bag-of-words model

● Evaluation
Going further
● Advanced language modelling and word embeddings

● NLP and machine learning libraries


Keras # from keras.layers.embeddings import Embedding
embed_layer = Embedding(10000, 400)

PyTorch # import torch.nn as nn


embed_layer = nn.Embedding(10000, 400)

You might also like