Deeplearning - Ai Deeplearning - Ai

Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Overview
deeplearning.ai
Some basic applications of word embeddings
village
city town
gas country
oil happy
petroleum
sad joyful
Semantic analogies Sentiment analysis Classification of

and similarity customer feedback
Advanced applications of word embeddings
Machine translation Information extraction Question answering

Learning objectives Prerequisite: neural networks
● Identify the key concepts of word representations
● Generate word embeddings
● Prepare text for machine learning
● Implement the continuous bag-of-words model

Basic Word
Representations
deeplearning.ai
Outline
● Integers
● One-hot vectors
● Word embeddings
Integers
Word Number
a 1
able 2
about 3
... ...
hand 615
… …
happy 621
... ...
zebra 1000
Integers
+ Simple
hand happy zebra

- Ordering: little semantic sense
615 < 621 < 1000
?! ?!
One-hot vectors
“a” “happy” “zebra”
1 a 0 a 0 a
0 able 0 able 0 able
0 about 0 about 0 about
⋮
0
...
hand
... 1000
rows
⋮
0
...
hand
... ⋮
0
...
hand
⋮ … ⋮ … ⋮ …
0 happy 1 happy 0 happy
⋮ ... ⋮ ... ⋮ ...
0 zebra 0 zebra 1 zebra
One-hot vectors
Word Number “happy”
a 1 1 0 a
able 2 2 0 able
about 3 3 0 about
... ... ... ⋮ ...
hand 615 615 0 hand
… … … ⋮ …
happy 621 621 1 happy
... ... ... ⋮ ...
zebra 1000 1000 0 zebra
One-hot vectors 0 a
⋮ …
~10k—1M+
+ Simple rows
1 happy
⋮ ...
0 Zyzzyva
+ No implied ordering
- Huge vectors happy d(paper, excited)

= d(paper, happy)
paper
- No embedded meaning = d(excited, happy)
excited
Word
Embeddings
deeplearning.ai
Meaning as vectors
rage anger spider boring paper kitten happy excited

(-2.52) (-2.08) (-1.53) (-0.91) (0.03) (1.09) (1.75) (2.31)
negative -2 -1 0 1 2 positive
Meaning as vectors
concrete
1
paper (0.03, 0.79)
spider (-1.53, 0.41) puppy kitten

(0.98, 0.57) (1.09, 0.57)
snake (-1.53, 0.41)
negative -2 -1 0 1 2 positive
rage (-2.52, -0.54) excited

anger (-2.08, -0.71) (2.31, -0.54)
boring happy
thought (0.03, -0.93) (1.75, -0.81)
-1
abstract
Word embedding vectors “happy”
0.123
⋮
+ Low dimension ~100—~1000
-4.059
rows
⋮
+ Embed meaning 1.891
○ e.g. semantic distance
forest ≈ tree forest ≉ ticket

○ e.g. analogies
Paris:France :: Rome:?
Terminology
word vectors
integers one-hot vectors word embedding vectors

“word vectors”
word embeddings
Summary
● Words as integers
● Words as vectors
○ One-hot vectors
○ Word embedding vectors
● Benefits of word embeddings for NLP

How to Create
Word
Embeddings
deeplearning.ai
Word embedding process Hyperparameters
Word embedding size ...
Corpus Embedding method

Transformation Machine learning model
General- Specialized words Learning task Self-supervised
purpose e.g. contracts,
“I think [???] I am” = unsupervised
e.g. Wikipedia law books integers, vectors + supervised
Words in context Meaning
Word embeddings
Word
Embedding
Methods
deeplearning.ai
Basic word embedding methods
● word2vec (Google, 2013)
○ Continuous bag-of-words (CBOW)
○ Continuous skip-gram / Skip-gram with negative sampling (SGNS)
● Global Vectors (GloVe) (Stanford, 2014)
● fastText (Facebook, 2016)

○ Supports out-of-vocabulary (OOV) words
Advanced word embedding methods
Deep learning, contextual embeddings
● BERT (Google, 2018)
Tunable pre-trained
● ELMo (Allen Institute for AI, 2018)
models available
● GPT-2 (OpenAI, 2018)

Continuous
Bag-of-Words
Model
deeplearning.ai
Continuous bag-of-words word embedding process
Embedding method
“I think
Corpus Transformation CBOW “therefore”
[???] I am”
Word embeddings
Corpus Transformation CBOW
Center word prediction: rationale
The little ? is barking
dog
puppy
hound
terrier
...
Creating a training example
center word
I am happy because I am learning
context words window

C=2 window size = 5
context half-size
From corpus to training

Embedding method
Transformation Context words
Corpus
Center
Context words
word CBOW
I am happy because I am learning I am because I happy
am happy I am because
happy because am learning I
Predicted center word
Word embeddings

Embedding method
Corpus
Center
Context words
word CBOW
Word embeddings

Embedding method
Corpus
Center
Context words
word CBOW
Word embeddings
CBOW in a nutshell
Source: Mikolov, T., Chen, K.,

Corrado, G.S., & Dean, J. (2013).
Efficient Estimation of Word
Representations in Vector Space
Cleaning and
Tokenization
deeplearning.ai
Cleaning and tokenization matters
● Letter case “The” == “the” == “THE” → lowercase / upper case
● Punctuation , ! . ? →. “ ‘ « » ’ ” →∅ … !! ??? → .
● Punctuation , ! . ? →. “ ‘ « » ’ ” →∅ … !! ??? → .
● Numbers 1 2 3 5 8 →∅ 3.14159 90210 → as is/<NUMBER>

● Punctuation , ! . ? →. “ ‘ « » ’ ” →∅ … !! ??? → .
● Special characters ∇ $ € § ¶ ** → ∅
● Punctuation , ! . ? →. “ ‘ « » ’ ” →∅ … !! ??? → .
● Special characters ∇ $ € § ¶ ** → ∅
● Special words 😊 #nlp → :happy: #nlp

Example in Python: corpus
Who ❤ "word embeddings" in 2020? I do!!!
emoji punctuation number

Example in Python: libraries
# pip install nltk

# pip install emoji
import nltk
from nltk.tokenize import word_tokenize
import emoji
nltk.download('punkt') # download pre-trained Punkt tokenizer for English

Example in Python: code
corpus = 'Who ❤ "word embeddings" in 2020? I do!!!'
data = re.sub(r'[,!?;-]+', '.', corpus)
→ Who ❤ "word embeddings" in 2020. I do.


data = nltk.word_tokenize(data) # tokenize string to words
→ ['Who', '❤', '``', 'word', 'embeddings', "''", 'in', '2020', '.', 'I',
'do', '.']

data = nltk.word_tokenize(data) # tokenize string to words
data = [ ch.lower() for ch in data
if ch.isalpha()
or ch == '.'
or emoji.get_emoji_regexp().search(ch)
]
→ ['who', '❤', 'word', 'embeddings', 'in', '.', 'i', 'do', '.']

Sliding Window
of Words in
Python
deeplearning.ai
Sliding window of words in Python
def get_windows(words, C):
i = C
while i < len(words) - C:
center_word = words[i]
context_words = words[(i - C):i] + words[(i+1):(i+C+1)]
yield context_words, center_word
i += 1

0 1 2 3 4 5 6
def get_windows(words, C):
...
yield context_words, center_word
for x, y in get_windows(
['i', 'am', 'happy', 'because', 'i', 'am', 'learning'],
2
):
print(f'{x}\t{y}')
for x, y in get_windows(
['i', 'am', 'happy', 'because', 'i', 'am', 'learning'],
2
):
print(f'{x}\t{y}')
→ ['I', 'am', 'because', 'I'] happy

['am', 'happy', 'I', 'am'] because
['happy', 'because', 'am', 'learning'] I
Transforming
Words into
Vectors
deeplearning.ai
Transforming center words into vectors
Corpus I am happy because I am learning
Vocabulary am, because, happy, I, learning
One-hot am because happy I learning

vector am 1 0 0 0 0
because 0 1 0 0 0
happy 0 0 1 0 0
I 0 0 0 1 0
learning 0 0 0 0 1
Transforming context words into vectors
Average of individual one-hot vectors
I am because I I am because I
am 0 1 0 0 0.25
because 0 0 1 0 0.25
happy 0 + 0 + 0 + 0 /4 = 0
I 1 0 0 1 0.5
learning 0 0 0 0 0
Final prepared training set
Context words Context words vector Center word Center word vector
I am because I [0.25; 0.25; 0; 0.5; 0] happy [0; 0; 1; 0; 0]
am happy I am [0.5; 0; 0.25; 0.25; 0] because [0; 1; 0; 0; 0]
happy because am learning [0.25; 0.25; 0.25; 0; 0.25] I [0; 0; 0; 1; 0]

Architecture
of the CBOW
Model
deeplearning.ai
Hyperparameters
Architecture of the CBOW model N: Word embedding size ...
Input layer Hidden layer Output layer
Context words Center word

vector W1 W2 vector
weights weights
⋮ ⋮ ⋮
b1 b2
x= ŷ=
biases biases
V ⋮ ⋮ ⋮ V
“I am happy ReLU softmax
because I am V N V
learning”
→V=5 x h ŷ
Architecture
of the CBOW
Model:
deeplearning.ai
Dimensions
Dimensions (single input)
W1 NxV W2 VxN
⋮ ⋮ ⋮
b1 Nx1 b2 Vx1
⋮ ReLU
⋮ softmax
⋮
z1 = W1x + b1 N x 1 z2 = W2h + b2 Vx1

x h = ReLU(z1) Nx1
h ŷ = softmax(z2) V x 1
ŷ
Vx1 Nx1 Vx1
Dimensions (single input)
Column vectors
z1 = W1x + b1 z1 = W1 = NxV x= b1 =
Nx1 Nx1
Vx1
Row vectors
b1 = 1xN W1 =
z1 = xW1T + b1 NxV b1 = 1xN
x= 1xV
Architecture of
the CBOW
Model:
deeplearning.ai
Dimensions 2
b1 → B1 = b1 … b1 N broadcasting
Dimensions (batch input) m

Context words
vectors
→ matrix
W1 NxV W2 VxN
⋮ ⋮ ⋮
x(1) … x(m)
B1 Nxm B2 Vxm
X= V
⋮ ReLU
⋮ softmax
⋮
m
Z 1 = W1 X + B 1 N x m Z 2 = W2 H + B 2 V x m
X H = ReLU(Z1) N x m
H Ŷ = softmax(Z2) V x m
Ŷ
Vxm Nxm Vxm
Dimensions (batch input)
Predicted
Context words
center word
matrix
matrix
W1 W2
⋮ ⋮ ⋮
X= x(1) … x(m) V B1 B2 Ŷ= ŷ(1) … ŷ(m) V
⋮ ⋮ ⋮
ReLU softmax
m m
X H Ŷ
Vxm Nxm Vxm
Architecture
of the CBOW
Model
deeplearning.ai
Activation Functions
Rectified Linear Unit (ReLU)
ReLU(x) = max(0, x)
Input layer Hidden layer
z1 = W1x + b1
W1
h = ReLU(z1)
⋮ ⋮
b1 z1 h
5.1 5.1
⋮ ⋮
ReLU -0.3 0
⋮ ReLU ⋮
-4.6 0
x h 0.2 0.2
Softmax ∈
ℝK
softmax ∈ [0, 1]K
~probabilities
Hidden layer Output layer
𝝨=1
z = W2h + b2
W2 ŷ = softmax(z)
⋮ ⋮ z ŷ
b2 z1 ŷ1 a
⋮ ⋮ ⋮
⋮ ⋮ zi softmax ŷi happy
softmax
⋮ ⋮ ⋮
zV ŷV zebra
V V
h ŷ Probabilities of
being center word
Softmax: example
z exp(z) ŷ = softmax(z)
9 8103 0.083 am
8 2981 0.03 because
11 exp 59874 /𝝨 0.612 happy ← Predicted center word
10 22026 0.225 I
8.5 4915 0.05 learning
𝝨=97899 𝝨=1
Training a
CBOW Model
deeplearning.ai
Cost Function
Loss
x ŷ y
Context Predicted Actual
words CBOW center word center word
vector vector vector
Input Machine learning Prediction Truth

model
minimize
Parameters
Loss error
W1, b1, W2, b2
adjust
cross-entropy loss
Actual Predicted
Cross-entropy loss y=
y1
⋮ ŷ=
ŷ1
⋮
yV ŷV
y ŷ log(ŷ) y ⊙ log(ŷ)
0 am 0.083 -2.49 0
0 because 0.03 -3.49 0
1 happy 0.611 log -0.49 ⊙y -0.49 -𝝨 J = 0.49
0 I 0.225 -1.49 0
0 learning 0.05 -2.49 0
Cross-entropy loss
y ŷ log(ŷ) y ⊙ log(ŷ)
0 am 0.96 -0.04 0
0 because 0.01 -4.61 0
1 happy 0.01 log -4.61 ⊙y -4.61 -𝝨 J = 4.61
0
0
I
learning
0.01
0.01
-4.61
-4.61
0
0 >
J(correct) = 0.49
Cross-entropy loss
J = -log ŷactual
word
incorrect
y ŷ predictions:
penalty
0 am 0.96
0 because 0.01
1 happy 0.01 → J = 4.61 correct
0 I 0.01 predictions:
0 learning 0.01 reward
Training a
CBOW Model
deeplearning.ai
Forward Propagation
Training process
● Forward propagation
● Cost
● Backpropagation and gradient descent

Z 1 = W1 X + B 1 Z 2 = W2 H + B 2
Forward propagation H = ReLU(Z1) Ŷ = softmax(Z2)

Predicted
Context words
center word
matrix
matrix
W1 W2
⋮ ⋮ ⋮
X= x(1) … x(m) B1 B2 Ŷ= ŷ(1) … ŷ(m)
⋮ ⋮ ⋮
ReLU softmax
X H Ŷ
Cost
Cost: mean of losses
Predicted
Actual center
center word
word matrix
matrix
Ŷ= ŷ(1) … ŷ(m) Y= y(1) … y(m)

Training a
CBOW Model
deeplearning.ai
Backpropagation and
Gradient Descent
Minimizing the cost
● Backpropagation: calculate partial derivatives of cost with respect to
weights and biases
Minimizing the cost
● Backpropagation: calculate partial derivatives of cost with respect to
weights and biases
● Gradient descent: update weights and biases

Gradient descent
Hyperparameter: learning rate ⍺
Extracting Word
Embedding
Vectors
deeplearning.ai
Extracting word embedding vectors: option 1
⋮ W1 b1 ReLU ⋮ W2 b2 softmax ⋮
am
am
… because
W1 = w(1) w(V) N x= happy V
I
learning
V
⋮ W1 b1 ReLU ⋮ W2 b2 softmax ⋮
w(1) am am
because
W2 = … V x= happy V
I
w(V) learning
N
w2(1)
W1 = w1(1) … w1(V) W2 = …
w2(V)
am
because
W3= 0.5 (W1+W2T) = w3(1) … w3(V) N x= happy V
I
learning
V
Evaluating
Word
Embeddings
deeplearning.ai
Intrinsic Evaluation
Intrinsic evaluation
Test relationships between words
● Analogies
Semantic analogies
“France” is to “Paris” as “Italy” is to <?>
Syntactic analogies
“seen” is to “saw” as “been” is to <?>
⚡ Ambiguity
“wolf” is to “pack” as “bee” is to <?> → swarm? colony?
● Analogies
● Analogies
● Clustering
Source: Michael Zhai, Johnny Tan, and Jinho

D. Choi. 2016. Intrinsic and extrinsic
evaluations of word embeddings
● Analogies
village
city town
● Clustering gas country
oil happy
petroleum
● Visualization sad joyful
Evaluating
Word
Embeddings
deeplearning.ai
Extrinsic Evaluation
Named entity
Extrinsic evaluation Andrew works at deeplearning.ai

person organization
Test word embeddings on external task

e.g. named entity recognition, parts-of-speech tagging
Extrinsic evaluation
Test word embeddings on external task
e.g. named entity recognition, parts-of-speech tagging
+ Evaluates actual usefulness of embeddings
- Time-consuming
- More difficult to troubleshoot

Conclusion
deeplearning.ai
Recap and assignment
● Data preparation
● Word representations
● Continuous bag-of-words model
● Evaluation
Going further
● Advanced language modelling and word embeddings
● NLP and machine learning libraries

Keras # from keras.layers.embeddings import Embedding
embed_layer = Embedding(10000, 400)
PyTorch # import torch.nn as nn

embed_layer = nn.Embedding(10000, 400)

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright:

Available Formats

Copyright Notice

These slides are distributed under the Creative Commons License.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Semantic analogies Sentiment analysis Classification of

Machine translation Information extraction Question answering

● Identify the key concepts of word representations

● Generate word embeddings

● Prepare text for machine learning

● Implement the continuous bag-of-words model

hand happy zebra

- Huge vectors happy d(paper, excited)

rage anger spider boring paper kitten happy excited

spider (-1.53, 0.41) puppy kitten

rage (-2.52, -0.54) excited

forest ≈ tree forest ≉ ticket

integers one-hot vectors word embedding vectors

● Benefits of word embeddings for NLP

Corpus Embedding method

● Global Vectors (GloVe) (Stanford, 2014)

● fastText (Facebook, 2016)

● BERT (Google, 2018)

● GPT-2 (OpenAI, 2018)

Center word prediction: rationale

The little ? is barking

Creating a training example

I am happy because I am learning

context words window

From corpus to training

From corpus to training

From corpus to training

Source: Mikolov, T., Chen, K.,

● Numbers 1 2 3 5 8 →∅ 3.14159 90210 → as is/<NUMBER>

● Numbers 1 2 3 5 8 →∅ 3.14159 90210 → as is/<NUMBER>

● Numbers 1 2 3 5 8 →∅ 3.14159 90210 → as is/<NUMBER>

● Special words 😊 #nlp → :happy: #nlp

Who ❤ "word embeddings" in 2020? I do!!!

emoji punctuation number

# pip install nltk

nltk.download('punkt') # download pre-trained Punkt tokenizer for English

data = re.sub(r'[,!?;-]+', '.', corpus)

→ Who ❤ "word embeddings" in 2020. I do.

data = re.sub(r'[,!?;-]+', '.', corpus)

data = re.sub(r'[,!?;-]+', '.', corpus)

→ ['who', '❤', 'word', 'embeddings', 'in', '.', 'i', 'do', '.']

I am happy because I am learning

→ ['I', 'am', 'because', 'I'] happy

Vocabulary am, because, happy, I, learning

One-hot am because happy I learning

I am because I [0.25; 0.25; 0; 0.5; 0] happy [0; 0; 1; 0; 0]

am happy I am [0.5; 0; 0.25; 0.25; 0] because [0; 1; 0; 0; 0]

happy because am learning [0.25; 0.25; 0.25; 0; 0.25] I [0; 0; 0; 1; 0]

Input layer Hidden layer Output layer

Context words Center word

z1 = W1x + b1 N x 1 z2 = W2h + b2 Vx1

Input layer Hidden layer Output layer

Input Machine learning Prediction Truth

I am happy because I am learning

● Backpropagation and gradient descent

Input layer Hidden layer Output layer

Ŷ= ŷ(1) … ŷ(m) Y= y(1) … y(m)

● Gradient descent: update weights and biases

Source: Michael Zhai, Johnny Tan, and Jinho

Extrinsic evaluation Andrew works at deeplearning.ai