You are on page 1of 33

Lecture 7: Word

Embeddings

Kai-Wei Chang
CS @ University of Virginia
kw@kwchang.net

Couse webpage: http://kwchang.net/teaching/NLP16

6501 Natural Language Processing 1


CuuDuongThanCong.com https://fb.com/tailieudientucntt
This lecture
v Learning word vectors (Cont.)

v Representation learning in NLP

6501 Natural Language Processing 2


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Recap: Latent Semantic Analysis

v Data representation
v Encode single-relational data in a matrix
v Co-occurrence (e.g., from a general corpus)
v Synonyms (e.g., from a thesaurus)

v Factorization
v Apply SVD to the matrix to find latent
components

v Measuring degree of relation


v Cosine of latent vectors

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Recap: Mapping to Latent Space via SVD

𝚺
𝐕'
𝑪 ≈ 𝐔
𝑘×𝑘 𝑘×𝑛

𝑑×𝑛 𝑑×𝑘
v SVD generalizes the original data
v Uncovers relationships not explicit in the thesaurus
v Term vectors projected to 𝑘-dim latent space
v Word similarity:
cosine of two column vectors in 𝚺𝐕 $

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Low rank approximation

v Frobenius norm. C is a 𝑚×𝑛 matrix


9 6

||𝐶||/ = 1 1 |𝑐34 |5
378 478

v Rank of a matrix.
v How many vectors in the matrix are
independent to each other

6501 Natural Language Processing 5


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Low rank approximation

v Low rank approximation problem:


min ||𝐶 − 𝑋||/ 𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘
=
v If I can only use k independent vectors to describe
the points in the space, what are the best choices?
Essentially, we minimize the “reconstruction loss” under a low rank constraint

6501 Natural Language Processing 6


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Low rank approximation

v Low rank approximation problem:


min ||𝐶 − 𝑋||/ 𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘
=
v If I can only use k independent vectors to describe
the points in the space, what are the best choices?
Essentially, we minimize the “reconstruction loss” under a low rank constraint

6501 Natural Language Processing 7


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Low rank approximation
v Assume rank of 𝐶 is r
v SVD: 𝐶 = 𝑈Σ𝑉 ' , Σ = diag(𝜎8 , 𝜎5 … 𝜎P , 0,0,0, … 0)
𝜎8 0 0
𝑟 non-zeros
Σ = 0 ⋱ 0
0 0 0

v Zero-out the r − 𝑘 trailing values


Σ′ = diag(𝜎8 , 𝜎5 … 𝜎U , 0,0,0, … 0)
v 𝐶 V = UΣV 𝑉 ' is the best k-rank approximation:
C V = 𝑎𝑟𝑔 min ||𝐶 − 𝑋||/ 𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘
=

6501 Natural Language Processing 8


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Word2Vec

v LSA: a compact representation of co-


occurrence matrix
v Word2Vec:Predict surrounding words (skip-gram)
v Similar to using co-occurrence counts Levy&Goldberg
(2014), Pennington et al. (2014)

v Easy to incorporate new words


or sentences

6501 Natural Language Processing 9


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Word2Vec
v Similar to language model, but predicting next
word is not the goal.
v Idea: words that are semantically similar often
occur near each other in text
v Embeddings that are good at predicting neighboring
words are also good at representing similarity

6501 Natural Language Processing 10


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Skip-gram v.s Continuous bag-of-words

v What differences?

6501 Natural Language Processing 11


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Skip-gram v.s Continuous bag-of-words

6501 Natural Language Processing 12


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Objective of Word2Vec (Skip-gram)

v Maximize the log likelihood of context word


𝑤\]9 , 𝑤\]9^8, … , 𝑤\]8 , 𝑤\^8 , 𝑤\^5 , … , 𝑤\^9
given word 𝑤\

v m is usually 5~10

6501 Natural Language Processing 13


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Objective of Word2Vec (Skip-gram)

v How to model log 𝑃(𝑤\^4 |𝑤\ )?


cde (fghij ⋅ lgh )
𝑝 𝑤\^4 𝑤\ = ∑
gn cde (fgn ⋅ lgh )
v softmax function Again!
v Every word has 2 vectors
v 𝑣p : when 𝑤 is the center word
v 𝑢p : when 𝑤 is the outside word (context word)

6501 Natural Language Processing 14


CuuDuongThanCong.com https://fb.com/tailieudientucntt
How to update?

cde (fghij ⋅ lgh )


𝑝 𝑤\^4 𝑤\ = ∑
gn cde (fgn ⋅ lgh )

v How to minimize 𝐽(𝜃)


v Gradient descent!
v How to compute the gradient?

6501 Natural Language Processing 15


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Recap: Calculus

v Gradient:
𝒙' = 𝑥8 𝑥5 𝑥z ,

𝜕𝜙(𝒙)
𝜕𝑥8
𝜕𝜙(𝒙)
∇𝜙 𝒙 =
𝜕𝑥5
𝜕𝜙(𝒙)
𝜕𝑥z
v 𝜙 𝒙 = 𝒂 ⋅ 𝒙 (or represented as 𝒂' 𝒙)
∇𝜙 𝒙 = 𝒂
6501 Natural Language Processing 16
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Recap: Calculus

v If 𝑦 = 𝑓 𝑢 and 𝑢 = 𝑔 𝑥 (i.e,. 𝑦 = 𝑓(𝑔 𝑥 )


ƒ„ ƒ†(f) ƒ‡(…) ƒ„ ƒf
= ( )
ƒ… ƒf ƒ… ƒf ƒ…

1. 𝑦 = 𝑥 ˆ + 6 z 2. y = ln (𝑥 5 + 5)
3. y = exp(x • + 3𝑥 + 2)

6501 Natural Language Processing 17


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Other useful formulation

v 𝑦 = exp 𝑥
dy
= exp x
dx
v y = log x
dy 1
=
dx x

When I say log (in this course), usually I mean ln

6501 Natural Language Processing 18


CuuDuongThanCong.com https://fb.com/tailieudientucntt
6501 Natural Language Processing 19
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Example

v Assume vocabulary set is 𝑊. We have one


center word 𝑐, and one context word 𝑜.
v What is the conditional probability 𝑝 𝑜 𝑐
exp (𝑢• ⋅ 𝑣– )
𝑝 𝑜𝑐 =
∑pV exp (𝑢p n ⋅ 𝑣– )
v What is the gradient of the log likelihood
w.r.t 𝑣– ?
𝜕 log 𝑝 𝑜 𝑐
= 𝑢• − 𝐸p∼™ 𝑤 𝑐 [𝑢p ]
𝜕𝑣–

6501 Natural Language Processing 20


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Gradient Descent

min 𝐽(𝑤)
p
Update w: 𝑤 ← 𝑤 − 𝜂∇𝐽(𝑤)

6501 Natural Language Processing 21


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Local minimum v.s. global minimum

6501 Natural Language Processing 22


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Stochastic gradient descent
8 6
v Let 𝐽 𝑤 = ∑ 𝐽 (𝑤)
6 478 4
v Gradient descent update rule:
ž 6
𝑤←𝑤 − 6 ∑478 𝛻𝐽4 𝑤
v Stochastic gradient descent:
8
v Approximate 6 ∑6478 𝛻𝐽4 𝑤 by the gradient at a
single example 𝛻𝐽3 𝑤 (why?)
v At each step: Randomly pick an example 𝑖
𝑤 ← 𝑤 − 𝜂𝛻𝐽3 𝑤

6501 Natural Language Processing 23


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Negative sampling

v With a large vocabulary set, stochastic


gradient descent is still not enough (why?)
𝜕 log 𝑝 𝑜 𝑐
= 𝑢• − 𝐸p∼™ 𝑤 𝑐 [𝑢p ]
𝜕𝑣–
v Let’s approximate it again!
vOnly sample a few words that do not appear
in the context
vEssentially, put more weights on positive
samples

6501 Natural Language Processing 24


CuuDuongThanCong.com https://fb.com/tailieudientucntt
More about Word2Vec – relation to LSA

v LSA factorizes a matrix of co-occurrence


counts
v (Levy and Goldberg 2014) proves that
skip-gram model implicitly factorizes a
(shifted) PMI matrix!
¡(„|…) ¡(…,„)
v PMI(w,c) =log ¡(„) = log ™(…)¡(„)
# 𝑤, 𝑐 ⋅ |𝐷|
= log
#(𝑤)#(𝑐)

6501 Natural Language Processing 25


CuuDuongThanCong.com https://fb.com/tailieudientucntt
All problem solved?

6501 Natural Language Processing 26


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Continuous Semantic Representations

sunny
rainy

cloudy windy
car
cab emotion sad
wheel
joy
feeling

6501 Natural Language Processing 27


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Semantics Needs More Than Similarity

Tomorrow will
be rainy.
Tomorrow will
be sunny.

𝑠𝑖𝑚𝑖𝑙𝑎𝑟(rainy, sunny)?

𝑎𝑛𝑡𝑜𝑛𝑦𝑚(rainy, sunny)?

6501 Natural Language Processing 28


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Polarity Inducing LSA [Yih, Zweig, Platt 2012]

v Data representation
v Encode two opposite relations in a matrix using
“polarity”
v Synonyms & antonyms (e.g., from a thesaurus)

v Factorization
v Apply SVD to the matrix to find latent
components

v Measuring degree of relation


v Cosine of latent vectors

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Encode Synonyms & Antonyms in Matrix

v Joyfulness: joy, gladden; sorrow, sadden


v Sad: sorrow, sadden; joy, gladden

Target word: row- Inducing polarity


vector
joy gladden sorrow sadden goodwill

Group 1: “joyfulness” 1 1 -1 -1 0

Group 2: “sad” -1 -1 1 1 0

Group 3: “affection” 0 0 0 0 1

Cosine Score: + 𝑆𝑦𝑛𝑜𝑛𝑦𝑚𝑠

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Encode Synonyms & Antonyms in Matrix

v Joyfulness: joy, gladden; sorrow, sadden


v Sad: sorrow, sadden; joy, gladden

Target word: row- Inducing polarity


vector
joy gladden sorrow sadden goodwill

Group 1: “joyfulness” 1 1 -1 -1 0

Group 2: “sad” -1 -1 1 1 0

Group 3: “affection” 0 0 0 0 1

Cosine Score: − 𝐴𝑛𝑡𝑜𝑛𝑦𝑚𝑠

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Continuous representations for entities

Democratic Party
Republic Party

George W Bush

Laura Bush Michelle Obama

6501 Natural Language Processing 32


CuuDuongThanCong.com https://fb.com/tailieudientucntt
Continuous representations for entities

• Useful resources for NLP applications


• Semantic Parsing & Question Answering
• Information Extraction

6501 Natural Language Processing 33


CuuDuongThanCong.com https://fb.com/tailieudientucntt

You might also like