Xu-Ly-Ngon-Ngu-Tu-Nhien - Kai-Wei-Chang - 07-Wordembedding - (Cuuduongthancong - Com)

Lecture 7: Word
Embeddings
Kai-Wei Chang
CS @ University of Virginia
kw@kwchang.net
Couse webpage: http://kwchang.net/teaching/NLP16
6501 Natural Language Processing 1

CuuDuongThanCong.com https://fb.com/tailieudientucntt
This lecture
v Learning word vectors (Cont.)
v Representation learning in NLP

Recap: Latent Semantic Analysis
v Data representation
v Encode single-relational data in a matrix
v Co-occurrence (e.g., from a general corpus)
v Synonyms (e.g., from a thesaurus)
v Factorization
v Apply SVD to the matrix to find latent
components
v Measuring degree of relation

v Cosine of latent vectors
Recap: Mapping to Latent Space via SVD
𝚺
𝐕'
𝑪 ≈ 𝐔
𝑘×𝑘 𝑘×𝑛
𝑑×𝑛 𝑑×𝑘
v SVD generalizes the original data
v Uncovers relationships not explicit in the thesaurus
v Term vectors projected to 𝑘-dim latent space
v Word similarity:
cosine of two column vectors in 𝚺𝐕 $
Low rank approximation
v Frobenius norm. C is a 𝑚×𝑛 matrix

9 6
||𝐶||/ = 1 1 |𝑐34 |5
378 478
v Rank of a matrix.
v How many vectors in the matrix are
independent to each other

v Low rank approximation problem:

min ||𝐶 − 𝑋||/ 𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘
=
v If I can only use k independent vectors to describe
the points in the space, what are the best choices?
Essentially, we minimize the “reconstruction loss” under a low rank constraint

v Low rank approximation problem:

min ||𝐶 − 𝑋||/ 𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘
=
v If I can only use k independent vectors to describe
the points in the space, what are the best choices?
Essentially, we minimize the “reconstruction loss” under a low rank constraint

v Assume rank of 𝐶 is r
v SVD: 𝐶 = 𝑈Σ𝑉 ' , Σ = diag(𝜎8 , 𝜎5 … 𝜎P , 0,0,0, … 0)
𝜎8 0 0
𝑟 non-zeros
Σ = 0 ⋱ 0
0 0 0
v Zero-out the r − 𝑘 trailing values

Σ′ = diag(𝜎8 , 𝜎5 … 𝜎U , 0,0,0, … 0)
v 𝐶 V = UΣV 𝑉 ' is the best k-rank approximation:
C V = 𝑎𝑟𝑔 min ||𝐶 − 𝑋||/ 𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘
=

Word2Vec
v LSA: a compact representation of co-

occurrence matrix
v Word2Vec:Predict surrounding words (skip-gram)
v Similar to using co-occurrence counts Levy&Goldberg
(2014), Pennington et al. (2014)
v Easy to incorporate new words

or sentences

Word2Vec
v Similar to language model, but predicting next
word is not the goal.
v Idea: words that are semantically similar often
occur near each other in text
v Embeddings that are good at predicting neighboring
words are also good at representing similarity

Skip-gram v.s Continuous bag-of-words
v What differences?

Skip-gram v.s Continuous bag-of-words

Objective of Word2Vec (Skip-gram)
v Maximize the log likelihood of context word

𝑤\]9 , 𝑤\]9^8, … , 𝑤\]8 , 𝑤\^8 , 𝑤\^5 , … , 𝑤\^9
given word 𝑤\
v m is usually 5~10

Objective of Word2Vec (Skip-gram)
v How to model log 𝑃(𝑤\^4 |𝑤\ )?

cde (fghij ⋅ lgh )
𝑝 𝑤\^4 𝑤\ = ∑
gn cde (fgn ⋅ lgh )
v softmax function Again!
v Every word has 2 vectors
v 𝑣p : when 𝑤 is the center word
v 𝑢p : when 𝑤 is the outside word (context word)

How to update?
cde (fghij ⋅ lgh )

𝑝 𝑤\^4 𝑤\ = ∑
gn cde (fgn ⋅ lgh )
v How to minimize 𝐽(𝜃)

v Gradient descent!
v How to compute the gradient?

Recap: Calculus
v Gradient:
𝒙' = 𝑥8 𝑥5 𝑥z ,
𝜕𝜙(𝒙)
𝜕𝑥8
𝜕𝜙(𝒙)
∇𝜙 𝒙 =
𝜕𝑥5
𝜕𝜙(𝒙)
𝜕𝑥z
v 𝜙 𝒙 = 𝒂 ⋅ 𝒙 (or represented as 𝒂' 𝒙)
∇𝜙 𝒙 = 𝒂
Recap: Calculus
v If 𝑦 = 𝑓 𝑢 and 𝑢 = 𝑔 𝑥 (i.e,. 𝑦 = 𝑓(𝑔 𝑥 )

ƒ„ ƒ†(f) ƒ‡(…) ƒ„ ƒf
= ( )
ƒ… ƒf ƒ… ƒf ƒ…
1. 𝑦 = 𝑥 ˆ + 6 z 2. y = ln (𝑥 5 + 5)
3. y = exp(x • + 3𝑥 + 2)

Other useful formulation
v 𝑦 = exp 𝑥
dy
= exp x
dx
v y = log x
dy 1
=
dx x
When I say log (in this course), usually I mean ln

Example
v Assume vocabulary set is 𝑊. We have one

center word 𝑐, and one context word 𝑜.
v What is the conditional probability 𝑝 𝑜 𝑐
exp (𝑢• ⋅ 𝑣– )
𝑝 𝑜𝑐 =
∑pV exp (𝑢p n ⋅ 𝑣– )
v What is the gradient of the log likelihood
w.r.t 𝑣– ?
𝜕 log 𝑝 𝑜 𝑐
= 𝑢• − 𝐸p∼™ 𝑤 𝑐 [𝑢p ]
𝜕𝑣–

Gradient Descent
min 𝐽(𝑤)
p
Update w: 𝑤 ← 𝑤 − 𝜂∇𝐽(𝑤)

Local minimum v.s. global minimum

Stochastic gradient descent
8 6
v Let 𝐽 𝑤 = ∑ 𝐽 (𝑤)
6 478 4
v Gradient descent update rule:
ž 6
𝑤←𝑤 − 6 ∑478 𝛻𝐽4 𝑤
v Stochastic gradient descent:
8
v Approximate 6 ∑6478 𝛻𝐽4 𝑤 by the gradient at a
single example 𝛻𝐽3 𝑤 (why?)
v At each step: Randomly pick an example 𝑖
𝑤 ← 𝑤 − 𝜂𝛻𝐽3 𝑤

Negative sampling
v With a large vocabulary set, stochastic

gradient descent is still not enough (why?)
𝜕 log 𝑝 𝑜 𝑐
= 𝑢• − 𝐸p∼™ 𝑤 𝑐 [𝑢p ]
𝜕𝑣–
v Let’s approximate it again!
vOnly sample a few words that do not appear
in the context
vEssentially, put more weights on positive
samples

More about Word2Vec – relation to LSA
v LSA factorizes a matrix of co-occurrence

counts
v (Levy and Goldberg 2014) proves that
skip-gram model implicitly factorizes a
(shifted) PMI matrix!
¡(„|…) ¡(…,„)
v PMI(w,c) =log ¡(„) = log ™(…)¡(„)
# 𝑤, 𝑐 ⋅ |𝐷|
= log
#(𝑤)#(𝑐)

All problem solved?

Continuous Semantic Representations
sunny
rainy
cloudy windy
car
cab emotion sad
wheel
joy
feeling

Semantics Needs More Than Similarity
Tomorrow will
be rainy.
Tomorrow will
be sunny.
𝑠𝑖𝑚𝑖𝑙𝑎𝑟(rainy, sunny)?
𝑎𝑛𝑡𝑜𝑛𝑦𝑚(rainy, sunny)?

Polarity Inducing LSA [Yih, Zweig, Platt 2012]
v Data representation
v Encode two opposite relations in a matrix using
“polarity”
v Synonyms & antonyms (e.g., from a thesaurus)
v Factorization
v Apply SVD to the matrix to find latent
components
v Measuring degree of relation

v Cosine of latent vectors
Encode Synonyms & Antonyms in Matrix
v Joyfulness: joy, gladden; sorrow, sadden

v Sad: sorrow, sadden; joy, gladden
Target word: row- Inducing polarity

vector
joy gladden sorrow sadden goodwill
Group 1: “joyfulness” 1 1 -1 -1 0
Group 2: “sad” -1 -1 1 1 0
Group 3: “affection” 0 0 0 0 1
Cosine Score: + 𝑆𝑦𝑛𝑜𝑛𝑦𝑚𝑠
Encode Synonyms & Antonyms in Matrix
v Joyfulness: joy, gladden; sorrow, sadden

v Sad: sorrow, sadden; joy, gladden
Target word: row- Inducing polarity

vector
joy gladden sorrow sadden goodwill
Group 1: “joyfulness” 1 1 -1 -1 0
Group 2: “sad” -1 -1 1 1 0
Group 3: “affection” 0 0 0 0 1
Cosine Score: − 𝐴𝑛𝑡𝑜𝑛𝑦𝑚𝑠
Continuous representations for entities
Democratic Party
Republic Party
George W Bush
Laura Bush Michelle Obama

Continuous representations for entities
• Useful resources for NLP applications

• Semantic Parsing & Question Answering
• Information Extraction


Xu-Ly-Ngon-Ngu-Tu-Nhien - Kai-Wei-Chang - 07-Wordembedding - (Cuuduongthancong - Com)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Xu-Ly-Ngon-Ngu-Tu-Nhien - Kai-Wei-Chang - 07-Wordembedding - (Cuuduongthancong - Com)

Uploaded by

Copyright:

Available Formats

Lecture 7: Word

Couse webpage: http://kwchang.net/teaching/NLP16

6501 Natural Language Processing 1

v Representation learning in NLP

6501 Natural Language Processing 2

v Measuring degree of relation

v Frobenius norm. C is a 𝑚×𝑛 matrix

6501 Natural Language Processing 5

v Low rank approximation problem:

6501 Natural Language Processing 6

v Low rank approximation problem:

6501 Natural Language Processing 7

v Zero-out the r − 𝑘 trailing values

6501 Natural Language Processing 8

v LSA: a compact representation of co-

v Easy to incorporate new words

6501 Natural Language Processing 9

6501 Natural Language Processing 10

6501 Natural Language Processing 11

6501 Natural Language Processing 12

v Maximize the log likelihood of context word

6501 Natural Language Processing 13

v How to model log 𝑃(𝑤\^4 |𝑤\ )?

6501 Natural Language Processing 14

cde (fghij ⋅ lgh )

v How to minimize 𝐽(𝜃)

6501 Natural Language Processing 15

v If 𝑦 = 𝑓 𝑢 and 𝑢 = 𝑔 𝑥 (i.e,. 𝑦 = 𝑓(𝑔 𝑥 )

6501 Natural Language Processing 17

When I say log (in this course), usually I mean ln

6501 Natural Language Processing 18

v Assume vocabulary set is 𝑊. We have one

6501 Natural Language Processing 20

6501 Natural Language Processing 21

6501 Natural Language Processing 22

6501 Natural Language Processing 23

v With a large vocabulary set, stochastic

6501 Natural Language Processing 24

v LSA factorizes a matrix of co-occurrence

6501 Natural Language Processing 25

6501 Natural Language Processing 26

6501 Natural Language Processing 27

6501 Natural Language Processing 28

v Measuring degree of relation

v Joyfulness: joy, gladden; sorrow, sadden

Target word: row- Inducing polarity

Cosine Score: + 𝑆𝑦𝑛𝑜𝑛𝑦𝑚𝑠

v Joyfulness: joy, gladden; sorrow, sadden

Target word: row- Inducing polarity

Cosine Score: − 𝐴𝑛𝑡𝑜𝑛𝑦𝑚𝑠

Laura Bush Michelle Obama

6501 Natural Language Processing 32

• Useful resources for NLP applications

6501 Natural Language Processing 33

You might also like