Professional Documents
Culture Documents
Dan Jurafsky and James Martin Speech and Language Processing
Dan Jurafsky and James Martin Speech and Language Processing
Chapter 6:
Vector Semantics, Part II
Tf-idf and PPMI are
sparse representations
3
Sparse versus dense vectors
Why dense vectors?
◦ Short vectors may be easier to use as features in machine
learning (less weights to tune)
◦ Dense vectors may generalize better than storing explicit
counts
◦ They may do better at capturing synonymy:
◦ car and automobile are synonyms; but are distinct dimensions
◦ a word with car as a neighbor and a word with automobile as a
neighbor4 should be similar, but aren't
◦ In practice, they work better
Dense embeddings you can
download!
9/7/18
Skip-Gram Training Data
Training sentence:
... lemon, a tablespoon of apricot jam a pinch ...
c1 c2 target c3 c4
9/7/18
Skip-Gram Goal
Given a tuple (t,c) = target, context
◦ (apricot, jam)
◦ (apricot, aardvark)
Return probability that c is a real context word:
P(+|t,c)
P(−|t,c) = 1−P(+|t,c)
12
9/7/18
How to compute p(+|t,c)?
Intuition:
◦ Words are likely to appear near similar words
◦ Model similarity with dot-product!
◦ Similarity(t,c) ∝ t · c
Problem:
◦Dot product is not a probability!
◦ (Neither is cosine)
ecall, for that matter, that cosine isn’t a probability
into a Turning dot
probability, product
we’ll use theinto a or sigmoi
logistic
probability
core of logistic regression:
The sigmoid lies between 0 and 1:
1
s (x) = x
1+e
lity that word c is a real context word for target wo
1
P(+|t, c) = t·c
1+e
s (x) = x
1+e
Turning
y that word c is adot
real product
context 1 into
word for a
target word t
P(+|t, c) =
probability 1 + e t·c
d function just returns a number between 0 and 1, so to
ll need to make sure that the total
1 probability of the tw
P(+|t,
a context word, and c)
c not=being a context
t·c word) sum to 1
1+e
lity that word c is not a real context word for t is thus:
unction just returns a number between 0 and 1, so
P( |t, c) = 1 P(+|t, c)
need to make sure that the totalt·c
probability of the t
ontext word, and c not=being e a context word) sum to
1 + e t·c
y that word c is not a real context word for t is thus:
n 6.19 give us the probability for one word, but we need
ple context words in the window. Skip-gram makes the
For all the context words:
ifying assumption that all context words are independen
Assume all context words are
y their probabilities:
independent
k
Y 1
P(+|t, c1:k ) = t·ci
1+e
i=1
k
X 1
log P(+|t, c1:k ) = log t·ci
1+e
i=1
9/7/18
Word2vec learns embeddings by starting with an initial set of embe
and then iteratively shifting the embedding of each word w to be mor
beddings of words that occur nearby in texts, and less like the embedd
that don’t occur nearby.
Skip-Gram Training
Let’s start by considering a single piece of the training data, from
above:
... Training
lemon, asentence:
[tablespoon of apricot jam, a] pinch ...
... lemon, a tablespoon
c1 of apricot
c2 t jamc3 a pinch
c4 ...
This example has c1a target word
c2 t t(apricot),
c3 and
c4 4 context words
window, resulting in 4 positive training instances (on the left below):
positive examples + negative examples -
t c •For each
t
positive
c
example,
t c
we'll create k negative
apricot aardvark apricot tw
apricot tablespoon
apricot of examples.
apricot puddle apricot h
apricot preserves noise words
•Using apricot where apricot18 d
apricot or •Any random
apricot word that apricot
coaxial isn't t f
For training a binary classifier we also need negative examples, an
9/7/18
Word2vec
rd2vec learns embeddings
learns embeddings by with
by starting starting with ansetinitial
an initial set of embe
of embedding ve
thenand then iteratively
iteratively shifting shifting the embedding
the embedding of each
of each word w toword w to like
be more be mor
the
dingsbeddings
of wordsofthat
words that
occur occurinnearby
nearby texts, in
andtexts, and the
less like lessembeddings
like the embedd
of w
thatoccur
don’t don’tnearby.
occur nearby.
Skip-Gram Training
Let’s start
ve: above:
Let’s
by start by considering
considering a single apiece
single
of piece of the data,
the training training data,
from the from
sent
... Training
lemon, lemon, asentence:
a [tablespoon of apricot
[tablespoon jam,
of apricot a] pincha]...
jam, pinch ...
... lemon,
c1 a tablespoon
c1 c2 of apricot
tc2 c3t jamc3c4a pinch
c4 ...
This example has a target
This example hasc1aword t (apricot),
target word and c3
4 context
c2 t t(apricot), c4 4words
and contextin words
the L =
dow,window,
resultingresulting
in 4 positive training training
in 4 positive instancesinstances
(on the left
(onbelow):
the left below):
k=2
positivepositive
examples +
examples + negativenegative
examplesexamples
- -
t tc c t ct ct ct c
apricot apricot
tablespoon
tablespoon apricot aardvark apricot twelve
apricot aardvark apricot tw
apricot apricot
of of apricot puddle apricot hello
apricot puddle apricot h
apricot apricot
preserves
preserves apricot where apricot dear
apricot where apricot19 d
apricot apricot
or or apricot coaxial apricot forever
apricot coaxial apricot f
For training a binaryaclassifier
For training
9/7/18 we also need
binary classifier negative
we also examples,examples,
need negative and in fact an
V ECTOR S EMANTICS
9/7/18
Learning the classifier
Iterative process.
We’ll start with 0 or random weights
Then adjust the word weights to
◦ make the positive pairs more likely
◦ and the negative pairs less likely
over the entire training set:
Objective Criteria
We want to maximize…
X X
logP (+|t, c) + logP ( |t, c)
(t,c)2+ (t,c)2
23
9/7/18
X X
L(q ) = log P(+|t, c) + log P( |t, c) (6
(t,c)2+ (t,c)2
s, we want to maximize the dot product of the word with the actual con
d minimize the dot products of the word with the k negative sampled n
increase
similarity( apricot , jam) C
W wj . ck
1. .. … d
apricot
1.2…….j………V 1
.
1 k jam neighbor word
. “…apricot jam…” .
. n aardvark random noise
. . word
d V
decrease
similarity( apricot , aardvark)
wj . cn
Train using gradient descent
Actually learns two separate embedding matrices W and C
Can use W and throw away C, or merge them somehow
Summary: How to learn word2vec
(skip-gram) embeddings
Start with V random 300-dimensional vectors as
initial embeddings
Use logistic regression, the second most basic
classifier used in machine learning after naïve
bayes
◦ Take a corpus and take pairs of words that co-occur as
positive examples
◦ Take pairs of words that don't co-occur as negative
examples
◦ Train the classifier to distinguish these by slowly adjusting
all the embeddings to improve the classifier performance
◦ Throw away the classifier code and keep the embeddings.
Evaluating embeddings
Compare to human scores on word
similarity-type tasks:
• WordSim-353 (Finkelstein et al., 2002)
• SimLex-999 (Hill et al., 2015)
• Stanford Contextual Word Similarity (SCWS) dataset
(Huang et al., 2012)
• TOEFL dataset: Levied is closest in meaning to: imposed,
believed, requested, correlated
Properties of embeddings
Similarity depends on window size C
30
Embeddings can help study
word history!
Train embeddings on old books to study
changes in word meaning!!
Will Hamilton
Diachronic word embeddings for
studying language change!
vs.
3
4
Visualizing changes
Project 300 dimensions down into 2
36
Embeddings and bias
Embeddings reflect cultural bias
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and
Adam T. Kalai. "Man is to computer programmer as woman is to
homemaker? debiasing word embeddings." In Advances in Neural
Information Processing Systems, pp. 4349-4357. 2016.
1910-1990
nikhgarg/EmbeddingDynamicStereotypes), and we link to external data
section B.1.
sources as appropriate.
apped confi-
ubsets of the Embeddings.
Garg, Nikhil,This work Londa,
Schiebinger, uses several pretrained
Jurafsky, Dan, word(2018).
and Zou, James embeddings publicly
Word embeddings quantify 100 years of gender
Similarly, for available online;
and ethnic refer Proceedings
stereotypes. to the respective sources
of the National for ofin-depth
Academy discussion
Sciences, 115(16), of
E3635–E3644
refs. 6 and 7 their training parameters. These embeddings are among the most com-
Change in association of Chinese names with adjectives
hen a larger
not needed.
monly used English embeddings, vary in the datasets on which they were
t the written
opular social
o consider in
on these ear-
s for gender
n is a positive
gful patterns
en this con-
oes not fully
present day.
nto the atti-
Changes in framing:
adjectives associated with Chinese
Table 3. Top Asian (vs. White) adjectives in 1910, 1950, and 1990
Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender
by relative norm difference in the COHA embedding
and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644 em
1910 1950 1990
ha
tio
Irresponsible Disorganized Inhibited sta
Envious Outrageous Passive em
Barbaric Pompous Dissolute ca
Aggressive Unstable Haughty wo
Transparent Effeminate Complacent tie
Monstrous Unprincipled Forceful ste
Hateful Venomous Fixed O
Cruel Disobedient Active tra
Greedy Predatory Sensitive ar
Bizarre Boisterous Hearty ap
an
ics
Conclusion
Concepts or word senses
◦ Have a complex many-to-many association with words
(homonymy, multiple senses)
◦ Have relations with each other
◦ Synonymy, Antonymy, Superordinate
◦ But are hard to define formally (necessary & sufficient
conditions)
Embeddings = vector models of meaning
◦ More fine-grained than just a string or index
◦ Especially good at modeling similarity/analogy
◦ Just download them and use cosines!!
◦ Can use sparse models (tf-idf) or dense models (word2vec,
GLoVE)
◦ Useful in practice but know they encode cultural stereotypes