You are on page 1of 46

Dan Jurafsky and James Martin

Speech and Language Processing

Chapter 6:
Vector Semantics, Part II
Tf-idf and PPMI are
sparse representations

tf-idf and PPMI vectors are


◦long (length |V|= 20,000 to 50,000)
◦sparse (most elements are zero)
Alternative: dense vectors

vectors which are


◦ short (length 50-1000)
◦ dense (most elements are non-zero)

3
Sparse versus dense vectors
Why dense vectors?
◦ Short vectors may be easier to use as features in machine
learning (less weights to tune)
◦ Dense vectors may generalize better than storing explicit
counts
◦ They may do better at capturing synonymy:
◦ car and automobile are synonyms; but are distinct dimensions
◦ a word with car as a neighbor and a word with automobile as a
neighbor4 should be similar, but aren't
◦ In practice, they work better
Dense embeddings you can
download!

Word2vec (Mikolov et al.)


https://code.google.com/archive/p/word2vec/
Fasttext http://www.fasttext.cc/
Glove (Pennington, Socher, Manning)
http://nlp.stanford.edu/projects/glove/
Word2vec
Popular embedding method
Very fast to train
Code available on the web
Idea: predict rather than count
Word2vec
◦Instead of counting how often each
word w occurs near "apricot"
◦Train a classifier on a binary
prediction task:
◦Is w likely to show up near "apricot"?

◦We don’t actually care about this task


◦But we'll take the learned classifier weights
as the word embeddings
Brilliant insight: Use running text as
implicitly supervised training data!
• A word s near apricot
• Acts as gold ‘correct answer’ to the
question
• “Is word w likely to show up near apricot?”
• No need for hand-labeled supervision
• The idea comes from neural language
modeling
• Bengio et al. (2003)
• Collobert et al. (2011)
Word2Vec: Skip-Gram Task
Word2vec provides a variety of options. Let's do
◦ "skip-gram with negative sampling" (SGNS)
Skip-gram algorithm
1. Treat the target word and a neighboring
context word as positive examples.
2. Randomly sample other words in the
lexicon to get negative samples
3. Use logistic regression to train a classifier
to distinguish those two cases
4. Use the weights as the embeddings
10

9/7/18
Skip-Gram Training Data
Training sentence:
... lemon, a tablespoon of apricot jam a pinch ...
c1 c2 target c3 c4

Asssume context words are those in +/- 2


word window
11

9/7/18
Skip-Gram Goal
Given a tuple (t,c) = target, context
◦ (apricot, jam)
◦ (apricot, aardvark)
Return probability that c is a real context word:

P(+|t,c)
P(−|t,c) = 1−P(+|t,c)
12

9/7/18
How to compute p(+|t,c)?
Intuition:
◦ Words are likely to appear near similar words
◦ Model similarity with dot-product!
◦ Similarity(t,c) ∝ t · c
Problem:
◦Dot product is not a probability!
◦ (Neither is cosine)
ecall, for that matter, that cosine isn’t a probability
into a Turning dot
probability, product
we’ll use theinto a or sigmoi
logistic
probability
core of logistic regression:
The sigmoid lies between 0 and 1:

1
s (x) = x
1+e
lity that word c is a real context word for target wo

1
P(+|t, c) = t·c
1+e
s (x) = x
1+e
Turning
y that word c is adot
real product
context 1 into
word for a
target word t
P(+|t, c) =
probability 1 + e t·c
d function just returns a number between 0 and 1, so to
ll need to make sure that the total
1 probability of the tw
P(+|t,
a context word, and c)
c not=being a context
t·c word) sum to 1
1+e
lity that word c is not a real context word for t is thus:
unction just returns a number between 0 and 1, so
P( |t, c) = 1 P(+|t, c)
need to make sure that the totalt·c
probability of the t
ontext word, and c not=being e a context word) sum to
1 + e t·c
y that word c is not a real context word for t is thus:
n 6.19 give us the probability for one word, but we need
ple context words in the window. Skip-gram makes the
For all the context words:
ifying assumption that all context words are independen
Assume all context words are
y their probabilities:
independent
k
Y 1
P(+|t, c1:k ) = t·ci
1+e
i=1
k
X 1
log P(+|t, c1:k ) = log t·ci
1+e
i=1

mary, skip-gram trains a probabilistic classifier that, gi


Skip-Gram Training Data
Training sentence:
... lemon, a tablespoon of apricot jam a pinch ...
c1 c2 t c3 c4

Training data: input/output pairs centering


on apricot
Asssume a +/- 2 word window
17

9/7/18
Word2vec learns embeddings by starting with an initial set of embe
and then iteratively shifting the embedding of each word w to be mor
beddings of words that occur nearby in texts, and less like the embedd
that don’t occur nearby.
Skip-Gram Training
Let’s start by considering a single piece of the training data, from
above:
... Training
lemon, asentence:
[tablespoon of apricot jam, a] pinch ...
... lemon, a tablespoon
c1 of apricot
c2 t jamc3 a pinch
c4 ...
This example has c1a target word
c2 t t(apricot),
c3 and
c4 4 context words
window, resulting in 4 positive training instances (on the left below):
positive examples + negative examples -
t c •For each
t
positive
c
example,
t c
we'll create k negative
apricot aardvark apricot tw
apricot tablespoon
apricot of examples.
apricot puddle apricot h
apricot preserves noise words
•Using apricot where apricot18 d
apricot or •Any random
apricot word that apricot
coaxial isn't t f
For training a binary classifier we also need negative examples, an
9/7/18
Word2vec
rd2vec learns embeddings
learns embeddings by with
by starting starting with ansetinitial
an initial set of embe
of embedding ve
thenand then iteratively
iteratively shifting shifting the embedding
the embedding of each
of each word w toword w to like
be more be mor
the
dingsbeddings
of wordsofthat
words that
occur occurinnearby
nearby texts, in
andtexts, and the
less like lessembeddings
like the embedd
of w
thatoccur
don’t don’tnearby.
occur nearby.
Skip-Gram Training
Let’s start
ve: above:
Let’s
by start by considering
considering a single apiece
single
of piece of the data,
the training training data,
from the from
sent

... Training
lemon, lemon, asentence:
a [tablespoon of apricot
[tablespoon jam,
of apricot a] pincha]...
jam, pinch ...
... lemon,
c1 a tablespoon
c1 c2 of apricot
tc2 c3t jamc3c4a pinch
c4 ...
This example has a target
This example hasc1aword t (apricot),
target word and c3
4 context
c2 t t(apricot), c4 4words
and contextin words
the L =
dow,window,
resultingresulting
in 4 positive training training
in 4 positive instancesinstances
(on the left
(onbelow):
the left below):
k=2
positivepositive
examples +
examples + negativenegative
examplesexamples
- -
t tc c t ct ct ct c
apricot apricot
tablespoon
tablespoon apricot aardvark apricot twelve
apricot aardvark apricot tw
apricot apricot
of of apricot puddle apricot hello
apricot puddle apricot h
apricot apricot
preserves
preserves apricot where apricot dear
apricot where apricot19 d
apricot apricot
or or apricot coaxial apricot forever
apricot coaxial apricot f
For training a binaryaclassifier
For training
9/7/18 we also need
binary classifier negative
we also examples,examples,
need negative and in fact an
V ECTOR S EMANTICS

Choosing noise words


PTER 6 • V ECTOR S EMANTICS

ord the as a noise word, with unigram probability p(“aardvark”) we w


the word the as a noise word, with unigram probability p(“aardvark”) we would
e aardvark, and soand
choose aardvark, on.soBut in practice
on. But in practiceit iscommon
it is common toaset
to set a =i.e..75,
= .75, i.e.
use the u
3
p (w): p 4 (w):w according to their unigram frequency P(w)
Could pick
3
ting weighting
4

More common to chosen then according to pα(w)


count(w)aa
count(w)
Pa (w) = P
count(w) a
(6.23)
Pa (w) = P w
a
w count(w)
Setting a = .75 gives better performance because it gives rare noise words
slightly higher probability: for rare words, Pa (w) > P(w). To visualize this intu-
a =
ttingition, α= .75
it might
gives
¾ works well
help
better
to because
performance
work outit the
gives rare noisebecause
probabilities words it gives
for anslightly
examplehigher rare noise w
with two events,
y higher probability
P(a) =probability:
.99 and P(b) = for.01: rare words, Pa (w) > P(w). To visualize this
it might To help
showtothis,
work outtwo
imagine theevents
probabilities
p(a)=.99 andfor
p(b)an example with two e
= .01:
= .99 and P(b) = .01: .99.75
Pa (a) = .75 .75
= .97
.99 + .01
.01.75
Pa (b) = .99
.75
.75 .75 = .03 (6.24)
.99 + .01
Pa (a) = = .97
.99.75 + .01.75
Given the set of positive and negative training instances, and an initial set of
Setup
Let's represent words as vectors of some length (say
300), randomly initialized.
So we start with 300 * V random parameters
Over the entire training set, we’d like to adjust those
word vectors such that we
◦ Maximize the similarity of the target word, context
word pairs (t,c) drawn from the positive data
◦ Minimize the similarity of the (t,c) pairs drawn from
the negative data.
21

9/7/18
Learning the classifier
Iterative process.
We’ll start with 0 or random weights
Then adjust the word weights to
◦ make the positive pairs more likely
◦ and the negative pairs less likely
over the entire training set:
Objective Criteria
We want to maximize…
X X
logP (+|t, c) + logP ( |t, c)
(t,c)2+ (t,c)2

Maximize the + label for the pairs from the positive


training data, and the – label for the pairs sample
from the negative data.

23

9/7/18
X X
L(q ) = log P(+|t, c) + log P( |t, c) (6
(t,c)2+ (t,c)2

Focusing on one target word t:


cusing in on one word/context pair (t, c) with its k noise words n1 ...nk
objective L is:
k
X
L(q ) = log P(+|t, c) + log P( |t, ni )
i=1
k
X
= log s (c · t) + log s ( ni · t)
i=1
k
X
1 1
= log c·t
+ log (6
1+e 1 + eni ·t
i=1

s, we want to maximize the dot product of the word with the actual con
d minimize the dot products of the word with the k negative sampled n
increase
similarity( apricot , jam) C
W wj . ck
1. .. … d
apricot
1.2…….j………V 1
.
1 k jam neighbor word
. “…apricot jam…” .
. n aardvark random noise
. . word
d V
decrease
similarity( apricot , aardvark)
wj . cn
Train using gradient descent
Actually learns two separate embedding matrices W and C
Can use W and throw away C, or merge them somehow
Summary: How to learn word2vec
(skip-gram) embeddings
Start with V random 300-dimensional vectors as
initial embeddings
Use logistic regression, the second most basic
classifier used in machine learning after naïve
bayes
◦ Take a corpus and take pairs of words that co-occur as
positive examples
◦ Take pairs of words that don't co-occur as negative
examples
◦ Train the classifier to distinguish these by slowly adjusting
all the embeddings to improve the classifier performance
◦ Throw away the classifier code and keep the embeddings.
Evaluating embeddings
Compare to human scores on word
similarity-type tasks:
• WordSim-353 (Finkelstein et al., 2002)
• SimLex-999 (Hill et al., 2015)
• Stanford Contextual Word Similarity (SCWS) dataset
(Huang et al., 2012)
• TOEFL dataset: Levied is closest in meaning to: imposed,
believed, requested, correlated
Properties of embeddings
Similarity depends on window size C

C = ±2 The nearest words to Hogwarts:


◦ Sunnydale
◦ Evernight
C = ±5 The nearest words to Hogwarts:
◦ Dumbledore
◦ Malfoy
◦ halfblood
29
Analogy: Embeddings capture
relational meaning!
vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’)
vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’)

30
Embeddings can help study
word history!
Train embeddings on old books to study
changes in word meaning!!

Will Hamilton
Diachronic word embeddings for
studying language change!

Word vectors for 1920 Word vectors 1990


“dog” 1990 word vector
“dog” 1920 word vector

vs.

1900 1950 2000

3
4
Visualizing changes
Project 300 dimensions down into 2

~30 million books, 1850-1990, Google Books data


The evolution of sentiment words
Negative words change faster than positive words

36
Embeddings and bias
Embeddings reflect cultural bias
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and
Adam T. Kalai. "Man is to computer programmer as woman is to
homemaker? debiasing word embeddings." In Advances in Neural
Information Processing Systems, pp. 4349-4357. 2016.

Ask “Paris : France :: Tokyo : x”


◦ x = Japan
Ask “father : doctor :: mother : x”
◦ x = nurse
Ask “man : computer programmer :: woman : x”
◦ x = homemaker
Embeddings reflect cultural bias
Caliskan, Aylin, Joanna J. Bruson and Arvind Narayanan. 2017. Semantics derived automatically from
language corpora contain human-like biases. Science 356:6334, 183-186.

Implicit Association test (Greenwald et al 1998): How associated are


◦ concepts (flowers, insects) & attributes (pleasantness, unpleasantness)?
◦ Studied by measuring timing latencies for categorization.

Psychological findings on US participants:


◦ African-American names are associated with unpleasant words (more than European-
American names)
◦ Male names associated more with math, female names with arts
◦ Old people's names with unpleasant words, young people with pleasant words.

Caliskan et al. replication with embeddings:


◦ African-American names (Leroy, Shaniqua) had a higher GloVe cosine
with unpleasant words (abuse, stink, ugly)
◦ European American names (Brad, Greg, Courtney) had a higher cosine
with pleasant words (love, peace, miracle)
Embeddings reflect and replicate all sorts of pernicious biases.
Directions
Debiasing algorithms for embeddings
◦ Bolukbasi, Tolga, Chang, Kai-Wei, Zou, James Y.,
Saligrama, Venkatesh, and Kalai, Adam T. (2016). Man is
to computer programmer as woman is to homemaker?
debiasing word embeddings. In Advances in Neural Infor-
mation Processing Systems, pp. 4349–4357.
Use embeddings as a historical tool to study bias
Embeddings as a window onto history
Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender
and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644

Use the Hamilton historical embeddings


The cosine similarity of embeddings for decade X
for occupations (like teacher) to male vs female
names
◦ Is correlated with the actual percentage of women
teachers in decade X
History of biased framings of women
Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender
and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644

Embeddings for competence adjectives are


biased toward men
◦ Smart, wise, brilliant, intelligent, resourceful,
thoughtful, logical, etc.
This bias is slowly decreasing
Embeddings reflect ethnic
stereotypes over time
Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender
and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644

• Princeton trilogy experiments


• Attitudes toward ethnic groups (1933,
1951, 1969) scores for adjectives
• industrious, superstitious, nationalistic, etc
• Cosine of Chinese name embeddings with
those adjective embeddings correlates with
human ratings.
he recall of
be adequate. Materials and Methods
s with other In this section we describe the datasets, embeddings, and word lists used,
te recall. For
Change in linguistic framing
as well as how bias is quantified. More detail, including descriptions of
e static occu- additional embeddings and the full word lists, are in SI Appendix, section
s and repro- A. All of our data and code are available on GitHub (https://github.com/

1910-1990
nikhgarg/EmbeddingDynamicStereotypes), and we link to external data
section B.1.
sources as appropriate.
apped confi-
ubsets of the Embeddings.
Garg, Nikhil,This work Londa,
Schiebinger, uses several pretrained
Jurafsky, Dan, word(2018).
and Zou, James embeddings publicly
Word embeddings quantify 100 years of gender
Similarly, for available online;
and ethnic refer Proceedings
stereotypes. to the respective sources
of the National for ofin-depth
Academy discussion
Sciences, 115(16), of
E3635–E3644
refs. 6 and 7 their training parameters. These embeddings are among the most com-
Change in association of Chinese names with adjectives
hen a larger
not needed.
monly used English embeddings, vary in the datasets on which they were

framed as "othering" (barbaric, monstrous, bizarre)


or the word
ublicly avail-
and all word
oss many dif-
uggests their

t the written
opular social
o consider in
on these ear-
s for gender
n is a positive
gful patterns
en this con-
oes not fully
present day.
nto the atti-
Changes in framing:
adjectives associated with Chinese
Table 3. Top Asian (vs. White) adjectives in 1910, 1950, and 1990
Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender
by relative norm difference in the COHA embedding
and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644 em
1910 1950 1990
ha
tio
Irresponsible Disorganized Inhibited sta
Envious Outrageous Passive em
Barbaric Pompous Dissolute ca
Aggressive Unstable Haughty wo
Transparent Effeminate Complacent tie
Monstrous Unprincipled Forceful ste
Hateful Venomous Fixed O
Cruel Disobedient Active tra
Greedy Predatory Sensitive ar
Bizarre Boisterous Hearty ap

an
ics
Conclusion
Concepts or word senses
◦ Have a complex many-to-many association with words
(homonymy, multiple senses)
◦ Have relations with each other
◦ Synonymy, Antonymy, Superordinate
◦ But are hard to define formally (necessary & sufficient
conditions)
Embeddings = vector models of meaning
◦ More fine-grained than just a string or index
◦ Especially good at modeling similarity/analogy
◦ Just download them and use cosines!!
◦ Can use sparse models (tf-idf) or dense models (word2vec,
GLoVE)
◦ Useful in practice but know they encode cultural stereotypes

You might also like