Professional Documents
Culture Documents
4.dist Semantics2 PDF
4.dist Semantics2 PDF
Pawan Goyal
CSE, IIT Kharagpur
1 / 63
2 / 63
basic intuition
word1
dog
dog
word2
small
domesticated
freq(1,2)
855
29
freq(1)
33,338
33,338
freq(2)
490,580
918
2 / 63
basic intuition
word1
dog
dog
word2
small
domesticated
freq(1,2)
855
29
freq(1)
33,338
33,338
freq(2)
490,580
918
Association measures are used to give more weight to contexts that are
more significantly associated with a target word.
2 / 63
basic intuition
word1
dog
dog
word2
small
domesticated
freq(1,2)
855
29
freq(1)
33,338
33,338
freq(2)
490,580
918
Association measures are used to give more weight to contexts that are
more significantly associated with a target word.
The less frequent the target and context element are, the higher the
weight given to their co-occurrence count should be.
2 / 63
basic intuition
word1
dog
dog
word2
small
domesticated
freq(1,2)
855
29
freq(1)
33,338
33,338
freq(2)
490,580
918
Association measures are used to give more weight to contexts that are
more significantly associated with a target word.
The less frequent the target and context element are, the higher the
weight given to their co-occurrence count should be.
Co-occurrence with frequent context element small is less informative
than co-occurrence with rarer domesticated.
2 / 63
basic intuition
word1
dog
dog
word2
small
domesticated
freq(1,2)
855
29
freq(1)
33,338
33,338
freq(2)
490,580
918
Association measures are used to give more weight to contexts that are
more significantly associated with a target word.
The less frequent the target and context element are, the higher the
weight given to their co-occurrence count should be.
Co-occurrence with frequent context element small is less informative
than co-occurrence with rarer domesticated.
different measures - e.g., Mutual information, Log-likelihood ratio
2 / 63
PMI(w1 , w2 ) = log2
Pcorpus (w1 , w2 )
Pind (w1 , w2 )
3 / 63
PMI(w1 , w2 ) = log2
PMI(w1 , w2 ) = log2
Pcorpus (w1 , w2 )
Pind (w1 , w2 )
Pcorpus (w1 , w2 )
Pcorpus (w1 )Pcorpus (w2 )
3 / 63
PMI(w1 , w2 ) = log2
PMI(w1 , w2 ) = log2
Pcorpus (w1 , w2 )
Pind (w1 , w2 )
Pcorpus (w1 , w2 )
Pcorpus (w1 )Pcorpus (w2 )
freq(w1 , w2 )
N
freq(w)
Pcorpus (w) =
N
Pcorpus (w1 , w2 ) =
3 / 63
4 / 63
4 / 63
4 / 63
4 / 63
fij
min(fi , fj )
fij + 1 min(fi , fj ) + 1
4 / 63
5 / 63
6 / 63
6 / 63
7 / 63
7 / 63
7 / 63
7 / 63
Phi coefficient
The square of the Phi coefficient is related to the chi-squared statistic for a
2 2 contingency table
8 / 63
Let X and Y denote the binary distributional vectors for words X and Y .
Similarity Measures
2|XY|
9 / 63
Let X and Y denote the binary distributional vectors for words X and Y .
Similarity Measures
2|XY|
9 / 63
Let X and Y denote the binary distributional vectors for words X and Y .
Similarity Measures
2|XY|
9 / 63
Let X and Y denote the binary distributional vectors for words X and Y .
Similarity Measures
2|XY|
9 / 63
10 / 63
Similarity Measures
~
Cosine similarity : cos(~X,~Y) = ~X Y~
|X||Y|
10 / 63
Similarity Measures
~
Cosine similarity : cos(~X,~Y) = ~X Y~
|X||Y|
10 / 63
11 / 63
Similarity Measures
p
p+q
11 / 63
Attributional Similarity
The attributional similarity between two words a and b depends on the degree
of correspondence between the properties of a and b.
Ex: dog and wolf
Relational Similarity
Two pairs(a, b) and (c, d) are relationally similar if they have many similar
relations.
Ex: dog: bark and cat: meow
12 / 63
13 / 63
13 / 63
13 / 63
13 / 63
Structured DSMs
Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?
X solves Y and Y is solved by X
14 / 63
Structured DSMs
Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?
X solves Y and Y is solved by X
An Ideal Formalism
Should mirror semantic relationships as close as possible
Incorporate word-based information and syntactic analysis
Should be applicable to different languages
14 / 63
Structured DSMs
Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?
X solves Y and Y is solved by X
An Ideal Formalism
Should mirror semantic relationships as close as possible
Incorporate word-based information and syntactic analysis
Should be applicable to different languages
Use Dependency grammar framework
14 / 63
Structured DSMs
15 / 63
Structured DSMs
Using Dependency Structure: How does it help?
The teacher eats a red apple.
15 / 63
Structured DSMs
Using Dependency Structure: How does it help?
The teacher eats a red apple.
15 / 63
Structured DSMs
16 / 63
Structured DSMs
Distributional models, as guided by dependency
Ex: For the sentence This virus affects the bodys defense system., the
dependency parse is:
17 / 63
Structured DSMs
Distributional models, as guided by dependency
Ex: For the sentence This virus affects the bodys defense system., the
dependency parse is:
Word vectors
<system, dobj, affects> ...
Corpus-derived ternary data can also be mapped onto a 2-way matrix
17 / 63
2-way matrix
18 / 63
2-way matrix
18 / 63
19 / 63
19 / 63
obj-carry
0.1
0.3
0.4
obj-buy
0.4
0.5
0.4
obj-drive
0.8
0
0
obj-eat
0.02
0.6
0.5
obj-store
0.2
0.3
0.4
sub-fly
0.05
0.05
0.02
...
...
...
...
...
...
...
...
...
...
...
19 / 63
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb eat.
20 / 63
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb eat.
n nouns having highest weight in the dimension obj-eat are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.
20 / 63
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb eat.
n nouns having highest weight in the dimension obj-eat are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.
The complete vectors of these n nouns are used to obtain an object
prototype of the verb.
20 / 63
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb eat.
n nouns having highest weight in the dimension obj-eat are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.
The complete vectors of these n nouns are used to obtain an object
prototype of the verb.
object prototype will indicate various attributes such as these nouns can
be consumed, bought, carried, stored etc.
20 / 63
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb eat.
n nouns having highest weight in the dimension obj-eat are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.
The complete vectors of these n nouns are used to obtain an object
prototype of the verb.
object prototype will indicate various attributes such as these nouns can
be consumed, bought, carried, stored etc.
Similarity of a noun to this object prototype is used to denote the
plausibility of that noun being an object of verb eat.
20 / 63
21 / 63
21 / 63
22 / 63
22 / 63
22 / 63
23 / 63
24 / 63
24 / 63
25 / 63
25 / 63
Constraints on TW
W1 = W2
inverse link constraint:
<<marine, use, bomb>, vt ><<bomb,use1 ,marine>, vt >
26 / 63
27 / 63
28 / 63
Relation Classification
Cause-Effect cycling-happiness
Purpose album-picture
Location-At pain-chest
Time-At snack-midnight
28 / 63
Word Vectors
29 / 63
Word Vectors
29 / 63
Word Vectors
One-hot representation
29 / 63
Suppose our vocabulary has only five words: King, Queen, Man, Woman,
and Child.
We could encode the word Queen as:
30 / 63
31 / 63
31 / 63
wi Rd
i.e., a ddimensional vector, which is mostly learnt!
32 / 63
wi Rd
i.e., a ddimensional vector, which is mostly learnt!
32 / 63
Distributional Representation
33 / 63
34 / 63
Word Embeddings
35 / 63
Word Embeddings
35 / 63
It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
36 / 63
It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Specifically, the regularities are observed as constant vector offsets
between pairs of words sharing a particular relationship.
36 / 63
It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Specifically, the regularities are observed as constant vector offsets
between pairs of words sharing a particular relationship.
36 / 63
It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Specifically, the regularities are observed as constant vector offsets
between pairs of words sharing a particular relationship.
36 / 63
Perhaps more surprisingly, we find that this is also the case for a variety of
semantic relations.
37 / 63
Perhaps more surprisingly, we find that this is also the case for a variety of
semantic relations.
37 / 63
38 / 63
39 / 63
Analogy Testing
40 / 63
Analogy Testing
41 / 63
42 / 63
43 / 63
44 / 63
Basic Idea
Instead of capturing co-occurrence counts directly, predict (using) surrounding
words of every word.
Code as well as word-vectors: https://code.google.com/p/word2vec/
45 / 63
46 / 63
CBOW
47 / 63
CBOW
47 / 63
CBOW
47 / 63
CBOW
The context words form the input layer. Each word is encoded in one-hot form.
A single hidden and output layer.
48 / 63
49 / 63
Given C input word vectors, the activation function for the hidden layer h
amounts to simply summing the corresponding hot rows in W1 , and dividing
by C to take their average.
50 / 63
From the hidden layer to the output layer, the second weight matrix W2 can be
used to compute a score for each word in the vocabulary, and softmax can be
used to obtain the posterior distribution of words.
51 / 63
Skip-gram Model
The skip-gram model is the opposite of the CBOW model. It is constructed
with the focus word as the single input vector, and the target context words are
now at the output layer:
52 / 63
The activation function for the hidden layer simply amounts to copying the
corresponding row from the weights matrix W1 (linear) as we saw before.
At the output layer, we now output C multinomial distributions instead of
just one.
The training objective is to mimimize the summed prediction error across
all context words in the output layer. In our example, the input would be
learning, and we hope to see (an, efficient, method, for, high,
quality, distributed, vector) at the output layer.
53 / 63
Skip-gram Model
Details
Predict surrounding words in a window of length c of each word
54 / 63
Skip-gram Model
Details
Predict surrounding words in a window of length c of each word
Objective Function: Maximize the log probablility of any context word given
the current center word:
J() =
1 T
log p(wt+j |wt )
T t=1
cjc,j6=0
54 / 63
Word Vectors
exp(v0wO T vWI )
p(wO |wI ) = W
w=1 exp(v0w T vWI )
55 / 63
Word Vectors
exp(v0wO T vWI )
p(wO |wI ) = W
w=1 exp(v0w T vWI )
where v and v0 are input and output vector representations of w (so every
word has two vectors)
55 / 63
Parameters
With ddimensional words and V many words:
56 / 63
j new = j old
J()
j old
57 / 63
Implementation Tricks
j new = j old
Jt ()
j old
58 / 63
Implementation Tricks
j new = j old
Jt ()
j old
58 / 63
Lfinal = L + L0
59 / 63
Lfinal = L + L0
59 / 63
Lfinal = L + L0
An interactive Demo
https://ronxin.github.io/wevi/
59 / 63
Glove
Combine the best of both worlds count based methods as well as direct
prediction methods
60 / 63
Glove
Combine the best of both worlds count based methods as well as direct
prediction methods
Fast training
Scalable to huge corpora
Good performance even with small corpus, and small vectors
Code and vectors: http://nlp.stanford.edu/projects/glove/
60 / 63
Glove Visualisations
61 / 63
Glove Visualisations
62 / 63
Glove Visualisations
63 / 63