Professional Documents
Culture Documents
Pawan Goyal
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 1 / 63
Context weighting: words as context
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 2 / 63
Context weighting: words as context
basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580
dog domesticated 29 33,338 918
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 2 / 63
Context weighting: words as context
basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580
dog domesticated 29 33,338 918
Association measures are used to give more weight to contexts that are
more significantly associated with a target word.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 2 / 63
Context weighting: words as context
basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580
dog domesticated 29 33,338 918
Association measures are used to give more weight to contexts that are
more significantly associated with a target word.
The less frequent the target and context element are, the higher the
weight given to their co-occurrence count should be.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 2 / 63
Context weighting: words as context
basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580
dog domesticated 29 33,338 918
Association measures are used to give more weight to contexts that are
more significantly associated with a target word.
The less frequent the target and context element are, the higher the
weight given to their co-occurrence count should be.
Co-occurrence with frequent context element small is less informative
than co-occurrence with rarer domesticated.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 2 / 63
Context weighting: words as context
basic intuition
word1 word2 freq(1,2) freq(1) freq(2)
dog small 855 33,338 490,580
dog domesticated 29 33,338 918
Association measures are used to give more weight to contexts that are
more significantly associated with a target word.
The less frequent the target and context element are, the higher the
weight given to their co-occurrence count should be.
Co-occurrence with frequent context element small is less informative
than co-occurrence with rarer domesticated.
different measures - e.g., Mutual information, Log-likelihood ratio
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 2 / 63
Pointwise Mutual Information (PMI)
Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pind (w1 , w2 )
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 3 / 63
Pointwise Mutual Information (PMI)
Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pind (w1 , w2 )
Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pcorpus (w1 )Pcorpus (w2 )
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 3 / 63
Pointwise Mutual Information (PMI)
Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pind (w1 , w2 )
Pcorpus (w1 , w2 )
PMI(w1 , w2 ) = log2
Pcorpus (w1 )Pcorpus (w2 )
freq(w1 , w2 )
Pcorpus (w1 , w2 ) =
N
freq(w)
Pcorpus (w) =
N
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 3 / 63
PMI: Issues and Variations
Positive PMI
All PMI values less than zero are replaced with zero.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 4 / 63
PMI: Issues and Variations
Positive PMI
All PMI values less than zero are replaced with zero.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 4 / 63
PMI: Issues and Variations
Positive PMI
All PMI values less than zero are replaced with zero.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 4 / 63
PMI: Issues and Variations
Positive PMI
All PMI values less than zero are replaced with zero.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 4 / 63
PMI: Issues and Variations
Positive PMI
All PMI values less than zero are replaced with zero.
fij min(fi , fj )
ij =
fij + 1 min(fi , fj ) + 1
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 4 / 63
Distributional Vectors: Example
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 5 / 63
Application to Query Expansion: Addressing Term
Mismatch
User query: insurance cover which pays for long term care.
A relevant document may contain terms different from the actual user query.
Some relevant words concerning this query: {medicare, premiums, insurers}
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 6 / 63
Application to Query Expansion: Addressing Term
Mismatch
User query: insurance cover which pays for long term care.
A relevant document may contain terms different from the actual user query.
Some relevant words concerning this query: {medicare, premiums, insurers}
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 6 / 63
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83
medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72
hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 7 / 63
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83
medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72
hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69
Broad expansion terms: medicare, beneficiaries, premiums . . .
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 7 / 63
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83
medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72
hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69
Broad expansion terms: medicare, beneficiaries, premiums . . .
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 7 / 63
Query Expansion using Unstructured DSMs
TREC Topic 104: catastrophic health insurance
Query Representation: surtax:1.0 hcfa:0.97 medicare:0.93 hmos:0.83
medicaid:0.8 hmo:0.78 beneficiaries:0.75 ambulatory:0.72 premiums:0.72
hospitalization:0.71 hhs:0.7 reimbursable:0.7 deductible:0.69
Broad expansion terms: medicare, beneficiaries, premiums . . .
Specific domain terms: CNES (Centre National dtudes Spatiales) and NASDA
(National Space Development Agency of Japan)
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 7 / 63
Phi coefficient
The square of the Phi coefficient is related to the chi-squared statistic for a
2 2 contingency table
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 8 / 63
Similarity Measures for Binary Vectors
Let X and Y denote the binary distributional vectors for words X and Y .
Similarity Measures
2|XY|
Dice coefficient : |X|+|Y|
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 9 / 63
Similarity Measures for Binary Vectors
Let X and Y denote the binary distributional vectors for words X and Y .
Similarity Measures
2|XY|
Dice coefficient : |X|+|Y|
|XY|
Jaccard Coefficient : |XY|
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 9 / 63
Similarity Measures for Binary Vectors
Let X and Y denote the binary distributional vectors for words X and Y .
Similarity Measures
2|XY|
Dice coefficient : |X|+|Y|
|XY|
Jaccard Coefficient : |XY|
|XY|
Overlap Coefficient : min(|X|,|Y|)
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 9 / 63
Similarity Measures for Binary Vectors
Let X and Y denote the binary distributional vectors for words X and Y .
Similarity Measures
2|XY|
Dice coefficient : |X|+|Y|
|XY|
Jaccard Coefficient : |XY|
|XY|
Overlap Coefficient : min(|X|,|Y|)
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 9 / 63
Similarity Measures for Vector Spaces
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 10 / 63
Similarity Measures for Vector Spaces
Similarity Measures
~
Cosine similarity : cos(~X,~Y) = ~XY~
|X||Y|
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 10 / 63
Similarity Measures for Vector Spaces
Similarity Measures
~
Cosine similarity : cos(~X,~Y) = ~XY~
|X||Y|
q
Euclidean distance : |~X ~Y| = ni=1 ( |~xX|i |~yY|i )2
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 10 / 63
Similarity Measure for Probability Distributions
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 11 / 63
Similarity Measure for Probability Distributions
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 11 / 63
Attributional Similarity vs. Relational Similarity
Attributional Similarity
The attributional similarity between two words a and b depends on the degree
of correspondence between the properties of a and b.
Ex: dog and wolf
Relational Similarity
Two pairs(a, b) and (c, d) are relationally similar if they have many similar
relations.
Ex: dog: bark and cat: meow
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 12 / 63
Relational Similarity: Pair-pattern matrix
Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason: stone and
carpenter: wood
Column vectors correspond to the patterns in which the pairs occur, e.g.
X cuts Y and X works with Y
Compute the similarity of rows to find similar pairs
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 13 / 63
Relational Similarity: Pair-pattern matrix
Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason: stone and
carpenter: wood
Column vectors correspond to the patterns in which the pairs occur, e.g.
X cuts Y and X works with Y
Compute the similarity of rows to find similar pairs
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 13 / 63
Relational Similarity: Pair-pattern matrix
Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason: stone and
carpenter: wood
Column vectors correspond to the patterns in which the pairs occur, e.g.
X cuts Y and X works with Y
Compute the similarity of rows to find similar pairs
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 13 / 63
Relational Similarity: Pair-pattern matrix
Pair-pattern matrix
Row vectors correspond to pairs of words, such as mason: stone and
carpenter: wood
Column vectors correspond to the patterns in which the pairs occur, e.g.
X cuts Y and X works with Y
Compute the similarity of rows to find similar pairs
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 13 / 63
Structured DSMs
Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?
X solves Y and Y is solved by X
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 14 / 63
Structured DSMs
Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?
X solves Y and Y is solved by X
An Ideal Formalism
Should mirror semantic relationships as close as possible
Incorporate word-based information and syntactic analysis
Should be applicable to different languages
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 14 / 63
Structured DSMs
Basic Issue
Words may not be the basic context units anymore
How to capture and represent syntactic information?
X solves Y and Y is solved by X
An Ideal Formalism
Should mirror semantic relationships as close as possible
Incorporate word-based information and syntactic analysis
Should be applicable to different languages
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 14 / 63
Structured DSMs
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 15 / 63
Structured DSMs
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 15 / 63
Structured DSMs
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 15 / 63
Structured DSMs
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 16 / 63
Structured DSMs
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 17 / 63
Structured DSMs
Word vectors
<system, dobj, affects> ...
Corpus-derived ternary data can also be mapped onto a 2-way matrix
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 17 / 63
2-way matrix
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 18 / 63
2-way matrix
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 18 / 63
Structured DSMs for Selectional Preferences
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 19 / 63
Structured DSMs for Selectional Preferences
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 19 / 63
Structured DSMs for Selectional Preferences
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 19 / 63
Structured DSMs for Selectional Preferences
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb eat.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 20 / 63
Structured DSMs for Selectional Preferences
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb eat.
n nouns having highest weight in the dimension obj-eat are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 20 / 63
Structured DSMs for Selectional Preferences
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb eat.
n nouns having highest weight in the dimension obj-eat are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.
The complete vectors of these n nouns are used to obtain an object
prototype of the verb.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 20 / 63
Structured DSMs for Selectional Preferences
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb eat.
n nouns having highest weight in the dimension obj-eat are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.
The complete vectors of these n nouns are used to obtain an object
prototype of the verb.
object prototype will indicate various attributes such as these nouns can
be consumed, bought, carried, stored etc.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 20 / 63
Structured DSMs for Selectional Preferences
Selectional Preferences
Suppose we want to compute the selectional preferences of the nouns as
object of verb eat.
n nouns having highest weight in the dimension obj-eat are selected, let
{vegetable, biscuit,. . .} be the set of these n nouns.
The complete vectors of these n nouns are used to obtain an object
prototype of the verb.
object prototype will indicate various attributes such as these nouns can
be consumed, bought, carried, stored etc.
Similarity of a noun to this object prototype is used to denote the
plausibility of that noun being an object of verb eat.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 20 / 63
Sequences of dependency edges as paths
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 21 / 63
Sequences of dependency edges as paths
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 21 / 63
Three parameters: Pado and Lapata (2007)
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 22 / 63
Three parameters: Pado and Lapata (2007)
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 22 / 63
Three parameters: Pado and Lapata (2007)
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 22 / 63
Choice of parameters by Pado
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 23 / 63
Distributional Memory (DM); Baroni and Lenci (2010)
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 24 / 63
Distributional Memory (DM); Baroni and Lenci (2010)
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 24 / 63
Weighted tuple structure
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 25 / 63
Weighted tuple structure
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 25 / 63
Weighted tuple structure
Constraints on TW
W1 = W2
inverse link constraint:
<<marine, use, bomb>, vt ><<bomb,use1 ,marine>, vt >
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 26 / 63
The DM semantic spaces
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 27 / 63
Some Pair-Problems Addressed
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 28 / 63
Some Pair-Problems Addressed
Relation Classification
Cause-Effect cycling-happiness
Purpose album-picture
Location-At pain-chest
Time-At snack-midnight
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 28 / 63
Word Vectors
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 29 / 63
Word Vectors
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 29 / 63
Word Vectors
One-hot representation
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 29 / 63
Word Vectors - One-hot Encoding
Suppose our vocabulary has only five words: King, Queen, Man, Woman,
and Child.
We could encode the word Queen as:
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 30 / 63
Limitations of One-hot encoding
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 31 / 63
Limitations of One-hot encoding
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 31 / 63
Word2Vec A distributed representation
Distributional representation word embedding?
Any word wi in the corpus is given a distributional representation by an
embedding
wi Rd
i.e., a ddimensional vector, which is mostly learnt!
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 32 / 63
Word2Vec A distributed representation
Distributional representation word embedding?
Any word wi in the corpus is given a distributional representation by an
embedding
wi Rd
i.e., a ddimensional vector, which is mostly learnt!
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 32 / 63
Distributional Representation
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 33 / 63
Distributional Representation: Illustration
If we label the dimensions in a hypothetical word vector (there are no such
pre-assigned labels in the algorithm of course), it might look a bit like this:
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 34 / 63
Word Embeddings
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 35 / 63
Word Embeddings
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 35 / 63
Reasoning with Word Vectors
It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 36 / 63
Reasoning with Word Vectors
It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Specifically, the regularities are observed as constant vector offsets
between pairs of words sharing a particular relationship.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 36 / 63
Reasoning with Word Vectors
It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Specifically, the regularities are observed as constant vector offsets
between pairs of words sharing a particular relationship.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 36 / 63
Reasoning with Word Vectors
It has been found that the learned word representations in fact capture
meaningful syntactic and semantic regularities in a very simple way.
Specifically, the regularities are observed as constant vector offsets
between pairs of words sharing a particular relationship.
and so on.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 36 / 63
Reasoning with Word Vectors
Perhaps more surprisingly, we find that this is also the case for a variety of
semantic relations.
Good at answering analogy questions
a is to b, as c is to ?
man is to woman as uncle is to ? (aunt)
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 37 / 63
Reasoning with Word Vectors
Perhaps more surprisingly, we find that this is also the case for a variety of
semantic relations.
Good at answering analogy questions
a is to b, as c is to ?
man is to woman as uncle is to ? (aunt)
A simple vector offset method based on cosine distance shows the relation.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 37 / 63
Vcctor Offset for Gender Relation
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 38 / 63
Vcctor Offset for Singular-Plural Relation
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 39 / 63
Encoding Other Dimensions of Similarity
Analogy Testing
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 40 / 63
Analogy Testing
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 41 / 63
Country-capital city relationships
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 42 / 63
More Analogy Questions
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 43 / 63
Element Wise Addition
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 44 / 63
Learning Word Vectors
Basic Idea
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 45 / 63
Two Variations: CBOW and Skip-grams
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 46 / 63
CBOW
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 47 / 63
CBOW
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 47 / 63
CBOW
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 47 / 63
CBOW
The context words form the input layer. Each word is encoded in one-hot form.
A single hidden and output layer.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 48 / 63
CBOW: Training Objective
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 49 / 63
CBOW: Input to Hidden Layer
Since our input vectors are one-hot, multiplying an input vector by the weight
matrix W1 amounts to simply selecting a row from W1 .
Given C input word vectors, the activation function for the hidden layer h
amounts to simply summing the corresponding hot rows in W1 , and dividing
by C to take their average.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 50 / 63
CBOW: Hidden to Output Layer
From the hidden layer to the output layer, the second weight matrix W2 can be
used to compute a score for each word in the vocabulary, and softmax can be
used to obtain the posterior distribution of words.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 51 / 63
Skip-gram Model
The skip-gram model is the opposite of the CBOW model. It is constructed
with the focus word as the single input vector, and the target context words are
now at the output layer:
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 52 / 63
Skip-gram Model: Training
The activation function for the hidden layer simply amounts to copying the
corresponding row from the weights matrix W1 (linear) as we saw before.
At the output layer, we now output C multinomial distributions instead of
just one.
The training objective is to mimimize the summed prediction error across
all context words in the output layer. In our example, the input would be
learning, and we hope to see (an, efficient, method, for, high,
quality, distributed, vector) at the output layer.
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 53 / 63
Skip-gram Model
Details
Predict surrounding words in a window of length c of each word
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 54 / 63
Skip-gram Model
Details
Predict surrounding words in a window of length c of each word
Objective Function: Maximize the log probablility of any context word given
the current center word:
1 T
J() = log p(wt+j |wt )
T t=1 cjc,j6=0
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 54 / 63
Word Vectors
exp(v0wO T vWI )
p(wO |wI ) = W
w=1 exp(v0w T vWI )
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 55 / 63
Word Vectors
exp(v0wO T vWI )
p(wO |wI ) = W
w=1 exp(v0w T vWI )
where v and v0 are input and output vector representations of w (so every
word has two vectors)
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 55 / 63
Parameters
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 56 / 63
Gradient Descent for Parameter Updates
j new = j old J()
j old
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 57 / 63
Implementation Tricks
j new = j old Jt ()
j old
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 58 / 63
Implementation Tricks
j new = j old Jt ()
j old
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 58 / 63
Two sets of vectors
Lfinal = L + L0
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 59 / 63
Two sets of vectors
Lfinal = L + L0
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 59 / 63
Two sets of vectors
Lfinal = L + L0
An interactive Demo
https://ronxin.github.io/wevi/
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 59 / 63
Glove
Combine the best of both worlds count based methods as well as direct
prediction methods
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 60 / 63
Glove
Combine the best of both worlds count based methods as well as direct
prediction methods
Fast training
Scalable to huge corpora
Good performance even with small corpus, and small vectors
Code and vectors: http://nlp.stanford.edu/projects/glove/
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 60 / 63
Glove Visualisations
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 61 / 63
Glove Visualisations
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 62 / 63
Glove Visualisations
Pawan Goyal (IIT Kharagpur) Distributional Semantics - Part II October 17-18, 2016 63 / 63