You are on page 1of 10

How to make words with vectors:

Phrase generation in distributional semantics

Georgiana Dinu and Marco Baroni


Center for Mind/Brain Sciences
University of Trento, Italy
(georgiana.dinu|marco.baroni)@unitn.it

Abstract Besides these theoretical considerations, phrase


generation from vectors has many useful applica-
We introduce the problem of generation
tions. We can, for example, synthesize the vector
in distributional semantics: Given a distri-
representing the meaning of a phrase or sentence,
butional vector representing some mean-
and then generate alternative phrases or sentences
ing, how can we generate the phrase that
from this vector to accomplish true paraphrase
best expresses that meaning? We mo-
generation (as opposed to paraphrase detection or
tivate this novel challenge on theoretical
ranking of candidate paraphrases).
and practical grounds and propose a sim-
ple data-driven approach to the estimation Generation can be even more useful when the
of generation functions. We test this in source vector comes from another modality or lan-
a monolingual scenario (paraphrase gen- guage. Recent work on grounding language in vi-
eration) as well as in a cross-lingual set- sion shows that it is possible to represent images
ting (translation by synthesizing adjective- and linguistic expressions in a common vector-
noun phrase vectors in English and gener- based semantic space (Frome et al., 2013; Socher
ating the equivalent expressions in Italian). et al., 2013). Given a vector representing an im-
age, generation can be used to productively con-
1 Introduction struct phrases or sentences that describe the im-
Distributional methods for semantics approximate age (as opposed to simply retrieving an existing
the meaning of linguistic expressions with vectors description from a set of candidates). Translation
that summarize the contexts in which they occur is another potential application of the generation
in large samples of text. This has been a very suc- framework: Given a semantic space shared be-
cessful approach to lexical semantics (Erk, 2012), tween two or more languages, one can compose a
where semantic relatedness is assessed by compar- word sequence in one language and generate trans-
ing vectors. Recently these methods have been lations in another, with the shared semantic vector
extended to phrases and sentences by means of space functioning as interlingua.
composition operations (see Baroni (2013) for an Distributional semantics assumes a lexicon of
overview). For example, given the vectors repre- atomic expressions (that, for simplicity, we take
senting red and car, composition derives a vector to be words), each associated to a vector. Thus,
that approximates the meaning of red car. at the single-word level, the problem of genera-
However, the link between language and mean- tion is solved by a trivial generation-by-synthesis
ing is, obviously, bidirectional: As message recip- approach: Given an arbitrary target vector, gener-
ients we are exposed to a linguistic expression and ate the corresponding word by searching through
we must compute its meaning (the synthesis prob- the lexicon for the word with the closest vector to
lem). As message producers we start from the the target. This is however unfeasible for larger
meaning we want to communicate (a thought) expressions: Given n vocabulary elements, this
and we must encode it into a word sequence (the approach requires checking nk phrases of length
generation problem). If distributional semantics k. This becomes prohibitive already for relatively
is to be considered a proper semantic theory, then short phrases, as reasonably-sized vocabularies do
it must deal not only with synthesis (going from not go below tens of thousands of words. The
words to vectors), but also with generation (from search space for 3-word phrases in a 10K-word
vectors to words). vocabulary is already in the order of trillions. In

624
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 624633,
c
Baltimore, Maryland, USA, June 23-25 2014. 2014 Association for Computational Linguistics
this paper, we introduce a more direct approach to longer phrases is handled by recursive extension
phrase generation, inspired by the work in com- of the two-word case. We assume a lexicon L,
positional distributional semantics. In short, we that is, a bi-directional look-up table containing a
revert the composition process and we propose list of words Lw linked to a matrix Lv of vectors.
a framework of data-induced, syntax-dependent Both synthesis and generation involve a trivial lex-
functions that decompose a single vector into a icon look-up step to retrieve vectors associated to
vector sequence. The generated vectors can then words and vice versa: We ignore it in the exposi-
be efficiently matched against those in the lexicon tion below.
or fed to the decomposition system again to pro-
duce longer phrases recursively. 3.1 Synthesis
To construct the vector representing a two-word
2 Related work phrase, we must compose the vectors associated
To the best of our knowledge, we are the first to to the input words. More formally, similarly to
explicitly and systematically pursue the generation Mitchell and Lapata (2008), we define a syntax-
problem in distributional semantics. Kalchbrenner dependent composition function yielding a phrase
and Blunsom (2013) use top-level, composed dis- vector p~:
tributed representations of sentences to guide gen- p~ = fcompR (~u, ~v )
eration in a machine translation setting. More pre-
where ~u and ~v are the vector representations asso-
cisely, they condition the target language model
ciated to words u and v. fcompR : Rd Rd Rd
on the composed representation (addition of word
(for d the dimensionality of vectors) is a compo-
vectors) of the source language sentence.
sition function specific to the syntactic relation R
Andreas and Ghahramani (2013) discuss the
holding between the two words.1
the issue of generating language from vectors and
Although we are not bound to a specific com-
present a probabilistic generative model for distri-
position model, throughout this paper we use the
butional vectors. However, their emphasis is on
method proposed by Guevara (2010) and Zanzotto
reversing the generative story in order to derive
et al. (2010) which defines composition as appli-
composed meaning representations from word se-
cation of linear transformations to the two con-
quences. The theoretical generating capabilities of
stituents followed by summing the resulting vec-
the methods they propose are briefly exemplified,
tors: fcompR (~u, ~v ) = W1 ~u + W2~v . We will further
but not fully explored or tested.
use the following equivalent formulation:
Socher et al. (2011) come closest to our target
problem. They introduce a bidirectional language-
to-meaning model for compositional distributional fcompR (~u, ~v ) = WR [~u; ~v ]
semantics that is similar in spirit to ours. How- where WR Rd2d and [~u; ~v ] is the vertical con-
ever, we present a clearer decoupling of synthesis catenation of the two vectors (using Matlab no-
and generation and we use different (and simpler) tation). Following Guevara, we learn WR using
training methods and objective functions. More- examples of word and phrase vectors directly ex-
over, Socher and colleagues do not train separate tracted from the corpus (for the rest of the pa-
decomposition rules for different syntactic config- per, we refer to these phrase vectors extracted
urations, so it is not clear how they would be able non-compositionally from the corpus as observed
to control the generation of different output struc- vectors). To estimate, for example, the weights
tures. Finally, the potential for generation is only in the WAN (adjective-noun) matrix, we use the
addressed in passing, by presenting a few cases corpus-extracted vectors of the words in tuples
where the generated sequence has the same syn- such as hred, car, red.cari, hevil, cat, evil.cati,
tactic structure of the input sequence. etc. Given a set of training examples stacked into
matrices U , V (the constituent vectors) and P (the
3 General framework
corresponding observed vectors), we estimate WR
We start by presenting the familiar synthesis set- by solving the least-squares regression problem:
ting, focusing on two-word phrases. We then in- 1
Here we make the simplifying assumption that all vec-
troduce generation for the same structures. Fi- tors have the same dimensionality, however this need not nec-
nally, we show how synthesis and generation of essarily be the case.

625
observed phrases, as in eq. (2), should be better
min kP WR [U ; V ]k (1) at capturing the idiosyncrasies of the actual dis-
WR Rd2d
tribution of phrases in the corpus and it is more
We use the approximation of observed phrase robust by being independent from the availability
vectors as objective because these vectors can pro- and quality of composition functions. On the other
vide direct evidence of the polysemous behaviour hand, if the goal is to revert as faithfully as possi-
of words: For example, the corpus-observed vec- ble the composition process and retrieve the orig-
tors of green jacket and green politician reflect inal constituents (e.g., in a different modality or a
how the meaning of green is affected by its occur- different language), then the objective in eq. (3) is
rence with different nouns. Moreover, it has been more motivated.
shown that for two-word phrases, despite their
relatively low frequency, such corpus-observed Nearest neighbour search We retrieve the near-
representations are still difficult to outperform in est neighbours of each constituent vector ~u ob-
phrase similarity tasks (Dinu et al., 2013; Turney, tained by decomposition by applying a search
2012). function s:
3.2 Generation
NN~u = s(~u, Lv , t)
Generation of a two-word sequence from a vec-
tor proceeds in two steps: decomposition of the
phrase vectors into two constituent vectors, and where NN~u is a list containing the t nearest
search for the nearest neighbours of each con- neighours of ~u from Lv , the lexical vectors. De-
stituent vector in Lv (the lexical matrix) in order pending on the task, t might be set to 1 to retrieve
to retrieve the corresponding words from Lw . just one word sequence, or to larger values to re-
trieve t alternatives. The similarity measure used
Decomposition We define a syntax-dependent to determine the nearest neighbours is another pa-
decomposition function: rameter of the search function; we omit it here as
we only experiment with the standard cosine mea-
[~u; ~v ] = fdecompR (~
p) sure (Turney and Pantel, 2010).2
where p~ is a phrase vector, ~u and ~v are vectors as-
3.3 Recursive (de)composition
sociated to words standing in the syntactic relation
R and fdecompR : Rd Rd Rd . Extension to longer sequences is straightforward
We assume that decomposition is also a linear if we assume binary tree representations as syn-
transformation, WR0 R2dd , which, given an in- tactic structures. In synthesis, the top-level
put phrase vector, returns two constituent vectors: vector can be obtained by applying composi-
tion functions recursively. For example, the
p) = WR0 p~
fdecompR (~ vector of big red car would be obtained as:
~ fcomp (red,
fcompAN (big, ~ car)),
~ where fcompAN
Again, we can learn from corpus-observed vectors AN
is the composition function for adjective-noun
associated to tuples of word pairs and the corre-
phrase combinations. Conversely, for generation,
sponding phrases by solving:
we decompose the phrase vector with fdecompAN .
min k[U ; V ] WR0 P k (2) The first vector is used for retrieving the nearest
0 R2dd
WR adjective from the lexicon, while the second vec-
tor is further decomposed.
If a composition function fcompR is available, an
In the experiments in this paper we assume that
alternative is to learn a function that can best revert
the syntactic structure is given. In Section 7, we
this composition. The decomposition function is
discuss ways to eliminate this assumption.
then trained as follows:
2
Note that in terms of computational efficiency, cosine-
min k[U ; V ] WR0 WR [U ; V ]k (3) based nearest neighbour searches reduce to vector-matrix
0 R2dd
WR
multiplications, for which many efficient implementations
exist. Methods such as locality sensitive hashing can be used
where the matrix WR is a given composition for further speedups when working with particularly large vo-
function for the same relation R. Training with cabularies (Andoni and Indyk, 2008).

626
4 Evaluation setting required for the training procedures. We sanity-
check the two spaces on MEN (Bruni et al., 2012),
In our empirical part, we focus on noun phrase a 3,000 items word similarity data set. cbow sig-
generation. A noun phrase can be a single noun or nificantly outperforms count (0.80 vs. 0.72 Spear-
a noun with one or more modifiers, where a mod- man correlations with human judgments). count
ifier can be an adjective or a prepositional phrase. performance is consistent with previously reported
A prepositional phrase is in turn composed of a results.6
preposition and a noun phrase. We learn two com-
position (and corresponding decomposition) func- (De)composition function training The train-
tions: one for modifier-noun phrases, trained on ing data sets consist of the 50K most frequent
adjective-noun (AN) pairs, and a second one for hu, v, pi tuples for each phrase type, for example,
prepositional phrases, trained on preposition-noun hred, car, red.cari or hin, car, in.cari.7 We con-
(PN) combinations. For the rest of this section we catenate ~u and ~v vectors to obtain the [U ; V ] ma-
describe the construction of the vector spaces and trix and we use the observed p~ vectors (e.g., the
the (de)composition function learning procedure. corpus vector of the red.car bigram) to obtain the
phrase matrix P . We use these data sets to solve
Construction of vector spaces We test two the least squares regression problems in eqs. (1)
types of vector representations. The cbow model and (2), obtaining estimates of the composition
introduced in Mikolov et al. (2013a) learns vec- and decomposition matrices, respectively. For the
tor representations using a neural network archi- decomposition function in eq. (3), we replace the
tecture by trying to predict a target word given the observed phrase vectors with those composed with
words surrounding it. We use the word2vec soft- fcompR (~u, ~v ), where fcompR is the previously esti-
ware3 to build vectors of size 300 and using a con- mated composition function for relation R.
text window of 5 words to either side of the target.
We set the sub-sampling option to 1e-05 and esti- Composition function performance Since the
mate the probability of a target word with the neg- experiments below also use composed vectors as
ative sampling method, drawing 10 samples from input to the generation process, it is important to
the noise distribution (see Mikolov et al. (2013a) provide independent evidence that the composi-
for details). We also implement a standard count- tion model is of high quality. This is indeed the
based bag-of-words distributional space (Turney case: We tested our composition approach on the
and Pantel, 2010) which counts occurrences of a task of retrieving observed AN and PN vectors,
target word with other words within a symmetric based on their composed vectors (similarly to Ba-
window of size 5. We build a 300Kx300K sym- roni and Zamparelli (2010), we want to retrieve the
metric co-occurrence matrix using the top most observed red.car vector using fcompAN (red, car)).
frequent words in our source corpus, apply posi- We obtain excellent results, with minimum accu-
tive PMI weighting and Singular Value Decompo- racy of 0.23 (chance level <0.0001). We also test
sition to reduce the space to 300 dimensions. For on the AN-N paraphrasing test set used in Dinu
both spaces, the vectors are finally normalized to et al. (2013) (in turn adapting Turney (2012)).
unit length.4 The dataset contains 620 ANs, each paired with
a single-noun paraphrase (e.g., false belief/fallacy,
For both types of vectors we use 2.8 billion to-
personal appeal/charisma). The task is to rank
kens as input (ukWaC + Wikipedia + BNC). The
all nouns in the lexicon by their similarity to the
Italian language vectors for the cross-lingual ex-
phrase, and return the rank of the correct para-
periments of Section 6 were trained on 1.6 bil-
phrase. Results are reported in the first row of Ta-
lion tokens from itWaC.5 A word token is a word-
ble 1. To facilitate comparison, we search, like
form + POS-tag string. We extract both word vec-
Dinu et al., through a vocabulary containing the
tors and the observed phrase vectors which are
20K most frequent nouns. The count vectors re-
3
Available at https://code.google.com/p/ sults are similar to those reported by Dinu and col-
word2vec/ leagues for the same model, and with cbow vec-
4
The parameters of both models have been chosen without
6
specific tuning, based on their observed stable performance in See Baroni et al. (2014) for an extensive comparison of
previous independent experiments. the two types of vector representations.
5 7
Corpus sources: http://wacky.sslmit.unibo. For PNs, we ignore determiners and we collapse, for ex-
it, http://www.natcorp.ox.ac.uk ample, in.the.car and in.car occurrences.

627
Input Output cbow count Input Output cbow count
AN N 11 171 A.N A, N 0.36,0.61 0.20,0.41
N A, N 67,29 204,168 P.N P, N 0.93,0.79 0.60,0.57
AN A, N 1.00,1.00 0.86,0.99
Table 1: Median rank on the AN-N set of Dinu et PN P, N 1.00,1.00 1.00,1.00
al. (2013) (e.g., personal appeal/charisma). First
row: the A and N are composed and the closest Table 2: Accuracy of generation models at re-
N is returned as a paraphrase. Second row: the trieving (at rank 1) the constituent words of
N vector is decomposed into A and N vectors and adjective-noun (AN) and preposition-noun (PN)
their nearest (POS-tag consistent) neighbours are phrases. Observed (A.N) and composed repre-
returned. sentations (AN) are decomposed with observed-
(eq. 2) and composed-trained (eq. 3) functions re-
tors we obtain a median rank that is considerably spectively.
higher than that of the methods they test.

5 Noun phrase generation paraphrase-by-generation task we tackle here and


in the next experiments. Compositional distri-
5.1 One-step decomposition butional semantic systems are often evaluated on
We start with testing one-step decomposition by phrase and sentence paraphrasing data sets (Bla-
generating two-word phrases. A first straightfor- coe and Lapata, 2012; Mitchell and Lapata, 2010;
ward evaluation consists in decomposing a phrase Socher et al., 2011; Turney, 2012). However,
vector into the correct constituent words. For this these experiments assume a pre-compiled list of
purpose, we randomly select (and consequently re- candidate paraphrases, and the task is to rank
move) from the training sets 200 phrases of each correct paraphrases above foils (paraphrase rank-
type (AN and PN) and apply decomposition op- ing) or to decide, for a given pair, if the two
erations to 1) their corpus-observed vectors and phrases/sentences are mutual paraphrases (para-
2) their composed representations. We generate phrase detection). Here, instead, we do not as-
two words by returning the nearest neighbours sume a given set of candidates: For example, in
(with appropriate POS tags) of the two vectors NAN paraphrasing, any of 20K2 possible com-
produced by the decomposition functions. Ta- binations of adjectives and nouns from the lexicon
ble 2 reports generation accuracy, i.e., the pro- could be generated. This is a much more challeng-
portion of times in which we retrieved the cor- ing task and it paves the way to more realistic ap-
rect constituents. The search space consists of plications of distributional semantics in generation
the top most frequent 20K nouns, 20K adjec- scenarios.
tives and 25 prepositions respectively, leading to The median ranks of the gold A and N of the
chance accuracy <0.0001 for nouns and adjectives Dinu set are shown in the second row of Table
and <0.05 for prepositions. We obtain relatively 1. As the top-generated noun is almost always,
high accuracy, with cbow vectors consistently out- uninterestingly, the input one, we return the next
performing count ones. Decomposing composed noun. Here we report results for the more moti-
rather than observed phrase representations is eas- vated corpus-observed training of eq. (2) (unsur-
ier, which is to be expected given that composed prisingly, using composed-phrase training for the
representations are obtained with a simpler, lin- task of decomposing single nouns leads to lower
ear model. Most of the errors consist in generat- performance).
ing synonyms (hard casedifficult case, true cost Although considerably more difficult than the
actual cost) or related phrases (stereo speak- previous task, the results are still very good, with
ersomni-directional sound). median ranks under 100 for the cbow vectors (ran-
Next, we use the AN-N dataset of Dinu and dom median rank at 10K). Also, the dataset pro-
colleagues for a more interesting evaluation of vides only one AN paraphrase for each noun, out
one-step decomposition. In particular, we reverse of many acceptable ones. Examples of generated
the original paraphrasing direction by attempting phrases are given in Table 3. In addition to gen-
to generate, for example, personal charm from erating topically related ANs, we also see nouns
charisma. It is worth stressing the nature of the disambiguated in different ways than intended in

628
Input Output Gold step we decompose using the modifier-noun rule
reasoning deductive thinking abstract thought
jurisdiction legal authority legal power
(fdecompAN ). We generate a noun from the head
thunderstorm thundery storm electrical storm slot vector and the adjective vector is further de-
folk local music common people composed using fdecompPN (returning the top noun
superstition old-fashioned religion superstitious notion
vitriol political bitterness sulfuric acid which is not identical to the previously generated
zoom fantastic camera rapid growth one). The results, in terms of top 1 accuracy and
religion religious religion religious belief median rank, are shown in Table 4. Examples are
given in Table 5.
Table 3: Examples of generating ANs from Ns us- For observed phrase vector training, accuracy
ing the data set of Dinu et al. (2013). and rank are well above chance for all constituents
(random accuracy 0.00005 for nouns and 0.04 for
the gold standard (for example vitriol and folk in prepositions, corresponding median ranks: 10K,
Table 3). Other interesting errors consist of de- 12). Preposition generation is clearly a more diffi-
composing a noun into two words which both have cult task. This is due at least in part to their highly
the same meaning as the noun, generating for ex- ambiguous and broad semantics, and the way in
ample religion religious religions. We observe which they interact with the nouns. For exam-
moreover that sometimes the decomposition re- ple, cable through ocean in Table 5 is a reason-
flects selectional preference effects, by generat- able paraphrase of undersea cable despite the gold
ing adjectives that denote typical properties of the preposition being under. Other than several cases
noun to be paraphrased (e.g., animosity is a (po- which are acceptable paraphrases but not in the
litical, personal,...) hostility or a fridge is a (big, gold standard, phrases related in meaning but not
large, small,...) refrigerator). This effect could be synonymous are the most common error (overcast
exploited for tasks such as property-based concept skies skies in sunshine). We also observe that
description (Kelly et al., 2012). often the A and N meanings are not fully separated
when decomposing and traces of the adjective
5.2 Recursive decomposition or of the original noun meaning can be found in
both generated nouns (for example nearby school
We continue by testing generation through recur-
schools after school). To a lesser degree, this
sive decomposition on the task of generating noun-
might be desirable as a disambiguation-in-context
preposition-noun (NPN) paraphrases of adjective-
effect as, for example, in underground cavern, in
nouns (AN) phrases. We introduce a dataset con-
secret would not be a context-appropriate para-
taining 192 AN-NPN pairs (such as pre-election
phrase of underground.
promises promises before election), which was
created by the second author and additionally cor- 6 Noun phrase translation
rected by an English native speaker. The data set
was created by analyzing a list of randomly se- This section describes preliminary experiments
lected frequent ANs. 49 further ANs (with adjec- performed in a cross-lingual setting on the task
tives such as amazing and great) were judged not of composing English AN phrases and generating
NPN-paraphrasable and were used for the experi- Italian translations.
ment reported in Section 7. The paraphrased sub- Creation of cross-lingual vector spaces A
set focuses on preposition diversity and on includ- common semantic space is required in order to
ing prepositions which are rich in semantic content map words and phrases across languages. This
and relevant to paraphrasing the AN. This has led problem has been extensively addressed in the
to excluding of, which in most cases has the purely bilingual lexicon acquisition literature (Haghighi
syntactic function of connecting the two nouns. et al., 2008; Koehn and Knight, 2002). We opt for
The data set contains the following 14 preposi- a very simple yet accurate method (Klementiev et
tions: after, against, at, before, between, by, for, al., 2012; Rapp, 1999) in which a bilingual dictio-
from, in, on, per, under, with, without.8 nary is used to identify a set of shared dimensions
NPN phrase generation involves the applica- across spaces and the vectors of both languages are
tion of two decomposition functions. In the first projected into the subspace defined by these (Sub-
8
This dataset is available at http://clic.cimec. space Projection - SP). This method is applicable
unitn.it/composes to count-type vector spaces, for which the dimen-

629
Input Output Training cbow count
AN N, P, N observed 0.98(1),0.08(5.5),0.13(20.5) 0.82(1),0.17(4.5),0.05(71.5)
AN N, P, N composed 0.99(1),0.02(12), 0.12(24) 0.99(1),0.06(10), 0.05(150.5)

Table 4: Top 1 accuracy (median rank) on the ANNPN paraphrasing data set. AN phrases are com-
posed and then recursively decomposed into N, (P, N). Comma-delimited scores reported for first noun,
preposition, second noun in this order. Training is performed on observed (eq. 2) and composed (eq. 3)
phrase representations.
Input Output Gold
mountainous region region in highlands region with mountains
undersea cable cable through ocean cable under sea
underground cavern cavern through rock cavern under ground
interdisciplinary field field into research field between disciplines
inter-war years years during 1930s years between wars
post-operative pain pain through patient pain after operation
pre-war days days after wartime days before war
intergroup differences differences between intergroup differences between minorities
superficial level level between levels level on surface

Table 5: Examples of generating NPN phrases from composed ANs.

sions correspond to actual words. As the cbow di- Input Output cbow count
mensions do not correspond to words, we align the AN(En) A,N (It) 0.31,0.59 0.24,0.54
cbow spaces by using a small dictionary to learn AN (It) A,N(En) 0.50,0.62 0.28,0.48
a linear map which transforms the English vectors
into Italian ones as done in Mikolov et al. (2013b). Table 6: Accuracy of EnIt and ItEn phrase
This method (Translation Matrix - TM) is applica- translation: phrases are composed in source lan-
ble to both cbow and count spaces. We tune the pa- guage and decomposed in target language. Train-
rameters (TM or SP for count and dictionary size ing on composed phrase representations (eq. (3))
5K or 25K for both spaces) on a standard task of (with observed phrase training (eq. 2) results are
translating English words into Italian. We obtain 50% lower).
TM-5K for cbow and SP-25K for count as opti-
mal settings. The two methods perform similarly
for low frequency words while cbow-TM-5K sig-
nificantly outperforms count-SP-25K for high fre-
quency words. Our results for the cbow-TM-5K Results are presented in Table 6. While in
setting are similar to those reported by Mikolov et these preliminary experiments we lack a proper
al. (2013b). term of comparison, the performance is very good
both quantitatively (random < 0.0001) and quali-
tatively. The EnIt examples in Table 7 are repre-
Cross-lingual decomposition training Train- sentative. In many cases (e.g., vicious killer, rough
ing proceeds as in the monolingual case, this time neighborhood) we generate translations that are
concatenating the training data sets and estimating arguably more natural than those in the gold stan-
a single (de)-composition function for the two lan- dard. Again, some differences can be explained
guages in the shared semantic space. We train both by different disambiguations (chest as breast, as
on observed phrase representations (eq. 2) and on in the generated translation, or box, as in the gold).
composed phrase representations (eq. 3). Translation into related but not equivalent phrases
and generating the same meaning in both con-
Adjective-noun translation dataset We ran- stituents (stellar star) are again the most signifi-
domly extract 1,000 AN-AN En-It phrase pairs cant errors. We also see cases in which this has the
from a phrase table built from parallel movie sub- desired effect of disambiguating the constituents,
titles, available at http://opus.lingfil. such as in the examples in Table 8, showing the
uu.se/ (OpenSubtitles2012, en-it) (Tiedemann, nearest neighbours when translating black tie and
2012). indissoluble tie.

630
Input Output Gold
vicious killer assassino feroce (ferocious killer) killer pericoloso
spectacular woman donna affascinante (fascinating woman) donna eccezionale
huge chest petto grande (big chest) scrigno immenso
rough neighborhood zona malfamata (ill-repute zone) quartiere difficile
mortal sin peccato eterno (eternal sin) pecato mortale
canine star stella stellare (stellar star) star canina

Table 7: EnIt translation examples (back-translations of generated phrases in parenthesis).

black tie EnIt ItEn


cravatta (tie) nero (black)
velluto (velvet) bianco (white) Thr. Accuracy Cov. Accuracy Cov.
giacca (jacket) giallo (yellow) 0.00 0.21 100% 0.32 100%
indissoluble tie 0.55 0.25 70% 0.40 63%
alleanza (alliance) indissolubile (indissoluble) 0.60 0.31 32% 0.45 37%
legame (bond) sacramentale (sacramental)
amicizia (friendship) inscindibile (inseparable)
0.65 0.45 9% 0.52 16%

Table 8: Top 3 translations of black tie and indis- Table 9: AN-AN translation accuracy (both A and
soluble tie, showing correct disambiguation of tie. N correct) when imposing a confidence threshold
(random: 1/20K 2 ).

7 Generation confidence and generation


quality
In Section 3.2 we have defined a search function
s returning a list of lexical nearest neighbours for
a constituent vector produced by decomposition.
Together with the neighbours, this function can
naturally return their similarity score (in our case,
the cosine). We call the score associated to the
top neighbour the generation confidence: if this
score is low, the vector has no good match in the
lexicon. We observe significant Spearman cor- Figure 1: ROC of distinguishing ANs para-
relations between the generation confidence of a phrasable as NPNs from non-paraphrasable ones.
constituent and its quality (e.g., accuracy, inverse
rank) in all the experiments. For example, for the
have this property. We assign an AN to the NPN-
AN(En)AN(It) experiment, the correlations be-
paraphrasable class if the mean confidence of the
tween the confidence scores and the inverse ranks
PN expansion in its attempted N(PN) decomposi-
for As and Ns, for both cbow and count vectors,
tion is above a certain threshold. We plot the ROC
range between 0.34 (p < 1e28 ) and 0.42. In
curve in Figure 1. We obtain a significant AUC of
the translation experiments, we can use this to au-
0.71.
tomatically determine a subset on which we can
translate with very high accuracy. Table 9 shows 8 Conclusion
AN-AN accuracies and coverage when translating
only if confidence is above a certain threshold. In this paper we have outlined a framework for
Throughout this paper we have assumed that the the task of generation with distributional semantic
syntactic structure of the phrase to be generated is models. We proposed a simple but effective ap-
given. In future work we will exploit the corre- proach to reverting the composition process to ob-
lation between confidence and quality for the pur- tain meaningful reformulations of phrases through
pose of eliminating this assumption. As a concrete a synthesis-generation process.
example, we can use confidence scores to distin- For future work we would like to experiment
guish the two subsets of the AN-NPN dataset in- with more complex models for (de-)composition
troduced in Section 5: the ANs which are para- in order to improve the performance on the tasks
phrasable with an NPN from those that do not we used in this paper. Following this, we

631
would like to extend the framework to handle bor in high dimensions. Commun. ACM, 51(1):117
arbitrary phrases, including making (confidence- 122, January.
based) choices on the syntactic structure of the Jacob Andreas and Zoubin Ghahramani. 2013. A
phrase to be generated, which we have assumed generative model of vector space semantics. In
to be given throughout this paper. Proceedings of the Workshop on Continuous Vector
In terms of applications, we believe that the line Space Models and their Compositionality, pages 91
99, Sofia, Bulgaria.
of research in machine translation that is currently
focusing on replacing parallel resources with large Marco Baroni and Roberto Zamparelli. 2010. Nouns
amounts of monolingual text provides an inter- are vectors, adjectives are matrices: Representing
adjective-noun constructions in semantic space. In
esting setup to test our methods. For example, Proceedings of EMNLP, pages 11831193, Boston,
Klementiev et al. (2012) reconstruct phrase ta- MA.
bles based on phrase similarity scores in seman-
Marco Baroni, Georgiana Dinu, and German
tic space. However, they resort to scoring phrase Kruszewski. 2014. Dont count, predict! A
pairs extracted from an aligned parallel corpus, as systematic comparison of context-counting vs.
they do not have a method to freely generate these. context-predicting semantic vectors. In Proceedings
Similarly, in the recent work on common vector of ACL, To appear, Baltimore, MD.
spaces for the representation of images and text, Marco Baroni. 2013. Composition in distributional
the current emphasis is on retrieving existing cap- semantics. Language and Linguistics Compass,
tions (Socher et al., 2014) and not actual genera- 7(10):511522.
tion of image descriptions. William Blacoe and Mirella Lapata. 2012. A com-
From a more theoretical point of view, our work parison of vector-based representations for seman-
fills an important gap in distributional semantics, tic composition. In Proceedings of EMNLP, pages
546556, Jeju Island, Korea.
making it a bidirectional theory of the connec-
tion between language and meaning. We can now Elia Bruni, Gemma Boleda, Marco Baroni, and
translate linguistic strings into vector thoughts, Nam Khanh Tran. 2012. Distributional semantics
in Technicolor. In Proceedings of ACL, pages 136
and the latter into their most appropriate linguis-
145, Jeju Island, Korea.
tic expression. Several neuroscientific studies sug-
gest that thoughts are represented in the brain by Georgiana Dinu, Nghia The Pham, and Marco Baroni.
patterns of activation over broad neural areas, and 2013. General estimation and evaluation of com-
positional distributional semantic models. In Pro-
vectors are a natural way to encode such patterns ceedings of ACL Workshop on Continuous Vector
(Haxby et al., 2001; Huth et al., 2012). Some Space Models and their Compositionality, pages 50
research has already established a connection be- 58, Sofia, Bulgaria.
tween neural and distributional semantic vector Katrin Erk. 2012. Vector space models of word mean-
spaces (Mitchell et al., 2008; Murphy et al., 2012). ing and phrase meaning: A survey. Language and
Generation might be the missing link to power- Linguistics Compass, 6(10):635653.
ful computational models that take the neural foot- Andrea Frome, Greg Corrado, Jon Shlens, Samy Ben-
print of a thought as input and produce its linguis- gio, Jeff Dean, MarcAurelio Ranzato, and Tomas
tic expression. Mikolov. 2013. DeViSE: A deep visual-semantic
embedding model. In Proceedings of NIPS, pages
Acknowledgments 21212129, Lake Tahoe, Nevada.

We thank Kevin Knight, Andrew Anderson, Emiliano Guevara. 2010. A regression model of
adjective-noun compositionality in distributional se-
Roberto Zamparelli, Angeliki Lazaridou, Nghia
mantics. In Proceedings of GEMS, pages 3337,
The Pham, German Kruszewski and Peter Tur- Uppsala, Sweden.
ney for helpful discussions and the anonymous re-
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick,
viewers for their useful comments. We acknowl-
and Dan Klein. 2008. Learning bilingual lexicons
edge the ERC 2011 Starting Independent Research from monolingual corpora. In Proceedings of ACL,
Grant n. 283554 (COMPOSES). pages 771779, Columbus, OH, USA, June.
James Haxby, Ida Gobbini, Maura Furey, Alumit Ishai,
Jennifer Schouten, and Pietro Pietrini. 2001. Dis-
References tributed and overlapping representations of faces
Alexandr Andoni and Piotr Indyk. 2008. Near-optimal and objects in ventral temporal cortex. Science,
hashing algorithms for approximate nearest neigh- 293:24252430.

632
Alexander Huth, Shinji Nishimoto, An Vu, and Jack Richard Socher, Eric Huang, Jeffrey Pennin, Andrew
Gallant. 2012. A continuous semantic space de- Ng, and Christopher Manning. 2011. Dynamic
scribes the representation of thousands of object and pooling and unfolding recursive autoencoders for
action categories across the human brain. Neuron, paraphrase detection. In Proceedings of NIPS, pages
76(6):12101224. 801809, Granada, Spain.

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Richard Socher, Milind Ganjoo, Christopher Manning,
continuous translation models. In Proceedings of and Andrew Ng. 2013. Zero-shot learning through
the 2013 Conference on Empirical Methods in Natu- cross-modal transfer. In Proceedings of NIPS, pages
ral Language Processing, Seattle, October. Associa- 935943, Lake Tahoe, Nevada.
tion for Computational Linguistics.
Richard Socher, Quoc Le, Christopher Manning, and
Colin Kelly, Barry Devereux, and Anna Korhonen. Andrew Ng. 2014. Grounded compositional se-
2012. Semi-supervised learning for automatic con- mantics for finding and describing images with sen-
ceptual property extraction. In Proceedings of the tences. Transactions of the Association for Compu-
3rd Workshop on Cognitive Modeling and Computa- tational Linguistics. In press.
tional Linguistics, pages 1120, Montreal, Canada.
Jorg Tiedemann. 2012. Parallel data, tools and inter-
Alexandre Klementiev, Ann Irvine, Chris Callison- faces in opus. In Proceedings of the Eight Interna-
Burch, and David Yarowsky. 2012. Toward sta- tional Conference on Language Resources and Eval-
tistical machine translation without parallel corpora. uation (LREC12), Istanbul, Turkey.
In Proceedings of EACL, pages 130140, Avignon,
France. Peter Turney and Patrick Pantel. 2010. From fre-
quency to meaning: Vector space models of se-
Philipp Koehn and Kevin Knight. 2002. Learning a mantics. Journal of Artificial Intelligence Research,
translation lexicon from monolingual corpora. In 37:141188.
In Proceedings of ACL Workshop on Unsupervised
Peter Turney. 2012. Domain and function: A dual-
Lexical Acquisition, pages 916, Philadelphia, PA,
space model of semantic relations and compositions.
USA.
Journal of Artificial Intelligence Research, 44:533
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey 585.
Dean. 2013a. Efficient estimation of word represen- Fabio Zanzotto, Ioannis Korkontzelos, Francesca
tations in vector space. http://arxiv.org/ Falucchi, and Suresh Manandhar. 2010. Estimat-
abs/1301.3781/. ing linear models for compositional distributional
semantics. In Proceedings of COLING, pages 1263
Tomas Mikolov, Quoc Le, and Ilya Sutskever. 2013b.
1271, Beijing, China.
Exploiting similarities among languages for Ma-
chine Translation. http://arxiv.org/abs/
1309.4168.

Jeff Mitchell and Mirella Lapata. 2008. Vector-based


models of semantic composition. In Proceedings of
ACL, pages 236244, Columbus, OH.

Jeff Mitchell and Mirella Lapata. 2010. Composition


in distributional models of semantics. Cognitive Sci-
ence, 34(8):13881429.

Tom Mitchell, Svetlana Shinkareva, Andrew Carlson,


Kai-Min Chang, Vincente Malave, Robert Mason,
and Marcel Just. 2008. Predicting human brain ac-
tivity associated with the meanings of nouns. Sci-
ence, 320:11911195.

Brian Murphy, Partha Talukdar, and Tom Mitchell.


2012. Selecting corpus-semantic models for neu-
rolinguistic decoding. In Proceedings of *SEM,
pages 114123, Montreal, Canada.

Reinhard Rapp. 1999. Automatic identification of


word translations from unrelated english and german
corpora. In Proceedings of the 37th annual meet-
ing of the Association for Computational Linguistics
on Computational Linguistics, ACL 99, pages 519
526. Association for Computational Linguistics.

633