You are on page 1of 10

From Word Embeddings To Document Distances

Matt J. Kusner MKUSNER @ WUSTL . EDU


Yu Sun YUSUN @ WUSTL . EDU
Nicholas I. Kolkin N . KOLKIN @ WUSTL . EDU
Kilian Q. Weinberger KILIAN @ WUSTL . EDU
Washington University in St. Louis, 1 Brookings Dr., St. Louis, MO 63130

Abstract
document 1 ‘greets’ document 2
We present the Word Mover’s Distance (WMD), Obama ‘Obama’ The
a novel distance function between text docu- speaks ‘speaks’ President
to ‘President’ greets
ments. Our work is based on recent results in the the
word embeddings that learn semantically mean- media ‘Chicago’ press
in ‘media’ in
ingful representations for words from local co- Illinois Chicago
‘Illinois’ ‘press’
occurrences in sentences. The WMD distance
measures the dissimilarity between two text doc- word2vec embedding
uments as the minimum amount of distance that Figure 1. An illustration of the word mover’s distance. All
the embedded words of one document need to non-stop words (bold) of both documents are embedded into a
“travel” to reach the embedded words of another word2vec space. The distance between the two documents is the
document. We show that this distance metric can minimum cumulative distance that all words in document 1 need
be cast as an instance of the Earth Mover’s Dis- to travel to exactly match document 2. (Best viewed in color.)
tance, a well studied transportation problem for
which several highly efficient solvers have been their frequent near-orthogonality (Schölkopf et al., 2002;
developed. Our metric has no hyperparameters Greene & Cunningham, 2006). Another significant draw-
and is straight-forward to implement. Further, we back of these representations are that they do not capture
demonstrate on eight real world document classi- the distance between individual words. Take for example
fication data sets, in comparison with seven state- the two sentences in different documents: Obama speaks
of-the-art baselines, that the WMD metric leads to the media in Illinois and: The President greets the press
to unprecedented low k-nearest neighbor docu- in Chicago. While these sentences have no words in com-
ment classification error rates. mon, they convey nearly the same information, a fact that
cannot be represented by the BOW model. In this case, the
closeness of the word pairs: (Obama, President); (speaks,
greets); (media, press); and (Illinois, Chicago) is not fac-
1. Introduction tored into the BOW-based distance.
Accurately representing the distance between two docu- There have been numerous methods that attempt to circum-
ments has far-reaching applications in document retrieval vent this problem by learning a latent low-dimensional rep-
(Salton & Buckley, 1988), news categorization and cluster- resentation of documents. Latent Semantic Indexing (LSI)
ing (Ontrup & Ritter, 2001; Greene & Cunningham, 2006), (Deerwester et al., 1990) eigendecomposes the BOW fea-
song identification (Brochu & Freitas, 2002), and multi- ture space, and Latent Dirichlet Allocation (LDA) (Blei
lingual document matching (Quadrianto et al., 2009). et al., 2003) probabilistically groups similar words into top-
The two most common ways documents are represented ics and represents documents as distribution over these top-
is via a bag of words (BOW) or by their term frequency- ics. At the same time, there are many competing vari-
inverse document frequency (TF-IDF). However, these fea- ants of BOW/TF-IDF (Salton & Buckley, 1988; Robert-
tures are often not suitable for document distances due to son & Walker, 1994). While these approaches produce a
more coherent document representation than BOW, they
Proceedings of the 32 nd International Conference on Machine often do not improve the empirical performance of BOW
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- on distance-based tasks (e.g., nearest-neighbor classifiers)
right 2015 by the author(s).
(Petterson et al., 2010; Mikolov et al., 2013c).
From Word Embeddings To Document Distances

In this paper we introduce a new metric for the distance be- (2007) which first decomposes each document into a set of
tween text documents. Our approach leverages recent re- subtopic units via TextTiling (Hearst, 1994), and then mea-
sults by Mikolov et al. (2013b) whose celebrated word2vec sures the effort required to transform a subtopic set into
model generates word embeddings of unprecedented qual- another via the EMD (Monge, 1781; Rubner et al., 1998).
ity and scales naturally to very large data sets (e.g., we use
New approaches for learning document representations
a freely-available model trained on approximately 100 bil-
include Stacked Denoising Autoencoders (SDA) (Glorot
lion words). The authors demonstrate that semantic rela-
et al., 2011), and the faster mSDA (Chen et al., 2012),
tionships are often preserved in vector operations on word
which learn word correlations via dropout noise in stacked
vectors, e.g., vec(Berlin) - vec(Germany) + vec(France)
neural networks. Recently, the Componential Counting
is close to vec(Paris). This suggests that distances and
Grid (Perina et al., 2013) merges LDA (Blei et al., 2003)
between embedded word vectors are to some degree se-
and Counting Grid (Jojic & Perina, 2011) models, allow-
mantically meaningful. Our metric, which we call the
ing ‘topics’ to be mixtures of word distributions. As well,
Word Mover’s Distance (WMD), utilizes this property of
Le & Mikolov (2014) learn a dense representation for doc-
word2vec embeddings. We represent text documents as a
uments using a simplified neural language model, inspired
weighted point cloud of embedded words. The distance be-
by the word2vec model (Mikolov et al., 2013a).
tween two text documents A and B is the minimum cumu-
lative distance that words from document A need to travel The use of the EMD has been pioneered in the computer vi-
to match exactly the point cloud of document B. Figure 1 sion literature (Rubner et al., 1998; Ren et al., 2011). Sev-
shows a schematic illustration of our new metric. eral publications investigate approximations of the EMD
for image retrieval applications (Grauman & Darrell, 2004;
The optimization problem underlying WMD reduces to
Shirdhonkar & Jacobs, 2008; Levina & Bickel, 2001). As
a special case of the well-studied Earth Mover’s Dis-
word embeddings improve in quality, document retrieval
tance (Rubner et al., 1998) transportation problem and
enters an analogous setup, where each word is associated
we can leverage existing literature on fast specialized
with a highly informative feature vector. To our knowledge,
solvers (Pele & Werman, 2009). We also compare several
our work is the first to make the connection between high
lower bounds and show that these can be used as approxi-
quality word embeddings and EMD retrieval algorithms.
mations or to prune away documents that are provably not
amongst the k-nearest neighbors of a query. Cuturi (2013) introduces an entropy penalty to the EMD
objective, which allows the resulting approximation to be
The WMD distance has several intriguing properties: 1.
solved with very efficient iterative matrix updates. Further,
it is hyper-parameter free and straight-forward to under-
the vectorization enables parallel computation via GPGPUs
stand and use; 2. it is highly interpretable as the dis-
However, their approach assumes that the number of di-
tance between two documents can be broken down and
mensions per document is not too high, which in our set-
explained as the sparse distances between few individual
ting is extremely large (all possible words). This removes
words; 3. it naturally incorporates the knowledge encoded
the main benefit (parallelization on GPGPUs) of their ap-
in the word2vec space and leads to high retrieval accu-
proach and so we develop a new EMD approximation that
racy—it outperforms all 7 state-of-the-art alternative doc-
appears to be very effective for our problem domain.
ument distances in 6 of 8 real world classification tasks.

2. Related Work 3. Word2Vec Embedding


Recently Mikolov et al. (2013a;b) introduced word2vec, a
Constructing a distance between documents is closely tied
novel word-embedding procedure. Their model learns a
with learning new document representations. One of the
vector representation for each word using a (shallow) neu-
first works to systematically study different combinations
ral network language model. Specifically, they propose a
of term frequency-based weightings, normalization terms,
neural network architecture (the skip-gram model) that con-
and corpus-based statistics is Salton & Buckley (1988).
sists of an input layer, a projection layer, and an output
Another variation is the Okapi BM25 function (Robertson
layer to predict nearby words. Each word vector is trained
& Walker, 1994) which describes a score for each (word,
to maximize the log probability of neighboring words in a
document) pair and is designed for ranking applications.
corpus, i.e., given a sequence of words w1 , . . . , wT ,
Aslam & Frost (2003) derive an information-theoretic sim-
ilarity score between two documents, based on probability T
1X X
of word occurrence in a document corpus. Croft & Lafferty log p(wj |wt )
T t=1
(2003) use a language model to describe the probability of j∈nb(t)
generating a word from a document, similar to LDA (Blei where nb(t) is the set of neighboring words of word wt and
et al., 2003). Most similar to our method is that of Wan p(wj |wt ) is the hierarchical softmax of the associated word
From Word Embeddings To Document Distances

vectors vwj and vwt (see Mikolov et al. (2013a) for more D1 Obama speaks to the media in Illinois.
details). Due to its surprisingly simple architecture and the 1.07 = 0.45 + 0.24 + 0.20 + 0.18
use of the hierarchical softmax, the skip-gram model can
be trained on a single machine on billions of words per D0 The President greets the press in Chicago.
hour using a conventional desktop computer. The ability 1.63 = 0.49 + 0.42 + 0.44 + 0.28
to train on very large data sets allows the model to learn
complex word relationships such as vec(Japan) - vec(sushi) D2 The band gave a concert in Japan.
+ vec(Germany) ≈ vec(bratwurst) and vec(Einstein) -
D0 The President greets the press in Chicago.
vec(scientist) + vec(Picasso) ≈ vec(painter) (Mikolov
et al., 2013a;b). Learning the word embedding is entirely 1.30
unsupervised and it can be computed on the text corpus of D3 Obama speaks in Illinois.
interest or be pre-computed in advance. Although we use
word2vec as our preferred embedding throughout, other
Figure 2. (Top:) The components of the WMD metric between a
embeddings are also plausible (Collobert & Weston, 2008; query D0 and two sentences D1 , D2 (with equal BOW distance).
Mnih & Hinton, 2009; Turian et al., 2010). The arrows represent flow between two words and are labeled
with their distance contribution. (Bottom:) The flow between two
4. Word Mover’s Distance sentences D3 and D0 with different numbers of words. This mis-
match causes the WMD to move words to multiple similar words.
Assume we are provided with a word2vec embedding ma-
trix X ∈ Rd×n for a finite size vocabulary of n words. The two text documents in the (n − 1)-simplex. First, we al-
ith column, xi ∈ Rd , represents the embedding of the ith low each word i in d to be transformed into any word in
word in d-dimensional space. We assume text documents d0 in total or in parts. Let T ∈ Rn×n be a (sparse) flow
are represented as normalized bag-of-words (nBOW) vec- matrix where Tij ≥ 0 denotes how much of word i in d
tors, d ∈ Rn . To be precise, if word i appears ci times in travels to word j in d0 . To transform d entirely into d0 we
the document, we denote di = Pnci cj . An nBOW vector Pthat the entire outgoing flow from word i equals di ,
ensure
j Tij = di . Further, theP
j=1
d is naturally very sparse as most words will not appear in i.e. amount of incoming flow
to word j must match d0j , i.e. i Tij = dj . Finally, we
0
any given document. (We remove stop words, which are
generally category independent.) can define the distance between the two documents as the
minimum (weighted) cumulative cost required to move all
nBOW representation. We can think of the vector d as words from d to d0 , i.e. i,j Tij c(i, j).
P
a point on the n−1 dimensional simplex of word distribu-
tions. Two documents with different unique words will lie Transportation problem. Formally, the minimum cumu-
in different regions of this simplex. However, these doc- lative cost of moving d to d0 given the constraints is pro-
uments may still be semantically close. Recall the earlier vided by the solution to the following linear program,
example of two similar, but word-different sentences in one n
X
document: “Obama speaks to the media in Illinois” and in min Tij c(i, j)
T≥0
another: “The President greets the press in Chicago”. After i,j=1
stop-word removal, the two corresponding nBOW vectors Xn
d and d0 have no common non-zero dimensions and there- subject to: Tij = di ∀i ∈ {1, . . . , n} (1)
fore have close to maximum simplex distance, although j=1
their true distance is small. Xn
Tij = d0j ∀j ∈ {1, . . . , n}.
Word travel cost. Our goal is to incorporate the seman- i=1
tic similarity between individual word pairs (e.g. Presi-
dent and Obama) into the document distance metric. One The above optimization is a special case of the earth
such measure of word dissimilarity is naturally provided by mover’s distance metric (EMD) (Monge, 1781; Rubner
their Euclidean distance in the word2vec embedding space. et al., 1998; Nemhauser & Wolsey, 1988), a well studied
More precisely, the distance between word i and word j be- transportation problem for which specialized solvers have
comes c(i, j) = kxi − xj k2 . To avoid confusion between been developed (Ling & Okada, 2007; Pele & Werman,
word and document distances, we will refer to c(i, j) as the 2009). To highlight this connection we refer to the resulting
cost associated with “traveling” from one word to another. metric as the word mover’s distance (WMD). As the cost
c(i, j) is a metric, it can readily be shown that the WMD is
Document distance. The “travel cost” between two words also a metric (Rubner et al., 1998).
is a natural building block to create a distance between two
documents. Let d and d0 be the nBOW representation of Visualization. Figure 2 illustrates the WMD metric on
two sentences D1 and D2 which we would like to compare
From Word Embeddings To Document Distances

to the query sentence D0 . First, stop-words are removed, our nearest neighbor search about promising candidates,
leaving President, greets, press, Chicago in D0 each with which allows us to speed up the exact WMD search sig-
di = 0.25. The arrows from each word i in sentences D1 nificantly. We can also use WCD to limit our k-nearest
and D2 to word j in D0 are labeled with their contribution neighbor search to a small subset of most promising candi-
to the distance Tij c(i, j). We note that the WMD agrees dates, resulting in an even faster approximate solution.
with our intuition, and “moves” words to semantically sim-
Relaxed word moving distance. Although the WCD is
ilar words. Transforming Illinois into Chicago is much
fast to compute, it is not very tight (see section 5). We
cheaper than is Japan into Chicago. This is because the
can obtain much tighter bounds by relaxing the WMD opti-
word2vec embedding places the vector vec(Illinois) closer
mization problem and removing one of the two constraints
to vec(Chicago) than vec(Japan). Consequently, the dis-
respectively (removing both constraints results in the triv-
tance from D0 to D1 (1.07) is significantly smaller than to
ial lower bound T = 0.) If just the second constraint is
D2 (1.63). Importantly however, both sentences D1 and
removed, the optimization becomes,
D2 have the same bag-of-words/TF-IDF distance from D0 ,
as neither shares any words in common with D0 . An addi- n
X
tional example D3 highlights the flow when the number of min Tij c(i, j)
T≥0
words does not match. D3 has term weights dj = 0.33 and i,j=1
excess flow is sent to other similar words. This increases Xn
the distance, although the effect might be artificially magni- subject to: Tij = di ∀i ∈ {1, . . . , n}.
fied due to the short document lengths as longer documents j=1
may contain several similar words.
This relaxed problem must yield a lower-bound to the
WMD distance, which is evident from the fact that every
4.1. Fast Distance Computation
WMD solution (satisfying both constraints) must remain a
The best average time complexity of solving the WMD op- feasible solution if one constraint is removed.
timization problem scales O(p3 log p), where p denotes the The optimal solution is for each word in d to move all its
number of unique words in the documents (Pele & Wer- probability mass to the most similar word in d0 . Precisely,
man, 2009). For datasets with many unique words (i.e., an optimal T∗ matrix is defined as
high-dimensional) and/or a large number of documents,
solving the WMD optimal transport problem can become
(
di if j = argminj c(i, j)
prohibitive. We can however introduce several cheap lower Tij =

(2)
bounds of the WMD transportation problem that allows us 0 otherwise.
to prune away the majority of the documents without ever
The optimality of this solution is straight-forward to show.
computing the exact WMD distance.
Let T be any feasible matrix for the relaxed problem, the
Word centroid distance. Following the work of Rubner contribution to the objective value for any word i, with
et al. (1998) it is straight-forward to show (via the triangle closest word j ∗ = argminj c(i, j), cannot be smaller:
inequality) that the ‘centroid’ distance kXd − Xd0 k2 must
lower bound the WMD between documents d, d0 ,
X X X
Tij c(i, j) ≥ Tij c(i, j ∗ ) = c(i, j ∗ ) Tij
n
X n
X j j j
Tij c(i, j) = Tij kxi − x0j k2
X
= c(i, j ∗ )di = T∗ij c(i, j).
i,j=1 i,j=1
j
Xn Xn
= kTij (xi − x0j )k2 ≥ Tij (xi − x0j ) Therefore, T∗ must yield a minimum objective value.

2
i,j=1 i,j=1 Computing this solution requires only the identification of
Xn X
n  n X
X n  j ∗ = argmini c(i, j), which is a nearest neighbor search
= Tij xi − Tij x0j

2
in the Euclidean word2vec space. For each word vec-
i=1 j=1 j=1 i=1 tor xi in document D we need to find the most simi-
n n
X X lar word vector xj in document D0 . The second setting,
= di xi − d0j x0j = kXd − Xd0 k2 .

2 when the first constraint is removed, is almost identical
i=1 j=1
except that the nearest neighbor search is reversed. Both
We refer to this distance as the Word Centroid Distance lower bounds ultimately rely on pairwise distance compu-
(WCD) as each document is represented by its weighted tations between word vectors. These computations can be
average word vector. It is very fast to compute via a few combined and reused to obtain both bounds jointly at lit-
matrix operations and scales O(dp). For nearest-neighbor tle additional overhead. Let the two relaxed solutions be
applications we can use this centroid-distance to inform `1 (d, d0 ) and `2 (d, d0 ) respectively. We can obtain an
From Word Embeddings To Document Distances
Table 1. Dataset characteristics, used in evaluation.
even tighter bound by taking the maximum of the two, BOW UNIQUE
`r (d, d0 ) = max (`1 (d, d0 ), `2 (d, d0 )), which we refer to NAME n DIM . WORDS ( AVG ) |Y|
as the Relaxed WMD (RWMD). This bound is significantly BBCSPORT 517 13243 117 5
tighter than WCD. The nearest neighbor search has a time TWITTER 2176 6344 9.9 3
complexity of O(p2 ), and it can be sped up further by lever- RECIPE 3059 5708 48.5 15
OHSUMED 3999 31789 59.2 10
aging out-of-the-box tools for fast (approximate or exact) CLASSIC 4965 24277 38.6 4
nearest neighbor retrieval (Garcia et al., 2008; Yianilos, REUTERS 5485 22425 37.1 8
1993; Andoni & Indyk, 2006). AMAZON 5600 42063 45.0 4
20 NEWS 11293 29671 72 20
Prefetch and prune. We can use the two lower bounds to
drastically reduce the amount of WMD distance computa- tences from academic papers, labeled by publisher name,
tions we need to make in order to find the k nearest neigh- REUTERS: a classic news dataset labeled by news topics
bors of a query document. We first sort all documents in in- (we use the 8-class version with train/test split as described
creasing order of their (extremely cheap) WCD distance to in Cardoso-Cachopo (2007)), AMAZON: a set of Amazon
the query document and compute the exact WMD distance reviews which are labeled by category product in {books,
to the first k of these documents. Subsequently, we tra- dvd, electronics, kitchen} (as opposed to by sentiment),
verse the remaining documents. For each we first check if and 20 NEWS: news articles classified into 20 different
the RWMD lower bound exceeds the distance of the current categories (we use the “bydate” train/test split1 Cardoso-
k th closest document, if so we can prune it. If not, we com- Cachopo (2007)). We preprocess all datasets by removing
pute the WMD distance and update the k nearest neighbors all words in the SMART stop word list (Salton & Buckley,
if necessary. As the RWMD approximation is very tight, 1971). For 20 NEWS, we additionally remove all words that
it allows us to prune up to 95% of all documents on some appear less than 5 times across all documents. Finally, to
data sets. If the exact k nearest neighbors are not required, speed up the computation of WMD (and its lower bounds)
an additional speedup can be obtained if this traversal is we limit all 20 NEWS documents to the most common 500
limited to m < n documents. We refer to this algorithm as words (in each document) for WMD-based methods.
prefetch and prune. If m = k, this is equivalent to returning
We split each dataset into training and testing subsets (if not
the k nearest neighbors of the WCD distance. If m = n it is
already done so). Table 1 shows relevant statistics for each
exact as only provably non-neighbors are pruned.
of these training datasets including the number of inputs
n, the bag-of-words dimensionality, the average number
5. Results of unique words per document, and the number of classes
|Y|. The word embedding used in our WMD implemen-
We evaluate the word mover’s distance in the context
tation is the freely-available word2vec word embedding2
of kNN classification on eight benchmark document cat-
which has an embedding for 3 million words/phrases (from
egorization tasks. We first describe each dataset and
Google News), trained using the approach in Mikolov et al.
a set of classic and state-of-the-art document represen-
(2013b). Words that are not present in the pre-computed
tations and distances. We then compare the nearest-
word2vec model are dropped for the WMD metric (and its
neighbor performance of WMD and the competing meth-
lower bounds), but kept for all baselines (thus giving the
ods on these datasets. Finally, we examine how the fast
baselines a slight competitive advantage).
lower bound distances can speedup nearest neighbor com-
putation by prefetching and pruning neighbors. Mat- We compare against 7 document representation baselines:
lab code for the WMD metric is available at http://
bag-of-words (BOW). A vector of word counts of dimen-
matthewkusner.com
sionality d, the size of the dictionary.
5.1. Dataset Description and Setup TFIDF term frequency-inverse document frequency
(Salton & Buckley, 1988): the bag-of-words representa-
We evaluate all approaches on 8 supervised document tion divided by each word’s document frequency.
datasets: BBCSPORT: BBC sports articles between 2004-
2005, TWITTER: a set of tweets labeled with sentiments BM25 Okapi: (Robertson et al., 1995) a ranking function
‘positive’, ‘negative’, or ‘neutral’ (Sanders, 2011) (the that extends TF-IDF for each word w in a document D:
set is reduced due to the unavailability of some tweets), IDF (w)T F (w,D)(k1 +1)
BM 25(w, D) = |D|
RECIPE : a set of recipe procedure descriptions labeled by T F (w,D)+k1 (1−b+b Davg )
their region of origin, OHSUMED: a collection of medi-
cal abstracts categorized by different cardiovascular dis- where IDF (w) is the inverse document frequency of word
ease groups (for computational efficiency we subsample 1
http://qwone.com/˜jason/20Newsgroups/
the dataset, using the first 10 classes), CLASSIC: sets of sen- 2
https://code.google.com/p/word2vec/
From Word Embeddings To Document Distances
Okapi BM25 [Robertson & Walker, 1994]
TF-IDF [Jones, 1972]
BOW [Frakes & Baeza-Yates, 1992]
Componential Counting Grid [Perina et al., 2013]

70
k-nearest neighbor error 66
mSDA [Chen et al., 2012]
LDA [Blei et al., 2003]
63 61 62 LSI [Deerwester et al., 1990]
59 59
Word Mover's Distance
test error %

60 53 53 54 56 54 58
51
43 48 49 51
50 44
45 44 44 41
43 42
40
40 33 33 32 34 32
35 36 36
33 31
29 29 28 29
30 22
27
21 21
17
20 14 14
17
12
8.4 8.06.9 9.78.1 9.3
10 6.4
4.3 4.6 5.0
6.7 6.96.3
3.5
7.4
2.8
0
1
bbcsport 2
twitter 3
recipe 4
ohsumed 5
classic 6
reuters 7
amazon 8
20news
Figure 3. The kNN test error results on 8 document classification data sets, compared to canonical and state-of-the-art baselines methods.

Table 2. Test error percentage and standard deviation for different


average error w.r.t. BOW

1.6
1.6 1.29 Okapi BM25 text embeddings. NIPS, AMZ, News are word2vec (w2v) models
TF-IDF
1.4
1.4 1.15 BOW
CCG
trained on different data sets whereas HLBL and Collo were also
1.2
1.2 mSDA obtained with other embedding algorithms.
1.0 LDA
1
1.0
0.72
LSI
WMD
D OCUMENT k- NEAREST NEIGHBOR RESULTS
0.8
0.8 0.60 0.55 DATASET HLBL CW NIPS AMZ N EWS
0.49 0.42 (W2V) (W2V) (W2V)
0.6
0.6
BBCSPORT 4.5 8.2 9.5 4.1 5.0
0.4
0.4
TWITTER 33.3 33.7 29.3 28.1 28.3
0.2
0.2 RECIPE 47.0 51.6 52.7 47.4 45.1
00 OHSUMED 52.0 56.2 55.6 50.4 44.5
Figure 4. The kNN1test2errors3 of4various
5 document
6 7 8metrics aver- CLASSIC 5.3 5.5 4.0 3.8 3.0
REUTERS 4.2 4.6 7.1 9.1 3.5
aged over all eight datasets, relative to kNN with BOW.
AMAZON 12.3 13.3 13.9 7.8 7.2

w, T F (w, D) is its term frequency in document D, |D| is


the number of words in the document, Davg is the average 2013): a generative model that directly generalizes the
size of a document, and k1 and b are free parameters. Counting Grid (Jojic & Perina, 2011), which models doc-
LSI Latent Semantic Indexing (Deerwester et al., 1990): uments as a mixture of word distributions, and LDA (Blei
uses singular value decomposition on the BOW representa- et al., 2003). We use the grid location admixture probabil-
tion to arrive at a semantic feature space. ity of each document as the new representation.

LDA Latent Dirichlet Allocation (Blei et al., 2003): a For each baseline we use the Euclidean distance for kNN
celebrated generative model for text documents that learns classification. All free hyperparameters were set with
representations for documents as distributions over word Bayesian optimization for all algorithms (Snoek et al.,
topics. We use the Matlab Topic Modeling Toolbox 2012). We use the open source MATLAB implementation
Steyvers & Griffiths (2007) and allow 100 iterations for “bayesopt.m” from Gardner et al. (2014).3
burn-in and run the chain for 1000 iterations afterwards.
Importantly, for each dataset we train LDA transductively, 5.2. Document classification
i.e. we train on the union of the training and holdout sets. Document similarities are particularly useful for classifica-
mSDA Marginalized Stacked Denoising Autoencoder tion given a supervised training dataset, via the kNN de-
(Chen et al., 2012): a representation learned from stacked cision rule (Cover & Hart, 1967). Different from other
denoting autoencoders (SDAs), marginalized for fast train- classification techniques, kNN provides an interpretable
ing. In general, SDAs have been shown to have state-of- certificate (i.e., in the form of nearest neighbors) that al-
the-art performance for document sentiment analysis tasks low practitioners the ability to verify the prediction result.
(Glorot et al., 2011). For high-dimensional datasets (i.e., Moreover, such similarities can be used for ranking and
all except BBCSPORT, TWITTER, and RECIPE) we use ei- recommendation. To assess the performance of our met-
ther the high-dimensional version of mSDA (Chen et al., ric on classification, we compare the kNN results of the
2012) or limit the features to the top 20% of the words (or- WMD with each of the aforementioned document repre-
dered by occurence), whichever performs better. sentations/distances. For all algorithms we split the train-
3
CCG Componential Counting Grid (Perina et al., http://tinyurl.com/bayesopt
From Word Embeddings To Document Distances
1.6
1.6
1.6
1.6

average distance w.r.t. WMD average kNN error w.r.t. BOW


1.4
1.4 twitter 1.4
1.4 amazon
1.2
1.2 1.2
1.2 0.93
0.96 1.0 1.2
1.2 1.2
1.2

1.0
1 0.92 0.92 1
1.0
1.0
1 1.0
1

distance
WCD
0.8
0.8 0.8
0.8
0.56 0.54 RWMD
0.8
0.8 0.8
0.8
WMD
0.6
0.6 0.6
0.6 0.45 0.42
0.6
0.6 0.6
0.6

0.30
0.4
0.4 0.4
0.4
0.4
0.4 0.4
0.4

0.2
0.2 0.2
0.2 0.2
0.2
0.2
0.2

0
0 RWMD RWMD WCD RWMD WMD 0
0 RWMD RWMD WCD RWMD WMD 00 00

1 c1 c2 1 c1 c2 0
0 100 200
200
300
400
400
600
500 600
800
700 800 900 1000
1000 0
0 500
500
1000 1500 2000
1000 1500 2000 2500
2500

training input index training input index


Figure 5. (Left:) The average distances of lower bounds as a ratio
w.r.t WMD. (Right:) kNN test error results on 8 datasets, com- Figure 6. The WCD, RWMD, and WMD distances (sorted by
pared to canonical and state-of-the-art baseline methods. WMD) for a random test query document.

ing set into a 80/20 train/validation for hyper-parameter all NIPS conference papers within the years 2004-2013 and
tuning. It is worth emphasizing that BOW, TF-IDF, BM25 trained a Skip-gram model on the dataset as per Mikolov
and WMD have no hyperparameters and thus we only op- et al. (2013b). We also used the 340, 000 Amazon re-
timize the neighborhood size (k ∈ {1, . . . , 19}) of kNN. view dataset of Blitzer et al. (2006) (a superset of the ama-
zon classification dataset used above) to train the review-
Figure 3 shows the kNN test error of the 8 aforementioned
based word2vec model. Prior to training we removed stop
algorithms on the 8 document classification datasets. For
words for both models, resulting in 36,425,259 words for
datasets without predefined train/test splits (BBCSPORT,
the NIPS dataset (50-dimensional) and 90,005,609 words
TWITTER , RECIPE, CLASSIC , AMAZON ) we averaged over
for the Reviews dataset (100-dimensional) (compared to
5 train/test splits and we report means and standard errors.
the 100 billion word dataset of Google News (NEWS)).
We order the methods by their average performance. Per-
haps surprisingly, LSI and LDA outperform the more re- Additionally, we experimented with the pre-trained em-
cent approaches CCG and mSDA. For LDA this is likely beddings of the hierarchical log-bilinear model (HLBL)
because it is trained transductively. One explanation for (Mnih & Hinton, 2009) and the model of Collobert & We-
why LSI performs so well may be the power of Bayesian ston (2008) (CW)4 . The HLBL model contains 246, 122
Optimization to tune the single LSI hyperparameter: the unique 50-dimensional word embeddings and the Collobert
number of basis vectors to use in the representation. Fine- model has 268, 810 unique word embeddings (also 50-
tuning the number of latent vectors may allow LSI to create dimensional). Table 2 shows classification results on all
a very accurate representation. On all datasets except two data sets except 20 NEWS (which we dropped due to run-
(BBCSPORT, OHSUMED), WMD achieves the lowest test ning time constraints). On the five larger data sets, the 3
error. Notably, WMD achieves almost a 10% reduction in million word Google NEWS model performs superior to
(relative) error over the second best method on TWITTER the smaller models. This result is in line with those of
(LSI). It even reaches error levels as low as 2.8% error on Mikolov et al. (2013a), that in general more data (as op-
classic and 3.5% error on REUTERS, even outperforming posed to simply relevant data) creates better embeddings.
transductive LDA, which has direct access to the features Additionally, the three word2vec (w2v) models outperform
of the test set. One possible explanation for the WMD per- the HLBL and Collobert models on all datasets. The clas-
formance on OHSUMED is that many of these documents sification error deteriorates when the underlying model is
contain technical medical terms which may not have a word trained on very different vocabulary (e.g. NIPS papers vs
embedding in our model. These words must be discarded, cooking recipes), although the performance of the Google
possibly harming the accuracy of the metric. NEWS corpus is surprisingly competitive throughout.
Figure 4 shows the average improvement of each method,
relative to BOW, across all datasets. On average, WMD 5.4. Lower Bounds and Pruning
results in only 0.42 of the BOW test-error and outperforms Although WMD yields by far the most accurate classifica-
all other metrics that we compared against. tion results, it is fair to say that it is also the slowest met-
ric to compute. We can therefore use the lower bounds
5.3. Word embeddings. from section 4 to speed up the distance computations. Fig-
ure 6 shows the WMD distance of all training inputs to two
As our technique is naturally dependent on a word em-
randomly chosen test queries from TWITTER and AMAZON
bedding, we examine how different word embeddings af-
in increasing order. The graph also depicts the WCD and
fect the quality of k-nearest neighbor classification with
RWMD lower bounds. The RWMD is typically very close
the WMD. Apart from the aforementioned freely-available
to the exact WMD distance, whereas the cheaper WCD
Google News word2vec model, we trained two other
word2vec models on a papers corpus (NIPS) and a product 4
Both available at http://metaoptimize.com/
review corpus (AMZ). Specifically, we extracted text from projects/wordreprs/
From Word Embeddings To Document Distances
2.1 s 0.1 s 1.2 s 2.5 s 1.3 s 1.4 s 2.5 s 8.8 s

x
prefetch and prune

x
18

28
50 m = k (WCD)

5. x
x

5x
x
2. x
13

0x

x
4. x

2
20
12
10

12
7

16

2.
m = 2k time per test
40 m = 4k

x
speedup w.r.t.

66
input with the
test error %

x
62

x
2. x
m = 8k

2. x
exhaustive

60
1

2. x

x
9

1. x
exhaustive

3.

1. x
1x
7

58
6

12 x
0

x
m = n (WMD)

5
30 WMD

5.
WMD
RWMD
20
29 x
9
10

x
13

4. x
x

9x
22

5x
x
x

10

0
13
11
16 x

x
20

3.
x

3x

8.
x

x
x
1

12

8x
x
14

3. x

x
12

2x
14
12
4.

12
11
7

11

3.
2.
2.
0
1
bbcsport 2
twitter 3
recipe 4
ohsumed 5
classic 6
reuters 7
amazon 8
20news
Figure 7. Test errors (in %) and speedups of the prefetch and prune algorithm as the document cutoff, m, changes. All speedups are
relative to the exhaustive k-NN search with the WMD (shown on the very top). The error bars indicate standard error (if applicable).

approximation is rather loose. The tightness of RWMD substantial. The exact method (m = n) typically results in
makes it valuable to prune documents that are provably not a speedup between 2× and 5× which appears pronounced
amongst the k nearest neighbors. Although the WCD is with increasing document lengths (e.g. 20 NEWS). It is
too loose for pruning, its distances increase with the exact interesting to observe, that the error drops most between
WMD distance, which makes it a useful heuristic to iden- m = k and m = 2k, which might yield a sweet spot be-
tify promising nearest neighbor candidates. tween accuracy and retrieval time for time-sensitive appli-
cations. As noted before, using RWMD directly leads to
The tightness of each lower bound can be seen in the
impressively low error rates and average retrieval times be-
left image in Figure 5 (averaged across all test points).
low 1s on all data sets. We believe the actual timing could
RWMDC1 , RWMDC2 correspond to WMD with only con-
be improved substantially with more sophisticated imple-
straints #1 and #2 respectively, which result in comparable
mentations (our code is in MATLAB) and parallelization.
tightness. WCD is by far the loosest and RWMD the tight-
est bound. Interestingly, this tightness does not directly
translate into retrieval accuracy. The right image shows 6. Discussion and Conclusion
the average kNN errors (relative to the BOW kNN error)
It is worthwhile considering why the WMD metric leads to
if the lower bounds are used directly for nearest neighbor
such low error rates across all data sets. We attribute this
retrieval. The two most left columns represent the two in-
to its ability to utilize the high quality of the word2vec em-
dividual lower bounds of the RWMD approximation. Both
bedding. Trained on billions of words, the word2vec em-
perform poorly (worse than WCD), however their maxi-
bedding captures knowledge about text documents in the
mum (RWMD) is surprisingly accurate and yields kNN er-
English language that may not be obtainable from the train-
rors that are only a little bit less accurate than the exact
ing set alone. As pointed out by Mikolov et al. (2013a),
WMD. In fact, we would like to emphasize that the aver-
other algorithms (such as LDA or LSI) do not scale nat-
age kNN error with RWMD (0.45 relative to BOW) still
urally to data sets of such scale without special approxi-
outperforms all other baselines (see Figure 4).
mations which often counteract the benefit of large-scale
Finally, we evaluate the speedup and accuracy of the exact data (although it is a worthy area of future work). Sur-
and approximate versions of the prefetch and prune algo- prisingly, this “latent” supervision benefits tasks that are
rithm from Section 4 under various values of m (Figure 7). different from the data used to learn the word embedding.
When m = k we use the WCD metric for classification
One attractive feature of the WMD, that we would like
(and drop all WMD computations which are unnecessary).
to explore in the future, is its interpretability. Document
For all other results we prefetch m instances via WCD, use
distances can be dissected into sparse distances between
RWMD to check if a document can be pruned and only
words, which can be visualized and explained to humans.
if not compute the exact WMD distance. The last bar for
Another interesting direction will be to incorporate docu-
each dataset shows the test error obtained with the RWMD
ment structure into the distances between words by adding
metric (omitting all WMD computations). All speedups
penalty terms if two words occur in different sections of
are reported relative to the time required for the exhaustive
similarly structured documents. If for example the WMD
WMD metric (very top of the figure) and were run in par-
metric is used to measure the distance between academic
alell on 4 cores (8 cores for 20 NEWS) of an Intel L5520
papers, it might make sense to penalize word movements
CPU with 2.27Ghz clock frequency.
between the introduction and method section more than
First, we notice in all cases the increase in error through word movements from one introduction to another.
prefetching is relatively minor whereas the speedup can be
From Word Embeddings To Document Distances

Acknowledgments Gardner, J., Kusner, M. J., Xu, E., Weinberger, K. Q., and
Cunningham, J. Bayesian optimization with inequality
KQW and MJK are supported by NSF grants IIA-1355406, constraints. In ICML, pp. 937–945, 2014.
IIS-1149882, EFRI-1137211. We thank the anonymous re-
viewers for insightful feedback. Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation
for large-scale sentiment classification: A deep learning
References approach. In ICML, pp. 513–520, 2011.

Andoni, A. and Indyk, P. Near-optimal hashing algorithms Grauman, K. and Darrell, T. Fast contour matching using
for approximate nearest neighbor in high dimensions. In approximate earth mover’s distance. In CVPR, volume 1,
FOCS, pp. 459–468. IEEE, 2006. pp. I–220. IEEE, 2004.
Aslam, J. A. and Frost, M. An information-theoretic mea- Greene, D. and Cunningham, P. Practical solutions to the
sure for document similarity. In SIGIR, volume 3, pp. problem of diagonal dominance in kernel document clus-
449–450. Citeseer, 2003. tering. In ICML, pp. 377–384. ACM, 2006.
Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet Hearst, M. A. Multi-paragraph segmentation of expository
allocation. Journal of Machine Learning Research, 3: text. In ACL, pp. 9–16. Association for Computational
993–1022, 2003. Linguistics, 1994.
Blitzer, J., McDonald, R., and Pereira, F. Domain adapta- Jojic, N. and Perina, A. Multidimensional counting grids:
tion with structural correspondence learning. In EMNLP, Inferring word order from disordered bags of words. In
pp. 120–128. ACL, 2006. UAI. 2011.
Brochu, E. and Freitas, N. D. Name that song! In NIPS, Le, Quoc V and Mikolov, Tomas. Distributed representa-
pp. 1505–1512, 2002. tions of sentences and documents. In ICML, 2014.
Cardoso-Cachopo, A. Improving Methods for Single-label
Levina, E. and Bickel, P. The earth mover’s distance is
Text Categorization. PdD Thesis, Instituto Superior Tec-
the mallows distance: Some insights from statistics. In
nico, Universidade Tecnica de Lisboa, 2007.
ICCV, volume 2, pp. 251–256. IEEE, 2001.
Chen, M., Xu, Z., Weinberger, K. Q., and Sha, F. Marginal-
Ling, Haibin and Okada, Kazunori. An efficient earth
ized denoising autoencoders for domain adaptation. In
mover’s distance algorithm for robust histogram compar-
ICML, 2012.
ison. Pattern Analysis and Machine Intelligence, 29(5):
Collobert, R. and Weston, J. A unified architecture for 840–853, 2007.
natural language processing: Deep neural networks with
multitask learning. In ICML, pp. 160–167. ACM, 2008. Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient
estimation of word representations in vector space. In
Cover, T. and Hart, P. Nearest neighbor pattern classifica- Proceedsings of Workshop at ICLR, 2013a.
tion. Information Theory, IEEE Transactions on, 13(1):
21–27, 1967. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S.,
and Dean, J. Distributed representations of words and
Croft, W. B. and Lafferty, J. Language modeling for infor- phrases and their compositionality. In NIPS, pp. 3111–
mation retrieval, volume 13. Springer, 2003. 3119, 2013b.

Cuturi, Marco. Sinkhorn distances: Lightspeed computa- Mikolov, T., Yih, W., and Zweig, G. Linguistic regularities
tion of optimal transport. In Advances in Neural Infor- in continuous space word representations. In NAACL,
mation Processing Systems, pp. 2292–2300, 2013. pp. 746–751. Citeseer, 2013c.

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, Mnih, A. and Hinton, G. E. A scalable hierarchical dis-
G. W., and Harshman, R. A. Indexing by latent semantic tributed language model. In NIPS, pp. 1081–1088, 2009.
analysis. Journal of the American Society of Information
Science, 41(6):391–407, 1990. Monge, G. Mémoire sur la théorie des déblais et des rem-
blais. De l’Imprimerie Royale, 1781.
Garcia, Vincent, Debreuve, Eric, and Barlaud, Michel. Fast
k nearest neighbor search using gpu. In CVPR Workshop, Nemhauser, G. L. and Wolsey, L. A. Integer and combina-
pp. 1–6. IEEE, 2008. torial optimization, volume 18. Wiley New York, 1988.
From Word Embeddings To Document Distances

Ontrup, J. and Ritter, H. Hyperbolic self-organizing maps Steyvers, M. and Griffiths, T. Probabilistic topic models.
for semantic navigation. In NIPS, volume 14, pp. 2001, Handbook of latent semantic analysis, 427(7):424–440,
2001. 2007.
Pele, O. and Werman, M. Fast and robust earth mover’s Turian, J., Ratinov, L., and Bengio, Y. Word representa-
distances. In ICCV, pp. 460–467. IEEE, 2009. tions: a simple and general method for semi-supervised
learning. In ACL, pp. 384–394. Association for Compu-
Perina, A., Jojic, N., Bicego, M., and Truski, A. Documents
tational Linguistics, 2010.
as multiple overlapping windows into grids of counts. In
NIPS, pp. 10–18. 2013. Wan, X. A novel document similarity measure based on
earth movers distance. Information Sciences, 177(18):
Petterson, J., Buntine, W., Narayanamurthy, S. M., Cae-
3718–3730, 2007.
tano, T. S., and Smola, A. J. Word features for latent
dirichlet allocation. In NIPS, pp. 1921–1929, 2010. Yianilos, Peter N. Data structures and algorithms for near-
Quadrianto, N., Song, L., and Smola, A. J. Kernelized sort- est neighbor search in general metric spaces. In Proceed-
ing. In NIPS, pp. 1289–1296, 2009. ings of ACM-SIAM Symposium on Discrete Algorithms,
pp. 311–321. Society for Industrial and Applied Mathe-
Ren, Z., Yuan, J., and Zhang, Z. Robust hand gesture matics, 1993.
recognition based on finger-earth mover’s distance with
a commodity depth camera. In Proceedings of ACM in-
ternational conference on multimedia, pp. 1093–1096.
ACM, 2011.
Robertson, S. E. and Walker, S. Some simple effective
approximations to the 2-poisson model for probabilis-
tic weighted retrieval. In Proceedings ACM SIGIR con-
ference on Research and development in information re-
trieval, pp. 232–241. Springer-Verlag New York, Inc.,
1994.
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu,
M. M., Gatford, M., et al. Okapi at trec-3. NIST SPE-
CIAL PUBLICATION SP, pp. 109–109, 1995.
Rubner, Y., Tomasi, C., and Guibas, L. J. A metric for
distributions with applications to image databases. In
ICCV, pp. 59–66. IEEE, 1998.
Salton, G. and Buckley, C. The smart retrieval systemex-
periments in automatic document processing. 1971.
Salton, G. and Buckley, C. Term-weighting approaches in
automatic text retrieval. Information processing & man-
agement, 24(5):513–523, 1988.
Sanders, N. J. Sanders-twitter sentiment corpus, 2011.
Schölkopf, B., Weston, J., Eskin, E., Leslie, C., and No-
ble, W. S. A kernel approach for learning from almost
orthogonal patterns. In ECML, pp. 511–528. Springer,
2002.
Shirdhonkar, S. and Jacobs, D. W. Approximate earth
movers distance in linear time. In CVPR, pp. 1–8. IEEE,
2008.
Snoek, J., Larochelle, H., and Adams, R. P. Practical
bayesian optimization of machine learning algorithms.
In NIPS, pp. 2951–2959, 2012.

You might also like