You are on page 1of 12

GloVe: Global Vectors for Word Representation

Jeffrey Pennington, Richard Socher, Christopher D. Manning
Computer Science Department, Stanford University, Stanford, CA 94305,,

Abstract the finer structure of the word vector space by ex-
amining not the scalar distance between word vec-
Recent methods for learning vector space
tors, but rather their various dimensions of dif-
representations of words have succeeded
ference. For example, the analogy “king is to
in capturing fine-grained semantic and
queen as man is to woman” should be encoded
syntactic regularities using vector arith-
in the vector space by the vector equation king −
metic, but the origin of these regularities
queen = man − woman. This evaluation scheme
has remained opaque. We analyze and
favors models that produce dimensions of mean-
make explicit the model properties needed
ing, thereby capturing the multi-clustering idea of
for such regularities to emerge in word
distributed representations (Bengio, 2009).
vectors. The result is a new global log-
bilinear regression model that combines The two main model families for learning word
the advantages of the two major model vectors are: 1) global matrix factorization meth-
families in the literature: global matrix ods, such as latent semantic analysis (LSA) (Deer-
factorization and local context window wester et al., 1990) and 2) local context window
methods. Our model efficiently leverages methods, such as the skip-gram model of Mikolov
statistical information by training only on et al. (2013c). Currently, both families suffer sig-
the nonzero elements in a word-word co- nificant drawbacks. While methods like LSA ef-
occurrence matrix, rather than on the en- ficiently leverage statistical information, they do
tire sparse matrix or on individual context relatively poorly on the word analogy task, indi-
windows in a large corpus. The model pro- cating a sub-optimal vector space structure. Meth-
duces a vector space with meaningful sub- ods like skip-gram may do better on the analogy
structure, as evidenced by its performance task, but they poorly utilize the statistics of the cor-
of 75% on a recent word analogy task. It pus since they train on separate local context win-
also outperforms related models on simi- dows instead of on global co-occurrence counts.
larity tasks and named entity recognition. In this work, we analyze the model properties
necessary to produce linear directions of meaning
1 Introduction and argue that global log-bilinear regression mod-
Semantic vector space models of language repre- els are appropriate for doing so. We propose a spe-
sent each word with a real-valued vector. These cific weighted least squares model that trains on
vectors can be used as features in a variety of ap- global word-word co-occurrence counts and thus
plications, such as information retrieval (Manning makes efficient use of statistics. The model pro-
et al., 2008), document classification (Sebastiani, duces a word vector space with meaningful sub-
2002), question answering (Tellex et al., 2003), structure, as evidenced by its state-of-the-art per-
named entity recognition (Turian et al., 2010), and formance of 75% accuracy on the word analogy
parsing (Socher et al., 2013). dataset. We also demonstrate that our methods
Most word vector methods rely on the distance outperform other current methods on several word
or angle between pairs of word vectors as the pri- similarity tasks, and also on a common named en-
mary method for evaluating the intrinsic quality tity recognition (NER) benchmark.
of such a set of word representations. Recently, We provide the source code for the model as
Mikolov et al. (2013c) introduced a new evalua- well as trained word vectors at http://nlp.
tion scheme based on word analogies that probes

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543,
October 25-29, 2014, Doha, Qatar. 2014 Association for Computational Linguistics

2014) has been suggested as an effec. vector log-bilinear models. vLBL models is to predict a word given its con- A main problem with HAL and related meth. word representations have roots stretching as far Recently. tations. of word-word co-occurrence counts be denoted by duced a model that learns word vector representa. are compressed so as The statistics of word occurrences in a corpus is to be distributed more evenly in a smaller inter. including a study (Bullinaria and Levy. will have a large effect on Unlike the matrix factorization methods. (2003) intro. for example. (2013a) propose a sim- matrices varies by application. HAL. in the form of Hellinger PCA (HPCA) (Lebret and In this section. In LSA. i. or and. A number of disadvantage that they do not operate directly on techniques exist that addresses this shortcoming of the co-occurrence statistics of the corpus. Collobert and Weston k X ik be the number of times any word appears (2008) decoupled the word vector training from in the context of word i. a square root type transformation sulting word vectors might represent that meaning. for example. the objec- and columns correspond to words and the entries tive is to predict a word’s context given the word correspond to the number of times a given word itself. Another Global Vectors. the primary source of information available to all val. work structure for learning useful word repre- proximations to decompose large matrices that sentations has been called into question. the Hyperspace Analogue to Language Levy et al. the question still remains as to how meaning tual information (PPMI) is a good transformation. (2014) proposed explicit word embed- (HAL) (Lund and Burgess. i. tations. ple single-layer architecture based on the inner trices are of “term-document” type.e. (2011) to use the full context of a word for learning the word represen- Matrix Factorization Methods. A variety of newer models also pursue this unsupervised methods for learning word represen- approach. the ma. X. dings based on a PPMI metric. let Pi j = P( j |i) = the downstream training objectives. First we establish some notation. tion. for word representation which we call GloVe. the rows In the skip-gram and ivLBL models. Instead. These methods utilize low-rank ap. Matrix factor. Through evaluation on a word analogy task. vLBL and ivLBL. which paved X i j /X i be the probability that word j appear in the 1533 . rather than just the preceding context as is ization methods for generating low-dimensional the case with language models. An advantage of this type of trans- formation is that the raw co-occurrence counts. text.2 Related Work the way for Collobert et al. such as the COALS method (Rohde et al. Let X i = P ture for language modeling. utilizes matrices of “term-term” type. we shed some light on this ques- Collobert. and In contrast. in which the co-occurrence matrix is first tire corpus.. For example. The capture statistical information about a corpus.. 3 The GloVe Model which for a reasonably sized corpus might span 8 or 9 orders of magnitude. is generated from these statistics. which fails to take advantage of the transformed by an entropy.e. Finally. Bengio et al. because the global corpus statis- approach is to learn word representations that aid tics are captured directly by the model.or correlation-based vast amount of repetition in the data. the rows product between two word vectors. We use our insights to construct a new model tive way of learning word representations. for Shallow Window-Based Methods. ods is that the most frequent words contribute a these models demonstrated the capacity to learn disproportionate amount to the similarity measure: linguistic patterns as linear relationships between the number of times two words co-occur with the the word vectors. ist. Let the matrix dows. whereas the objective in the CBOW and occurs in the context of another given word. word j occurs in the context of word i. 1996). and how the re- More recently. Mnih and correspond to words or terms. the importance of the full neural net- back as LSA. and the columns Kavukcuoglu (2013) also proposed closely-related correspond to different documents in the corpus. and although many such methods now ex- 2007) that indicates that positive pointwise mu. the their similarity despite conveying relatively little shallow window-based methods suffer from the about their semantic relatedness. normalization. whose entries X i j tabulate the number of times tions as part of a simple neural network architec. in making predictions within local context win.. these models scan context windows across the en- 2006). The skip-gram and continuous bag-of-words (CBOW) particular type of information captured by such models of Mikolov et al.

can restrict our consideration to those functions F pect of interest. Con. To do so ate starting point for word vector learning should consistently. F (wi . (3). For words Next. i. between a word and a context word is arbitrary and The above argument suggests that the appropri. Xi rameters. say k = tion parameterized by. is solved by. the ratio should be close take the dot product of the arguments.96 context of word i. we can first and steam. and k. for which we might take i = ice and j = steam. the ratio is better able to distinguish relevant which prevents F from mixing the vector dimen- words (solid and gas) from irrelevant words (water sions in undesirable ways. w̃k ) = . to one. or. With this aim.g. the right-hand side is extracted from the corpus. However.2. Table 1 shows these probabilities and their   Pik ratios for a large corpus. Since vector spaces are inher- how certain aspects of meaning can be extracted ently linear structures. expect the ratio Pik /P j k will be large. +) and (R>0 . the distinction nate between the two relevant words. the symmetry can be restored in two the most general model takes the form. (1) to. w j . and X ik F (wTi w̃k ) = Pik = . While F could be taken to be a complicated func- for words k related to steam but not ice. the information present the ratio Pik /P j k in the We begin with a simple example that showcases word vector space. Compared to the raw probabil- ities.e. (3) Pj k these expectations. Probability and Ratio k = solid k = gas k = water k = fashion P(k |ice) 1. e. w̃k ) = . we must not only exchange w ↔ w̃ be with ratios of co-occurrence probabilities rather but also X ↔ X T . we would like F to encode wTi w̃k = log(Pik ) = log(X ik ) − log(X i ) .Table 1: Co-occurrence probabilities for target words ice and steam with selected context words from a 6 billion token corpus. by Eqn. that we are free to exchange the two roles. (4) is F = exp. word-word co-occurrence matrices. the most natural way to do directly from co-occurrence probabilities. but Eqn. Noting that the variant under this relabeling. (2) Pj k by studying the ratio of their co-occurrence prob- abilities with various probe words. we sider two words i and j that exhibit a particular as. The solution to Eqn.2 × 10 −3 1. say k = solid.36 0.6 × 10 −5 3. (6) 1534 . words. The number of possibilities for F is vast. a neural network.. First. To avoid this issue. and the numbers confirm F (wi − w j )T w̃k = . (4) where w ∈ Rd are word vectors and w̃ ∈ Rd F (wTj w̃k ) are separate context word vectors whose role will which. be discussed in Section 4. (1) Pj k   F (wTi w̃k ) F (wi − w j )T w̃k = . First. Only in the ratio does noise from non-discriminative words like water and fashion cancel out. for concreteness. we are vectors while the right-hand side is a scalar. (5) F may depend on some as-of-yet unspecified pa. (3) is not. modifying Eqn. suppose we are that depend only on the difference of the two target interested in the concept of thermodynamic phase.7 × 10 −5 P(k |steam) 2.8 × 10 −4 2. ×). Pik The relationship of these words can be examined F (wi − w j .8 × 10 −5 P(k |ice)/P(k |steam) 8. the ratio should be small. steps. (2) k related to ice but not steam. For words k like ing so would obfuscate the linear structure we are water or fashion.9 8. In this equation.. this is with vector differences. that are either related to both ice trying to capture. but by enforcing a few desiderata we can select a unique choice. we require that F be a homomorphism Pik between the groups (R. j.5 × 10 −2 1. do- gas. Next.9 × 10 −4 6. Similarly. Our final model should be in- than the probabilities themselves. so that large values (much greater than 1) correlate well with properties specific to ice. or to neither. k.0 × 10 −3 1. ratio Pik /P j k depends on three words i. note that for and fashion) and it is also better able to discrimi. we note that the arguments of F in Eqn. and small values (much less than 1) correlate well with properties specific of steam.2 × 10 −5 7.

2013a). If f is viewed as a continuous Most of the details of these models are irrelevant function. log(X ik ) → log(1 + X ik ).0 wTi w̃k + bi + b̃k = log(X ik ) . Therefore. let us assume that Q i j is a softmax. 0. Although we offer only empirical motivation for vergences. as defined in Eqn. which est improvement over a linear version with α = 1. (11) not overweighted. plied global objective function can be written as.6 dent of k so it can be absorbed into a bias bi for 0. but one class of functions that we found Evaluating the normalization factor of the soft- to work well can be parameterized as. For con- J= f X i j wTi w̃ j + bi + b̃ j − log X i j . The idea of factorizing the log of the choosing the value 3/4. Training pro- 2. (11) can be evaluated much 1 otherwise . 3. which we fix to x max = 100 for all our lution to this issue is to include an additive shift experiments. Such rare co. The performance of the model depends weakly on verges whenever its argument is zero. However. The weight. 0. so that frequent co-occurrences are J=− log Q i j . this term is indepen. aside from the the fact that they enough that the lim x→0 f (x) log2 x is finite. it should vanish as x → 0 fast for our purposes. (7) Figure 1: Weighting function f with α = 3/4. cer- depending on the vocabulary size and corpus. f (x) should be non-decreasing so that rare ceeds in an on-line. Eqn.0 Next. exp(wTi w̃ j ) Q i j = PV . the cutoff. in this Casting Eqn. X ues of x. (6) would exhibit the ex- change symmetry if not for the log(X i ) on the 0. (8). A main drawback to this model is that it weighs all co-occurrences equally. f (x) should be relatively small for large val.2 restores the symmetry. Because all unsupervised methods for learning occurrences are noisy and carry less information word vectors are ultimately based on the occur- than the more frequent ones — yet even just the rence statistics of a corpus. there should be com- zero entries account for 75–95% of the data in X. but it is actually ill-defined since the logarithm di. stochastic fashion. j=1 creteness. particularly the recent window-based meth- gression model that addresses these problems. maintains the sparsity of X while avoiding the di. we note that Eqn. cost function gives us the model The starting point for the skip-gram or ivLBL methods is a model Q i j for the probability that X V   2 word j appears in the context of word i. (7) is a drastic simplification over Eqn. One reso. 1535 . the skip-gram and ivLBL ( (x/x max ) α if x < x max models introduce approximations to Q i j . (7) as a least squares problem and subsection we show how these models are related introducing a weighting function f (X i j ) into the to our proposed model. but the im- co-occurrences are not overweighted. even 3. (10) ing function should obey the following properties: k=1 exp(w T w̃ ) i k 1.. Nevertheless. adding an additional bias b̃k for w̃k 0.1 Relationship to Other Models those that happen rarely or never.8 right-hand side. 1. (8) where V is the size of the vocabulary. Finally. f (0) = 0. How- f (x) = (9) ever. i ∈corpus j ∈context(i) Of course a large number of functions satisfy these properties. monalities between the models. We found that α = 3/4 gives a mod- in the logarithm. ods like skip-gram and ivLBL. the sum in Eqn. tain models remain somewhat opaque in this re- We propose a new weighted least squares re. max for each term in this sum is costly.4 wi . i. (1). our experiments. gard. it is interesting that a sim- co-occurrence matrix is closely related to LSA and ilar fractional power scaling was found to give the we will use the resulting model as a baseline in best performance in (Mikolov et al. To al- low for efficient training. attempt to maximize the log probability as a con- text window scans over the corpus.

terms that have the same values for i and j. which can complicate the optimiza- tion. which we are where H (Pi . can be modeled as a power-law function of the frequency rank of that word pair. the computational undesirable properties that ought to be addressed complexity of the model depends on the number of before adopting it as a model for learning word nonzero elements in the matrix X. X i is preordained by the on-line training method P Recalling our notation for X i = k X ik and inherent to the skip-gram and ivLBL models. which we define in analogy The result is. that did not require this property of Q. and it would be be placed on the number of nonzero elements of desirable to consider a different distance measure X. Mikolov et al. the skip-gram and ivLBL models.Q i ) is the cross entropy of the dis. ber is always less than the total number of en- To begin. One could inter- pret this objective as a “global skip-gram” model. |C|. duce the effective value of the weighting factor for i=1 j=1 i=1 frequent words. j i with word j. For this reason it is im- a computational bottleneck owing to the sum over portant to determine whether a tighter bound can the whole vocabulary in Eqn. (2013a) observe that performance X V X V X V can be increased by filtering the data so as to re- J=− Xi Pi j log Q i j = X i H (Pi .Q i ) . we will Jˆ = X i P̂i j − Q̂ i j 2 (14) assume that the number of co-occurrences of word i.2 Complexity of the model and it might be interesting to investigate further. free to take to depend on the context word as well. X  this objective bears some formal resemblance to Jˆ = f (X i j ) wTi w̃ j − log X i j 2 . we can rewrite J as. 3. (12) 2 = X i wTi w̃ j − log X i j . However. to X i . (8). 1536 . and it has the unfortunate stantial improvement over the shallow window- property that distributions with long tails are of- based approaches. j where we have used the fact that the number of Finally. the model scales no worse than many possible distance measures between prob- O(|V | 2 ). In fact. At first glance this might seem like a sub- ability distributions. Eqn. typical vocabularies have hun- to the unlikely events. (13) directly which is equivalent1 to the cost function of as opposed to the on-line training methods used in Eqn. it is neces- normalization factors in Q and P are discarded. With this in mind. j In fact. j X J=− X i j log Q i j . (17) (r i j ) α large values. (8). An effective remedy is to minimize the 1 We could also include bias terms in Eqn. X  Jˆ = X i log P̂i j − log Q̂ i j 2 X V X V i. As this num- vectors. tributions Pi and Q i . by no means guaranteed to be optimal. where P̂i j = X i j and Q̂ i j = exp(wTi w̃ j ) are the ri j : unnormalized distributions. In particular. which scale with the corpus ten modeled poorly with too much weight given size. At this stage another k problem emerges. we introduce (13) a more general weighting function. This presents larger than most corpora. (15) i=1 j=1 i. X i j . (13) exhibits a number of of the weighting function f (X ). As can be seen from Eqn.more efficiently if we first group together those squared error of the logarithms of P̂ and Q̂ instead. namely that X i j often takes very Xi j = . we observe that while the weighting factor like terms is given by the co-occurrence matrix X. (10). which is actually much tribution Q be properly normalized. As a weighted sum of cross-entropy error. A natural In order to make any concrete statements about choice would be a least squares objective in which the number of nonzero elements in X. for the mea- dreds of thousands of words. (16) the weighted least squares objective of Eqn. (8) and the explicit form On the other hand. which we derived previously. cross entropy error is just one among tries of the matrix. it is possible to optimize Eqn. sary to make some assumptions about the distribu- X  tion of word co-occurrences. i. Furthermore. so that |V | 2 can be in sure to be bounded it requires that the model dis- the hundreds of billions. (16). it is Pi j = X i j /X i .

7 giving. (2013a).6B 4.1 65.b).9 50.6B 67.5B 55.. the and in fact it does somewhat better than the on-line model should uniquely identify the missing term. We answer the question “a is to b 4 Experiments as c is to ?” by finding the word d whose repre- sentation wd is closest to wb − wa + wc according 4. di- O(|C|) if α < 1. with only an exact correspondence counted as a correct match.3 60. typically analogies about verb tenses or forms of Therefore we conclude that the complexity of the adjectives.3 both numbers are large. This number is ivLBL 100 1. for example “dance is to dancing as fly model is much better than the worst case O(V 2 ). (i)vLBL results are from (Mnih et al. (18) available2 .9 63. the right hand side of Eqn. we trained SG† where we have rewritten the last sum in terms of and CBOW† using the word2vec tool3 . |X |. is to ?”.7 x 1−s SG† 300 6B 73.0 64. nonzero elements in the matrix X. we observe about people or places. SG 300 1B 61 61 61 CBOW 300 1.7 1−s (20) CBOW 1000 6B 57. Therefore we GloVe 100 |X | |C| ∼ + ζ (α) |X | α + O(1) .1 53. as described in (Luong et al.8 ).1 Evaluation methods to the cosine similarity. In this case we have that |X | = O(|C| 0. and report an accuracy of 68. skip-gram (SG) and CBOW results are from (Mikolov et al. rα ij r =1 2013). (18) as.0 66. Word analogies.6 42. To correctly answer the question.0 We are interested in how |X | is related to |C| when GloVe 300 1. per limit of the sum. given portional to the sum over all elements of the co- as percent accuracy. The semantic questions are typically analogies For the corpora studied in this article. s = + ζ (s) + O(x −s ) if s > 0.5 70. Underlined scores are best occurrence matrix X.5B 65. only one of the two terms on der. The word analogy task con- and which term that is depends on whether α > 1.6B 80.The total number of words in the corpus is pro- Table 2: Results on the word analogy task. |X | = k 1/α . (21) GloVe 300 42B 81.4 10. window-based methods which scale like O(|C|). In the dataset for NER (Tjong Kim Sang and De Meul- limit that X is large..2 49. is the maximum fre- quency rank. i. which coincides with the number of Model Dim. GloVe 300 6B 77.1 52. For this purpose we use the expansion of gen- SVD-L 300 6B 56. (17) with α = as Berlin is to ?”.0 1−α where ζ (s) is the Riemann zeta function.1 65.6 SVD-L 300 42B 38. SG 1000 6B 66.2 64. therefore we are free to SVD 300 6B 6. sists of questions like.544 such questions.4 58.0 71. within groups of similarly-sized models.1 |X |.3 8. (19) vLBL 300 1.2 also equal to the maximum value of r in Eqn.4 65.8 61. like “Athens is to Greece that X i j is well-modeled by Eqn. s .6B 16. and on the CoNLL-2003 shared benchmark evaluation. m .5B 54.1 H x.2 63. set.0 similarity tasks. 2003).8 60.8 such that X i j ≥ 1. bold X X |X| scores are best overall.0 60.1 7. 1976). 2013a. |X | = 1/α (22) vided into a semantic subset and a syntactic sub- O(|C| ) if α > 1. The up- for details and a description of the SVD models.2 16. HPCA vectors are publicly k |C| ∼ Xi j = = k H | X |.3 expand the right hand side of the equation for large SVD-S 300 6B 36.5 54.3 75. Tot.α .ch/words/ task of Mikolov et al. “a is to b as c is to ?” ( The dataset contains 19.4 67. (17) HPCA 100 1.7 46. (21) will be relevant.4 We conduct experiments on the word analogy 2 http://lebret.24% on 1537 . Syn.3 can write Eqn.3 68. Size Sem. 1 .1 |C| ∼ |X | α H | X |. 4 Levy et al.9 69.e. 3C OS M UL.0 ivLBL 300 1. See text the generalized harmonic number Hn.α . The syntactic questions are 1.1 eralized harmonic numbers (Apostol.6 36. CBOW† 300 6B 63..25. (2014) introduce a multiplicative analogy 2013).6 63.6 67. a variety of word 3 http://code.

contribute 1/d to the total count. 2013).. 2 (2001-02) and ACE-2003 data. we choose to use kens. build a vocabulary of the 400. and then construct a matrix of co- MC (Miller and Charles. This is one way organization. a 2014 Wikipedia dump with 1. In all cases we use a decreasing weighting tion of documents from Reuters newswire articles. 2013). the window size is 10. larger vocabulary of about 2 million words. 2001). perform equivalently. We adopt the BIO2 annota. In (b) and (c). and 3) MUC7 For all our experiments. the two sets of vectors should CRFjoin model of (Wang and Manning. there is evidence that for certain types of neural networks. In addition. so that word pairs that are d words apart annotated with four entity types: person. This number is evaluated on a subset of the the vocabulary. and noise and generally improve results (Ciresan kens. and 100 iterations otherwise (see model (Finkel et al. and on 42 billion tokens of primary focus since it tests for interesting vector web data. we use a worse than cosine similarity in almost all of our experiments. the 5 To demonstrate the scalability of the model.. we also combination Gigaword5 + Wikipedia2014. In (a). 2) ACE Phase about the words’ relationship to one another. While the analogy task is our has 6 billion tokens. We train mod. 4. W and W̃ are features as input. rate). and RW (Luong et al.000 most include WordSim-353 (Finkelstein et al. choose how large the context window should be 2012). 3C OS M UL performed 6 For the model trained on Common Crawl data. and miscellaneous. 80 70 70 70 65 65 60 60 60 Accuracy [%] Accuracy [%] Accuracy [%] 50 55 55 40 50 50 Semantic Semantic Semantic Syntactic Syntactic Syntactic 30 45 45 Overall Overall Overall 20 40 40 0 100 200 300 400 500 600 2 4 6 8 10 2 4 6 8 10 Vector Dimension Window Size Window Size (a) Symmetric context (b) Symmetric context (c) Asymmetric context Figure 2: Accuracy on the analogy task as function of vector size and window size/type. We run 50 iterations for vectors smaller than with the standard distribution of the Stanford NER 300 dimensions.. Gigaword 5 which has 4. SCWS (Huang et al. vectors for each word of a five-word context are The model generates two sets of word vectors. et al.05.6 for more details about the convergence discrete features were generated for the CoNLL. as well as all the preprocessing steps (Duchi et al. These kenizer. 1991). stochastically sampling non- described in (Wang and Manning. function. we trained a conditional random equivalent and differ only as a result of their ran- field (CRF) with exactly the same setup as the dom initializations. and whether to distinguish left context from right Named entity recognition.905 Section 4. which trained it on a much larger sixth corpus. Unless otherwise noted. 2013). We explore the effect of these choices be- English benchmark dataset for NER is a collec. from Common Crawl5 . We tokenize space substructures. but in this case we did not lowercase the analogy task.. 50-dimensional ten words to the left and ten words to the right. containing 840 bil- lion tokens of web data. frequent words6 . low. dataset so it is not included in Table 2.6 billion to. With this in mind. With these W and W̃ . 2012). We use a zero elements from X. On the other hand. so the results are not directly comparable. added and used as continuous features. The CoNLL-2003 context. and train the model using AdaGrad tion standard. 2011). 1538 . we set x max = 100. A total of 437. In constructing X. Formal Run test set. 2005). RG (Rubenstein occurrence counts X. we also evaluate our model on and lowercase each corpus with the Stanford to- a variety of word similarity tasks in Table 3. we use a context of 2003 training dataset.. α = 3/4. to account for the fact that very distant word pairs els on CoNLL-03 training data on test on three are expected to contain less relevant information datasets: 1) ConLL-03 testing data. with initial learning rate of comprehensive set of discrete features that comes 0. we must and Goodenough. the vector size is 100. location. All models are trained on the 6 billion token corpus.. Word similarity. When X is symmetric.3 billion tokens. 1965).2 Corpora and training details training multiple instances of the network and then We trained our model on five corpora of varying combining the results can help reduce overfitting sizes: a 2010 Wikipedia dump with 1 billion to.

1 58.000 most GloVe 6B 65. except for the CoNLL 7 We also investigated several other weighting schemes for test set. The fact that this frequently each word occurs with only the top basic SVD model does not scale well to large cor- 10.3 25. what we report here performed best.4 79.9 38. We Collobert and Weston (2008) (CW). with a the models of Huang et al.8 Section 4. The model labeled Discrete is due to a number of factors. riety of state-of-the-art models. With SVD 6B 35. pora lends further evidence to the necessity of the cal of many matrix-factorization-based methods as type of weighting scheme proposed in our model. We conclude that the GloVe vectors are weighting schemes like PPMI destroy the sparsity of X and therefore cannot feasibly be used with large vocabularies.5 71.5 37. Many better.2 65.6 68. The L-BFGS training termi- better than the other baselines.1 56. (2012) (HSMN) and substantial corresponding performance boost. often with smaller nates when no improvement has been achieved on vector sizes and smaller corpora. from the word vectors by first normalizing each The singular vectors of this matrix constitute feature across the vocabulary and then calculat- the baseline “SVD”.7 58.8 53.0 6 billion token corpus (Wikipedia 2014 + Giga. except in this case but they perform very poorly on the word analogy task.0 32. we generate a truncated matrix Xtrunc which retains the information of how L model on this larger corpus.9 We used 10 negative samples. This is Manning (2013). on which the HPCA method does slightly transforming X.6 word2vec.3 Results words of news data.the sum W + W̃ as our word vectors.1 frequent words and a context window size of 10. Both methods help compress the tors available on the word2vec website that are otherwise large range of values in X. with the larity tasks. 8 We use the same parameters as above.5 SG† 6B 62.6 to be a good choice for this corpus. CBOW∗ 100B 68.9 59. useful in downstream NLP tasks. including our choice to the baseline using a comprehensive set of discrete use negative sampling (which typically works bet- features that comes with the standard distribution ter than the hierarchical softmax). but with no word vec- negative samples.2 57. A similarity score is obtained otherwise computationally expensive. we found 5 negative samples to work slightly better than 10.5 38.6 75.7 continuous bag-of-words (CBOW† ) models on the SVD-L 6B 65.0 53.000 most frequent words. as can be The GloVe model outperforms all other methods seen by the decreased performance of the SVD- on all evaluation metrics. Doing so typ- Table 3: Spearman rank correlation on word simi- ically gives a small boost in performance. these information-theoretic trans- formations do indeed work well on word similarity measures. Otherwise all config- ing the word2vec tool are somewhat better than urations are identical to those used by Wang and most of the previously published results.3 39.2 word 5) with a vocabulary of the top 400. and the choice of the corpus. the extra columns can contribute a disproportion- Table 3 shows results on five different word ate number of zero entries and the methods are similarity datasets.3 35. we also compare to trained on a large 42 billion token corpus.7 trained with word and phrase vectors on 100B 4. In addition to the HPCA and SVD We demonstrate that the model can easily be models discussed previously.5 71. as well as with our own results produced using the word2vec Model Size WS353 MC RG SCWS RW tool and with several baselines using SVDs. as was first With smaller vocabularies. tor features. This step is typi. 1539 . we train the skip-gram (SG† ) and SVD-S 6B 56. which we show in GloVe 42B 75.8 72. We compute Spearman’s baselines: √ “SVD-S” in which we take the SVD of rank correlation coefficient between this score and Xtrunc .1 42.7 72.8 65. We present results on the word analogy task in Ta- Table 4 shows results on the NER task with the ble 2. Our results us- the dev set for 25 iterations. All vectors are 300-dimensional. We also evaluate two related ing the cosine similarity.7 75.7 77.6 34.4 59. We trained note that increasing the corpus size does not guar- the CBOW model using the word2vec tool8 . and “SVD-L” in which we take the SVD the human judgments.0 76.6 82. SVD-L 42B 74.2 69.1 37. GloVe outperforms it while using a corpus less than half the size.5 For the SVD baselines.6 47. CBOW∗ denotes the vec- of log(1+ Xtrunc ). antee improved results for other models. The GloVe model performs significantly CRF-based model.9 83. The biggest increase in the semantic analogy task. CBOW† 6B 57.4 74. CBOW∗ vectors are from the word2vec website We compare with the published results of a va- and differ in that they contain phrase vectors.4 45. the number of of the Stanford NER model.

1 88. Moreover. HSMN. there many parameters that have a strong effect on per- is a monotonic increase in performance as the cor.2 87. CW 92. cor- Interestingly.7 Model Analysis: Comparison with 4. is more fre.4 55 SVD 90. larger window sizes.7 Figure 3: Accuracy on the analogy task for 300- HSMN 90. 4. Wikipedia’s code is currently designed for only a single epoch: 1540 . A rigorous quantitative comparison of GloVe with ogy task for 300-dimensional vectors trained on word2vec is complicated by the existence of different corpora.2 82. on the other hand. In (b) and thread of a dual 2. chine).2 88. pus. See quently non-local. the same trend is not true for the se. including window size. pends on the vector size and the number of itera- mation is mostly drawn from the immediate con. formance. vocabulary window that extends to the left and right of a tar. and vocabulary size to the configuration men- mantic subtask. we show performance on the word anal. a 400.7 78.g. the time it takes to train the model de- which aligns with the intuition that syntactic infor.7 dimensional vectors trained on different corpora.7 50 SVD-S 91. a single iteration takes 14 minutes.7 77. We control for the main sources of vari- pus size increases. and corpus size. task for small and asymmetric context windows.and country.8 73. 2010).5 85. and dows. In (a).4 Model Analysis: Vector Length and 4. (2014) for some benchmarks). a 6 billion token corpus takes about 85 minutes.7 74. Given X.000 word vocabulary. Se. For GloVe. we show the results of experiments that and training the model. e. We 80 use publicly-available vectors for HPCA. This is trol for is training time. based analogies in the analogy dataset and the fact For word2vec.2 81. GloVe 93.4 73.6B tokens 4.7 80.8 85.3B tokens Wiki2014 6B tokens 42B tokens HPCA 92. size.5 Model Analysis: Corpus Size word2vec In Fig. 70 Model Dev Test ACE MUC7 65 60 Discrete 91.5 1B tokens 1. the rele- likely due to the large number of city. Using a single tors larger than about 200 dimensions. shown for neural vectors in (Turian et al. context window size. Performance is better on the syntactic sub. 4 for a plot of the learning curve. vant parameter is the number of training iterations. where the models trained on the tioned in the previous subsection. Semantic Syntactic Overall Table 4: F1 score on NER task with 50d vectors. smaller Wikipedia corpora do better than those The most important remaining variable to con- trained on the larger Gigaword corpus. This is to be expected since ation that we identified in Sections 4. context window.6 74. For 300-dimensional vectors with the above text and can depend strongly on word order.9 82. the for most such locations. populating X with a 10 word symmetric size for symmetric and asymmetric context win.5 by larger corpora typically produce better statistics.1 entries are updated to assimilate new knowledge.0 85. setting the vector length.7 80. 4. 75 Accuracy [%] and CW. the obvious choice would be the that Wikipedia has fairly comprehensive articles number of training epochs. we observe diminishing returns for vec.5 84.6 Model Analysis: Run-time Context Size The total run-time is split between populating X In Fig. get word will be called symmetric. A context many factors..6 88.4 77. 3. 85 Discrete is the baseline without word vectors.2 whereas Gigaword is a fixed news repository with outdated and possibly incorrect information. and one which this step could easily be parallelized across mul- extends only to the left will be called asymmet. Unfortunately.0 85. and more of it is captured with Fig. tiple machines (see.3 73..1GHz Intel Xeon E5-2658 ma- (c). Lebret and Collobert ric. On the syntactic subtask.5 77. we examine the effect of varying the window chine.4 and 4. Though we did not do so. settings (and using all 32 cores of the above ma- mantic information.3 82. The former depends on vary vector length and context window.4 81. tions.2 CBOW 93.6 71.3 Gigaword5 + Wiki2010 Wiki2014 Gigaword5 Common Crawl SVD-L 90. See text for details. 2.7 81.

In this work we argue that the two classes of training words seen by the model. Any opinions. GloVe consistently outperforms able comments. making a modification for rently. and conclusion or recommendations ex- 9 In contrast. Reduction Agency (DTRA) under Air Force Re- search Laboratory (AFRL) contract no. 2013). It achieves better results faster. (2014) argue is to vary the number of negative samples.. for example. Another choice support. GloVe. yond about 10. We note log-bilinear regression model for the unsupervised that word2vec’s performance actually decreases learning of word representations that outperforms if the number of negative samples increases be. In all cases. considerable attention has been focused Projects Agency (DARPA) Deep Exploration and on the question of whether distributional word Filtering of Text (DEFT) Program under AFRL representations are best learned from count-based contract no.000 word vocabulary. non-decreasing function of the number of negative samples. and acknowledges the support of the Defense Threat also obtains the best results irrespective of speed. damental level since they both probe the under- We set any unspecified parameters to their de. vocabulary. FA8650- 5 Conclusion 10-C-7020 and the Defense Advanced Research Recently. which is governed by the number of iterations for GloVe and by the number of negative samples for CBOW (a) and skip-gram (b). prediction-based models garner substantial multiple passes a non-trivial task. word similarity. We thank the anonymous reviewers for their valu- and training time. though we acknowledge that this simplifica. Cur- pass through the data. assuming that they are close to opti. it specifies a learning schedule specific to a single methods or from prediction-based methods. we train 300-dimensional vectors on the same 6B token corpus (Wikipedia 2014 + Gigaword 5) with the same 400. Baroni et al. findings. and use a symmetric context window of size 10. FA8750-13-2-0040. is a new global and negative samples for word2vec. accuracy on the analogy task is a do not necessarily reflect the view of the DTRA. but fault values. We construct a model that utilizes this main ben- In Fig. the efficiency with which the count-based meth- mal. 4. DEFT. lying co-occurrence statistics of the corpus. tion should be relaxed in a more thorough analysis. or the US government. Stanford University gratefully word2vec. ods capture global statistics can be advantageous. AFRL. the meaningful linear substructures prevalent in The two x-axes at the bottom indicate the corre. Presumably this is because the and named entity recognition tasks. noise-contrastive estimation is an approxi- pressed in this material are those of the authors and mation which improves with more negative samples. other models on word analogy. so in some of methods are not dramatically different at a fun- ways it is analogous to extra epochs. In Ta- ble 1 of (Mnih et al. The result. negative sampling method does not approximate the target probability distribution well. Training Time (hrs) Training Time (hrs) 1 2 3 4 5 6 3 6 9 12 15 18 21 24 72 72 70 70 Accuracy [%] Accuracy [%] 68 GloVe 68 CBOW 66 66 64 64 GloVe Skip-Gram 62 62 5 10 15 20 25 20 40 60 80 100 60 60 Iterations (GloVe) Iterations (GloVe) 1 3 5 7 10 15 20 25 30 40 50 1 2 3 4 5 6 7 10 12 15 20 Negative Samples (CBOW) Negative Samples (Skip-Gram) (a) GloVe vs CBOW (b) GloVe vs Skip-Gram Figure 4: Overall accuracy on the word analogy task as a function of training time. we plot the overall performance on efit of count data while simultaneously capturing the analogy task as a function of training time. recent log-bilinear prediction-based methods like sponding number of training iterations for GloVe word2vec. Adding that these models perform better across a range of negative samples effectively increases the number tasks.9 Acknowledgments For the same corpus. 1541 . window size.

tations with recursive neural networks for mor- phology. pages 3111–3119. Zach Solan. predict! A Gan. 2012. 1990. Ng. pages 160–167. Thomas K. 2006. Apostol. Learning deep architectures high-dimensional semantic spaces from lexical for AI. John A. and Pavel Contextual correlates of semantic similarity. NAACL. Better word represen- abilistic language model. In NIPS. 2009. 2013. CoNLL-2014. In NIPS. embeddings through Hellinger PCA. 2013a. 12:2493–2537. Contextual correlates of synonymy. Christopher D. Greg neural networks segment neuronal membranes Corrado. Tomas Mikolov. and Germán Omer Levy. 2013c. 34:1–47. Ronan Collobert and Jason Weston. analysis. 2013. tinuous space word representations. Elad Hazan. Charles. Kai Chen. model of semantic similarity based on lexical John Duchi. Réjean Ducharme. Gam- bardella. ing and stochastic optimization. George W. Linguistic regularities in con- ing: deep neural networks with multitask learn. Fabrizio Sebastiani. Parsing With Manning. 3:1137–1155. Language and cognitive processes. Tom M. and Eytan Ruppin. Yoav Goldberg. Wen tau Yih. Communications of the ACM. Behavior Research Methods. Lev Finkelstein. 1976. In Proceedings tomated text categorization. Evgenly Gabrilovich. Dan C. Pascal Vin. In ACL. pages 406–414. Word ber Theory. Com- tias. Introduction to Analytic Number Theory. Richard Socher. Deep Tomas Mikolov. In EACL. Improving Compositional Vector Grammars. Susan T. Yossi Ma. Ng. ning. and Christian Janvin.References Word Representations via Global Context and Multiple Word Prototypes. In NIPS. Dumais. John Bauer. 1996. Levy. Producing Yoshua Bengio. 28:203–208. Ciresan. Journal of the American Society for and David C. In Proceedings of ICML. JMLR. Plaut. 2013b. In ACL. 2001. Goodenough. Tomas Mikolov. and Jeffrey Dean. Adaptive subgradient methods for online learn. Yoshua Bengio. and Yoram Singer. Michael Karlen. Don’t count. Minh-Thang Luong. Richard Socher. 1991. munications of the ACM. 1965. Scott Deerwester. Ilya Sutskever. Jason Weston. 12. 6(1):1–28. and Computers. Rohde. Kai Chen. Landauer. Kevin Lund and Curt Burgess. Natural Language Processing (Al. and Geoffrey fied architecture for natural language process. 2003. Efficient Estimation of Word Behavior Research Methods. Luca M. Laura M. pher D Manning. Richard Socher. context-predicting semantic vectors. Rémi Lebret and Ronan Collobert. Harshman. JMLR. Distributed in electron microscopy images. 8(10):627–633. ACM. 2011. Herbert Rubenstein and John B. 1542 . In ACL. and Jef- occurrence statistics: A computational study. and Israel Ramat- Kruszewski. Ehud Rivlin. Linguistic regularities in sparse and systematic comparison of context-counting vs. 2007. and Christo- cent. and Jürgen Schmidhuber. 2013. Gonnerman. Machine learning in au- text: The concept revisited. Ex- tracting semantic representations from word co. 2014. In HLT- ing. In ICLR Work- shop Papers. Koray Kavukcuoglu. 2012. 8:627–633. JMLR. Georgiana Dinu. Miller and Walter G. Representations in Vector Space. In- Learning. 2008. 39(3):510–526. Man- Eric H. An improved Information Science. pages representations of words and phrases and their 2852–2860. Introduction to Analytic Num. explicit word representations. Huang. frey Dean. A uni. CoNLL-2013. Léon Bottou. George A. Foundations and Trends in Machine co-occurrence. Ronan Collobert. 2002. co-occurence. and Richard noise-contrastive estimation. and Andrew Y. most) from Scratch. 2014. Bullinaria and Joseph P. 41. 2011. Andriy Mnih and Koray Kavukcuoglu. Placing search in con. A neural prob. strumentation. T. Alessandro Giusti. and Andrew Y. 2014. Kuksa. compositionality. Marco Baroni. Greg Corrado. Wide Web. ACM Computing of the 10th international conference on World Surveys. Gadi Wolfman. Zweig. Christopher D. Indexing by latent semantic Douglas L. Learning word embeddings efficiently with Furnas.

Lev Ratinov. Tjong Kim Sang and Fien De Meul- der. In Proceedings of the SIGIR Conference on Research and Develop- ment in Informaion Retrieval. 2003. Joseph Turian. In CoNLL-2003. 2003. Erik F. Boris Katz. 2013. pages 384–394. In Proceedings of the 6th International Joint Conference on Natural Lan- guage Processing (IJCNLP).Stefanie Tellex. Aaron Fernandes. Manning. and Yoshua Bengio. Word representations: a simple and gen- eral method for semi-supervised learning. and Gregory Marton. Quanti- tative evaluation of passage retrieval algorithms for question answering. Introduction to the CoNLL-2003 shared task: Language-independent named en- tity recognition. In Proceedings of ACL. Mengqiu Wang and Christopher D. Effect of non-linear deep architecture in sequence labeling. 1543 . 2010. Jimmy Lin.