You are on page 1of 11

An Exploratory analysis of Word2Vec

Vijay Hareesh Yadav Avula
(vavula@umail.iu.edu)

Praneet Vizzapu
(pranvizz@indiana.edu)

As the classes progressed and when we came to the dimensionality reduction topics such as PCA. Vqueen – vwoman + vman ≈ vking . CMDS came up and we pondered over the fact as to which topic is currently talked about in the industry and that’s how we came across word2vec and here we present our paper on exploring it.Introduction When this course started out. In this report we are going to investigate the significance of Word2Vec for research going forward and how it relates and compares to prior art in the field. So what word2vec does is basically detect similarities mathematically. but instead reflect similarities and dissimilarities between them by turning them into vectors. The float values represents the coordinates of the words in this N dimensional space. Arguably. It creates vectors that are numerical representations of word features. (N being the size of the word vectors obtained). research into word embedding is one of the most interesting in the deep learning world at the moment. especially as determining the position of one point in space relative to another. we were more focused on the exploratory data analysis part. Ex: sentiment analysis. In short: Word2Vec is building word projections in a latent space of N dimensions. And it can make accurate guesses about a word’s association with other words. One dimension could put man woman king & queen as ‘people’ and other could associate it with ‘gender’ or ‘royalty’. So word2vec assigns a vector to every word. Most tasks in natural language processing and understanding involve looking at words. and could benefit from word representations that do not treat individual words as unique symbols. Most prominently among these new techniques has been a group of related algorithm commonly referred to as Word2Vec which came out of google research. Every dimension in the vector tries to encapsulate some property of the word. This can be used in a lot of fields. A vector is a quantity having direction as well as magnitude.

It does so without human intervention. features such as the context of individual words. however as we shall see the idea of words in the context of a sentence or a surrounding word window can be generalized to any problem domain dealing with sequences of sets of related data points. From our implementation. from the word clouds.brother The principal components were drawn for 100 components and the first 2 components were taken to plot against the tSNE plot dampened to 2 axes. of principal component 10. Given enough data. The positive and negative here are nothing but the . However. Surprisingly. we found some interesting similar words showing up like: Inch . Word2vec can make highly accurate guesses about a word’s meaning based on past appearances? Using large amounts of unannotated plain text. Word meaning and relationships between words are encoded spatially. Word2vec creates vectors that are distributed numerical representations of word features. depicting the positive and negative so that we didn’t want to miss the negative relation between the words. word2vec learns relationships between words automatically. the plots looked the same and is presented here. usage and contexts. Research Questions: Methodology: Word2Vec and the concept of word embeddings originate in the domain of NLP. The words with top 5 individual component values in either order are also shown to draw some inference between the words.mile Sister .Word2vec converts literal words into statistical vectors. we have two word cloud plots each for top ten principal components.

('brown'. 1. 1. ('natural'.485166). 1. ('oxygen'.5631089). 1. 1. ('reply'. 1. 1.5256946). 1. ('office'. 1.6034856). 1. ('guess'. In a similar way we take another word which is similar to this word but not in the top 50 similar words given by the gensim model and try to apply the same method on it too. ('speak'. 1.6302099).5135049). our approach is to make an inference between these two words and find some similar words in this context.4701176). ('born'. 1. 1. 1. ('molecule'. 1.4122548).5380408). 1. 1.5672433)] PC 60 [('oil'. ('rich'. ('law'. ('heat'. 1. 1. 1. ('verb'. ('solution'. ('string'. 1. PC 57 [('consonant'.3791058). ('wife'.501106). 1. 1. 1. ('sure'. ('want'. ('row'.5320506)] PC 67 [('family'.4792042). 1. ('seat'. adding 1 to every value to push them above 0 and subtracting every value from 1 which flips the order of the words along with pushing all the values above 0.5240699). 1. ('age'.e. 1. ('molecule'. ('plural'. 1.3995171).6137302)] PC 59 [('pass'.5778085)] [('sister'. ('oh'.5255997). ('vowel'. 1. ('sentence'. 1. ('children'. ('ice'.4536896).8822672)] However. 1.3845954).6536573)] PC 62 [('decimal'. ('wife'. We would like to correlate between these top 50 words of each cluster and form a possible relation out of it and want to see if that relation has any importance in the real world. ('minute'.4003348). ('baby'. 1.4387532). 1. ('chord'. 1. For our example. Now.6033188)][('notice'. the relation between the positive and negative frequencies hasn't been found sufficient and we have tried to change the way of encountering the problem. ('family'.4547601). ('gas'. 1. 1.4937381). Now.6416764)][('cell'.5505885).5491316). 1. 1. 1.422509). we have taken Apple and apple and have found 50 most similar words for each other.4014579).4575489).4579792).5155525). 1.normalization of the values i.5281613).4310522).4547477).6018817).41456)][('moon'. .5294175). 1. ('brother'. 1.4370517).5211747). 1. ('syllable'.. we are trying to find out the most similar 50 words for a given word and apply PCA on it to see how the clusters are formed. 1. But we have to see if this works. 1. 1. 1. 1. 1.5653347)][('we'. ('death'.4973043).4909767). ('air'. the difference we observed is the word with ‘A’ refers to the major electronics company and the word with ‘a’ refers to the fruit that we come across in our daily life.5841198). 1.6048248). ('oh'.5857494). ('wind'.

Analysis & Results: .

. the negative category had words related to Chemistry lab which were completely not related in any perspective. However.e. the apple and Apple PCA cluster plots had an implicating meaning of a clustered set of words in the plot shown. the top and bottom frequency.The wordcloud plots that we have tried to analyse did not give us major significance of the correlation between the categories of the words in the positive and negative word clouds i. Because if the positive category had words with major frequency related to family. .

we would also like to draw line segment and observe the slopes and data points.topn=1) [('peach'. King-Queen example. Eg: Raspberry got similar word as peach BlackBerry got similar word as strawberry from the analogy.'BlackBerry'].topn=1) [('strawberry'.6707682013511658)] The above result is obtained from applying the similar example of Man-Woman. 0.most_similar(positive=['apple'.model. We succeeded in getting a fruit name but not the same fruit that is associated with the Brand name that is given as an input.'Raspberry']. negative=['Apple'].most_similar(positive=['apple'. negative=['Apple']. However. .5871326923370361)] model. 0. The experiment was intended to get an output of the corresponding fruit name which matches to the company name.

the vectors that connected these had similar negative slope. . But if we plot the actual fruit names to the company names. we get the following. though the fruit contingent wasn’t much similar to the company name.These are some interesting results obtained about the similarity.

However.6%. the raspberry pair has a positive slope. just changing the middle ‘B’ to ‘b’ and we still get ‘strawberry’ as the most similar fruit. Another interesting experiment was made using these set by changing the ‘BlackBerry’ to ‘Blackberry’ i.Well. in this case. if we carefully observe the plots we still see a similarity in the slope of the vectors ‘Apple’’apple’ and ‘Blackberry’-’strawberry’(72. But. Which is apart from the remaining two vectors plotted. I think we could infer that the Word2Vec model tries to find out inference on the basis of slopes as well.8%) but ‘Blackberry’-’blackberry’ had a negative slope crossing the vector of ‘Apple’-’apple’ and still having the similarity of 30..9% compared to strawberry which was 67%. From this experiment. . That was surprising since we also thought that vectors that are parallel to each other might have higher or at least comparable similarity rates. even if the BlackBerry pair had similar slope to the apple pair. the model didn’t predict this pair when we tried to calculate the similarity. the result was just 8.e.

.

com/idio/wiki2vec 4. we found out that though the model found similarity has similar slope. for the analogies as shown in the man woman king queen example. the words that we believe are similar could have lower percentage of probability and contradicting the slope of the original analogy like demonstrated in the ‘Blackberry’-’blackberry’ example.com/p/word2vec/ 2.Wiki trained model: https://github. Thus. Eg: the highest repeating word on one’s feed could be transformed into a tab dynamically to make it easier for the user to navigate between top trending topics for his personalised feed.English words: https://gist. References: 1. 3. Word2Vec: https://code. However. Since the King. the model didn’t work with the experiment that we have made.github. Relating the importance of slopes of vectors in finding out the analogies between similar words. 2.Word cloud: https://github. Man and Woman example is a generic one and could be easily found in the wikipedia corpus. the inferences have to be tested well within the boundaries of the corpus the model uses to train.com/amueller/word_cloud . Queen. the model works correctly.Conclusion: The complete scope of word2vec depends on the corpus that the user uses to train the model.gensim: https://radimrehurek. Regarding the slopes of the vectors that we thought would have been similar. Though the Wikipedia model used to train has covered a lot of topics.com/gensim/ 3. Future Scope: 1. it essentially couldn’t have covered this company with fruit name combinations.google. It could be implemented with twitter feed to find a dynamic stories selection that could be put in Tabs. Feeding the Word2Vec with related corpus that we would like to find analogies with.com/deekayen/414874 5.