Word Embeddings With Neural Network

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/337623019
Word Embedding Using Neural Network
Preprint · November 2019

DOI: 10.13140/RG.2.2.36299.34082
CITATIONS READS
0 570
2 authors, including:
Anees Ahmed
PAF Karachi Institute of Economics & Technology
2 PUBLICATIONS 0 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Word Embeddigs View project
All content following this page was uploaded by Anees Ahmed on 29 November 2019.
The user has requested enhancement of the downloaded file.

Word Embedding Using Neural Network
Anees Ahmed Dr. Salman
Muhammad Fahad Dept. of computer Science PAF Faculty member of computer Science
KIET Institute of Technology PAF KIET Institute of Technology
Dept. of computer Science Karachi, Pakistan Karachi, Pakistan
PAF KIET Institute of Technology @gmail.com Engr.ahmed@gmail.com
Karachi, Pakistan
Swiftfahad2@gmail.com
word model in which scoring was assigned to each word that

Abstract— Today ’ s advance field like speech recognition represents entire vocabularies in large sparse vectors.
and computer vision is a gift of Neural Network [1, 2]. Problem with that representation is the amount of sparse
Neural network has also contributed in Natural language vocabularies filled with so many zeros. This study uses the
processing (NLP), which is quite challenging domain but dense representation of vector where a vector represents the
neural network has come up with excellent contribution in projection of the word into a continuous vector space.
this domain [3]. But for Text base applications, Main learning is the position of word within vector space
performance of neural network is still questionable. This derived from text based on the surrounded words. This position
study will answer the question by applying neural network of word is referred as its embedding.
by integration word embedding in text base application.
This study also explore the application, implicit benefits This is quite easier to identify the relation between different
and future implementation of neural network with text entities when the data is in the form of audio, video or image,
embedding like dense vector representation of word form because these types of data has high-dimensional and rich
by identifying the similar word available in vector space dataset vectors encoded with individual raw pixel and
through NLP. intensity information which makes the job easy for example
it’s easy to identify the relation between cat and dog.
I. INTRODUCTION
But if we talk about Natural language processing it’s become
Dealing with traditional NLP is a difficult task because it
quite difficult because traditionally cat and dog were treated as
requires deep learning of linguistics in addition with domain
separate discrete atomic symbols and can be represented as
expertise. For example the whole linguistic classes is
Id537 and Id143 respectively. This technique of encoding ids
dedicated with a terms like morphemes and phonemes.
provides no useful relation about the individual words. This
This study use Word2vector model to implement “ word
model can only leverage minimal information that has been
embedding” and vector representation of words.
captured and share about the cat while processing the dog
Initially Word2vector is applied as a preprocessing step then
(such as both are animals, both are pets etc).
learned vectors pass on to the discriminative neural model for
predictive analysis. In this study we represent the word as a vector which provides
A. Neural Network discrete and unique ids with greater data sparseness which will
help to form more train statistical model.
In recent past there are wide range of applications regarding of
neural networks can be seen image recognition to natural C. Word Vectors
language processing to time-series forecasting [2]. In word vector strategy each word is represented as d-
In our study the main focus would be deep learning is dimension vector. Approach use to arrange the words is
embedding, a method used to represent discrete variables as known as co-occurrence matrix. This matrix contains all the
continuous vectors. This technique has found practical words appear next to each other in the corpus (data set).
applications with word embedding are for machine I love NLP and I like
translation and entity embedding’s for categorical variables[2].
B. Word Embedding
This is a dense vector representation of word and
documentation. This approach is enhancing form of old bag-of-
Table1: Co-Occurrence Matrix
Table1; clearly shows that the dimensionality of each word F. Neural Probablastic Model
will increase linearly as we increase the size of corpus. As the Neural Probabilistic language models are traditionally trained
size increase we face the problem of sparseness, suppose we using the Maximum Likelihood principle to maximize the
had million of words so we have to create millions x millions probability of the next word wt (for “target”) given the
matrix and span of zero are increased. The big drawback is a previous words h (for “history”) in terms of a softmax
lot of memory wastage. Word2Vector model optimize this function.
approach and provides the most optimal way for representing
words.
D. Vector Space Model
The model of representing embeds words in a continuous

vector space is known as vector space model. This model where score (wt, h) computes the compatibility of the target
mapped the series of similar word to their nearby points. word wt with the context h (a dot product is commonly used).
Natural language processing with vector space model has long
II. LITERRATURE REVIEW
rich history with each others; there are several methods for
natural language processing NLP. All the methods depend on Lilleberg, Joseph demonstrates the compression of tf-idf with
the distributed hypothesis strategy according to which all word2vec in combination with tf-idf. The result suggests that
those words have same meaning those appearing and share both techniques in combination can outperform tf-idf because
same context. All the approaches of processing NLP lie in two of the complementary feature its offer [3].
categories; one is count-base approach (i.e Latent Semantic
Analysis) and second is Predictive methods (i.e Neural
III. METHODOLOGY
Probabilistic Language Models).
Neural network in combination with word embedding depend
on nature of task though general encoding is same for both.
Each vector representing each word passes directly to
embedding layer. There are several hidden layer presents and
one output layer to show the results. Word embedding is also
update itself by self learning process which depends on
specific model. Actually the learning should be on the weights
that representing the connections and can be achieved through
back propagation.
Fig1: Dense representation of vector

In first approach we count the number of co-occurrence of the
word with its neighbor word within the whole stream of words.
Then the mapping of each word count result with the small,
dense vector is done.
Predictive models directly try to predict a word from its
neighbors in terms of learned small, dense embedding vectors
(considered parameters of the model).
Fig2: Layard Architecture
E. Word2Vec This values are then compared with one-hot encoding,

IT is new computationally efficient method for predicting transformation is unsupervised that could be a big issue.
learning words from a raw text. This model can be Slowly embedding keeps improving as the learning process
implementing by using two techniques. carried out using neural network on a supervise task.
 Continuous Bag of words (CBOW)
To perform parameter base embedding weights are adjusted to
 Skip Gram Model keep loss as maximum as possible. Resultant vector represents
the category it belongs to and all the other vectors of that
Both of the technique is quite similar to each other, the category remain close to each others.
different lies in approach. CBOW analysis source text and
predict the target word, while skip gram predict source words For example, Out of 50,000 words vocabulary that is collected
from target words [3]. from the movie review, 100-dimensions embedding for each
word using neural network can be generated. These dimension
trained to predict the sentimentality of the reviews.
● Based on the center word wi we are trying to predict
The words presents in review like ‘Brilliant’ or ‘Excellent’ lies
which word c j ∈ V appears in its context.
in the same category and come closer in a vector space show
the positive review about the movie. That is on the basis of We can then summarize our neural network as follows:
learning that network has gain previously. Eq (1)
A. Skip-gram (SG) Algorithm
The skip-gram representation is considered as more accurate
as compare to old tradition method like beg of words etc. The
main reason of accuracy is the more generalized context
generation [5]. D. Skip Gram Model
B. Implementation Eq (2)
Each word gets two initial random vectors of size d V,
named as center and context vector. For a center word, predict
which words vi ∈ V appear in context. For example, given
fox predict quick; brown; jumps; over.
We predict that the context word c j is the one with the largest
probability p(cj | w i ).
Since softmax is a monotonic (order-preserving) function, we
actually maximize the dot product wi T · c j
Eq (3)
The dot product is larger if the vectors u and v are more

Fig3: Skip Gram Model similar (cosine similarity).
In the model, we actually iterate over the vocabulary V and
C. Alogrithm
find the most similar word to wi.
● Single-layer neural network with an identity transfer

function in the hidden layer.
● The hidden layer contains the dense center vector of the
word
● The input X to the network is the one-hot representation
of the center word
● The input x to the network is actually a one-hot vector
which identifies the center word
● The output classes are the words from the vocabulary
● x · V just returns the i-th row of V (wi ), where i is the
index of the 1 in the x vector.
● V is the vocabulary, which contains a d-dimensional
Fig4: Skip Gram Sketch
vector wi for each word.
● The hidden layer contains the d−dimensional word
representation – in the case of skip-gram.
● We normalize the output values to probabilities by using
the softmax function.
View publication stats
Reference
[1] Yann LeCun, Koray Kavukcuoglu, and Clément Farabet. 2010.
Convolutional networks and applications in vision. In Circuits and
Systems (ISCAS), Proceedings of 2010 IEEE International Symposium
on IEEE, pages 253–256.
[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012.
Imagenet classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages 1097–1105.
[3] Lilleberg, Joseph, Yun Zhu, and Yanqing Zhang. "Support vector
Fig5: Cosine Similarity of words
machines and word2vec for text classification with semantic
features." Cognitive Informatics & Cognitive Computing (ICCI* CC),
2015 IEEE 14th International Conference on. IEEE, 2015.
IV. COMPARATIVE ANALYSIS AND DISCUSSION
[4] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech
Some simple factors can be used to measure the performance recognition with deep recurrent neural networks. Acoustics, speech and
signal processing (ICASSP), 2013 IEEE International Conference on.
of skip gram neural network. IEEE, 2013
Equation 1 shows the output function for the 1st hidden layer is [5] Tutek,M,(2017), “Word Embeddings and NeuralNetworks for Natural
the product of the input vector V of each center word Cj over Language Processing” [Avaialble online at]
the product of input vector V of each context work Cl and https://www.fer.unizg.hr/_download/repository/TAR-07-WENN.pdf
application of softmax over the product. [Accessed on] 05 December 2018.
Setting up the cost value 1 for each arithmetic operation we [6] Skymind (n.d.). “A Beginner's Quide to Word2Vec and Neural Word
Embeddings ”[Avaialble online at] < https://skymind.ai/wiki/word2vec
can get the value for each neuron. So the operation performed > [Accessed on] 06 December 2018.
between each center and context words is 2V times for each [7] Brownlee Jason(2017),“ How to Use Word Embedding Layers for Deep
softmax operation. Learning with Keras” [Avaialble online at] <
https://machinelearningmastery.com/use-word-embedding-layers-deep-
learning-keras/> [Accessed on] 06 December 2018.
For example, for 1000 input words there will be:
N = V * Cj + softmax(V*C) + (V * C) + softmax(V * C)
N = 1000 * 1000 + (1000) + ( 1000 * 1000) + 1000
N = (1012 +2 x 103 ) for single nuron
N = 2 x 1016 for 10 nurons per layrs
N = 2 x 1000 58 for 1 input, 2 hidden and 1 output layrs
So our neural network embeddings running cost will be 2Vc

where V is input sample
There will be liner memory space is required for each word,

because there is only one extra floating value will be required
to store computation in each step of neuronal processing
Memory = V * [length of max word] * 8 bytes (Single

precision floating number)
V. CONCLUSION
Neural network is seen as low-dimensional representation of

discrete data into continuous vectors. Traditional encoding
methods has gain new spark and limitation like visualization,
mapping and finding nearest neighbor are overcome by this
new approach.
Neural network embedding are quite intuitive and simple as

implementation point of view. We firmly believe that anyone
can word embedding and can present discrete variables and
present a useful application of deep learning.

Word Embeddings With Neural Network

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Word Embeddings With Neural Network

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Word Embedding Using Neural Network

Preprint · November 2019

Word Embeddigs View project

The user has requested enhancement of the downloaded file.

word model in which scoring was assigned to each word that

The model of representing embeds words in a continuous

Fig1: Dense representation of vector

E. Word2Vec This values are then compared with one-hot encoding,

The dot product is larger if the vectors u and v are more

● Single-layer neural network with an identity transfer

So our neural network embeddings running cost will be 2Vc

There will be liner memory space is required for each word,

Memory = V * [length of max word] * 8 bytes (Single

Neural network is seen as low-dimensional representation of

Neural network embedding are quite intuitive and simple as

You might also like