You are on page 1of 11

[1] This research work proposes sentiment classification which uses word2vec;

a word embedding model attain remarkable result in performance-wise. It uses


two training algorithms CBOW model and Skip-gram word2vec to interpret
semantic context regardless the word order and has experimented with twitter
airline posts. It works analogous to human mind which uses word association to
identify possible word combination. Every tweet is limited to 140 characters
that makes pattern recognition difficult. Stylometry is literature style that
differentiate one from other author. It is not restricted to user vocabulary but
also to syntax and spelling. The experimental analysis not only perform
sentiment classification but also analyze the similarities of the post to identify
the origin of the post. The usage of word embedding insists that not necessary to
create features based on stylometry for sentiment classification. The existing
works used CBOW and SG model for continuous vector representation of
extensive dataset. The efficacy of these model architectures were analyzed with
Feedforward Neural Net language model ,Recurrent neural net language model,
parallel training of neural networks and new log-linear modes such as CBOW
and SG. These simple architectures CBOW and SG outperformed due to low
computational complexity and high accuracy over extensive datasets.Another
work used different ML algorithms using lexicon based approach.Pang Lee et al
stated that SVM outperformed for Sentiment classification.Maas et.all used
supervised and unsupervised learning for sentiment classification using
document level.He Pang Lee work had been kept as benchmark.He contributed
sentence level sentiment classification.Go et al. removed emoticons and non
word tokens then applied ML algorithms and stated Maximum Entropy
classifier gives 83% result using both unigram and bigram.Hashtag differentiate
social media text from other text data. Kouloumpis et al used dataset containing
hash tag and stated that there is an improvement in SA.But inclusion of
emoticons degrades in performance when compared to hashtags. Lilleberg
compared word2vec against TF-IDF and stated that word2vec weighted by
TFIDF outperformed in SA..Also,he stated that the accuracy has positive impact
with no of categories.It degrades its accuracy if no of categiries is more.Wang et
al stated the efficacy of TFIDF with six tuple vector model in conjunction with
High Adverb of Degree Count(HADC) and proved that six-tuple with HADC
outperformed in SA. Using Chinese reviews not tested on English text.Sarkar et
al.stated NB is not effective all alone but gives more accurate results if its is
used in conjunction with other classifier models.It also used two step FS
methods which used most often used words in selection then used clustering to
lessen feature space.Most often used words are identified by using univariate
chisquared and outperformed more traditional methods by using greedy based
search and wrapper or correlation based feature selection.The proposed work
used imbalanced twitter posts having 15 attributes is of classes
positive,negative and neutral .Its data description is also discussed analysis
skewed to negative class as it has more data instances.The data partition is set to
70:30.Data are pre-processed to ignore stop words,transform to lowercase,and
tokenized sentences into words and word vector using word2vec
representation..NLTK punkt tokenizer is used to split words ,characters and
symbols in each sentence.The two different algorithms are used .They are
CBOW and SG and their performance can be enhanced with parameter
tuning .The dimensionality of feature vector is set to 300,context window size is
10 and minimum word count is set to 5 and model is tested using hierarchical
soft-max and negative sampling .Chosen algorithm is Hierarchical soft-max as
its accuracy is higher.Before training the average vectors are scaled using scikit
learn scale function.The average feature vector is used due to different tweets
length.The average feature vector are scaled to obtain more accurate
results.Random oversampler is used to select random data instances until equal
number of instances present in every class.Several classifiers LR,NB and SVM
are used and their performance is measured using scikit metrics.The similarity
between tweet posts are assessed by similarity model Gensim TFIDF with LSI
to analyze the keyword and the topics of tweets.The limitation in our proposed
work is length of tweets post is 140 character per post and uneven data
distribution.The similarity between tweet sample 0 and the rest of the samples
is high due to tweet posts has 140 word limitations and large sample size and
context of tweets overlap.After removal of stop words, the LSI model will
analyze the latent semantic inclination and generalize the topic of each
posts.Cosine similarity measure is used to find the similarities between sample0
and rest of samples.TF-IDF measure is also used whose output value is <1 and
greater than 0.Angle of similarity is smaller,then the similarity between two
texts are more similar.Angle of similarity is 90 degree then the two texts are
dissimilar.Social media like facebook and twitter provide feedback regarding
stakeholder satisfaction about the product results companies and government to
get enormous amount.Twitter is the ideal plateform for SA.Challenges in twitter
post is 140 characters per post .After subtracting the length of user’s twitter
handle there is not much space left to complete proper sentence .To resolve this
issue ,user’s use abbreviation and user’s slang or hashtag.The usage of
word2vec avoids manually created features based off stylometry to classify
sentiments correctly.Another challenge is due to imbalanced dataset.ML
algorithms like NB,SVC and LR use 4000 test data to classify over 10000
training data.Among the classifiers ,SVC yields 72% by using SG word2vec as
Feature representation.The results can be enhanced in terms of accuracy by
proper fine-tuning the parameters.
[2]Le fokkens stated that taxonomy-based approaches are more reliable than
corpus-based approaches in measuring human word similarity rank. The vector-
distributed model gives more coverage. The similarities of the adjectives are
identified by using finding the shortest path distance between derivationally
related forms of nouns(Adjectives) .The hybrid method combines the
taxonomy-based and vector distributed-based to acquire the best of these
approaches in terms of reliability and coverage. Le fokken has experimented
with dataset SimLex-999 for identifying the similarities between all-word pairs
without being affected by relatedness of the words. It is proved that the
taxonomy-based approach outperforms corpus based as the corpus based is
affected by association whereas taxonomy based approaches use vertical
relations that are suited for determining similarity.He proposed three WN based
adjective similarity measures and assessed them using SimLex999.It is required
the attention of a representation of adjectives in WN. The adjective pairs are
ranked in terms of their similarity is more important than having a specific
number for each pair. Spearman correlation is used for estimating the similarity.
He insisted on multiple different assessment methods that may leads to different
conclusions about the results.He also proposed to use ordering accuracy to
resolve tie correction providing partial similarity score to word pairs having
ties.Taxonomy based approaches have more ties than corpus based
approaches.The intuition behind this proposal is that overall ranking is more
important than arbitrary local differences.I will use Le and Fokkens comparison
by group where pairs of adjective pairs are grouped by difference in their
similarity scores in gold standard. This is useful to see how well different
models perform at varying levels of granularity.HSO and LESK are two
classical measures perform well.He proposed a method based on DRF that is
associated with adjective lemmas provided good result but less coverage issue is
present.Another approach using attributes can be used as an alternative but not
feasible to incorporate them in distance metrics.Finally hybrid method that
combines taxonomy based and vector distributed model is used.Resnik uses
these measures for all senses of each word and takes the highest similarity
score.Hill et al did judgement task they looked for similarities and biased
towards selecting most similar senses.This is strengthen by LESK results yields
stronger correlation of rho=0.51. The correlation of the HSO scores with
SimLex 999 almost doubled rho=0.45.The adjective similarity between DRF is
represented in WN.The similarities between adjectives are a function of the
properties they describe.111 adjective pairs in SimLex 999 to assess the
performance of this measure.To perform the evaluation ,all adjective pairs are
selected for which WordNet 3.0 specifies DRN.This resulted in 88 adjective
pairs and 89 different adjectives.The distance measure is defined as follows,
For adjectives A and B, get a list of all synsets corresponding to A and B.
Then generate two new lists of DRNs DRNA and DRNB
The distance between A and B is given by min({distance(x,y):<x,y> ϵ
DRNA*DRNB})
Where the distance is the shortest distance.I predicted that there would be
correlation between distance between A and B .This expectation is validated by
the results:our similarity measure has a spearman correlation rho of -0.64 with
SimLex 999.Only 41% of derivationally related nouns are present.This
technique shows negative impact for better coverage .Two types of approaches
but neither produced any significant correlation with SimLex 999 data,
Take the shortest path distance between all attributes of first/all senses of A and
B
Use the size of overlap between the sets of attributes of A and B.
WordNet 3.0 has 620 adjectives that even have attributes .On an average each
adjective has 1.03 attribute. In sum , attribute-based similarity measure are
obtained and it is required to map adjectives to all their possible attributes.
Allen describes a method to automatically learn from WN glosses which
attributes an adjective can describe. The hybrid model that combines WN and
vectors.
Generate Similarity values for all pairs using WN and other approaches X. So
we have two lists of similarity values Lw and Lx.
Sort both the lists hence we get a ranking for all pairs.In Lw,there will be many
pairs with same rank.
Create list Lo initially a copy of Lw.Use the values from Lx as a tie-breaker so
that all pairs have a unique rank..
Iterate over all the pairs p in Lx that donot occur in Lw.The first pair is a special
case if p is the first item of Lx,put it at the start of Lo.Otherwise treat it like
other pairs:get the pair immediately preceding p in Lx and look up its position
in Lo.Insert p immediately after that position in Lo.
The result Lo is a sorted list that maintains the structure of Lw but that also
contains all the pairs under consideration.For SimLex dataset ,the hybrid
approach achieves a correlation of rho=-0.62 compared to rho=-0.58 for baroni
et. al alone.
We gain precision by involving DRF in the estimation of similarity values using
spearman correlation.DRF and vector based achieved comparable results. DRF
method has advantage over vector based .When differences between two word
pairs are small,the vector based approach seems to have upperhand in
determining which is more similar.When the difference between pairs are
larger,it seems that hybrid approach is better at determining which pair is
similar.The tie correction for vector based approach have better ordering
accuracy.The score 0.5 is awarded for a tie.DRF has more chances of ties and
the average score is drawn towards 50%
In addition to new information to adjective synsets that can be used to increase
coverage.Adjective hierarchy :Germanet uses adjective hierarchy organized by
using hyponym relations.WN distance metrics directly on adjective
synsets.Mapping between germanet and Princeton WN is still incomplete.Add
new cross-POS relations:Two types of cross POS links that are available in
WN.They are attributes and derivationally related forms.More diverse set of
relations between adjectives and nouns.EuroWordNet has
xpos_near_synonym,xpos_has_hyperonym and xpos_has_hyponym relations
acts as access point for noun hierarchy.WordNet.PT has similar
relations.Inclusion of derivationally related to link encode similar information
without the requirement of two words morphologically resembling each
other.Adding these relations gives better coverage provides good score.Add
domain information:Each synset in WN is related to a particular domain.Like
property-of relation domain information doesnot seem to be helpful in actual
ranking but the knowledge whether two adjectives are associated with the same
domain may serve as a useful bias.WN based measures of adjective similarity
such as LESK and HSO and two new measures based on specific cross-POS
links and the shortest path distance between nouns they are related to.DRF is
used to get state-of-art results on SimLex 999.If there is coverage issue ,hybrid
model is better than vector based.On inspection,it seems these two measures
doesnot capture same information.Future focus is to combine distributional and
taxonomy based measures.Another way to improve similarity estimations is to
extend WN with new information .Attributes relation seems unusable for
similarity related work but maybe useful if more attribute links are added to
WN.Lot of promising work being done with other WNs to explore relation
between WN and lexical similarity.
[3]Skip-gram model is a model for learning high dimensional word vector that
capture semantic relationships that exists between words and linguistic
regularities.The issue in Skip-gram model is that it doesnot take into an account
lexical ambiguity but maintains single word representation for each word.Even
though several variants of skip-gram model overcome this issue and multi-
prototype word representation need fixed number of senses and learn those
senses through greedy heuristics based approach.Adaptive skip-gram model ; a
non-parametric Bayesian extension of skip-gram model automatically learns
required number of word representations at desired resolution.We originate
online variational learning algorithm for the model and empirically show its
efficacy in WSI task.Various NLP applications constitute continuous valued
word representations.They have given as an input to the high level algorithms
like text processing pipeline and overcome the issue lexical spareness .These
input features have semantic properities and relationships between concepts are
represented by words.Currently deep learning method adopt neural network for
learning this input word representations .There are two models CBOW and
Skip-gram model were used to get distribution of high dimensionalfeature
vector.They are efficient and enable to process text data in an online stream
settings.Both CBOW and SG-model retain only unique representation per word
fail to represent two or more senses like the word apple.Hence the most frequent
word sense are taken into consideration or the senses are mixed.These two
scenarios are not suitable for realtime applications.Multi-prototype word
representation ;an unsupervised learning that corresponds to multiple word
representations captures different senses of a single word.WSI automatically
identifies many word meanings.Different senses are distinguished by separate
representations.Sense as distinguishable recognition of spelled words.WSI is
related to WSD and its objective is to choose appropriate senes for an
ambiguous word from the sense inventory based on its context.The sense
inventory is obtained from WSI or provided as a external information.NLP
applications has capability to deal with lexical ambiguity.Word representation
represents the features widely used in dependency parsing,NER,SA etc.Multi-
prototype could enhance the performance of such representation based
approaches.In this work,Adaptive skip gram representation is used to acquire
the characteristics of fast online learning and high quality representation of
words that enable automatically learns number of prototypes per word at desired
semantic resolution.Skip-gram model is defined as a set of grouped word
prediction tasks.It predicts a targeted word given a word using correspondingly
their input and output representations.
P(v|w,ө)=exp(inwT,outv)/Σv’=1Vexp(inwT,outv’)
Where global parameters ө={inv,outv}v=1V represents input and output
representations for all words of the dictionary indexed with 1 ….V.Both input
and output representations are real vectors of dimensionality D.
The skip-gram model predicts context words y given a word x
P(Y|x, ө)=ПjP(yi|x, ө)
Input text o consists of N words o1,o2,o3…oN is recognized as a sequence of
input words X={xi}i=1N and their contexts Y={yi}i=1N.ith training object is given
by {xi,yi} where xi=oi and yi={ot} tϵc(i) where c(i) is the set of indices such that |
t-i|<=C/2 and t≠I for all tϵc(i).
The Skip-gram objective function is the probability of contexts given
corresponding input words:
P(Y|X, ө)=Пi=1Np(yi|xi, ө)=Пi=1NПj=1CP(yij|xi, ө)
Eventhough the context words are adjacent with each other and they are
intersect ,the model considers corresponding prediction problems independent.
The Skip-gram model is trained to recognize stream of words as input.The aim
is then optimized in a randomized probability by sampling ith word and its
context estimate gradients and updating parameters.After the model is
trained ,treated the input representation of the trained model as word features
and demonstrated that they captured semantic similarity between concepts
represented by words .The input representation is also known as prototypes.The
differentiation and evaluation probabilities has linear complexity (dictionary
size )too expensive for practical applications.The alternate name for softmax
predition is known as hierarchical softmax prediction .
P(v|w, ø)=Пnϵpathσσ(v)σ(ch(n)inTwoutn)
The output representations represents nodes in a binary tree.where leaves are all
nodes representing a path that starts from root to leaf node.ch(n) assign either 1
or -1 to each node in path(v) which purely depends on whether n is a left or
right child of previous node in the path.The above equation guaranteed to sum -
1i.e to be distributed with respect to ad σ(x)=1/(1+exp(-x)=1=σ(-x).Skip-gram
uses Huffman tree to built hierarchical tree for improve efficacy. The actual
skip-gram maintain one input representation per word.It may cannot capture all
the senses .But it is necessary to capture right number of prototypes for
capturing possible senses of a specific word.Adaptive for allocation of
additional prototypes for ambiguous word is required.Adpative skip-gram
automatically learn the required number of prototypes for each word using
Bayesian non-parametric approach.Each word has k meaning and associated
representation with its own prototype.Our aimis to identify particular chaoice
of sense .To resolve this issue ,we introduce latent variable z that encodes the
index of active sense .
P(v|z=k,w,ө)=Пnϵpath(v)σ(ch(n)inTwk outn)
By introducing latent variable z,we bring asymmetry3 between input and
output representations comparing to 3 now only prototypes dependent on
particular word sense.Number of prototypes for a particular word is determined
by the training text corpus.We approach this problem by employing Bayesian
no-parametricinto skip-grammodel.We use Dirichlet process for automatic
determination of required number of prototypes.It is widely used in infinite
mixture modelling and other problems where the no of structure component is
not known in priori.We use definition of DP through stick breaking
representation to define prior over sense of a word.The sense probabilities are
calculated by dividing total probability mass into infinite number of diminishing
pieces summing to 1.The prior probability of k sense of the word w is given by,
P(z=k|w,β)=βwkПr=1k-1(1-βwr),
P(βwk|α)=Beta(βwk|1,α),k=1,2…N.This assumes infinite number of prototypes
prototypes for words willnot exceed the number of times the occurrence of a
given word denoted as nw.The α hypermeter controls the number of prototypes
for a word allocated a priori.Asymptotically the expected number of prototypes
of a word w ids proportional to αlog(nw).Larger the value of α represents more
prototype that may lead to more granular or specific meaning captured by
learned representation and number of prototypes scales logarithmically with
number of occurrences.Another property of DP is to raise the complexity of
latent variable’s space with more data arriving.The unique sense of word in a
large text corpus is identified by our proposed work.Integrating all the parts
together ,Adagram model is as follows
P(Y,Z,β|X,α,ө)=Пw=1VПk=1ꝏP(βwk|α)Пi=1N[p(zi|xi,β)Пj=1Cp(yij|zi,xi,ө)]
Z={zi}i=1N is the set of senses for all words.To train Adagram maximize the
marginal likelihood the model.
logp(Y|X,ө,α)=logʃP(Y,Z,β|X,α,ө)dβ
The marginal probability is intractable because of ;latent variables Z and β.β
and ө are infinite dimensional parameters. Our model could not
straightforwardly trained gradient parameters .To make this tractable, we
consider variational lower bound on marginal probability.
L=Eq[log(Y,Z,β|X,α,ө)-logq(Z,β)

N
q(zi) Пw=1VПK=1Tq(βwk) is the fully factorized variational
q(Z,B)=q(Z),q(B)=Пi=1
aapproximation to the posterior p(Z,β)is equal to minimization of KL
divergence between q(Z,β) and true posterior. Within this approximation, the
variational lower bound (L(q(Z),q(β),ө) takes the following form,
(L(q(Z),q(β),ө)=Eq[Σw=1VΣk=1Tlogp(βwk|α)-logq(βwk)+Σi=1N(logp(zi|xi,β)-
logq(zi)+Σj=1Clogp(yij|zi,xi,ө))]
Setting derivatives of L(q(Z),q(β),ө) w.r.to q(z) and q(β) to zero yields standard
update equations.

Fact extraction, coreference resolution etc are some of NLP tasks which highly
depends on existing word taxonomies or ontologies. The word taxonomy is
constructed by extracting taxonomical relations from a dictionary or
encyclopaedia. It consists of several relations. The quality of extracted
taxonomy is highly dependent on WSD results. Many of the WSD approaches is
modeled with ML tasks. The proposed work uses Lesk algorithm which is used
for feature representation highly rely on association, Word2Vec provides neural
network features and Adagram is used as word sense representation models. We
apply several WSD algorithms based on dictionary. Many work majorly
concentrate on the impact of different approaches to mine WSD features in
order to improve word frequency. WSD detects the precise sense from the set of
all possible senses based on contexts. The noun is a ambiguous word have
multiple senses. Hypernym represents “Is a” relationship which comes under
noun word. The challenge associated with WSD is the generation of word
taxonomy.
In word taxonomy, a directed graph consists of nodes represents word senses
and edges represents hyponym-hypernym relation. For e.g apple is a fruit. Here
apple is a specific word. Fruit is a generic word. The semantic resources are
lexical database, thesauri, ontologies etc. Part of NLP tasks are constructing a
database of semantic similarity or a tool for term generalization. There are
several kinds of approaches to create or update a taxonomy. Word taxonomy
can be
i)generated manually by lexicographers.
ii) converted from existing structured resource
iii)extracted from a corpus, or derived from a corpus
iv)derived from corpus trained vector semantic model.
Corpus extraction efficiency and methods vary greatly based on corpus with
remarkable works performed on corpora of dictionary glosses, formal text
corpora and large general corpora. Each of these approaches is a trade-off
between required labour and quality of resulting taxonomy.
As a consequence for taxonomy extraction ;a monolingual dictionary consists of
sentences that contain hypernym. Hypernym occupies same syntactic position.It
is able to generate high quality taxonomies by mining hypernym relations from
corpora. This work is bound to extracting taxonomy from Russian language
from monolingual dictionary. WSD methods from general corpora are fitted for
this kind of WSD tasks. This paper focuses on application of WSD methods in
hypernym disambiguation in monolingual dictionary. It also describes the
parameters of WSD methods that are helpful to solve WSD problem.
Organization of paper is aligned as
1. Brief overview of existing approaches to the problem of WSD.
2. A description of data sources, data preparation and annotation.
3. A description of WSD pipeline that compares different feature extraction
and machine learning configurations.
4. Description and analysis of WSD parameters and their performance
5. Discussion of results and finally concluding the activity.
Background
All the approaches related to WSD is based on context that defines word sense
though the context varies.
Lesk algorithm measures the similarity metric between two contexts.Even it
perform well it suffers from data sparcity.The simplest solution is to overcome
the limitation to use semantic relatedness databases.The usage of WordNet
synset is to add more overlapping context words.Siddorov et al increases the
matches between two context using extended synonym dictionary and a
dedicated derivation morphology system.It has high WSD precision on a
Spanish corpus.Many more attempts are uses ML in a WSD task
e.g.LDA,Maximum entropy classifier ,genetic algorithms and others.Approach
related to WSD based on neural network with auto encoder or similar topology
.However early approaches are high computation demand and slow and noisy
learning algorithms.Mikolov et al trained autoencoder ,Word2Vec shows
similarity between arithmetic operations on autoencoder derived word
embedding and semantic relations.It insists Skip gram model is better than
CBOW model.Word embedding model doesnot provide a singly way to word
context into feature vector.It also composed Word embedding features from the
corpus for WSD.They tested several representation of sense;a concatenation or
different weighted averages of word context vectors.Many works were carried
out to construct word embedding model whose senses are assigned to vectors.It
sg model ; a semantically disambiguated model.It uses resulting set of vectors
as a semantic relatedness databased in a WSD tasks.Chen et al uses iteratively
performing WSD on a corpus using Skip-gram model and training the model on
a resultant corpus to improve SG model train the model on a resultant corpus to
improve WSD performance over naïve SG model.It is very demanding in both
time and space requirement.The practical implementation of WSD is the direct
induction of WS embedding.Adagram;a non-parametic extension of SG model
that conducts Bayesian induction of word senses and optimization of word
senses embedding representations and word sense probabilities in a given
context.RNN in NLP approaches WSD based on LSTM model uses coarse
model of how human reads sentences sequentially.
[J] Word Sense Disambiguation is widely used in NLP tasks such as Natural
Language Processing,Machine Learning .The experimental analysis of the
proposed work includes context mining ,feature analysis and text
classification.Adjective plays a vital part in text classification using machine
learning algorithms.The application of the proposed work includes document
indexing based on controlled vocabulary,adjective,word sense
disambiguation,constructing hierarchical categorization of web pages,spam
detection,topic labelling,web search,document summairization etc..Feature
extraction is characterized by using cuckoo search algorithm and text
classification is performed by using linear support vector machine.Text
document mining uses both machine learning algorithm and deep neural
network.WSD detect exact sense to the intended word that distinguish from all
other possible senses.Text document mining removes irrelevant ,redundant and
noisy features.FE and FS are two techniques used for FR techniques.FE
technique is characterized from new and low-dimensional element space.E.g.
PCA,LDA etc.Feature Analysis techniques assign weights in the range of (0,1)
to each feature.The WSD identifies the sense of words.Knowledge based
frameworks highlights the importance of words of last SenseEval event

You might also like