Professional Documents
Culture Documents
ASSIGNMENT I
WEKA (Waikato Condition for Information Investigation) is a broad suite of Java class libraries
that acknowledge many top level PC based knowledge and information mining checks. WEKA
gives executions of reenacted knowledge estimations that you can without a great deal of a stretch
apply to a dataset. It moreover combines a game plan of devices for developing datasets, for
example, the means discretization and taking a gander at. This paper surveys the preprocess a
dataset, feed it into a learning plan, and separate the subsequent classifier and its presentation—all
without framing any program code whatsoever utilizing Weka.
1. Download Weka and Install
Visit the Weka Download page and locate a version of Weka suitable for your computer
(Windows, Mac, or Linux).
2. Start Weka
Start Weka. This may involve finding it in program launcher or double clicking on the weka.jar
file. This will start the Weka GUI Chooser. The Weka GUI Chooser lets you choose one of the
Explorer, Experimenter, KnowledgeExplorer and the Simple CLI (command line interface).
6. Review Results
7.
Sentiment Analysis can be considered a classification process as illustrated in the diagram above.
There are three main classification levels in SA: document-level, sentence-level, and aspect-level
SA.
(i) Document-level SA aims to classify an opinion document as expressing a positive or
negative opinion or sentiment. It considers the whole document a basic information unit
(talking about one topic).
(ii) Sentence-level SA aims to classify sentiment expressed in each sentence. The first step is to
identify whether the sentence is subjective or objective. If the sentence is subjective,
Sentence-level SA will determine whether the sentence expresses positive or negative
opinions. Wilson et al. have pointed out that sentiment expressions are not necessarily
subjective in nature. However, there is no fundamental difference between document and
sentence level classifications because sentences are just short documents. Classifying text at
the document level or at the sentence level does not provide the necessary detail needed
opinions on all aspects of the entity which is needed in many applications, to obtain these
details; we need to go to the aspect level.
(iii) Aspect-level SA aims to classify the sentiment with respect to the specific aspects of
entities. The first step is to identify the entities and their aspects. The opinion holders can
give different opinions for different aspects of the same entity like this sentence “The voice
quality of this phone is not good, but the battery life is long”. This survey tackles the first
two kinds of SA.
Sentiment classification techniques
b. Topic modeling
Topic Modelling discovers abstract topics in a corpus based on clusters of words found in each
document and their respective frequency. A document typically contains multiple topics in
different proportions, thus the widget also reports on the topic weight per document.
The widget wraps gensim’s topic models (LSI, LDA, HDP).
The first, LSI, can return both positive and negative words (words that are in a topic and those that
aren’t) and concurrently topic weights, that can be positive or negative. As stated by the main
gensim’s developer, Radim Řehůřek: “LSI topics are not supposed to make sense; since LSI
allows negative numbers, it boils down to delicate cancellations between topics and there’s no
straightforward way to interpret a topic."
LDA can be more easily interpreted, but is slower than LSI. HDP has many parameters - the
parameter that corresponds to the number of topics is Top level truncation level (T). The smallest
number of topics that one can retrieve is 10.
o Latent Semantic Indexing. Returns both negative and positive words and topic
weights.
o Latent Dirichlet Allocation
o Hierarchical Dirichlet Process
Parameters for the algorithm. LSI and LDA accept only the number of topics
modelled, with the default set to 10. HDP, however, has more parameters. As this algorithm is
computationally very demanding, we recommend you to try it on a subset or set all the required
parameters in advance and only then run the algorithm (connect the input to the widget).
o First level concentration (γ): distribution at the first (corpus) level of Dirichlet
Process
o Second level concentration (α): distribution at the second (document) level of
Dirichlet Process
o The topic Dirichlet (α): concentration parameter used for the topic draws
o Top level truncation (Τ): corpus-level truncation (no of topics)
o Second level truncation (Κ): document-level truncation (no of topics)
o Learning rate (κ): step size
o Slow down parameter (τ)
Produce a report.
If Commit Automatically is on, changes are communicated automatically.
Alternatively press Commit.
c. Word embedding
Word embedding — the mapping of words into numerical vector spaces — has proved to be an
incredibly important method for natural language processing (NLP) tasks in recent years, enabling
various machine learning models that rely on vector representation as input to enjoy richer
representations of text input. These representations preserve more semantic and syntactic
information on words, leading to improved performance in almost every imaginable NLP task.
Document Embedding parses ngrams of each document in corpus, obtains embedding for each
ngram using pretrained model for chosen language and obtains one vector for each document by
aggregating ngram embeddings using one of offered aggregators. Note that method will work on
any ngrams but it will give best results if corpus is preprocessed such that ngrams are words
(because model was trained to embed words).
1. Widget parameters:
A possible way to map the field is into the following four prominent approaches:
1. Summarizing word vectors: This is the classic approach. Bag-of-words does exactly this
for one-hot word vectors, and the various weighing schemes you can apply to it are variations on
this way to summarizing word vectors. However, this approach is also valid when used with the
most state-of-the-art word representations (usually by averaging instead of summing), especially
when word embeddings are optimized with this use in mind, and can stand its ground against any
of the sexier methods covered here.
2. Topic modelling: While this is not usually the main application for topic modeling
techniques like LDA and PLSI, they inherently generate a document embedding space meant to
model and explain word distribution in the corpus and where dimensions can be seen as latent
semantic structures hidden in the data, and are thus useful in our context. I don’t really cover this
approach in this post (except a brief intro to LDA), since I think that it is both well represented by
LDA and well known generally.
3. Encoder-decoder models: This is the newest unsupervised addition to the scene, featuring
the likes of doc2vec and skip-thought. While this approach has been around since the early 2000’s
— under the name of neural probabilistic language models — it has gained new life recently with
its successful application to word embedding generation, with current research focusing on how to
extend its use to document embedding. This approach gains more than others from the increasing
availability of large unlabeled corpora.
4. Supervised representation learning: This approach owes its life to the great rise (or
resurgence) of neural network models, and their ability to learn rich representations of input data
using various non-linear multi-layer operators, which can approximate a wide range of mappings.