You are on page 1of 69

https://bit.

ly/2MWyuD7

Text Mining

Social
Sepuluh Nopember
Institute of Technology (ITS)

Sentiment
29-31 Oct. 2019

Analysis
Using WEKA Tools

Dr Shuzlina Abdul Rahman Ts Dr Sofianita Mutalib


Associate Professor Senior Lecturer
Centre for Information Systems Studies, Faculty of Computer & Mathematical Sciences,
Universiti Teknologi MARA (UiTM), Shah Alam, MALAYSIA
Learning Outcomes from
Workshop

• At the end of the workshop, participants


should be able to use WEKA for the Sentiment
below tasks:
Analysis
• pre-process textual data
• develop classification model using
machine learning algorithms
INTRODUCTION

WHAT IS SENTIMENT WHY SA IS HOW TO APPLY SA IN


ANALYSIS (SA)? INTERESTING? WEKA?
What is
Sentiment
Analysis
(SA)?
[1]

SA - process which focuses on analyzing people’s opinions, feelings, and


attitudes towards a specific product, organization or service.
Why SA is
interesting?

[2] https://www.pewresearch.org/fact-tank/2019/04/04/indonesians-optimistic-
about-their-countrys-democracy-and-economy-as-elections-near/
Basic Concepts
• A document can be described by a set of
representative keywords called index terms.
Concepts • Different index terms have varying relevance when
of Text used to describe document contents.
Analytics? • This effect is captured through the assignment of
numerical weights to each index term of a document.
(e.g.: frequency, term-frequency-index document
frequency or tf-idf)
• DBMS Analogy
• Index Terms à Attributes
• Weights à Attribute Values

7 Han et al, 2011


Basic Concepts
• Index Terms (Attribute) Selection - tokenization:
• Stop list – irrelevant set of words (a, the, of,…)
• Word stem – (drug: drugs, drugged) viewed as different occurrences of the same
word
• Index terms weighting methods
• Terms r Documents Frequency Matrices – measures the no. of
occurrences of term t in the document d

Han et al, 2011

8
Text Categorization
• Pre-given categories and labeled document examples (Categories may form
hierarchy)
• Classify new documents ; e.g. Google mail apps
• A standard classification (supervised learning ) problem

Positive
Categorization
Positive
Past Reviews System
Negative

… …
Positive
Neutral
Neutral
New Reviews Negative
Han et al, 2011
9
Data Acquisition and Preparation

Tokenization

Text Pre-processing Feature selection


How To Feature Transformation

apply SA in
Weka Tools? Text Classification using
Random Forest (RF)
Naïve Bayes (NB)

Machine Learning Logistic Regression (LR)


Support Vector Machine (SVM)

Analysis of the results


WEKA

Machine Learning Software in Java


https://www.kaggle.com/datasets https://archive.ics.uci.edu/ml/datasets.php
WEKA 3 • Weka is a collection of
machine learning
algorithms for data mining
tasks. It contains tools for
data preparation,
classification, regression,
clustering, association rules
mining, and visualization.
• https://www.cs.waikato.ac.
nz/ml/weka/
• Two versions:
• Stable ver. (3.8)
• Developer ver. (3.9)
Waikato Environment for Knowledge
Analysis (WEKA)
In 1993, the University of Waikato in New Zealand began development of
the original version of Weka

It is released as open source software under the GNU GPL. It is written in


JAVA and provides a API that is well documented and promotes integration
into your own applications

WEKA provides a collection of data mining, machine learning algorithms


and processing tools

WEKA is an environment for comparing learning algorithms


WEKA GUI Chooser

Data
Mining
Tool

https://www.cs.waikato.ac.nz/ml/weka/index.html
WEKA Interfaces

E-BOOKS
https://www.cs.waikato.ac.nz/ml/weka/documentation.html

Select and run classification Run association algorithms to


and regression algorithms to extract insights from your Visualize the relationship
operate on your data between attributes.
data

Load a dataset and Run attribute selection


manipulate the data into a Select and run clustering algorithms on your data to
form that you want to work algorithms on your dataset select those attributes that
with are relevant to the feature
you want to predict.
.arff
.csv .dat

WEKA
Data
Format
WEKA 3

LET’S BEGIN
Dataset - Product Reviews
• Amazon Kindle
• https://www.kaggle.com/bharadwaj6/kindle-
reviews
• Amazon Kindle Store category from May 1996
- July 2014.
• Contains total of 982619 entries.
• Each reviewer has at least 5 reviews and each
product has at least 5 reviews in this dataset.
Download and open files and
save (.csv) in Excel

You need to save the


file in .csv (Comma
delimited)
Data Preparation
• Transform
• Merge
• Rename

Text Pre-processing
• Tokenization
• Attribute selection
Outline
Text Classification using Machine Learning
• Random Forest (RF)
• Naïve Bayes (NB)
• Logistic Regression (LR)
• Support Vector Machine (SVM)

Analysis of the results


Data Preparation
• For this exercise
• 10 attributes
• 2500 instances
• Two primary attributes
• overall
• reviewText
• Delete other attributes
• no, asin, helpful,
reviewTime,
reviewID,
reviewerName,
summary, and
unixReviewTime
Data Preparation
• Then, Load your dataset into Weka
• Click ‘Open File..’ -> find your data ‘.csv’
-> click ‘Open’
Data Preparation
• Overall is a overall rating based on the
reviews.
• We have reviews in the range of [1-5].
• Let label rating = 1 or 2 as the
“Negative” review .
• Lets consider "3" as the “Neutral”
review.
• And rating = 4 or 5 as “Positive”
reviews.
• Then, run again the dataset in Weka.
Data Preparation
• Overall is a overall rating based
on the reviews.
• We have reviews in the range of
[1-5].
• Let label rating = 1 or 2 as the
“Negative” review .
• Lets consider "3" as the
“Neutral” review.
• And rating = 4 or 5 as “Positive”
reviews.
Data Preparation
• Change from numeric to nominal
• Steps:
• Filter>>Unsupervised>> Attribute >> NumericToNominal
• Change ”attributeIndices” to first
• Click OK >> Apply
Merge the values of
attribute “overall”, for
Data Preparation ‘1’ and ‘2’ and
‘4’ and ‘5’
• Merge TWO attributes’ value;
• Steps:
• Filter>>Unsupervised>> Attribute >> MergeTwoValues
• Change ”attributeIndices” to first
• Change “firstValueIndex” to 1 or first
• Change “secondValueIndex” to 2
• Click OK >> Apply

RECALL our objective is to:


• Overall is a overall rating based on the reviews.
• We have reviews in the range of [1-5].
• Let label rating = 1 or 2 as the “Negative” review .
• Lets consider "3" as the “Neutral” review.
• And rating = 4 or 5 as “Positive” reviews.
Data Preparation
• Rename the new label
• Steps:
• Filter>>Unsupervised>> Attribute >> RenameNominalValues
• Change ”selectedAttributes” to 1
• Type in “valueReplacements” with 1_2:neg, 3:neu, 4_5:pos
• Click OK >> Apply
Data Preparation
• To set attribute “overall” as
class label
• Steps:
• Click edit, go to header
column one, right hand
click, choose “Attribute as
class”
• Save file as
kindle_review_labelled.arff
Data Preparation with Excel (alternative)
Go to Excel. Find and Replace
‘Overall’ number into label.
Data Preparation

• Transform
• Merge
• Rename

Text Pre-processing

• Tokenization
• Attribute selection
Outline
Text Classification using Machine Learning

• Random Forest (RF)


• Naïve Bayes (NB)
• Logistic Regression (LR)
• Support Vector Machine (SVM)

Analysis of the results


Text Pre-processing (Tokenization)
• Process text (attribute review)
from Nominal to String
• Steps;
• Choose
Filter>>Unsupervised>>attribute
>>Nominal to String
• Change “attributeIndexes” to 1
• Click OK, then Apply
Text Pre-processing (Tokenization)
• Process text (attribute review) to create
Index Terms
• Steps;
• Choose
Filter>>Unsupervised>>attribute>>StringToWordVector”
• Change “IDFTransform” to True
• Change “TFTransorm” to True
• Change “attributeIndices” to 1
• Change “lowerCaseTokens” to True
• Change “outputWordCounts” to ‘True’,
• Goto “stemmer” and Choose ”IteratedLovinsStemmer”
• Goto “stopwordsHandler” and Choose “Rainbow”
• Goto “tokenizer” and Choose “WordTokenizer”, type in
delimiters as follows:
.,;:'"()?!@#$%^&*(){}[]/|\1234567890-=+_)
• WordToKeep = 200
• Click OK, then Apply
• Word Tokenizer used for tokenizing a set of delimiter characters
• Filter -> Choose ‘Unsupervised’ -> ‘Attribute’ -> click
‘StringToWordVector’

• 200 attributes to
be kept for each
class
• WordToKeep
= 200
Text Pre-processing (Tokenization): Output
• Selection of attribute also can be done in ‘Classify’ tab.
• On ‘Classify’ tab -> ‘Classifier’ -> ‘Meta’ -> ‘AttributeSelectedClassifier’

Classifier -> choose ‘J48’


Evaluator -> choose
‘GainRatioAttributeEval’
search -> choose ‘Ranker’

Cont…
• Click on ‘Ranker’ -> change ‘numToSelect’ =100
which means Weka will select the top 100 attributes by the rank
Alternative Approach:
On ‘Select attribute’ tab >>choose ‘AttributeSelection’>>
‘CorrelationAttributeEval’; choose “Ranker”
Click Start ; always ensure the Class label is the correct one.

CorrelationAttributeEval
• Evaluates the worth of an attribute by
measuring the correlation (Pearson’s)
of the attributes with the class.
• Nominal attributes are considered on
a value by value basis by treating each
value as an indicator. An overall
correlation for a nominal attribute is
arrived at via a weighted average.
• ‘Test options’ – for exercise we are going to use percentage split by
default = 66% -> click ‘Start’
Text preprocessing: Save reduced dataset

In the result list, go to the


run result and right hand
click, choose “Save reduced
data” and name the file.
Alternative technique of Tokenization NGram
• Open the original 2500 dataset (or retrieved the save file).
• Perform the similar process to get below result:
Text preprocessing – NGram Tokenization and Stop Words
• NGramTokenizer used to split a string into an n-gram with min and max grams.
• Filter -> unsupervised -> ‘StringToWordVector’

IDFTransform, TFTransform,
lowerCaseTokens,
outputWordCounts ->
change to ‘True’,
stopwordsHandler =
‘MultiStopWords’
Tokenizer =
‘NGramTokenizer’,
WordToKeep = 200,
Click OK & Apply
Text preprocessing – NGramTokenization and Stop Words
• Check the Select Attribute and the word ranked in it.
Attribute evaluator -> choose ‘CorrelationAttributeEval’ ->click ‘Start’
Tokenization:
Use Own Dictionary and Stop Words
Best Practice - Always save files in .arff format
• Preprocess -> click ‘Save’ -> name your file .arff -> click ‘Save’
Data Preparation

• Transform
• Merge
• Rename

Text Pre-processing

• Tokenization
• Attribute selection
Outline
Text Classification using Machine Learning

• Random Forest (RF)


• Naïve Bayes (NB)
• Logistic Regression (LR)
• Support Vector Machine (SVM)

Analysis of the results


• Text Classification based on
Sentiment
• Sentiment is labelled based on
case study
Text
Classification
Text Classification – RandomForest (RF)
Test Options: Cross-validation 10 folds
Save Models
• Classify -> in ‘Result list’ -> right click on ‘RandomForest’-> click ‘Save result buffer’

• Name your model and save as “.model”


• Repeat step for each model.
Text Classification – Naïve Bayes (NB)
Test Options: Cross-validation 10 folds
Text Classification – Logistic Regression (LR)
Test Options: Cross-validation 10 folds
Text Classification – Support Vector Machine (SVM)
Test Options: Cross-validation 10 folds
To use SVM classification, you need to download the LibSVM first.
Go to Weka GUI -> Tools -> click on Package manager

Download the latest version and install


Text Classification – Support Vector Machine (SVM)
Test Options: Cross-validation 10 folds
To use SVM classification, you need to download the LibSVM first.
Go to Weka GUI -> Tools -> click on Package manager
Text Classification – RandomForest (RF)
Test Options: Percentage split by default = 66% training, 34% testing
Text Classification – Naïve Bayes (NB)
Test Options: Percentage split by default = 66% training, 34% testing
Text Classification – Logistic Regression (LR)
Test Options: Percentage split by default = 66% training, 34% testing
Text Classification – Support Vector Machine (SVM)
Test Options: Percentage split by default = 66% training, 34% testing
Data Preparation

• Transform
• Merge
• Rename

Text Pre-processing

• Tokenization
• Attribute selection
Outline
Text Classification using Machine Learning

• Random Forest (RF)


• Naïve Bayes (NB)
• Logistic Regression (LR)
• Support Vector Machine (SVM)

Analysis of the results


Analysis of the Results
• This result is based on the average weight.

Methods Ten fold cross-validation Split percentage by default =


66% training, 34% testing
TP Rate FP Rate Precision Recall TP Rate FP Rate Precision Recall

Random Forest (RF) 0.611 0.256 0.566 0.611 0.586 0.265 0.535 0.586
Naïve Bayes (NB) 0.505 0.271 0.530 0.505 0.502 0.265 0.531 0.502
Logistic Regression (LR) 0.612 0.222 0.593 0.612 0.576 0.235 0.563 0.576

Support Vector Machine (SVM) 0.640 0.235 0.614 0.640 0.638 0.236 0.637 0.638
Classifier Evaluation Metrics:
Precision and Recall
• Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive?
*+ *+
• 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = =
*+,-+ +.

• Recall: completeness – what % of positive tuples did the classifier label as positive?
*+ *+
• 𝑟𝑒𝑐𝑎𝑙𝑙 = =
*+,-1 +

Actual C1 ~C1
• Perfect score is 1.0 class\Predicte
d class
• Inverse relationship between C1 True False
precision & recall Positives
(TP)
Negatives
(FN)
~C1 False True
Positives Negatives
(FP) (TN)
62
Basic Measures for Text Retrieval

Relevant Relevant &


Retrieved Retrieved

All Documents

• Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e.,
“correct” responses)
| {Relevant} Ç {Retrieved} |
precision =
| {Retrieved} |

• Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

| {Relevant} Ç {Retrieved} |
recall =
| {Relevant} |
63
References
• [1] Sentiment Analysis, https://devopedia.org/sentiment-analysis

• [2] Indonesians optimistic about their country’s democracy and economy as


elections near, APRIL 4, 2019, https://www.pewresearch.org/fact-
tank/2019/04/04/indonesians-optimistic-about-their-countrys-democracy-and-
economy-as-elections-near/

• [3] Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and
Techniques, 3rd edition, Morgan Kaufmann, 2011.

• [4] AN OVERVIEW OF SENTIMENT ANALYSIS,


https://jgateplus.com/home/2019/01/16/an-overview-of-sentiment-analysis/
How to Assign Weights
• Two-fold heuristics based on frequency
• TF (Term frequency)
• More frequent within a document à more relevant to semantics
• e.g., “query” vs. “commercial”

• IDF (Inverse document frequency)


• Less frequent among documents à more discriminative
• e.g. “algebra” vs. “science”

65
TF Weighting
• Weighting:
• More frequent => more relevant to topic
• e.g. “query” vs. “commercial”

• Raw TF= f(t,d): how many times term t appears in doc d

• Normalization:
• Document length varies => relative frequency preferred
• e.g., Maximum frequency normalization

66
IDF Weighting
• Ideas:
vLess frequent among documents à more discriminative
• Formula:


• n — total number of docs
• k — # docs with term t appearing
(the DF document frequency)

67
TF-IDF Weighting
• TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t)
• Frequent within doc à high tf à high weight
• Selective among docs à high idf à high weight
• Recall VS model
• Each selected term represents one dimension
• Each doc is represented by a feature vector
• Its t-term coordinate of document d is the TF-IDF weight
• This is more reasonable
• Just for illustration …
• Many complex and more effective weighting variants exist in practice

68
Regex
• https://www.w3schools.com/python/python_regex.asp

You might also like