Professional Documents
Culture Documents
ly/2MWyuD7
Text Mining
Social
Sepuluh Nopember
Institute of Technology (ITS)
Sentiment
29-31 Oct. 2019
Analysis
Using WEKA Tools
[2] https://www.pewresearch.org/fact-tank/2019/04/04/indonesians-optimistic-
about-their-countrys-democracy-and-economy-as-elections-near/
Basic Concepts
• A document can be described by a set of
representative keywords called index terms.
Concepts • Different index terms have varying relevance when
of Text used to describe document contents.
Analytics? • This effect is captured through the assignment of
numerical weights to each index term of a document.
(e.g.: frequency, term-frequency-index document
frequency or tf-idf)
• DBMS Analogy
• Index Terms à Attributes
• Weights à Attribute Values
8
Text Categorization
• Pre-given categories and labeled document examples (Categories may form
hierarchy)
• Classify new documents ; e.g. Google mail apps
• A standard classification (supervised learning ) problem
Positive
Categorization
Positive
Past Reviews System
Negative
… …
Positive
Neutral
Neutral
New Reviews Negative
Han et al, 2011
9
Data Acquisition and Preparation
Tokenization
apply SA in
Weka Tools? Text Classification using
Random Forest (RF)
Naïve Bayes (NB)
Data
Mining
Tool
https://www.cs.waikato.ac.nz/ml/weka/index.html
WEKA Interfaces
E-BOOKS
https://www.cs.waikato.ac.nz/ml/weka/documentation.html
WEKA
Data
Format
WEKA 3
LET’S BEGIN
Dataset - Product Reviews
• Amazon Kindle
• https://www.kaggle.com/bharadwaj6/kindle-
reviews
• Amazon Kindle Store category from May 1996
- July 2014.
• Contains total of 982619 entries.
• Each reviewer has at least 5 reviews and each
product has at least 5 reviews in this dataset.
Download and open files and
save (.csv) in Excel
Text Pre-processing
• Tokenization
• Attribute selection
Outline
Text Classification using Machine Learning
• Random Forest (RF)
• Naïve Bayes (NB)
• Logistic Regression (LR)
• Support Vector Machine (SVM)
• Transform
• Merge
• Rename
Text Pre-processing
• Tokenization
• Attribute selection
Outline
Text Classification using Machine Learning
• 200 attributes to
be kept for each
class
• WordToKeep
= 200
Text Pre-processing (Tokenization): Output
• Selection of attribute also can be done in ‘Classify’ tab.
• On ‘Classify’ tab -> ‘Classifier’ -> ‘Meta’ -> ‘AttributeSelectedClassifier’
Cont…
• Click on ‘Ranker’ -> change ‘numToSelect’ =100
which means Weka will select the top 100 attributes by the rank
Alternative Approach:
On ‘Select attribute’ tab >>choose ‘AttributeSelection’>>
‘CorrelationAttributeEval’; choose “Ranker”
Click Start ; always ensure the Class label is the correct one.
CorrelationAttributeEval
• Evaluates the worth of an attribute by
measuring the correlation (Pearson’s)
of the attributes with the class.
• Nominal attributes are considered on
a value by value basis by treating each
value as an indicator. An overall
correlation for a nominal attribute is
arrived at via a weighted average.
• ‘Test options’ – for exercise we are going to use percentage split by
default = 66% -> click ‘Start’
Text preprocessing: Save reduced dataset
IDFTransform, TFTransform,
lowerCaseTokens,
outputWordCounts ->
change to ‘True’,
stopwordsHandler =
‘MultiStopWords’
Tokenizer =
‘NGramTokenizer’,
WordToKeep = 200,
Click OK & Apply
Text preprocessing – NGramTokenization and Stop Words
• Check the Select Attribute and the word ranked in it.
Attribute evaluator -> choose ‘CorrelationAttributeEval’ ->click ‘Start’
Tokenization:
Use Own Dictionary and Stop Words
Best Practice - Always save files in .arff format
• Preprocess -> click ‘Save’ -> name your file .arff -> click ‘Save’
Data Preparation
• Transform
• Merge
• Rename
Text Pre-processing
• Tokenization
• Attribute selection
Outline
Text Classification using Machine Learning
• Transform
• Merge
• Rename
Text Pre-processing
• Tokenization
• Attribute selection
Outline
Text Classification using Machine Learning
Random Forest (RF) 0.611 0.256 0.566 0.611 0.586 0.265 0.535 0.586
Naïve Bayes (NB) 0.505 0.271 0.530 0.505 0.502 0.265 0.531 0.502
Logistic Regression (LR) 0.612 0.222 0.593 0.612 0.576 0.235 0.563 0.576
Support Vector Machine (SVM) 0.640 0.235 0.614 0.640 0.638 0.236 0.637 0.638
Classifier Evaluation Metrics:
Precision and Recall
• Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive?
*+ *+
• 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = =
*+,-+ +.
• Recall: completeness – what % of positive tuples did the classifier label as positive?
*+ *+
• 𝑟𝑒𝑐𝑎𝑙𝑙 = =
*+,-1 +
Actual C1 ~C1
• Perfect score is 1.0 class\Predicte
d class
• Inverse relationship between C1 True False
precision & recall Positives
(TP)
Negatives
(FN)
~C1 False True
Positives Negatives
(FP) (TN)
62
Basic Measures for Text Retrieval
All Documents
• Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e.,
“correct” responses)
| {Relevant} Ç {Retrieved} |
precision =
| {Retrieved} |
• Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved
| {Relevant} Ç {Retrieved} |
recall =
| {Relevant} |
63
References
• [1] Sentiment Analysis, https://devopedia.org/sentiment-analysis
• [3] Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and
Techniques, 3rd edition, Morgan Kaufmann, 2011.
65
TF Weighting
• Weighting:
• More frequent => more relevant to topic
• e.g. “query” vs. “commercial”
• Normalization:
• Document length varies => relative frequency preferred
• e.g., Maximum frequency normalization
66
IDF Weighting
• Ideas:
vLess frequent among documents à more discriminative
• Formula:
•
• n — total number of docs
• k — # docs with term t appearing
(the DF document frequency)
67
TF-IDF Weighting
• TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t)
• Frequent within doc à high tf à high weight
• Selective among docs à high idf à high weight
• Recall VS model
• Each selected term represents one dimension
• Each doc is represented by a feature vector
• Its t-term coordinate of document d is the TF-IDF weight
• This is more reasonable
• Just for illustration …
• Many complex and more effective weighting variants exist in practice
68
Regex
• https://www.w3schools.com/python/python_regex.asp