Rodrigo Martínez-Castaño
Juan C. Pichel
Pablo Gamallo
Sentiment Analysis on
Multilingual Tweets using
Big Data Technologies
Centro Singular de Investigación en Tecnoloxías da Información
Universidade de Santiago de Compostela
Index
1. Introduction
2. Background & Related Work
3. Architecture of the System
4. Tweet Mining Module
5. Sentiment Analysis Module
6. Performance Results
7. Conclusions
8. Evolution of the System
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 2
Introduction 1/2
Sentiment Analysis
Consists in finding the opinion
Twitter is a large source of short texts
Useful conclusions with huge amounts of text
Analysing tweets is a big challenge
Human subjectivity
Too short to be linguistically analysed
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 3
Introduction 2/2
Parallel architecture using Big Data
Standard solutions cannot handle GBs or TBs of text in
reasonable time
Apache Hadoop cluster with HBase
Goals
Sentiment classifier should perform as well as other
state-of-the-art classifiers in different languages
Millions of tweets should be processed in short times
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 4
Background & Related Work 1/2
Big Data processing
MapReduce programming model
Two phases: map and reduce
Inputs and outputs are key-value pairs
Apache Hadoop
HDFS (filesystem)
YARN (resource-management platform)
MapReduce Framework
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 5
Background & Related Work 2/2
Sentiment analysis
Machine learning
Learning algorithms over a known dataset
Training features such as bag of words or PoS tags
Lexicon-based
Polarity lexicons (dictionaries)
Main strategy: machine learning + polarity lexicons + shallow
syntactic information to detect polarity shifters
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 6
Architecture of the System
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 7
Tweet Mining Module
Streaming API consumer
Consumes a sample stream (around 1%)
Not enough
Web scraper
Acquires tweets from the Twitter web interface
Multi-thread
Loop queries based on a term list
Storage under Apache HBase
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 8
Sentiment Analysis Module 1/4
CitiusSentiment
Naive Bayes classifier
Optimal time performance
Reasonable accuracy
Independence among linguistic features
Multilingual
Spanish, English, Portuguese and Galician
Lexicon-based features
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 9
Sentiment Analysis Module 2/4
Strategy & features
The annotated corpus only contains positive and negative
examples of tweets
The tweet is considered neutral if it does not contain any
word within the polarity lexicon
Precision higher than 80%
Pre-processing (URLs, hashtags, emoticons, etc.)
Considered features
Lemmas
Multiwords
Polarity lexicons (10,850 –English–, 4,564 –Spanish–)
Valence shifters
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 10
Sentiment Analysis Module 3/4
Integration into a Big Data infrastructure
Perldoop for the translation of the classifier to Java
MapReduce application
Mappers
̶ Every tweet to be processed must match the query
terms
̶ If so, the text is processed through the classifier modules
̶ Two key-value pairs are emitted
To increment the counter of successfully processed tweets
To point the polarity of the processed tweet (-1, 0 or 1)
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 11
Sentiment Analysis Module 4/4
Integration into a Big Data infrastructure
Reducer
̶ Computes the total number of processed tweets
̶ Summarizes the total score
̶ Calculates the positivity ratio
Positivity ratio
A normalized value between 0 and 1
̶ Negative < 0.45
̶ Positive > 0.55
Contrasts the positive tweets with the negative ones
σ 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑖𝑒𝑠
+1
𝑁𝑜. 𝑜𝑓 𝑡𝑤𝑒𝑒𝑡𝑠
2
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 12
Web interface
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 13
Performance results 1/5
Tweet Mining Module evaluation
Average number of unique collected tweets per second, filtered by language.
Lists of terms
5,000 most frequent words in Spanish
5,000 most frequent words in English
Terms distributed over 72 threads
Much higher performance with the full TMM system
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 14
Performance results 2/5
Sentiment analysis evaluation
F-score and ranking of our system (CitiusSentiment) in the TASS
competition: Sentiment analysis on Spanish tweets.
F-score and ranking of our system (CitiusSentiment) in the SemEval competition, namely task 9 focused on
sentiment analysis in Twitter (only English microtexts).
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 15
Performance results 3/5
Evaluation of the Big Data infrastructure 1/3
Successfully processed tweets, matches and positivity ratio for the selected Spanish terms.
Processing time (in minutes) for the selected Spanish terms.
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 16
Performance results 4/5
Evaluation of the Big Data infrastructure 2/3
System speedup with respect to the no. of
HBase regions
12
10
0
1 2 4 8 16 32
corrupción gobierno elecciones
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 17
Performance results 5/5
Evaluation of the Big Data infrastructure 3/3
68 Spanish popular terms selected as targets for the TMM
Manual splits of the original HBase table with a single region
50 GiB of RAM per node for YARN containers (5 nodes)
The system scales up quite good with enough physical
resources to handle the launched tasks
Higher number of splits with low number of coincidences
causes overhead
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 18
Conclusions
Twitter is a large source of short texts with opinions
Making sentiment analysis on tweets is challenging
Our classifier performs above the average in two competitions
Big Data technologies help speeding up the sentiment analysis
process
The Tweet Mining Module improves the number of captured
tweets
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 19
Evolution of the System
A real-time system
Apache Storm for real-time processing
Inter-module RAM based buffers for faster I/O
Apache Spark for real time queries on the polarized
tweets
RESTful API & web interface
̶ Exploration of popular terms
̶ Custom real-time queries
̶ Chart represented results divided by custom time intervals
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 20
Thank you!
Rodrigo Martínez-Castaño: rodrigo.martinez@usc.es
Juan C. Pichel: juancarlos.pichel@usc.es
Pablo Gamallo: pablo.gamallo@usc.es
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 21