0% found this document useful (0 votes)

401 views21 pages

GitHub Training

This document describes a system for performing sentiment analysis on tweets in multiple languages using big data technologies. The system includes a tweet mining module that collects tweets from Twitter's API and via web scraping. It stores the tweets in Apache Hadoop's HBase database. A sentiment analysis module uses a Naive Bayes classifier with lexicon and part-of-speech features to classify tweets in several languages as positive, negative or neutral. MapReduce programs integrate the classifier into the big data infrastructure to analyze millions of tweets quickly. Evaluation results show the classifier performs well compared to other methods and the system scales to handle large volumes of tweets efficiently using Hadoop clusters.

Uploaded by

cyberfox786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

401 views21 pages

GitHub Training

Uploaded by

cyberfox786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Rodrigo Martínez-Castaño

Juan C. Pichel
Pablo Gamallo

Sentiment Analysis on
Multilingual Tweets using
Big Data Technologies
Centro Singular de Investigación en Tecnoloxías da Información
Universidade de Santiago de Compostela
Index
1. Introduction
2. Background & Related Work
3. Architecture of the System
4. Tweet Mining Module
5. Sentiment Analysis Module
6. Performance Results
7. Conclusions
8. Evolution of the System

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 2

Introduction 1/2
Sentiment Analysis
 Consists in finding the opinion
 Twitter is a large source of short texts
 Useful conclusions with huge amounts of text
Analysing tweets is a big challenge
 Human subjectivity
 Too short to be linguistically analysed

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 3

Introduction 2/2
Parallel architecture using Big Data
 Standard solutions cannot handle GBs or TBs of text in
reasonable time
 Apache Hadoop cluster with HBase
Goals
 Sentiment classifier should perform as well as other
state-of-the-art classifiers in different languages
 Millions of tweets should be processed in short times

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 4

Background & Related Work 1/2
Big Data processing

MapReduce programming model

 Two phases: map and reduce
 Inputs and outputs are key-value pairs
Apache Hadoop
 HDFS (filesystem)
 YARN (resource-management platform)
 MapReduce Framework

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 5

Background & Related Work 2/2
Sentiment analysis

 Machine learning
 Learning algorithms over a known dataset
 Training features such as bag of words or PoS tags
 Lexicon-based
 Polarity lexicons (dictionaries)
 Main strategy: machine learning + polarity lexicons + shallow
syntactic information to detect polarity shifters

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 6

Architecture of the System

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 7

Tweet Mining Module
Streaming API consumer
 Consumes a sample stream (around 1%)
 Not enough

Web scraper
 Acquires tweets from the Twitter web interface
 Multi-thread
 Loop queries based on a term list

Storage under Apache HBase

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 8
Sentiment Analysis Module 1/4
CitiusSentiment

Naive Bayes classifier

 Optimal time performance
 Reasonable accuracy
 Independence among linguistic features
Multilingual
 Spanish, English, Portuguese and Galician
Lexicon-based features

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 9

Sentiment Analysis Module 2/4
Strategy & features
 The annotated corpus only contains positive and negative
examples of tweets
 The tweet is considered neutral if it does not contain any
word within the polarity lexicon
 Precision higher than 80%
 Pre-processing (URLs, hashtags, emoticons, etc.)
 Considered features
 Lemmas
 Multiwords
 Polarity lexicons (10,850 –English–, 4,564 –Spanish–)
 Valence shifters

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 10

Sentiment Analysis Module 3/4
Integration into a Big Data infrastructure

Perldoop for the translation of the classifier to Java

MapReduce application
 Mappers
̶ Every tweet to be processed must match the query
terms
̶ If so, the text is processed through the classifier modules
̶ Two key-value pairs are emitted
 To increment the counter of successfully processed tweets
 To point the polarity of the processed tweet (-1, 0 or 1)

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 11

Sentiment Analysis Module 4/4
Integration into a Big Data infrastructure
 Reducer
̶ Computes the total number of processed tweets
̶ Summarizes the total score
̶ Calculates the positivity ratio

 Positivity ratio
 A normalized value between 0 and 1
̶ Negative < 0.45
̶ Positive > 0.55
 Contrasts the positive tweets with the negative ones
σ 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑖𝑒𝑠
+1
𝑁𝑜. 𝑜𝑓 𝑡𝑤𝑒𝑒𝑡𝑠
2
Centro de Investigación en Tecnoloxías da Información (CiTIUS) 12
Web interface

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 13

Performance results 1/5
Tweet Mining Module evaluation

Average number of unique collected tweets per second, filtered by language.

Lists of terms
 5,000 most frequent words in Spanish
 5,000 most frequent words in English
Terms distributed over 72 threads
Much higher performance with the full TMM system

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 14

Performance results 2/5
Sentiment analysis evaluation

F-score and ranking of our system (CitiusSentiment) in the TASS

competition: Sentiment analysis on Spanish tweets.

F-score and ranking of our system (CitiusSentiment) in the SemEval competition, namely task 9 focused on
sentiment analysis in Twitter (only English microtexts).

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 15

Performance results 3/5
Evaluation of the Big Data infrastructure 1/3

Successfully processed tweets, matches and positivity ratio for the selected Spanish terms.

Processing time (in minutes) for the selected Spanish terms.

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 16

Performance results 4/5
Evaluation of the Big Data infrastructure 2/3

System speedup with respect to the no. of

HBase regions
12

0
1 2 4 8 16 32
corrupción gobierno elecciones

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 17

Performance results 5/5
Evaluation of the Big Data infrastructure 3/3

 68 Spanish popular terms selected as targets for the TMM

 Manual splits of the original HBase table with a single region

 50 GiB of RAM per node for YARN containers (5 nodes)

 The system scales up quite good with enough physical

resources to handle the launched tasks

 Higher number of splits with low number of coincidences

causes overhead

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 18

Conclusions
 Twitter is a large source of short texts with opinions
 Making sentiment analysis on tweets is challenging
 Our classifier performs above the average in two competitions
 Big Data technologies help speeding up the sentiment analysis
process

 The Tweet Mining Module improves the number of captured

tweets

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 19

Evolution of the System
A real-time system
 Apache Storm for real-time processing
 Inter-module RAM based buffers for faster I/O
 Apache Spark for real time queries on the polarized
tweets
 RESTful API & web interface
̶ Exploration of popular terms
̶ Custom real-time queries
̶ Chart represented results divided by custom time intervals

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 20

Thank you!

Rodrigo Martínez-Castaño: rodrigo.martinez@usc.es

Juan C. Pichel: juancarlos.pichel@usc.es
Pablo Gamallo: pablo.gamallo@usc.es

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 21

Github Tutorial
No ratings yet
Github Tutorial
10 pages
Script+answers Vol2
No ratings yet
Script+answers Vol2
175 pages
Essential Docker Commands Guide
No ratings yet
Essential Docker Commands Guide
5 pages
GitHub Pages Setup Guide
No ratings yet
GitHub Pages Setup Guide
2 pages
Essential Git Commands Cheat Sheet
50% (2)
Essential Git Commands Cheat Sheet
2 pages
Linux Shell Scripting Guide
No ratings yet
Linux Shell Scripting Guide
17 pages
Git Basics and Workflow Guide
No ratings yet
Git Basics and Workflow Guide
5 pages
GitHub Repository Creation Guide
100% (1)
GitHub Repository Creation Guide
9 pages
Configuring Jenkins with Maven Projects
No ratings yet
Configuring Jenkins with Maven Projects
13 pages
Git Cheat Sheet: 50 Essential Commands
100% (1)
Git Cheat Sheet: 50 Essential Commands
26 pages
GitHub Essentials: Collaborative Workflow Guide
No ratings yet
GitHub Essentials: Collaborative Workflow Guide
34 pages
Github Tutorial PDF
100% (5)
Github Tutorial PDF
15 pages
The Basics of Git and GitHub
100% (1)
The Basics of Git and GitHub
10 pages
Git Work Book
100% (4)
Git Work Book
74 pages
3.git Cheat Sheet (Dark) in VS COde
No ratings yet
3.git Cheat Sheet (Dark) in VS COde
1 page
Docker Command Cheat Sheet
No ratings yet
Docker Command Cheat Sheet
1 page
Gitlab
100% (1)
Gitlab
25 pages
Git Configuration and Usage Notes
No ratings yet
Git Configuration and Usage Notes
37 pages
Git Basics: TortoiseGit, Gerrit, Jenkins
100% (1)
Git Basics: TortoiseGit, Gerrit, Jenkins
34 pages
VS Code
No ratings yet
VS Code
8 pages
LinuxCheatSheet PDF
100% (1)
LinuxCheatSheet PDF
13 pages
VS Code Tips for Developers
100% (2)
VS Code Tips for Developers
47 pages
DevOps Chapter 2
No ratings yet
DevOps Chapter 2
31 pages
01 Version Control
No ratings yet
01 Version Control
37 pages
Git Bash Cheat Sheet Guide
100% (2)
Git Bash Cheat Sheet Guide
3 pages
Lars Vogel, Alex Blewitt - Distributed Version Control With Git - Mastering The Git Command Line - Third Edition (2014, Lars Vogel)
No ratings yet
Lars Vogel, Alex Blewitt - Distributed Version Control With Git - Mastering The Git Command Line - Third Edition (2014, Lars Vogel)
409 pages
Git & GitHub Beginner Tutorial
100% (1)
Git & GitHub Beginner Tutorial
28 pages
Basic Git Commands - Atlassian Documentation
No ratings yet
Basic Git Commands - Atlassian Documentation
2 pages
Git Basics for Columbia Students
No ratings yet
Git Basics for Columbia Students
8 pages
How to Start and Manage a Git Repository
No ratings yet
How to Start and Manage a Git Repository
13 pages
Git Concepts Explained: Objects & Repos
No ratings yet
Git Concepts Explained: Objects & Repos
27 pages
Github
100% (1)
Github
46 pages
Git & GitHub Workshop Guide
No ratings yet
Git & GitHub Workshop Guide
99 pages
WSL Guide
No ratings yet
WSL Guide
153 pages
PHP Laravel
No ratings yet
PHP Laravel
6 pages
Git Hub Cheat Sheet
No ratings yet
Git Hub Cheat Sheet
2 pages
Top 40 Interview Questions On Git
No ratings yet
Top 40 Interview Questions On Git
8 pages
Docker Commands Cheat Sheet PDF
No ratings yet
Docker Commands Cheat Sheet PDF
1 page
GitLab Workflow for OpenAirInterface
No ratings yet
GitLab Workflow for OpenAirInterface
26 pages
Intro to Git: Basics and Lab Guide
100% (1)
Intro to Git: Basics and Lab Guide
35 pages
Essential Git Command Cheat Sheet
100% (1)
Essential Git Command Cheat Sheet
2 pages
Cheat Sheet: All Clusterrolebindings Clusterroles Function
No ratings yet
Cheat Sheet: All Clusterrolebindings Clusterroles Function
1 page
Git Basics by Arjun Panwar
100% (1)
Git Basics by Arjun Panwar
86 pages
Docker Glossary & Command Reference
67% (3)
Docker Glossary & Command Reference
2 pages
Docker and Containerization Guide
No ratings yet
Docker and Containerization Guide
20 pages
Git and GitHub: A Beginner's Guide
No ratings yet
Git and GitHub: A Beginner's Guide
28 pages
Git Cheat Sheet for Developers
No ratings yet
Git Cheat Sheet for Developers
10 pages
(External) FREE AWS Cloud Project Bootcamp - Outline
No ratings yet
(External) FREE AWS Cloud Project Bootcamp - Outline
42 pages
Essential Git Commands Cheat Sheet
No ratings yet
Essential Git Commands Cheat Sheet
6 pages
Essential Bash Shortcuts and Commands
No ratings yet
Essential Bash Shortcuts and Commands
8 pages
Essential Git Commands Cheat Sheet
100% (1)
Essential Git Commands Cheat Sheet
2 pages
Sentiment Analysis of Tweets: ML Comparison
No ratings yet
Sentiment Analysis of Tweets: ML Comparison
6 pages
Sent 2
No ratings yet
Sent 2
7 pages
Twitter Sentiment Analysis Techniques
No ratings yet
Twitter Sentiment Analysis Techniques
8 pages
Sentiment Analysis in Marketing Campaigns
No ratings yet
Sentiment Analysis in Marketing Campaigns
3 pages
Twitter Sentiment Analysis Project Report
No ratings yet
Twitter Sentiment Analysis Project Report
26 pages
Sentiment Analysis On Twitter
No ratings yet
Sentiment Analysis On Twitter
19 pages
IEEE Paper Format
No ratings yet
IEEE Paper Format
4 pages
Twitter Sentiment Analysis with ML Techniques
No ratings yet
Twitter Sentiment Analysis with ML Techniques
9 pages
Real-Time Twitter Sentiment Analysis App
No ratings yet
Real-Time Twitter Sentiment Analysis App
5 pages
Spiritual Realms
100% (2)
Spiritual Realms
23 pages
Article - A Preface To Ramadan
No ratings yet
Article - A Preface To Ramadan
1 page
DSP Problem
No ratings yet
DSP Problem
44 pages
SBI Clerk Seating Arrangement Puzzles
No ratings yet
SBI Clerk Seating Arrangement Puzzles
54 pages
Vocabulary Worksheet 1.1.2: Unit 1 Extend
No ratings yet
Vocabulary Worksheet 1.1.2: Unit 1 Extend
2 pages
A Secretary's Power Manhwa - Google Search
No ratings yet
A Secretary's Power Manhwa - Google Search
1 page
My Diagnostic Test First Year
No ratings yet
My Diagnostic Test First Year
2 pages
Detailed Lesson Plan - Deped
No ratings yet
Detailed Lesson Plan - Deped
10 pages
FNMI Lesson: Identity and Emotions
No ratings yet
FNMI Lesson: Identity and Emotions
2 pages
Errors en
No ratings yet
Errors en
2 pages
Embracing Courage: Lessons from Esther
No ratings yet
Embracing Courage: Lessons from Esther
26 pages
Cookbooks PDF
No ratings yet
Cookbooks PDF
4 pages
Panduan Coding dan Struktur Kontrol
No ratings yet
Panduan Coding dan Struktur Kontrol
6 pages
Igcse Ict 0470 - Notes
No ratings yet
Igcse Ict 0470 - Notes
153 pages
Numerical Analysis
No ratings yet
Numerical Analysis
174 pages
Lesson Plan for Teaching Letter Yy
No ratings yet
Lesson Plan for Teaching Letter Yy
4 pages
Game Protocol Interface Specification
No ratings yet
Game Protocol Interface Specification
5 pages
JGroups 3.x Group Communication Guide
No ratings yet
JGroups 3.x Group Communication Guide
142 pages
Benefits of Hiking for Students
No ratings yet
Benefits of Hiking for Students
8 pages
Customer Service Manual Froster-MED-95 KD 235-03.10
No ratings yet
Customer Service Manual Froster-MED-95 KD 235-03.10
12 pages
Soal Uts
No ratings yet
Soal Uts
9 pages
Ma English Syllabus Shah Abdul Latif University Khairpur
No ratings yet
Ma English Syllabus Shah Abdul Latif University Khairpur
3 pages
Nota Staff Perfomance Appraisal 1
No ratings yet
Nota Staff Perfomance Appraisal 1
24 pages
Q-Shop Floor Labor Tool
No ratings yet
Q-Shop Floor Labor Tool
17 pages
Reader-Response Worksheets
No ratings yet
Reader-Response Worksheets
2 pages
Inspiring Stories of Young Disabled Achievers
33% (6)
Inspiring Stories of Young Disabled Achievers
16 pages
HSC Report Writing AI Prompt
No ratings yet
HSC Report Writing AI Prompt
3 pages
Bingo
No ratings yet
Bingo
13 pages
ZMF4ECL Users Guide
No ratings yet
ZMF4ECL Users Guide
254 pages
Field Study 4
No ratings yet
Field Study 4
2 pages

GitHub Training

Uploaded by

GitHub Training

Uploaded by

Rodrigo Martínez-Castaño

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 2

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 3

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 4

MapReduce programming model

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 5

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 6

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 7

Storage under Apache HBase

Naive Bayes classifier

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 9

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 10

Perldoop for the translation of the classifier to Java

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 11

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 13

Average number of unique collected tweets per second, filtered by language.

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 14

F-score and ranking of our system (CitiusSentiment) in the TASS

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 15

Processing time (in minutes) for the selected Spanish terms.

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 16

System speedup with respect to the no. of

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 17

 68 Spanish popular terms selected as targets for the TMM

 Manual splits of the original HBase table with a single region

 50 GiB of RAM per node for YARN containers (5 nodes)

 The system scales up quite good with enough physical

 Higher number of splits with low number of coincidences

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 18

 The Tweet Mining Module improves the number of captured

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 19

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 20

Rodrigo Martínez-Castaño: rodrigo.martinez@usc.es

Centro de Investigación en Tecnoloxías da Información (CiTIUS) 21

You might also like