100% found this document useful (1 vote)

3K views26 pages

On Real-Time Twitter Analysis.

Slides for the talk given at the Apache Hadoop Get Together in Berlin on April 18, 2012.

Uploaded by

Mikio Braun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

3K views26 pages

On Real-Time Twitter Analysis.

Slides for the talk given at the Apache Hadoop Get Together in Berlin on April 18, 2012.

Uploaded by

Mikio Braun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

On Real-Time Twitter Analysis

Mikio L. Braun mikiobraun twimpact UG (haftungsbeschrnkt) with Matthias Jugel thinkberg Apache Hadoop Get Together, Berlin April 28, 2012 [Link] [Link]

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

Big Data and Data Science and Social Media

There's a lot you can do with social media data

Trend analysis (trending topics) Sentiment analysis Impact analysis (Klout, Kred, etc.) More general studies (diameter of network, distribution patterns, etc.) Event treams (Twitter stream) Graph data (user relationships, retweet networks) Text data (sentiment analysis, word clouds) URLs

Types of data

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

Social Media Streaming Data

Examples

Twitter firehose/sprinkler Click-through data [Link] URL resolution requests up to a few thousand events per second events are small up to a few kilobytes

Some numbers:

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

What's in a Tweet?
Tweet Hashtag Link User Mention Keywords Retweeting User Retweeted User Retweeted Tweet

Timestamp

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

TWIMPACT - Retweet trends

Trending by retweet activity Robust matching of tweets even if shortened, edited (slightly) Compute trends for links, hashtags, URLs Aggregate TWIMPACT score for users

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

How to scale stream processing?

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

History of approaches

Started in June 2009 Free Twitter stream (capped at 50 tweets/s)

Language Storage backend

Version 1 Version 2 Stream mining + in memory

Version 3

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

Putting it all in a data base

Insert millions of rows into data base Get reports by

SELECT*,COUNT(*)FROMevents WHEREcreated_at> ANDcreated_at< GROUPBYid ORDERBYCOUNT(*)DESC LIMIT100;

Hardly real-time. Also, data bases will become slower and slower...

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

NoSQL: Cassandra

Structure: Families Tables Rows Key Value pairs Easy clustering (peer-to-peer configuration) Flexible consistency, read-repair, hinted handoff, etc. No locking, (in 0.6.x:) no support for indices, counters complete rewrite Operations profile (about 50:50 read/write)

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

Cassandra: Multithreading

Multithreading helps (but without locking support?)

64 32 Tweets per second 4 2 16 8

Core i7, 4 cores (2 + 2 HT)

2012 TWIMPACT

Seconds
Apache Hadoop Get Together, April 18, 2012, Berlin

Cassandra: Configuration

Flush
Memtables, indexes, etc.

Size of Memtable: 128M, JVM Heap: 3G, #CF: 12

Apache Hadoop Get Together, April 18, 2012, Berlin

Compaction
2012 TWIMPACT

Cassandra: Configuration

Tweets per second

Compaction

Big GC

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

NoSQL/Cassandra - Summary

Works quite well, faster than PostgreSQL (from 200 to 600 tps) Lack of locking/indices require a lot of manual management Configuration messy 4 node cluster vs. single node:
Single node consistently 1.5 3 times faster!

Ultimately, becomes slower and slower Doesn't handle deletions gracefully

2012 TWIMPACT

Apache Hadoop Get Together, April 18, 2012, Berlin

Stream processing frameworks

Stream processing = scalable actor based concurrency For example:

Twitter's (backtype's) Storm [Link] Yahoo's S4 [Link] Esper [Link] Streambase [Link]

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

Stream processing- some thoughts

Maximum throughput hard to estimate Not everything can be parallelized Scalable storage system still necessary How to deal with failure/congestion? Persistent messaging middleware not what you might want.

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

The DataSift infrastructure

[Link]

Parse

Augment Content

Custom Filters

Delivery

Throughput: 120,000 tweets per second

936 CPU cores Analyzes 250 million tweets per day Peak throughput: 120,000 t/s monitoring & accounting

C++, PHP, Java/Scala, Ruby MySQL on SSDs, HBase (30 nodes, 400TB), memcached, Redis for some queues 0MQ, Kafka (LinkedIn)

but: 120,000 / 936 = 128.2 tweets per second per core

Apache Hadoop Get Together, April 18, 2012, Berlin
2012 TWIMPACT

Principles of Stream Processing

Keep resource needs constant Control maximum processing rates Disks too slow, keep data in RAM

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

Stream mining
fixed number of slots asd erq qwe 13r13t erqew erq qer fgsa gwth 5z3 wet 42 37 25 20 17 13 10 7 4 qer 5 21

Focus on relevant data, discard the rest Provably approximates true counts Keep data in memory

Space Saving algorithm (Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, International Conference on Database Theory, 2005.)

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

TWIMPACT

Real-time Twitter Retweet Analysis

Stream mining to keep hot set of few hundred thousand most active retweets in memory Secondary indices, bipartite graphs, object stores Write snapshots to disk for later analysis Up to several thousand tweets per second in single threaded operation.

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

2011 in Retweets

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

2011 in Retweets

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

Our Analysis Pipeline

JSON parsing Thread 1 Tweets Retweet Matching & Retweet Trends Thread k synchronized worker threads single threaded

Analyzing dependent trends (links/hashtags/etc.) Day 1 Day 2

Snapshots

Trends

Day n map reduce like

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

Most retweeted users

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

Most retweeted tweets

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

Social network buzz

Apache Hadoop Get Together, April 18, 2012, Berlin

2012 TWIMPACT

Summary

Many interesting challenges in social media. Many different data types, including streams. MapReduce doesn't really fit stream processing You can't just scale into real-time Principles of Stream Processing

Bounded hot set of data in memory Mine stream, discard irrelevant data

Real world applications often include a mixture of multithreading, stream processing, map reduce and single thread stages.
2012 TWIMPACT

Apache Hadoop Get Together, April 18, 2012, Berlin

Realtime Personalization and Recommendation With Stream Mining
No ratings yet
Realtime Personalization and Recommendation With Stream Mining
24 pages
Online Learning With Streamdrill
No ratings yet
Online Learning With Streamdrill
25 pages
Beyond Scaling Real Time Even Processing With Stream Mining
No ratings yet
Beyond Scaling Real Time Even Processing With Stream Mining
30 pages
Online Learning With Stream Mining
No ratings yet
Online Learning With Stream Mining
36 pages
Quantifying Spatiotemporal Dynamics of Twitter Replies To News Feeds
No ratings yet
Quantifying Spatiotemporal Dynamics of Twitter Replies To News Feeds
6 pages
Java Matrix Library Guide
No ratings yet
Java Matrix Library Guide
19 pages
Fast Cross Validation Via Sequential Analysis - Appendix
No ratings yet
Fast Cross Validation Via Sequential Analysis - Appendix
6 pages
Some Introductory Remarks On Bayesian Inference
100% (1)
Some Introductory Remarks On Bayesian Inference
35 pages
Scalability Challenges in Big Data Science
No ratings yet
Scalability Challenges in Big Data Science
33 pages
Fast Cross-Validation Via Sequential Analysis - Paper
No ratings yet
Fast Cross-Validation Via Sequential Analysis - Paper
5 pages
Fast Cross Validation Via Sequential Analysis - Talk
No ratings yet
Fast Cross Validation Via Sequential Analysis - Talk
16 pages
Introduction to Cassandra NoSQL Database
100% (1)
Introduction to Cassandra NoSQL Database
35 pages
Terminals of ECU
No ratings yet
Terminals of ECU
2 pages
Critical Creek Flow Analysis Guide
No ratings yet
Critical Creek Flow Analysis Guide
16 pages
Types of Numbers Crossword
No ratings yet
Types of Numbers Crossword
1 page
11.wind Farm To Weak-Grid Connection
100% (1)
11.wind Farm To Weak-Grid Connection
47 pages
Understanding Artificial Intelligence Basics
No ratings yet
Understanding Artificial Intelligence Basics
10 pages
Leica Detection DD Series Locators & Accessories: User Manual
No ratings yet
Leica Detection DD Series Locators & Accessories: User Manual
82 pages
Hydrology Assignment Guide
No ratings yet
Hydrology Assignment Guide
15 pages
Chapter 2 Ram Rom
No ratings yet
Chapter 2 Ram Rom
51 pages
Microprocessor vs Microcontroller Overview
No ratings yet
Microprocessor vs Microcontroller Overview
9 pages
Computer Application Packages Main Note
No ratings yet
Computer Application Packages Main Note
82 pages
Paulino, Paul Tyrone R. Assignment 5
No ratings yet
Paulino, Paul Tyrone R. Assignment 5
15 pages
Book Chapter 3-Properties of Pure Substances
No ratings yet
Book Chapter 3-Properties of Pure Substances
35 pages
Bhanzu WS 29 TH May
No ratings yet
Bhanzu WS 29 TH May
4 pages
Blockchain Assignment 2 GPT
No ratings yet
Blockchain Assignment 2 GPT
23 pages
Hydraulic Accumulators & Intensifiers
100% (1)
Hydraulic Accumulators & Intensifiers
7 pages
Worcester Large 3-Piece Cryogenic Valves: Installation, Operating & Maintenance Instructions
No ratings yet
Worcester Large 3-Piece Cryogenic Valves: Installation, Operating & Maintenance Instructions
14 pages
Airborne Weather Radar Guide
100% (1)
Airborne Weather Radar Guide
13 pages
Specs PG 6 BR 2073 202212 SQ Series Dryer
No ratings yet
Specs PG 6 BR 2073 202212 SQ Series Dryer
1 page
NN-493 Vol.+30+No.+17+ (2023) Final JPTCP (872+to+893)
No ratings yet
NN-493 Vol.+30+No.+17+ (2023) Final JPTCP (872+to+893)
22 pages
As 60052-2005 Voltage Measurement by Means of Standard Air Gaps
No ratings yet
As 60052-2005 Voltage Measurement by Means of Standard Air Gaps
9 pages
Barbershop Harmony Guide
No ratings yet
Barbershop Harmony Guide
1 page
Bugzilla Tutorial For Beginners
No ratings yet
Bugzilla Tutorial For Beginners
28 pages
Physics Graphs for Students
No ratings yet
Physics Graphs for Students
2 pages
Liquid-Liquid Extraction, COSMO-SAC Predictions and Process Flow Sheeting of 1-Butanol Enhancement Using Mesitylene and Oleyl Alcohol
No ratings yet
Liquid-Liquid Extraction, COSMO-SAC Predictions and Process Flow Sheeting of 1-Butanol Enhancement Using Mesitylene and Oleyl Alcohol
47 pages
Project Report
No ratings yet
Project Report
65 pages
Shubh Garg: AI Engineer Profile
No ratings yet
Shubh Garg: AI Engineer Profile
1 page
HM Springs and Pendulums 3 6
No ratings yet
HM Springs and Pendulums 3 6
4 pages
3-DOF Position Sensing with Linear-Hall Array
No ratings yet
3-DOF Position Sensing with Linear-Hall Array
21 pages
Chapter 1 Overview of C
No ratings yet
Chapter 1 Overview of C
11 pages
Programming With Posix Threads Butenhof PDF Download
No ratings yet
Programming With Posix Threads Butenhof PDF Download
3 pages

On Real-Time Twitter Analysis.

Uploaded by

On Real-Time Twitter Analysis.

Uploaded by

On Real-Time Twitter Analysis

Apache Hadoop Get Together, April 18, 2012, Berlin

Big Data and Data Science and Social Media

There's a lot you can do with social media data

Apache Hadoop Get Together, April 18, 2012, Berlin

Social Media Streaming Data

Apache Hadoop Get Together, April 18, 2012, Berlin

Apache Hadoop Get Together, April 18, 2012, Berlin

TWIMPACT - Retweet trends

Apache Hadoop Get Together, April 18, 2012, Berlin

How to scale stream processing?

Apache Hadoop Get Together, April 18, 2012, Berlin

Started in June 2009 Free Twitter stream (capped at 50 tweets/s)

Version 1 Version 2 Stream mining + in memory

Apache Hadoop Get Together, April 18, 2012, Berlin

Putting it all in a data base

Insert millions of rows into data base Get reports by

Apache Hadoop Get Together, April 18, 2012, Berlin

Apache Hadoop Get Together, April 18, 2012, Berlin

Multithreading helps (but without locking support?)

Core i7, 4 cores (2 + 2 HT)

Size of Memtable: 128M, JVM Heap: 3G, #CF: 12

Tweets per second

Apache Hadoop Get Together, April 18, 2012, Berlin

Ultimately, becomes slower and slower Doesn't handle deletions gracefully

Apache Hadoop Get Together, April 18, 2012, Berlin

Stream processing frameworks

Stream processing = scalable actor based concurrency For example:

Apache Hadoop Get Together, April 18, 2012, Berlin

Stream processing- some thoughts

Apache Hadoop Get Together, April 18, 2012, Berlin

The DataSift infrastructure

Throughput: 120,000 tweets per second

but: 120,000 / 936 = 128.2 tweets per second per core

Principles of Stream Processing

Apache Hadoop Get Together, April 18, 2012, Berlin

Apache Hadoop Get Together, April 18, 2012, Berlin

Real-time Twitter Retweet Analysis

Apache Hadoop Get Together, April 18, 2012, Berlin

Apache Hadoop Get Together, April 18, 2012, Berlin

Apache Hadoop Get Together, April 18, 2012, Berlin

Our Analysis Pipeline

Analyzing dependent trends (links/hashtags/etc.) Day 1 Day 2

Day n map reduce like

Apache Hadoop Get Together, April 18, 2012, Berlin

Most retweeted users

Apache Hadoop Get Together, April 18, 2012, Berlin

Most retweeted tweets

Apache Hadoop Get Together, April 18, 2012, Berlin

Social network buzz

Apache Hadoop Get Together, April 18, 2012, Berlin

Apache Hadoop Get Together, April 18, 2012, Berlin

You might also like