You are on page 1of 133

The Art of Social Media Analysis with Twitter & Python

http://www.oscon.com/oscon2012/public/schedule/detail/23130

krishna sankar @ksankar

Update[Aug 5, 2012]
I have a set of blogs annotating these slides @ Have tried to add more details which were covered at the Tutorial Let me know if you need more explanations. I will update the blog
http://doubleclix.wordpress.com/2012/08/04/big- data-with-twitter-api-twitter-tips-a-bakers-dozen/

Intro
API, Objects,

o House Rules (1 of 2)

o Doesnt assume any knowledge of Twitter API o Goal: Everybody in the same page & get a working knowledge of Twitter API o To bootstrap your exploration into Social Network Analysis & Twitter o Simple programs, to illustrate usage & data manipulation

Twitter Network Analysis Pipeline


We will analyze @clouderati, 2072 followers, exploding to ~980,000 distinct users down one level

NLP, NLTK, Sentiment Analysis #tag Network

@mention network
Rewteeet analytics, Information contagion

Cliques, social graph


Growth, weakties

Intro
API, Objects,

o House Rules (2 of 2)
o Am using the requests library o There are good Twitter frameworks for python, but wanted to build from the basics. Once one understands the fundamentals, frameworks can help o Many areas to explore not enough time. So decided to focus on social graph, cliques & networkx

Twitter Network Analysis Pipeline


We will analyze @clouderati, 2072 followers, exploding to ~980,000 distinct users down one level

NLP, NLTK, Sentiment Analysis #tag Network

@mention network
Rewteeet analytics, Information contagion

Cliques, social graph


Growth, weakties

About Me
Director of Engineering & Data Science / Data Scientist/AWS Ops Guy at genophen.com
o o o o o o o o o o o o Co-chair 2012 IEEE Precision Time Synchronization http://www.ispcs.org/2012/index.html Blog : http://doubleclix.wordpress.com/ Quora : http://www.quora.com/Krishna-Sankar

Prior Gigs

Lead Architect (Egnyte) Distinguished Engineer (CSCO) Employee #64439 (CSCO) to #39(Egnyte) & now #9 !

Current Focus:

Design, build & ops of BioInformatics/Consumer Infrastructure on AWS, MongoDB, Solr, Drupal,GitHub, Big Data (more of variety, variability, context & graphs, than volume or velocity so far !) Overlay based semantic search & ranking http://goo.gl/P1rhc Big Data Engineering Top 10 Pragmatics (Summary) http://goo.gl/0SQDV The Art of Big Data (Detailed) http://goo.gl/EaUKH The Hitchhikers Guide to Kaggle OSCON 2011 Tutorial

Other related Presentations

Twitter Tips A Bakers Dozen


1. 2. Twitter APIs are (more or less) congruent & symmetric Twitter is usually right & simple - recheck when you get unexpected results before blaming Twitter
o o o I was getting numbers when I was expecting screen_names in user objects. Was ready to send blasting e-mails to Twitter team. Decided to check one more time and found that my parameter key was wrong-screen_name instead of user_id Always test with one or two records before a long run ! - learned the hard way In a week, you can pull in 4-5 million users & some tweets ! Night runs are far more faster & error-free Would make it easy to work with Twitter at scale I use MongoDB Keep the schema simple & no fancy transformation And as far as possible same as the ( json) response Use NOSQL CLI for trimming records et al

3. 4.

Twitter APIs are very powerful consistent use can bear huge data
o o o o o o

Use a NOSQL data store as a command buer & data buer

The End Beg As Th inni ng e

5.

Always use a big data pipeline


o

Twitter Tips A Bakers Dozen


Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize That way you can orthogonally extend, with functional components like command buers, validation et al

6.

Use functional approach for a scalable pipeline


o o o

7.

o The equivalent of the traditional ETL o Validation stage & validation routines are important
Cannot expect perfect runs Cannot manually look at data either, when data is at scale

Crawl-Store-Validate-Recrawl-Refresh cycle

Compose your data big pipeline with well dened granular functions, each doing only one thing Dont overload the functional components (i.e. no collect, unroll & store as a single component) Have well dened functional components with appropriate caching, buering, checkpoints & restart techniques This did create some trouble for me, as we will see later

8.

Have control numbers to validate runs & monitor them


o o

I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! There will be a separate printout of the control numbers that will be kept in the operations files

Twitter Tips A Bakers Dozen


9. Program defensively
o

o o o o o
o

more so for a REST-based-Big Data-Analytics systems Expect failures at the transport layer & accommodate for them

10. Have Erlang-style supervisors in your pipeline


Fail fast & move on Dont linger and try to x errors that cannot be controlled at that layer A higher layer process will circle back and do incremental runs to correct missing spiders and crawls Be aware of visibility & lack of context. Validate at the lowest layer that has enough context to take corrective actions I have an example in part 2

11. Data will never be perfect


o Know your data & accommodate for its idiosyncrasies
for example: 0 followers, protected users, 0 friends,

Twitter Tips A Bakers Dozen


12. Check Point frequently (preferably after ever API call) & have a re-startable command buer cache
o

See a MongoDB example in Part 2

13. Dont bombard the URL


o
o

Wait a few seconds before successful calls. This will end up with a scalable system, eventually

I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in !

14. Always measure the elapsed time of your API runs & processing
o Kind of early warning when something is wrong

15. Develop incrementally; dont fail to check cut & paste errors

16. The Twitter big data pipeline has lots of opportunities for parallelism
o o

Twitter Tips A Bakers Dozen

o o o o

17. Pay attention to handos between stages

Leverage data parallelism frameworks like MapReduce But rst : Prototype as a linear system, Optimize and tweak the functional modules & cache strategies, Note down stages and tasks that can be parallelized and Then parallelize them For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial They might require transformation for example collect & store might store a user list as multiple arrays, while the model requires each user to be a document for aggregation But resist the urge to overload collect with transform o i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents Add transformation as a granular function of course, with appropriate buffering, caching, checkpoints & restart techniques

18. Have a good log management system to capture and wade through logs

Twitter Tips A Bakers Dozen


19. Understand the underlying network characteristics for the inference you want to make
o o o o Twitter Network != Facebook Network , Twitter Graph != LinkedIn Graph Twitter Network is more of an Interest Network So, many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense But, others like Cliques and Bipartite Graphs do

Twitter Gripes
1. Need more rich APIs for #tags
o o o o Somewhat similar to users viz. followers, friends et al Might make sense to make #tags a top level object with its own semantics Returns 400 bad Request instead of 420 Granted, there is enough information to figure this out

2. HTTP Error Return is not uniform

3. Need an easier way to get screen_name from user_id 4. following vs. friends_count i.e. following is a dummy variable. 5.
o
o

Parameter Validation is not uniform

There are a few like this, most probably for backward compatibility

Gives 404 Not found instead of 406 Not Acceptable or 413 Too Long or 416 Range Unacceptable

6. Overall more validation would help


o

Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out

A Fork

deep & NLTK , NLP weets into T ment
4 o Sen ysis Anal

Not enough time for both

I chose the Social Graph route


A minute about Twitter as platform & its evolution



blog/ com/ tter- er. wi twitt /dev. nsistent-t ps:/ htt ring-co e deliv ence ri expe

.. we want to make sure that the Twitter experience is straightforward and easy to understand -- whether youre on Twitter.com or elsewhere on the web-Michael!

The micro-blogging service must find the right balance of running a profitable business and maintaining a robust developers' community. Chenda, CBS news!

My Wish & Hope I spend a lot of time with Twitter & derive value; the platform is rich & the APIs intuitive I did like the fact that tweets are part of LinkedIn. I still used Twitter more than LinkedIn
o o I dont think showing Tweets in LinkedIn took anything away from the Twitter experience LinkedIn experience & Twitter experience are dierent & distinct. Showing tweets in LinkedIn didnt change that

I sincerely hope that the platform grows with a rich developer eco system Orthogonally extensible platform is essential Of course, along with a congruent user experience core Twitter consumption experience through consistent tools

For Hands on Today

Setup

o Python 2.7.3 o easy_install v requests http://docs.python-requests.org/en/latest/user/quickstart/#make-a- request o easy_install v requests-oauth


o Hands on programs at https://github.com/xsankar/oscon2012-handson

For advanced data science with social graphs


o easy_install v networkx o easy_install v numpy o easy_install v nltk Not for this tutorial, but good for sentiment analysis et al o Mongodb I used MongoDB in AWS m2.xlarge, RAID 10 X 8 X 15 GB EBS o graphviz - http://www.graphviz.org/; easy_install pygraphviz o easy_install pydot

Thanks To these Giants


Problem Domain For this tutorial



Data Science (trends, analytics et al) on Social Networks as observed by Twitter primitives
o Not for Twitter based apps for real time tweets o Not web sites with real time tweets

By looking at the domain in aggregate to derive inferences & actionable recommendations Which also means, you need to be deliberate & systemic ( i.e. not look at a uctuation as a trend but dig deeper before pronouncing a trend)

Twi8er & Comics

h8p://www.buzzfeed.com/ma8buchanan/the-rst-30-tweets-ever h8p://frankandernest.com/cgi/view/display.pl?111-02-24

I.

Mechanics : Twitter API (1:30 PM - 3:00 PM)


o o o o Essential Fundamentals (Rate Limit, HTTP Codes et al) Objects API Hands-on (2:45 PM - 3:00 PM)

Agenda

II. Break (3:00 PM - 3:30 PM) III. Twitter Social Graph Analysis (3:30 PM - 5:00 PM)
o o Underlying Concepts Social Graph Analysis of @clouderati Stages, Strategies & Tasks Code Walk thru

Open This First

Twi8er API : Read These First


Using Twitter Brand
o New logo & associated guidelines : https://twitter.com/about/logos o Twitter Rules : https://support.twitter.com/groups/33-report-a-violation/topics/121-guidelines- best-practices/articles/18311-the-twitter-rules o Developer Rules of the road https://dev.twitter.com/terms/api-terms 1. 2. 3. 4. 5. 6. 7. 8. https://dev.twitter.com/docs/things-every-developer-should-know https://dev.twitter.com/docs/faq Field Guide to Objects https://dev.twitter.com/docs/platform-objects Security https://dev.twitter.com/docs/security-best-practices Media Best Practices : https://dev.twitter.com/media Consolidates Page : https://dev.twitter.com/docs Streaming APIs https://dev.twitter.com/docs/streaming-apis How to Appeal (Not that you all would need it !) https://support.twitter.com/ articles/72585

Read These Links First

Only One version of Twitter APIs

API Status Page

https://dev.twitter.com/status https://dev.twitter.com/issues https://dev.twitter.com/discussions

h8ps://dev.twi8er.com/status

http://www.buzzfeed.com/tommywilhelm/google- users-being-total-dicks-about-the-twitter

Open This First


Install pre-req as per the setup slide Run
o oscon2012_open_this_rst.py o To test connectivity canary query

Run
o oscon2012_rate_limit_status.py o Use http://www.epochconverter.com to check reset_time

Formats xml, json, atom & rss

Twitter API
Near-realtime, High Volume
Core Data,
Core Twitter Objects

REST

Streaming

Follow users, topics, data mining


Twitter REST

Seach & Trend


Twitter Search

Public Streams User Streams Site Streams Firehose Keywords Specic User Trends Rate Limit : Complexity & Frequency

Build Prole Create/Post Tweets Reply Favorite, Re-tweet Rate Limit : 150/350

Rate Limit

Rate Limits
By API type & Authentication Mode
API
REST Search Streaming Fire hose none No authC
150/hr Complexity & Frequency authC
350/hr -N/A- Upto 1% none 400 420 Error

Rate Limit Header


{ "status": "200 OK", "vary": "Accept-Encoding", "x-frame-options": "SAMEORIGIN", "x-mid": "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6", "x-ratelimit-class": "api", "x-ratelimit-limit": "150", "x-ratelimit-remaining": "149", "x-ratelimit-reset": "1340467358", "x-runtime": "0.04144", "x-transaction": "2b49ac31cf8709af", "x-transaction-mask": "a6183a5f8ca9431b53b5644ef114df9d6bba" }

Rate Limit-ed Header


{ "cache-control": "no-cache, max-age=300", "content-encoding": "gzip", "content-length": "150", "content-type": "application/json; charset=utf-8", "date": "Wed, 04 Jul 2012 00:48:25 GMT", "expires": "Wed, 04 Jul 2012 00:53:25 GMT", "server": "tfe", "status": "400 Bad Request", "vary": "Accept-Encoding", "x-ratelimit-class": "api", "x-ratelimit-limit": "150", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1341363230", "x-runtime": "0.01126" }

Rate Limit Example


Run
o oscon2012_rate_limit_02.py

It iterates through a list to get followers List is 2072 long

{ "date": "Wed, 04 Jul 2012 00:54:16 GMT", "status": "200 OK", "vary": "Accept-Encoding", "x-frame-options": "SAMEORIGIN", "x-mid": "f31c7278ef8b6e28571166d359132f152289c3b8", "x-ratelimit-class": "api", Last time, it gave me 5 min. "x-ratelimit-limit": "150", Now the reset timer is 1 "x-ratelimit-remaining": "147", hour "x-ratelimit-reset": "1341366831", 150 calls, not authenticated "x-runtime": "0.02768", "x-transaction": "f1bafd60112dddeb", "x-transaction-mask": "a6183a5f8ca9431b53b5644ef11417281dbc" }

{ "cache-control": "no-cache, max-age=300", "content-encoding": "gzip", "content-type": "application/json; charset=utf-8", "date": "Wed, 04 Jul 2012 00:55:04 GMT", And Rate Limit kicked-in "status": "400 Bad Request", "transfer-encoding": "chunked", "vary": "Accept-Encoding", "x-ratelimit-class": "api", "x-ratelimit-limit": "150", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1341366831", "x-runtime": "0.01342" }

API with OAuth


{ "date": "Wed, 04 Jul 2012 01:32:01 GMT", "etag": "\"dd419c02ed00fc6b2a825cc27wbe040\"", "expires": "Tue, 31 Mar 1981 05:00:00 GMT", "last-modied": "Wed, 04 Jul 2012 01:32:01 GMT", "pragma": "no-cache", "server": "tfe", "status": "200 OK", "vary": "Accept-Encoding", "x-access-level": "read", "x-frame-options": "SAMEORIGIN", "x-mid": "5bbb87c04fa43c43bc9d7482bc62633a1ece381c", "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "349", "x-ratelimit-reset": "1341369121", "x-runtime": "0.05539", "x-transaction": "9f8508fe4c73a407", "x-transaction-mask": "a6183a5f8ca9431b53b5644ef11417281dbc" }

OAuth api-identied 1 hr reset 350 calls

{ "date": "Thu, 05 Jul 2012 14:56:05 GMT", "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "133", "x-ratelimit-reset": "1341500165", } ******** 2416 { "date": "Thu, 05 Jul 2012 14:56:18 GMT", "status": "200 OK", . "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "349", "x-ratelimit-reset": "1341503776", ******** 2417

+1 hour

Rate Limit resets during consecutive calls

Unexplained Errors
Traceback (most recent call last): File "oscon2012_get_user_info_01.py", line 39, in <module> r = client.get(url, params=payload) File "build/bdist.macosx-10.6-intel/egg/requests/sessions.py", line 244, in get File "build/bdist.macosx-10.6-intel/egg/requests/sessions.py", line 230, in request File "build/bdist.macosx-10.6-intel/egg/requests/models.py", line 609, in send requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.twitter.com', port=443): Max retries exceeded with url: /1/users/lookup.json? user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44 614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854 7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8 962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C While trying to get details of 1,000,000 users, I get this error 17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C usually 10-6 AM PST 42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C 8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725% 2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084% Got around by Trap & wait 5 seconds 2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887% 2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155 56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260 Night Runs are relatively error free 09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446 14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886 54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C 13727232%2C199803906%2C220435108%2C268531201

{ "date": "Fri, 06 Jul 2012 03:41:09 GMT", "expires": "Fri, 06 Jul 2012 03:46:09 GMT", "server": "tfe", "set-cookie": "dnt=; domain=.twitter.com; path=/; expires=Thu, 01-Jan-1970 00:00:00 GMT", "status": "400 Bad Request", "vary": "Accept-Encoding", "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1341546334", "x-runtime": "0.01918" } Error, sleeping { "date": "Fri, 06 Jul 2012 03:46:12 GMT", "status": "200 OK", "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "349",

A Day in the life of Twitter Rate Limit

Missed by 4 min!

OK after 5 min sleep

Strategies
I have no exotic strategies, so far !
1. Obvious : Track elapsed time & sleep when rate limit kicks in 2. Combine authenticated & non-authenticated calls 3. Use multiple API types 4. Cache 5. Store & get only what is needed 6. Checkpoint & buer request commands 7. Distributed data parallelism for example AWS instances http://www.epochconverter.com/ <- useful to debug the timer

Pl share your tips and tricks for conserving the Rate Limit

Authentication

Authentication
Three modes As of Aug 31, 2010, only Anonymous or OAuth are supported OAuth enables the user to authorize an application without sharing credentials Also has the ability to revoke Twitter supports OAuth 1.0a OAuth 2.0 is the new standard, much simpler
o No timeframe for Twitter support, yet o Anonymous o HTTP Basic Auth o OAuth

OAuth Pragmatics
Helpful Links
o o o o https://dev.twitter.com/docs/auth/oauth https://dev.twitter.com/docs/auth/moving-from-basic-auth-to-oauth https://dev.twitter.com/docs/auth/oauth/single-user-with-examples http://blog.andydenmark.com/2009/03/how-to-build-oauth-consumer.html

Discussion on OAuth internal mechanisms is better left for another day For headless applications to get OAuth token, go to https:// dev.twitter.com/apps Create an application & get four credential pieces
o Consumer Key, Consumer Secret, Access Token & Access Token Secret

All the frameworks have support for OAuth. So plug in these values & use the frameworks calls I used request-oauth library like so:

request-oauth
def get_oauth_client():
consumer_key = "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"
consumer_secret = "fceb3aedb960374e74f559caeabab3562efe97b4" access_token = "df919acd38722bc0bd553651c80674fab2b465086782Ls" access_token_secret = "1370adbe858f9d726a43211afea2b2d9928ed878" header_auth = True oauth_hook = OAuthHook(access_token, access_token_secret, consumer_key, consumer_secret, header_auth) client = requests.session(hooks={'pre_request': oauth_hook}) return client

Get client using the token, key & secret from dev.twitter.com/apps

def get_followers(user_id):
url = 'https://api.twitter.com/1/followers/ids.json payload={"user_id":user_id} # if cursor is needed {"cursor":-1,"user_id":scr_name} r = requests.get(url, params=payload)

Use the client instead of requests

def get_followers_with_oauth(user_id,client):
url = 'https://api.twitter.com/1/followers/ids.json' payload={"user_id":user_id} # if cursor is needed {"cursor":-1,"user_id":scr_name} r = client.get(url, params=payload)

Ref: h8p://pypi.python.org/pypi/requests-oauth

OAuth Authorize screen


The user authenticates with Twitter & grants access to Forbes Social Forbes social doesnt have the users credentials, but uses OAuth to access the users account

HTTP Status Codes

HTTP status Codes


0 Never made it to Twitter Servers - Library error 200 OK 304 Not Modied 400 Bad Request
o Check error message for explanation o REST Rate Limit !

404 Not Found 406 Not Acceptable 413 Too Long 416 Range Unacceptable 420 Enhance Your Calm
o Rate Limited

401 UnAuthorized 403 Forbidden


o Hit Update Limit (> max Tweets/day, following too many people)

500 Internal Server Error 502 Bad Gateway 503 Service Unavailable
o Overloaded o Down for maintenance o Overloaded Fail whale

o Beware you could get this for other reasons as well.

504 Gateway Timeout

h8ps://dev.twi8er.com/docs/error-codes-responses

HTTP Status Code - Example


{ "cache-control": "no-cache, max-age=300", "content-encoding": "gzip", "content-length": "91", "content-type": "application/json; charset=utf-8", "date": "Sat, 23 Jun 2012 00:06:56 GMT", "expires": "Sat, 23 Jun 2012 00:11:56 GMT", "server": "tfe", "status": "401 Unauthorized", "vary": "Accept-Encoding", "www-authenticate": "OAuth realm=\"https://api.twitter.com\"", "x-ratelimit-class": "api", "x-ratelimit-limit": "0", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1340413616", "x-runtime": "0.01997" } { "errors": [ { "code": 53, "message": "Basic authentication is not supported" } ] }

Detailed error message in JSON ! I like this

HTTP Status Code Confusing Example


{ GET https://api.twitter.com/1/users/lookup.json? screen_nme=twitterapi,twitter&include_entities= "pragma": "no-cache", true "server": "tfe", Spelling Mistake "status": "404 Not Found", o Should be screen_name But confusing error ! } Should be 406 Not Acceptable or 413 Too Long , { showing parameter error "errors": [ { "code": 34, "message": "Sorry, that page does not exist" } ] }

HTTP Status Code - Example


{ "cache-control": "no-cache, no-store, must-revalidate, pre-check=0, post-check=0", "content-encoding": "gzip", "content-length": "112", Sometimes, the errors are "content-type": "application/json;charset=utf-8", "date": "Sat, 23 Jun 2012 01:23:47 GMT", not correct. I got this error "expires": "Tue, 31 Mar 1981 05:00:00 GMT", for user_timeline.json w/ user_id=20,15,12 "status": "401 Unauthorized", "www-authenticate": "OAuth realm=\"https://api.twitter.com\"", Clearly a parameter error "x-frame-options": "SAMEORIGIN", (i.e. more parameters) "x-ratelimit-class": "api", "x-ratelimit-limit": "150", "x-ratelimit-remaining": "147", "x-ratelimit-reset": "1340417742", "x-transaction": "d545a806f9c72b98" } { "error": "Not authorized", "request": "/1/statuses/user_timeline.json?user_id=12%2C15%2C20" }

Objects

Friends
Follow

Followers
Are Followed By

Twitter Platform Objects


user_mentions
embed

Users

Status Update

Entities

Temporally Ordered
TimeLine

Tweets

urls

Places

embe d

media

hashtags

h8ps://dev.twi8er.com/docs/platform-objects

Tweets
A.k.a Status Updates Interesting elds
o o o o o o Coordinates <- geo location created_at entities (will see later) Id, id_str possibly sensitive user (will see later)
perspectival attributes embedded within a child object of an unlike parent hard to maintain at scale https://dev.twitter.com/docs/faq#6981

o withheld_in_countries
https://dev.twitter.com/blog/new-withheld-content-elds-api-responses
h8ps://dev.twi8er.com/docs/platform-objects/tweets

A word about id, id_str


June 1, 2010
o Snowake the id generator service o The full ID is composed of a timestamp, a worker number, and a sequence number o Had problems with JavaScript to handle numbers > 53 bits o id:819797 o id_str:819797

h8p://engineering.twi8er.com/2010/06/announcing-snowake.html h8ps://groups.google.com/forum/?fromgroups#!topic/twi8er-development-talk/ahbvo3VTIYI h8ps://dev.twi8er.com/docs/twi8er-ids-json-and-snowake

Tweets - example
Let us run oscon2012-tweets.py Example of tweet
o coordinates o id o id_str

Users
followers_count geo_enabled Id, Id_str name, screen_name Protected status, statuses_count withheld_in_countries

h8ps://dev.twi8er.com/docs/platform-objects/users

Users Let us run some examples


Run
o oscon_2012_users.py Lookup users by screen_name o oscon12_rst_20_ids.py Lookup users by user_id

Inspect the results


o id, name, status, status_count, protected, followers (for top 10 followers), withheld users

Can use information for customizing the users screen in your web app

Entities
Metadata & Contextual Information You can parse them, but Entities parse them out as structured data REST API/Search API include_entities=1 Streaming API included by default hashtags, media, urls, user_mentions
h8ps://dev.twi8er.com/docs/platform-objects/entities h8ps://dev.twi8er.com/docs/tweet-entities h8ps://dev.twi8er.com/docs/tco-url-wrapper

Entities
Run
o oscon2012_entities.py

Inspect hashtags, urls et al

Places
attributes bounding_box Id (as a string!) country name

h8ps://dev.twi8er.com/docs/platform-objects/places h8ps://dev.twi8er.com/docs/about-geo-place-a8ributes

Places
Can search for tweets near a place like so: Get latlong of conv center [45.52929,-122.66289]
o Tweets near that place

Tweets near San Jose [37.395715,-122.102308] We will not see further here. But very useful

Timelines
Collections of tweets ordered by time Use max_id & since_id for navigation

h8ps://dev.twi8er.com/docs/working-with-timelines

Other Objects & APIs


Lists Notications Friendships/exists to see if one follows the other

Friends
Follow

Followers
Are Followed By

Twitter Platform Objects


user_mentions
embed

Users

Status Update

Entities

Temporally Ordered
TimeLine

Tweets

urls

Places

embe d

media

hashtags

h8ps://dev.twi8er.com/docs/platform-objects

Hands-on Exercise (15 min)


Setup environment slide #14 Sanity Check Environment & Libraries Get objects (show calls)
o o o o o oscon2012_open_this_rst.py o oscon2012_rate_limit_status.py

Lookup users by screen_name - oscon12_users.py Lookup users by id - oscon12_rst_20_ids.py Lookup tweets - oscon12_tweets.py Get entities - oscon12_entities.py

Inspect the results Explore a little bit Discussion

Twi8er APIs

Twitter API
Near-realtime, High Volume
Core Data,
Core Twitter Objects

REST

Streaming

Follow users, topics, data mining


Twitter REST

Seach & Trend


Twitter Search

Public Streams User Streams Site Streams Firehose Keywords Specic User Trends Rate Limit : Complexity & Frequency

Build Prole Create/Post Tweets Reply Favorite, Re-tweet Rate Limit : 150/350

Twi8er REST API


https://dev.twitter.com/docs/api What we were doing were the REST API Request-Response Anonymous or OAuth Rate Limited :
o 150/350

Twi8er Trends
oscon2012-trends.py Trends/weekly, Trends/monthly Let us run some examples
o oscon2012_trends_daily.py o oscon2012_trends_weekly.py

Trends & hashtags


o o o o o #hashtag euro2012 http://hashtags.org/euro2012 http://sproutsocial.com/insights/2011/08/twitter-hashtags/ http://blog.twitter.com/2012/06/euro-2012-follow-all-action-on-pitch.html Top 10 : http://twittercounter.com/pages/100, http://twitaholic.com/

Brand Rank w/ Twi8er


Walk Through & results of following Followed 10 user-brands for a few days to nd growth Brand Rank
o oscon2012_brand_01.py

o Growth of a brand w.r.t the industry o Surge in popularity could be due to ve or +ve buzz. Need to understand & correlate using Twitter APIs & metrics

API : url='https://api.twitter.com/1/users/ lookup.json' payload={"screen_name":"miamiheat,okcthunder,n ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati, googleio,OReillyMedia"}

Brand Rank w/ Twi8er


Clouderati is very stable

Brand Rank w/ Twi8er Tech Brands


Google I/O showed a spike on 6/27- 6/28 OReillyMedia shares some spike Looking at a few days worth of data, our best inference is that oscon doesnt track with googleio Clouderati doesnt track at all

Brand Rank w/ Twi8er World of Soccer


FOXSoccer,UEFAcom track each other
The numbers seldom decrease. So calculating ve velocity will not work OTOH, if you see a ve velocity, investigate

NBA, MiamiHeat, okcthunder track each other Used % than absolute numbers to compare The hike on 7/6 to 7/10 is interesting.

Brand Rank w/ Twi8er World of Basketball

Brand Rank w/ Twi8er Rising Tide


For some reason, all numbers are going up 7/6 thru 7/10 except for clouderati! Is a rising (Twitter) tide lifting all (well, almost all) ?

Trivia : Search API


Search(search.twitter.com)
o Built by Summize which was acquired by Twitter in 2008 o Summize described itself as sentiment mining

Search API
Very simple
o GET http://search.twitter.com/search.json?q=<blah>

Based on a search criteria The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets Recent = Last 6-9 days worth of tweets Anonymous Call Rate Limit
o Not No. of calls/hour, but Complexity & Frequency
h8ps://dev.twi8er.com/docs/using-search h8ps://dev.twi8er.com/docs/api/1/get/search

Search API
Filters
o o o o o Search URL encoded @ = %40, #=%23 emoticons :) and :(, http://search.twitter.com/search.atom?q=sometimes+%3A) http://search.twitter.com/search.atom?q=sometimes+%3A(

Location Filters, date lters Content searches

Streaming API
Not request response; but stream Twitter frameworks have the support Rate Limit : Upto 1% Stall warning if the client is falling behind Good Documentation Links
o https://dev.twitter.com/docs/streaming-apis/connecting o https://dev.twitter.com/docs/streaming-apis/parameters o https://dev.twitter.com/docs/streaming-apis/processing

Firehose
~ 400 million public tweets/day If you are working with Twitter rehose, I envy you !

If you hit real limits, then explore the rehose route AFAIK, it is not cheap, but worth it

API Best Practices


1. Use JSON 2. Use user_id than screen_name
o User_id is constant while screen_name can change

3. max_id and since_id


o For example direct messages, if you have last message use since_id for search o max_id how far to go back

4. Cache as much as you can 5. Set the User-Agent header for debugging I have listed a few good blogs that have API best practices, in the reference section, at the end of this presentation
These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the sources

Twitter API
Near-realtime, High Volume
Core Data,
Core Twitter Objects

REST

Streaming

Follow users, topics, data mining


Twitter REST

Seach & Trend


Twitter Search

Build Prole Create/Post Tweets Reply Favorite, Re-tweet Rate Limit : 150/350

Keywords Specic User Trends Rate Limit : Complexity & Frequency

Questions ?

Public Streams User Streams Site Streams Firehose

Part II
SNA

Part II Twitter Network Analysis

1. Collect

2. Store

3. Transform & Analyze


the Keep dont . Tip: 3 simple; a schem afrai d to be for m trans

Validate Dataset & re-crawl/refresh



Most important & the ugliest slide in this deck !

as ent , lem Imp ipeline 1. Tip: age d p nolith a st r a mo neve

5. Predict, Recommend & Visualize

4. Model & Reason

Trivia
Social Network Analysis originated as Sociometry & the social network was called a sociogram Back then, Facebook was called SocioBinder! Jacob Levi Morano, is considered the originator
o NYTimes, April 3, 1933, P. 17

Twi8er Networks-Denitions
Nodes
o Users o #tags

Edges
o o o o Follows Friends @mentions #tags

Directed

Twi8er Networks-Denitions
In-degree
o Followers

Out-Degree
o Friends/Follow

Centrality Measures Hubs & Authorities


o Hubs/Directories tell us where Authorities are o Of Mortals & Celebrities is more Twitter-style

Twi8er Networks-Properties
Concepts From Citation Networks
M N L A G H F K

o Cocitation Common papers that cite a paper B Common Followers o C & G (Followed by F & H) C o Bibliographic Coupling Cite the same papers D E Common Friends (i.e. follow same person) o D, E, F & H

J I

Twi8er Networks-Properties
Concepts From Citation Networks
M K A G N o Cocitation Common papers that cite a paper Common Followers L o C & G (Followed by F & H) o Bibliographic Coupling Cite the same papers B Common Friends (i.e. follow same person) o D, E, F & H follow C o H & F follow C & G C So H & F have high coupling D Hence, if H follows A, we can recommend F to follow A E J I

H F

Twi8er Networks-Properties
Bipartite/Aliation Networks
o Two disjoint subsets o The bipartite concept is very relevant to Twitter social graph o Membership in Lists lists vs. users bipartite graph o Common #Tags in Tweets #tags vs. members bipartite graph o @mention together ? Can this be a bipartite graph ? How would we fold this ?

Other Metrics & Mechanisms


Kronecker Graphs Models
o Kronecker product is a way of generating self-similar matrices o Prof.Leskovec et al dene the Kronecker product of two graphs as the Kronecker product of their adjacency matrices o Application : Generating models for analysis, prediction, anomaly detection et al

Erdos-Renyl Random Graphs

Network Diameter Weak Ties Follower velocity (+ve & ve), Association strength
o Unfollow not a reliable measure o But an interesting property to investigate when it happens

o Easy to build a Gn,p graph o Assumes equal likelihood of edges between two nodes o In a Twitter social network, we can create a more realistic expected distribution (adding the social reality dimension) by inspecting the #tags & @mentions

Not covered here, but potential for an encore !


Ref: Jure Leskovec: Kronecker Graphs, Random Graphs

Twitter != LinkedIn, Twitter != Facebook Twitter Network == Interest Network

Twi8er Networks-Properties

Be cognizant of the above when you apply traditional network properties to Twitter For example,
o Six degrees of separation doesn't make sense (most of the time) in Twitter except may be for Cliques o Is diameter a reliable measure for a Twitter Network ? Probably not o Do cut sets make sense ? Probably not o But citation network principles do apply; we can learn from cliques o Bipartite graphs do make sense

Cliques (1 of 2)
Maximal subset of the vertices in an undirected network such that every member of the set is connected by an edge to every other Cohesive subgroup, closely connected Near-cliques than a perfect clique (k-plex i.e. connected to at least n-k others) k-plex clique to discover sub groups in a sparse network; 1-plex being the perfect clique

Ref: Networks, An Introduction-Newman

Cliques (2 of 2)
k-core at least k others in the subset; (n-k)-plex k-clique no more than k distance away
o Path inside or outside the subset o k-clan or k-club (path inside the subset)

We will apply k-plex Cliques for one of our hands-on

Ref: Networks, An Introduction-Newman

Sentiment Analysis
Sentiment Analysis is an important & interesting work on the Twitter platform
o Collect Tweets o Opinion Estimation -Pass thru Classier, Sentiment Lexicons
Nave Bayes/Max Entropy Class/SVM

o Aggregated Text Sentiment/Moving Average

I chose not to dive deeper because of time constraints


o Couldnt do justice to API, Social Network and Sentiment Analysis, all in 3 hrs

Next 3 Slides have couple of interesting examples

Sentiment Analysis
Twitter Mining for Airline Sentiment Opinion Lexicon - +ve 2000, -ve 4800

h8p://www.inside-r.org/howto/mining-twi8er-airline-consumer-sentiment h8p://sentiment.christopherpo8s.net/lexicons.html#opinionlexicon

Need I say more ?


A bit of clever math can uncover interes4ng pa7erns that are not visible to the human eye

h8p://www.economist.com/blogs/schumpeter/2012/06/tracking-social-media?fsrc=scn/gp/wl/bl/moodofthemarket h8p://www.relevantdata.com/pdfs/IUStudy.pdf

Project Ideas

Interesting Vectors of Exploration



1. Find trending #tags & then related #tags using cliques over co-#tag-citation, which infers topics related to trending topics 2. Related #tag topics over a set of tweets by a user or group of users 3. Analysis-In/Out ow, Tweet Flow
Frequent @mention

4. Find aliation networks by List memberships, #tags or frequent @mentions

Interesting Vectors of Exploration



5. Use centrality measures to determine mortals vs. celebrities 6. Classify Tweet networks/cliques based on message passing characteristics
Tweets vs. Retweets, No of reweets, Measure Inuence by retweet count & frequency Information contagion by looking at dierent retweet network subcomponents who, when, how much,

7. Retweet Network

Twi8er Network Graph Analysis


An Example

Analysis Story Board


@clouderati is a popular cloud related Twitter account Goals:
In this tutorial For you to explore !!
o Analyze the social graph characteristics of the users who are following the account Dig one level deep, to the followers & friends, of the followers of @clouderati o How many cliques ? How strong are they ? o Does the @mention support the clique inferences ? o What are the retweet characteristics ? o How does the #tag network graph look like ?

Twi8er Analysis Pipeline Story Board Stages, Strategies, APIs & Tasks
Stage 4
o Get & Store User details (distinct user list) o Unroll
Note: Needed a command buer to manage scale (~980,000 users)

For e follo ach @c loud w o erat Find er i frie nd=f inte rsec o tion llower Note: Unroll - se t stage took time
& missteps

Stag e 5

Stage 3
o Get distinct user list applying the set(union(list)) operation
o o o

raph ocial g heory s Create twork t ne Apply ues & other liq Infer c s tie proper

tage 6 S

@clouderati Twi8er Social Graph


Stats (Retrospect after the runs):
o Stage 1 @clouderati has 2072 followers o Stage 2 Limiting followers to 5,000 per user o Stage 3 Digging 1st level (set union of followers & friends of the followers of @clouderati) explodes into ~980,000 distinct users o MongoDB of the cache and intermediate datasets ~10 GB o The database was hosted at AWS (Hi Mem XLarge m2.xlarge ), 8 X 15 GB, Raid 10, opened to Internet with DB authentication

Code & Run Walk Through


o Code:
oscon_2012_user_list_spider_01.py Stage 1
o Get @clouderati Followers o Store in MongoDB

o Challenges:
Nothing fancy Get the record and store Would have had to recurse through a REST cursor if there were more than 5000 followers @clouderati has 2072 followers

o Interesting Points:

Code & Run Walk Through


o Code:
oscon_2012_user_list_spider_02.py oscon_2012_twitter_utils.py oscon_2012_mongo.py oscon_2012_validate_dataset.py Multiple runs, errors et al !

o Challenges:
Stage 2
o Crawl 1 level deep o Get friends & followers o Validate, re-crawl & refresh

o Interesting Points:

Set operation between two mongo collections for restart buer Protected users, some had 0 followers, or 0 friends Interesting operations for validate, re-crawl and refresh Added status_code to dierentiate protected users {'$set': {'status_code': '401 Unauthorized,401 Unauthorized'}} Getting friends & followers of 2000 users is the hardest (or so I thought, until I got through the next stage!)

Validate-Recrawl-Refresh Logs
pymongo version = 2.2 o 1st run 132 bad records Connected to DB! o This is the classic Erlang-style 2075 supervisor Error Friends : <type 'exceptions.KeyError'> o The crawl continue on transport errors 43cd40e5557c00c7000000 - none has 2072 followers & 0 friends without worrying about retry Error Friends : <type 'exceptions.KeyError'> o Validate will recrawl & refresh as 43a958e5557cfc58000000 - none has 2072 followers & 0 friends Error Friends : <type 'exceptions.KeyError'> needed 43ccdee5557c00b6000000 - none has 2072 followers & 0 friends 43d3b9e5557c01b900001e - 371187804 has 0 followers & 0 friends 43d3d8e5557c01b9000048 - 63488295 has 155 followers & 0 friends 43d3d9e5557c01b9000049 - 342712617 has 0 followers & 0 friends 43d3d9e5557c01b900004a - 21266738 has 0 followers & 0 friends 43d3dae5557c01b900004b - 204652853 has 0 followers & 0 friends 4475cfe5557c1657000074 - 258944989 has 0 followers & 0 friends 4475d3e5557c165700007d - 327286780 has 0 followers & 0 friends Looks like we have 132 not so good records Elapsed Time = 0.546846

Code & Run Walk Through


o Code:
oscon2012_analytics_01.py Stage 3
o Get distinct user list applying the set(union(list)) operation

o Challenges:
o Figure out the right Set operations

o Interesting Points:
973,323 unique users ! Recursively apply set union over 400,00 lists Set operations took slightly more than a minute

Code & Run Walk Through


o Code:
Stage 4
o Get & Store User details (distinct user list) o Unroll

oscon2012_analytics_01.py (focus on cmd string creation) oscon2012_get_user_info_01.py oscon2012_unroll_user_list_01.py oscon2012_unroll_user_list_02.py

o Challenges:

Where do I start ? In the next few slides Took me a few days to get it right (along with my daily job!) Unfortunately I did not employ parallelism & didnt use my MacPro with 32 GB memory. So the runs were long But learned hard lessons on check point & restart Tracking Control Numbers Time Marathon unroll run 19:33:33 !

o Interesting Points:

Twi8er @ scale Pa8ern


Challenge: Problem: Solution:
o You want to get screen names, follower counts and other details for a million users o No easy REST API o https://api.twitter.com/1/users/lookup.json will take 100 user_ids and give details o This is a scalability challenge. Approach it like so o Create a command buer collection in MongoDB splitting millon user_ids into batches of 100 o Have a done ag initialized to 0 for checkpoint & restart o After each cmd str is executed, rest done:1 o For subsequent runs, ignore done:1. o Also helps in control number tracking

Control numbers

Control Numbers
> db.t_users_info.count() 8122 > db.api_str.count({"done":0,"seq_no":{"$lt":8185}},{"seq_no:) 63 > db.api_str.nd({"done":0,"seq_no":{"$lt":8185}},{"seq_no":1}) The collection should have 8185 { "_id" : ObjectId("44daeae5557c28bf001d53"), "seq_no" : 5433 } documents But it has nly 8122. { "_id" : ObjectId("44daeae5557c28bf001d59"), "seq_no" :o5439 } Where did the rest go ? { "_id" : ObjectId("44daeae5557c28bf001d5f"), "seq_no" : 5445 } 63 of them still have done=0 8122 + 63 = 8185 ! { "_id" : ObjectId("44daebe5557c28bf001d74"), "seq_no" : 5466 } Aha, mystery solved. { "_id" : ObjectId("44daece5557c28bf001d7a"), "seq_no" : 5472 } They fell through the cracks Need a catch-all nal run { "_id" : ObjectId("44daece5557c28bf001d80"), "seq_no" : 5478 } { "_id" : ObjectId("44daede5557c28bf001d90"), "seq_no" : 5494 } { "_id" : ObjectId("44daefe5557c28bf001daf"), "seq_no" : 5525 }

Day in the life of a Control Number Detective Run #1


Remember : 973,323 users. So, 9734 cmd strings (100 users perstring) > > db.api_str.count() 9831 > db.api_str.count({"done":0}) 239 >> db.t_users_info.count() 9592 > > db.api_str.count({"api_str":""}) 97 So we should have 9831 97 = 9734 records The second run should generate 9734-9592 = 142 calls (i.e. 350-142=208 rate-limit should remain). Let us see. { "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "209", } Yep, 209 left >

Day in the life of a Control Number Detective Run #2


Remember : 973,323 users. So, 9734 cmd strings (100 users perstring) > db.t_users_info.count() 9728 > db.api_str.count({"api_str":""}) 97 > db.api_str.count({"done":0}) 103 >9734-9728=6, same as 103-97 ! Run once more ! > db.api_str.nd({"done":0},{"seq_no":1}) { "_id" : ObjectId("44dbd4e5557c28bf002e22"), "seq_no" : 9736 } { "_id" : ObjectId("44db05e5557c28bf001f47"), "seq_no" : 5933 } { "_id" : ObjectId("44db8be5557c28bf0028f6"), "seq_no" : 8412 } { "_id" : ObjectId("44dba2e5557c28bf002a8c"), "seq_no" : 8818 } { "_id" : ObjectId("44dbaee5557c28bf002b69"), "seq_no" : 9039 } { "_id" : ObjectId("44dbb8e5557c28bf002c1c"), "seq_no" : 9218 } { "x-ratelimit-limit": "350", "x-ratelimit-remaining": "344", } Yep, 6 more records > db.t_users_info.count() 9734

Good, got 9734 !

Professor Layton would be proud !


In fact, I have all the four & plan to spend sometime with them & Laphraig !

Monitor runs & track control numbers

Unroll run 8:48 PM to ~4:08 PM next day !

Track error & the document numbers

Code & Run Walk Through


o Code:
Stage 5
o For each @clouderati follower o Find friend=follower - set intersection

oscon2012_nd_strong_ties_01.py oscon2012_social_graph_stats_01.py None. Python set operations made this easy Even at this scale, single machine is not enough Should have tried data parallelism Was getting invalid cursor error from MongoDB So had to do the updates in two steps
This task is well suited to leverage data parallelism as it is commutative & associative

o Challenges:

o Interesting Points:

Code & Run Walk Through


o Code:
Stage 6
o Create social graph o Apply network theory o Infer cliques & other properties

oscon2012_nd_cliques_01.py o Lots of good information hidden in the data ! o Memory !

o Challenges: o Interesting Points:


o Graph, List & set operations o networkx has lots of interesting graph algorithms o Collections.Counter to the rescue

Twi8er Social Graph Analysis of @clouderati


o o o o o o o 2072 Followers; 973,323 unique users one level down w/ followers/friends trimmed at 5,000 Strong ties 235,697 users, 462, 419 edges 501,367 Cliques 253 unique users 8,906 Cliques w/ > 10 users GeorgeReese in 7,973 of them ! See List for 1st 125 krishnan 3,446,randy 2,197, joe 1,977, sam 1,937, jp 485, stu 403, urquhart 263,beaker 226, acroll 149, adrian 63, gevaperry 24 Of course, clique analysis does not tell us the whole story
o follower=friend

Clique Distribution = {2: 296521, 3: 58368, 4: 36421, 5: 28788, 6: 24197, 7: 20240, 8: 15997, 9: 11929, 10: 6576, 11: 1909, 12: 364, 13: 55, 14: 2}

Twi8er Social Graph Analysis of @clouderati


Celebrity very low strong ties Higher Celebrity, low strong ties

o sort by followers vs. sort by strong ties is interesting


Medium Celebrity, medium strong ties

Twi8er Social Graph Analysis of @clouderati


o A higher Strong Ties number is interesting
It means a very high follower-friend intersection Reeves 62%, bgolden 85%

Bur a high clique with a smaller Strong ties show more cohesive & stronger social graph eg.Krishnan - 15% friends-followers Samj 33%

Twi8er Social Graph Analysis of @clouderati


o Ideas for more Exploration
Include all followers (instead of stopping at the 5000 cap) Get tweets & track @mention Frequent @mention shows more stronger ties #tag analysis could show some interesting networks

Twitter Tips A Bakers Dozen


1. 2. Twitter APIs are (more or less) congruent & symmetric Twitter is usually right & simple - recheck when you get unexpected results before blaming Twitter
o o o I was getting numbers when I was expecting screen_names in user objects. Was ready to send blasting e-mails to Twitter team. Decided to check one more time and found that my parameter key was wrong-screen_name instead of user_id Always test with one or two records before a long run ! - learned the hard way In a week, you can pull in 4-5 million users & some tweets ! Night runs are far more faster & error-free Would make it easy to work with Twitter at scale I use MongoDB Keep the schema simple & no fancy transformation And as far as possible same as the ( json) response Use NOSQL CLI for trimming records et al

3. 4.

Twitter APIs are very powerful consistent use can bear huge data
o o o o o o

Use a NOSQL data store as a command buer & data buer

The Beg in The ning A End s

5.

Always use a big data pipeline


o

Twitter Tips A Bakers Dozen


Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize That way you can orthogonally extend, with functional components like command buers, validation et al

6.

Use functional approach for a scalable pipeline


o o o

7.

o The equivalent of the traditional ETL o Validation stage & validation routines are important
Cannot expect perfect runs Cannot manually look at data either, when data is at scale

Crawl-Store-Validate-Recrawl-Refresh cycle

Compose your data big pipeline with well dened granular functions, each doing only one thing Dont overload the functional components (i.e. no collect, unroll & store as a single component) Have well dened functional components with appropriate caching, buering, checkpoints & restart techniques This did create some trouble for me, as we will see later

8.

Have control numbers to validate runs & monitor them


o o

I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! There will be a separate printout of the control numbers that will be kept in the operations files

Twitter Tips A Bakers Dozen


9. Program defensively
o

o o o o o
o

more so for a REST-based-Big Data-Analytics systems Expect failures at the transport layer & accommodate for them

10. Have Erlang-style supervisors in your pipeline


Fail fast & move on Dont linger and try to x errors that cannot be controlled at that layer A higher layer process will circle back and do incremental runs to correct missing spiders and crawls Be aware of visibility & lack of context. Validate at the lowest layer that has enough context to take corrective actions I have an example in part 2

11. Data will never be perfect


o Know your data & accommodate for its idiosyncrasies
for example: 0 followers, protected users, 0 friends,

Twitter Tips A Bakers Dozen


12. Check Point frequently (preferably after ever API call) & have a re-startable command buer cache
o

See a MongoDB example in Part 2

13. Dont bombard the URL


o
o

Wait a few seconds before successful calls. This will end up with a scalable system, eventually

I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in !

14. Always measure the elapsed time of your API runs & processing
o Kind of early warning when something is wrong

15. Develop incrementally; dont fail to check cut & paste errors

16. The Twitter big data pipeline has lots of opportunities for parallelism
o o

Twitter Tips A Bakers Dozen

o o o o

17. Pay attention to handos between stages

Leverage data parallelism frameworks like MapReduce But rst : Prototype as a linear system, Optimize and tweak the functional modules & cache strategies, Note down stages and tasks that can be parallelized and Then parallelize them For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial They might require transformation for example collect & store might store a user list as multiple arrays, while the model requires each user to be a document for aggregation But resist the urge to overload collect with transform o i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents Add transformation as a granular function of course, with appropriate buffering, caching, checkpoints & restart techniques

18. Have a good log management system to capture and wade through logs

Twitter Tips A Bakers Dozen


19. Understand the underlying network characteristics for the inference you want to make
o o o o Twitter Network != Facebook Network , Twitter Graph != LinkedIn Graph Twitter Network is more of an Interest Network So, many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense But, others like Cliques and Bipartite Graphs do

Twitter Gripes
1. Need more rich APIs for #tags
o o o o Somewhat similar to users viz. followers, friends et al Might make sense to make #tags a top level object with its own semantics Returns 400 bad Request instead of 420 Granted, there is enough information to figure this out

2. HTTP Error Return is not uniform

3. Need an easier way to get screen_name from user_id 4. following vs. friends_count i.e. following is a dummy variable. 5.
o
o

Parameter Validation is not uniform

There are a few like this, most probably for backward compatibility

Gives 404 Not found instead of 406 Not Acceptable or 416 Range Unacceptable

6. Overall more validation would help


o

Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out

Thanks To these Giants


Thanks To these Giants


Thanks To these Giants


Thanks To these Giants


Thanks To these Giants


I had a good time researching & preparing for this Tutorial.  I hope you learned a few new things & have a few vectors to follow

You might also like