The Art of Social Media Analysis With Twitter & Python

The Art of Social Media Analysis with Twitter & Python
http://www.oscon.com/oscon2012/public/schedule/detail/23130
krishna sankar @ksankar
Update[Aug 5, 2012]
I have a set of blogs annotating these slides @ Have tried to add more details which were covered at the Tutorial Let me know if you need more explanations. I will update the blog
http://doubleclix.wordpress.com/2012/08/04/big- data-with-twitter-api-twitter-tips-a-bakers-dozen/
Intro
API, Objects,

o House Rules (1 of 2)
o Doesnt assume any knowledge of Twitter API o Goal: Everybody in the same page & get a working knowledge of Twitter API o To bootstrap your exploration into Social Network Analysis & Twitter o Simple programs, to illustrate usage & data manipulation
Twitter Network Analysis Pipeline

We will analyze @clouderati, 2072 followers, exploding to ~980,000 distinct users down one level
NLP, NLTK, Sentiment Analysis #tag Network
@mention network
Rewteeet analytics, Information contagion
Cliques, social graph

Growth, weakties
Intro
API, Objects,

o House Rules (2 of 2)
o Am using the requests library o There are good Twitter frameworks for python, but wanted to build from the basics. Once one understands the fundamentals, frameworks can help o Many areas to explore not enough time. So decided to focus on social graph, cliques & networkx
Twitter Network Analysis Pipeline

We will analyze @clouderati, 2072 followers, exploding to ~980,000 distinct users down one level
NLP, NLTK, Sentiment Analysis #tag Network
@mention network
Rewteeet analytics, Information contagion
Cliques, social graph

Growth, weakties
About Me
Director of Engineering & Data Science / Data Scientist/AWS Ops Guy at genophen.com
o o o o o o o o o o o o Co-chair 2012 IEEE Precision Time Synchronization http://www.ispcs.org/2012/index.html Blog : http://doubleclix.wordpress.com/ Quora : http://www.quora.com/Krishna-Sankar
Prior Gigs
Lead Architect (Egnyte) Distinguished Engineer (CSCO) Employee #64439 (CSCO) to #39(Egnyte) & now #9 !
Current Focus:
Design, build & ops of BioInformatics/Consumer Infrastructure on AWS, MongoDB, Solr, Drupal,GitHub, Big Data (more of variety, variability, context & graphs, than volume or velocity so far !) Overlay based semantic search & ranking http://goo.gl/P1rhc Big Data Engineering Top 10 Pragmatics (Summary) http://goo.gl/0SQDV The Art of Big Data (Detailed) http://goo.gl/EaUKH The Hitchhikers Guide to Kaggle OSCON 2011 Tutorial
Other related Presentations
Twitter Tips A Bakers Dozen

1. 2. Twitter APIs are (more or less) congruent & symmetric Twitter is usually right & simple - recheck when you get unexpected results before blaming Twitter
o o o I was getting numbers when I was expecting screen_names in user objects. Was ready to send blasting e-mails to Twitter team. Decided to check one more time and found that my parameter key was wrong-screen_name instead of user_id Always test with one or two records before a long run ! - learned the hard way In a week, you can pull in 4-5 million users & some tweets ! Night runs are far more faster & error-free Would make it easy to work with Twitter at scale I use MongoDB Keep the schema simple & no fancy transformation And as far as possible same as the ( json) response Use NOSQL CLI for trimming records et al
3. 4.
Twitter APIs are very powerful consistent use can bear huge data
o o o o o o
Use a NOSQL data store as a command buer & data buer
The End Beg As Th inni ng e
5.
Always use a big data pipeline

o

Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize That way you can orthogonally extend, with functional components like command buers, validation et al
6.
Use functional approach for a scalable pipeline

o o o
7.
o The equivalent of the traditional ETL o Validation stage & validation routines are important
Cannot expect perfect runs Cannot manually look at data either, when data is at scale
Crawl-Store-Validate-Recrawl-Refresh cycle
Compose your data big pipeline with well dened granular functions, each doing only one thing Dont overload the functional components (i.e. no collect, unroll & store as a single component) Have well dened functional components with appropriate caching, buering, checkpoints & restart techniques This did create some trouble for me, as we will see later
8.
Have control numbers to validate runs & monitor them

o o
I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! There will be a separate printout of the control numbers that will be kept in the operations files

9. Program defensively
o
o o o o o
o
more so for a REST-based-Big Data-Analytics systems Expect failures at the transport layer & accommodate for them
10. Have Erlang-style supervisors in your pipeline

Fail fast & move on Dont linger and try to x errors that cannot be controlled at that layer A higher layer process will circle back and do incremental runs to correct missing spiders and crawls Be aware of visibility & lack of context. Validate at the lowest layer that has enough context to take corrective actions I have an example in part 2
11. Data will never be perfect

o Know your data & accommodate for its idiosyncrasies
for example: 0 followers, protected users, 0 friends,

12. Check Point frequently (preferably after ever API call) & have a re-startable command buer cache
o
See a MongoDB example in Part 2
13. Dont bombard the URL

o
o
Wait a few seconds before successful calls. This will end up with a scalable system, eventually
I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in !
14. Always measure the elapsed time of your API runs & processing
o Kind of early warning when something is wrong
15. Develop incrementally; dont fail to check cut & paste errors
16. The Twitter big data pipeline has lots of opportunities for parallelism
o o
o o o o
17. Pay attention to handos between stages
Leverage data parallelism frameworks like MapReduce But rst : Prototype as a linear system, Optimize and tweak the functional modules & cache strategies, Note down stages and tasks that can be parallelized and Then parallelize them For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial They might require transformation for example collect & store might store a user list as multiple arrays, while the model requires each user to be a document for aggregation But resist the urge to overload collect with transform o i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents Add transformation as a granular function of course, with appropriate buffering, caching, checkpoints & restart techniques
18. Have a good log management system to capture and wade through logs

19. Understand the underlying network characteristics for the inference you want to make
o o o o Twitter Network != Facebook Network , Twitter Graph != LinkedIn Graph Twitter Network is more of an Interest Network So, many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense But, others like Cliques and Bipartite Graphs do
Twitter Gripes
1. Need more rich APIs for #tags
o o o o Somewhat similar to users viz. followers, friends et al Might make sense to make #tags a top level object with its own semantics Returns 400 bad Request instead of 420 Granted, there is enough information to figure this out
2. HTTP Error Return is not uniform
3. Need an easier way to get screen_name from user_id 4. following vs. friends_count i.e. following is a dummy variable. 5.
o
o
Parameter Validation is not uniform
There are a few like this, most probably for backward compatibility
Gives 404 Not found instead of 406 Not Acceptable or 413 Too Long or 416 Range Unacceptable
6. Overall more validation would help

o
Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out
A Fork

deep & NLTK , NLP weets into T ment
4 o Sen ysis Anal
Not enough time for both
I chose the Social Graph route

A minute about Twitter as platform & its evolution

blog/ com/ tter- er. wi twitt /dev. nsistent-t ps:/ htt ring-co e deliv ence ri expe
.. we want to make sure that the Twitter experience is straightforward and easy to understand -- whether youre on Twitter.com or elsewhere on the web-Michael!
The micro-blogging service must find the right balance of running a profitable business and maintaining a robust developers' community. Chenda, CBS news!
My Wish & Hope I spend a lot of time with Twitter & derive value; the platform is rich & the APIs intuitive I did like the fact that tweets are part of LinkedIn. I still used Twitter more than LinkedIn
o o I dont think showing Tweets in LinkedIn took anything away from the Twitter experience LinkedIn experience & Twitter experience are dierent & distinct. Showing tweets in LinkedIn didnt change that
I sincerely hope that the platform grows with a rich developer eco system Orthogonally extensible platform is essential Of course, along with a congruent user experience core Twitter consumption experience through consistent tools
For Hands on Today
Setup
o Python 2.7.3 o easy_install v requests http://docs.python-requests.org/en/latest/user/quickstart/#make-a- request o easy_install v requests-oauth

o Hands on programs at https://github.com/xsankar/oscon2012-handson
For advanced data science with social graphs

o easy_install v networkx o easy_install v numpy o easy_install v nltk Not for this tutorial, but good for sentiment analysis et al o Mongodb I used MongoDB in AWS m2.xlarge, RAID 10 X 8 X 15 GB EBS o graphviz - http://www.graphviz.org/; easy_install pygraphviz o easy_install pydot
Thanks To these Giants

Problem Domain For this tutorial

Data Science (trends, analytics et al) on Social Networks as observed by Twitter primitives
o Not for Twitter based apps for real time tweets o Not web sites with real time tweets
By looking at the domain in aggregate to derive inferences & actionable recommendations Which also means, you need to be deliberate & systemic ( i.e. not look at a uctuation as a trend but dig deeper before pronouncing a trend)
Twi8er & Comics
h8p://www.buzzfeed.com/ma8buchanan/the-rst-30-tweets-ever h8p://frankandernest.com/cgi/view/display.pl?111-02-24
I.
Mechanics : Twitter API (1:30 PM - 3:00 PM)

o o o o Essential Fundamentals (Rate Limit, HTTP Codes et al) Objects API Hands-on (2:45 PM - 3:00 PM)
Agenda

II. Break (3:00 PM - 3:30 PM) III. Twitter Social Graph Analysis (3:30 PM - 5:00 PM)
o o Underlying Concepts Social Graph Analysis of @clouderati Stages, Strategies & Tasks Code Walk thru
Open This First
Twi8er API : Read These First

Using Twitter Brand
o New logo & associated guidelines : https://twitter.com/about/logos o Twitter Rules : https://support.twitter.com/groups/33-report-a-violation/topics/121-guidelines- best-practices/articles/18311-the-twitter-rules o Developer Rules of the road https://dev.twitter.com/terms/api-terms 1. 2. 3. 4. 5. 6. 7. 8. https://dev.twitter.com/docs/things-every-developer-should-know https://dev.twitter.com/docs/faq Field Guide to Objects https://dev.twitter.com/docs/platform-objects Security https://dev.twitter.com/docs/security-best-practices Media Best Practices : https://dev.twitter.com/media Consolidates Page : https://dev.twitter.com/docs Streaming APIs https://dev.twitter.com/docs/streaming-apis How to Appeal (Not that you all would need it !) https://support.twitter.com/ articles/72585
Read These Links First
Only One version of Twitter APIs
API Status Page
https://dev.twitter.com/status https://dev.twitter.com/issues https://dev.twitter.com/discussions
h8ps://dev.twi8er.com/status
http://www.buzzfeed.com/tommywilhelm/google- users-being-total-dicks-about-the-twitter
Open This First

Install pre-req as per the setup slide Run
o oscon2012_open_this_rst.py o To test connectivity canary query
Run
o oscon2012_rate_limit_status.py o Use http://www.epochconverter.com to check reset_time
Formats xml, json, atom & rss
Twitter API
Near-realtime, High Volume
Core Data,
Core Twitter Objects

REST
Streaming
Follow users, topics, data mining

Twitter REST
Seach & Trend

Twitter Search
Public Streams User Streams Site Streams Firehose Keywords Specic User Trends Rate Limit : Complexity & Frequency
Build Prole Create/Post Tweets Reply Favorite, Re-tweet Rate Limit : 150/350
Rate Limit
Rate Limits
By API type & Authentication Mode
API
REST Search Streaming Fire hose none No authC
150/hr Complexity & Frequency authC
350/hr -N/A- Upto 1% none 400 420 Error

Rate Limit Header

{ "status": "200 OK", "vary": "Accept-Encoding", "x-frame-options": "SAMEORIGIN", "x-mid": "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6", "x-ratelimit-class": "api", "x-ratelimit-limit": "150", "x-ratelimit-remaining": "149", "x-ratelimit-reset": "1340467358", "x-runtime": "0.04144", "x-transaction": "2b49ac31cf8709af", "x-transaction-mask": "a6183a5f8ca9431b53b5644ef114df9d6bba" }
Rate Limit-ed Header

{ "cache-control": "no-cache, max-age=300", "content-encoding": "gzip", "content-length": "150", "content-type": "application/json; charset=utf-8", "date": "Wed, 04 Jul 2012 00:48:25 GMT", "expires": "Wed, 04 Jul 2012 00:53:25 GMT", "server": "tfe", "status": "400 Bad Request", "vary": "Accept-Encoding", "x-ratelimit-class": "api", "x-ratelimit-limit": "150", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1341363230", "x-runtime": "0.01126" }
Rate Limit Example

Run
o oscon2012_rate_limit_02.py
It iterates through a list to get followers List is 2072 long
{ "date": "Wed, 04 Jul 2012 00:54:16 GMT", "status": "200 OK", "vary": "Accept-Encoding", "x-frame-options": "SAMEORIGIN", "x-mid": "f31c7278ef8b6e28571166d359132f152289c3b8", "x-ratelimit-class": "api", Last time, it gave me 5 min. "x-ratelimit-limit": "150", Now the reset timer is 1 "x-ratelimit-remaining": "147", hour "x-ratelimit-reset": "1341366831", 150 calls, not authenticated "x-runtime": "0.02768", "x-transaction": "f1bafd60112dddeb", "x-transaction-mask": "a6183a5f8ca9431b53b5644ef11417281dbc" }
{ "cache-control": "no-cache, max-age=300", "content-encoding": "gzip", "content-type": "application/json; charset=utf-8", "date": "Wed, 04 Jul 2012 00:55:04 GMT", And Rate Limit kicked-in "status": "400 Bad Request", "transfer-encoding": "chunked", "vary": "Accept-Encoding", "x-ratelimit-class": "api", "x-ratelimit-limit": "150", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1341366831", "x-runtime": "0.01342" }
API with OAuth

{ "date": "Wed, 04 Jul 2012 01:32:01 GMT", "etag": "\"dd419c02ed00fc6b2a825cc27wbe040\"", "expires": "Tue, 31 Mar 1981 05:00:00 GMT", "last-modied": "Wed, 04 Jul 2012 01:32:01 GMT", "pragma": "no-cache", "server": "tfe", "status": "200 OK", "vary": "Accept-Encoding", "x-access-level": "read", "x-frame-options": "SAMEORIGIN", "x-mid": "5bbb87c04fa43c43bc9d7482bc62633a1ece381c", "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "349", "x-ratelimit-reset": "1341369121", "x-runtime": "0.05539", "x-transaction": "9f8508fe4c73a407", "x-transaction-mask": "a6183a5f8ca9431b53b5644ef11417281dbc" }
OAuth api-identied 1 hr reset 350 calls
{ "date": "Thu, 05 Jul 2012 14:56:05 GMT", "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "133", "x-ratelimit-reset": "1341500165", } ******** 2416 { "date": "Thu, 05 Jul 2012 14:56:18 GMT", "status": "200 OK", . "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "349", "x-ratelimit-reset": "1341503776", ******** 2417
+1 hour
Rate Limit resets during consecutive calls
Unexplained Errors
Traceback (most recent call last): File "oscon2012_get_user_info_01.py", line 39, in <module> r = client.get(url, params=payload) File "build/bdist.macosx-10.6-intel/egg/requests/sessions.py", line 244, in get File "build/bdist.macosx-10.6-intel/egg/requests/sessions.py", line 230, in request File "build/bdist.macosx-10.6-intel/egg/requests/models.py", line 609, in send requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.twitter.com', port=443): Max retries exceeded with url: /1/users/lookup.json? user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44 614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854 7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8 962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C While trying to get details of 1,000,000 users, I get this error 17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C usually 10-6 AM PST 42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C 8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725% 2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084% Got around by Trap & wait 5 seconds 2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887% 2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155 56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260 Night Runs are relatively error free 09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446 14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886 54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C 13727232%2C199803906%2C220435108%2C268531201
{ "date": "Fri, 06 Jul 2012 03:41:09 GMT", "expires": "Fri, 06 Jul 2012 03:46:09 GMT", "server": "tfe", "set-cookie": "dnt=; domain=.twitter.com; path=/; expires=Thu, 01-Jan-1970 00:00:00 GMT", "status": "400 Bad Request", "vary": "Accept-Encoding", "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1341546334", "x-runtime": "0.01918" } Error, sleeping { "date": "Fri, 06 Jul 2012 03:46:12 GMT", "status": "200 OK", "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "349",
A Day in the life of Twitter Rate Limit
Missed by 4 min!
OK after 5 min sleep
Strategies
I have no exotic strategies, so far !
1. Obvious : Track elapsed time & sleep when rate limit kicks in 2. Combine authenticated & non-authenticated calls 3. Use multiple API types 4. Cache 5. Store & get only what is needed 6. Checkpoint & buer request commands 7. Distributed data parallelism for example AWS instances http://www.epochconverter.com/ <- useful to debug the timer

Pl share your tips and tricks for conserving the Rate Limit

Authentication
Authentication
Three modes As of Aug 31, 2010, only Anonymous or OAuth are supported OAuth enables the user to authorize an application without sharing credentials Also has the ability to revoke Twitter supports OAuth 1.0a OAuth 2.0 is the new standard, much simpler
o No timeframe for Twitter support, yet o Anonymous o HTTP Basic Auth o OAuth
OAuth Pragmatics
Helpful Links
o o o o https://dev.twitter.com/docs/auth/oauth https://dev.twitter.com/docs/auth/moving-from-basic-auth-to-oauth https://dev.twitter.com/docs/auth/oauth/single-user-with-examples http://blog.andydenmark.com/2009/03/how-to-build-oauth-consumer.html
Discussion on OAuth internal mechanisms is better left for another day For headless applications to get OAuth token, go to https:// dev.twitter.com/apps Create an application & get four credential pieces
o Consumer Key, Consumer Secret, Access Token & Access Token Secret
All the frameworks have support for OAuth. So plug in these values & use the frameworks calls I used request-oauth library like so:
request-oauth
def get_oauth_client():
consumer_key = "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"
consumer_secret = "fceb3aedb960374e74f559caeabab3562efe97b4" access_token = "df919acd38722bc0bd553651c80674fab2b465086782Ls" access_token_secret = "1370adbe858f9d726a43211afea2b2d9928ed878" header_auth = True oauth_hook = OAuthHook(access_token, access_token_secret, consumer_key, consumer_secret, header_auth) client = requests.session(hooks={'pre_request': oauth_hook}) return client
Get client using the token, key & secret from dev.twitter.com/apps
def get_followers(user_id):
url = 'https://api.twitter.com/1/followers/ids.json payload={"user_id":user_id} # if cursor is needed {"cursor":-1,"user_id":scr_name} r = requests.get(url, params=payload)
Use the client instead of requests
def get_followers_with_oauth(user_id,client):
url = 'https://api.twitter.com/1/followers/ids.json' payload={"user_id":user_id} # if cursor is needed {"cursor":-1,"user_id":scr_name} r = client.get(url, params=payload)
Ref: h8p://pypi.python.org/pypi/requests-oauth
OAuth Authorize screen

The user authenticates with Twitter & grants access to Forbes Social Forbes social doesnt have the users credentials, but uses OAuth to access the users account
HTTP Status Codes
HTTP status Codes

0 Never made it to Twitter Servers - Library error 200 OK 304 Not Modied 400 Bad Request
o Check error message for explanation o REST Rate Limit !
404 Not Found 406 Not Acceptable 413 Too Long 416 Range Unacceptable 420 Enhance Your Calm
o Rate Limited
401 UnAuthorized 403 Forbidden

o Hit Update Limit (> max Tweets/day, following too many people)
500 Internal Server Error 502 Bad Gateway 503 Service Unavailable
o Overloaded o Down for maintenance o Overloaded Fail whale
o Beware you could get this for other reasons as well.
504 Gateway Timeout
h8ps://dev.twi8er.com/docs/error-codes-responses
HTTP Status Code - Example

{ "cache-control": "no-cache, max-age=300", "content-encoding": "gzip", "content-length": "91", "content-type": "application/json; charset=utf-8", "date": "Sat, 23 Jun 2012 00:06:56 GMT", "expires": "Sat, 23 Jun 2012 00:11:56 GMT", "server": "tfe", "status": "401 Unauthorized", "vary": "Accept-Encoding", "www-authenticate": "OAuth realm=\"https://api.twitter.com\"", "x-ratelimit-class": "api", "x-ratelimit-limit": "0", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1340413616", "x-runtime": "0.01997" } { "errors": [ { "code": 53, "message": "Basic authentication is not supported" } ] }
Detailed error message in JSON ! I like this
HTTP Status Code Confusing Example

{ GET https://api.twitter.com/1/users/lookup.json? screen_nme=twitterapi,twitter&include_entities= "pragma": "no-cache", true "server": "tfe", Spelling Mistake "status": "404 Not Found", o Should be screen_name But confusing error ! } Should be 406 Not Acceptable or 413 Too Long , { showing parameter error "errors": [ { "code": 34, "message": "Sorry, that page does not exist" } ] }
HTTP Status Code - Example

{ "cache-control": "no-cache, no-store, must-revalidate, pre-check=0, post-check=0", "content-encoding": "gzip", "content-length": "112", Sometimes, the errors are "content-type": "application/json;charset=utf-8", "date": "Sat, 23 Jun 2012 01:23:47 GMT", not correct. I got this error "expires": "Tue, 31 Mar 1981 05:00:00 GMT", for user_timeline.json w/ user_id=20,15,12 "status": "401 Unauthorized", "www-authenticate": "OAuth realm=\"https://api.twitter.com\"", Clearly a parameter error "x-frame-options": "SAMEORIGIN", (i.e. more parameters) "x-ratelimit-class": "api", "x-ratelimit-limit": "150", "x-ratelimit-remaining": "147", "x-ratelimit-reset": "1340417742", "x-transaction": "d545a806f9c72b98" } { "error": "Not authorized", "request": "/1/statuses/user_timeline.json?user_id=12%2C15%2C20" }
Objects
Friends
Follow

Followers
Are Followed By

Twitter Platform Objects

user_mentions
embed

Users
Status Update

Entities
Temporally Ordered
TimeLine
Tweets
urls

Places
embe d
media
hashtags
h8ps://dev.twi8er.com/docs/platform-objects
Tweets
A.k.a Status Updates Interesting elds
o o o o o o Coordinates <- geo location created_at entities (will see later) Id, id_str possibly sensitive user (will see later)
perspectival attributes embedded within a child object of an unlike parent hard to maintain at scale https://dev.twitter.com/docs/faq#6981
o withheld_in_countries
https://dev.twitter.com/blog/new-withheld-content-elds-api-responses
h8ps://dev.twi8er.com/docs/platform-objects/tweets
A word about id, id_str

June 1, 2010
o Snowake the id generator service o The full ID is composed of a timestamp, a worker number, and a sequence number o Had problems with JavaScript to handle numbers > 53 bits o id:819797 o id_str:819797
h8p://engineering.twi8er.com/2010/06/announcing-snowake.html h8ps://groups.google.com/forum/?fromgroups#!topic/twi8er-development-talk/ahbvo3VTIYI h8ps://dev.twi8er.com/docs/twi8er-ids-json-and-snowake
Tweets - example
Let us run oscon2012-tweets.py Example of tweet
o coordinates o id o id_str
Users
followers_count geo_enabled Id, Id_str name, screen_name Protected status, statuses_count withheld_in_countries
h8ps://dev.twi8er.com/docs/platform-objects/users
Users Let us run some examples

Run
o oscon_2012_users.py Lookup users by screen_name o oscon12_rst_20_ids.py Lookup users by user_id
Inspect the results

o id, name, status, status_count, protected, followers (for top 10 followers), withheld users
Can use information for customizing the users screen in your web app
Entities
Metadata & Contextual Information You can parse them, but Entities parse them out as structured data REST API/Search API include_entities=1 Streaming API included by default hashtags, media, urls, user_mentions
h8ps://dev.twi8er.com/docs/platform-objects/entities h8ps://dev.twi8er.com/docs/tweet-entities h8ps://dev.twi8er.com/docs/tco-url-wrapper
Entities
Run
o oscon2012_entities.py
Inspect hashtags, urls et al
Places
attributes bounding_box Id (as a string!) country name
h8ps://dev.twi8er.com/docs/platform-objects/places h8ps://dev.twi8er.com/docs/about-geo-place-a8ributes
Places
Can search for tweets near a place like so: Get latlong of conv center [45.52929,-122.66289]
o Tweets near that place
Tweets near San Jose [37.395715,-122.102308] We will not see further here. But very useful
Timelines
Collections of tweets ordered by time Use max_id & since_id for navigation
h8ps://dev.twi8er.com/docs/working-with-timelines
Other Objects & APIs

Lists Notications Friendships/exists to see if one follows the other
Friends
Follow

Followers
Are Followed By

Twitter Platform Objects

user_mentions
embed

Users
Status Update

Entities
Temporally Ordered
TimeLine
Tweets
urls

Places
embe d
media
hashtags
h8ps://dev.twi8er.com/docs/platform-objects
Hands-on Exercise (15 min)

Setup environment slide #14 Sanity Check Environment & Libraries Get objects (show calls)
o o o o o oscon2012_open_this_rst.py o oscon2012_rate_limit_status.py
Lookup users by screen_name - oscon12_users.py Lookup users by id - oscon12_rst_20_ids.py Lookup tweets - oscon12_tweets.py Get entities - oscon12_entities.py
Inspect the results Explore a little bit Discussion
Twi8er APIs
Twitter API
Core Data,

REST
Streaming

Twitter REST
Seach & Trend

Twitter Search
Public Streams User Streams Site Streams Firehose Keywords Specic User Trends Rate Limit : Complexity & Frequency
Twi8er REST API

https://dev.twitter.com/docs/api What we were doing were the REST API Request-Response Anonymous or OAuth Rate Limited :
o 150/350
Twi8er Trends
oscon2012-trends.py Trends/weekly, Trends/monthly Let us run some examples
o oscon2012_trends_daily.py o oscon2012_trends_weekly.py
Trends & hashtags

o o o o o #hashtag euro2012 http://hashtags.org/euro2012 http://sproutsocial.com/insights/2011/08/twitter-hashtags/ http://blog.twitter.com/2012/06/euro-2012-follow-all-action-on-pitch.html Top 10 : http://twittercounter.com/pages/100, http://twitaholic.com/
Brand Rank w/ Twi8er

Walk Through & results of following Followed 10 user-brands for a few days to nd growth Brand Rank
o oscon2012_brand_01.py
o Growth of a brand w.r.t the industry o Surge in popularity could be due to ve or +ve buzz. Need to understand & correlate using Twitter APIs & metrics
API : url='https://api.twitter.com/1/users/ lookup.json' payload={"screen_name":"miamiheat,okcthunder,n ba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati, googleio,OReillyMedia"}
Brand Rank w/ Twi8er

Clouderati is very stable
Brand Rank w/ Twi8er Tech Brands

Google I/O showed a spike on 6/27- 6/28 OReillyMedia shares some spike Looking at a few days worth of data, our best inference is that oscon doesnt track with googleio Clouderati doesnt track at all
Brand Rank w/ Twi8er World of Soccer

FOXSoccer,UEFAcom track each other
The numbers seldom decrease. So calculating ve velocity will not work OTOH, if you see a ve velocity, investigate
NBA, MiamiHeat, okcthunder track each other Used % than absolute numbers to compare The hike on 7/6 to 7/10 is interesting.
Brand Rank w/ Twi8er World of Basketball
Brand Rank w/ Twi8er Rising Tide

For some reason, all numbers are going up 7/6 thru 7/10 except for clouderati! Is a rising (Twitter) tide lifting all (well, almost all) ?
Trivia : Search API

Search(search.twitter.com)
o Built by Summize which was acquired by Twitter in 2008 o Summize described itself as sentiment mining
Search API
Very simple
o GET http://search.twitter.com/search.json?q=<blah>
Based on a search criteria The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets Recent = Last 6-9 days worth of tweets Anonymous Call Rate Limit
o Not No. of calls/hour, but Complexity & Frequency
h8ps://dev.twi8er.com/docs/using-search h8ps://dev.twi8er.com/docs/api/1/get/search
Search API
Filters
o o o o o Search URL encoded @ = %40, #=%23 emoticons :) and :(, http://search.twitter.com/search.atom?q=sometimes+%3A) http://search.twitter.com/search.atom?q=sometimes+%3A(
Location Filters, date lters Content searches
Streaming API
Not request response; but stream Twitter frameworks have the support Rate Limit : Upto 1% Stall warning if the client is falling behind Good Documentation Links
o https://dev.twitter.com/docs/streaming-apis/connecting o https://dev.twitter.com/docs/streaming-apis/parameters o https://dev.twitter.com/docs/streaming-apis/processing
Firehose
~ 400 million public tweets/day If you are working with Twitter rehose, I envy you !
If you hit real limits, then explore the rehose route AFAIK, it is not cheap, but worth it
API Best Practices

1. Use JSON 2. Use user_id than screen_name
o User_id is constant while screen_name can change
3. max_id and since_id

o For example direct messages, if you have last message use since_id for search o max_id how far to go back
4. Cache as much as you can 5. Set the User-Agent header for debugging I have listed a few good blogs that have API best practices, in the reference section, at the end of this presentation
These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the sources

Twitter API
Core Data,

REST
Streaming

Twitter REST
Seach & Trend

Twitter Search
Keywords Specic User Trends Rate Limit : Complexity & Frequency
Questions ?
Public Streams User Streams Site Streams Firehose
Part II
SNA
Part II Twitter Network Analysis
1. Collect
2. Store
3. Transform & Analyze

the Keep dont . Tip: 3 simple; a schem afrai d to be for m trans
Validate Dataset & re-crawl/refresh

Most important & the ugliest slide in this deck !
as ent , lem Imp ipeline 1. Tip: age d p nolith a st r a mo neve
5. Predict, Recommend & Visualize
4. Model & Reason
Trivia
Social Network Analysis originated as Sociometry & the social network was called a sociogram Back then, Facebook was called SocioBinder! Jacob Levi Morano, is considered the originator
o NYTimes, April 3, 1933, P. 17
Twi8er Networks-Denitions
Nodes
o Users o #tags
Edges
o o o o Follows Friends @mentions #tags
Directed
Twi8er Networks-Denitions
In-degree
o Followers
Out-Degree
o Friends/Follow
Centrality Measures Hubs & Authorities

o Hubs/Directories tell us where Authorities are o Of Mortals & Celebrities is more Twitter-style
Twi8er Networks-Properties
Concepts From Citation Networks
M N L A G H F K
o Cocitation Common papers that cite a paper B Common Followers o C & G (Followed by F & H) C o Bibliographic Coupling Cite the same papers D E Common Friends (i.e. follow same person) o D, E, F & H
J I
Concepts From Citation Networks
M K A G N o Cocitation Common papers that cite a paper Common Followers L o C & G (Followed by F & H) o Bibliographic Coupling Cite the same papers B Common Friends (i.e. follow same person) o D, E, F & H follow C o H & F follow C & G C So H & F have high coupling D Hence, if H follows A, we can recommend F to follow A E J I
H F
Bipartite/Aliation Networks
o Two disjoint subsets o The bipartite concept is very relevant to Twitter social graph o Membership in Lists lists vs. users bipartite graph o Common #Tags in Tweets #tags vs. members bipartite graph o @mention together ? Can this be a bipartite graph ? How would we fold this ?
Other Metrics & Mechanisms

Kronecker Graphs Models
o Kronecker product is a way of generating self-similar matrices o Prof.Leskovec et al dene the Kronecker product of two graphs as the Kronecker product of their adjacency matrices o Application : Generating models for analysis, prediction, anomaly detection et al
Erdos-Renyl Random Graphs
Network Diameter Weak Ties Follower velocity (+ve & ve), Association strength
o Unfollow not a reliable measure o But an interesting property to investigate when it happens
o Easy to build a Gn,p graph o Assumes equal likelihood of edges between two nodes o In a Twitter social network, we can create a more realistic expected distribution (adding the social reality dimension) by inspecting the #tags & @mentions
Not covered here, but potential for an encore !

Ref: Jure Leskovec: Kronecker Graphs, Random Graphs
Twitter != LinkedIn, Twitter != Facebook Twitter Network == Interest Network
Be cognizant of the above when you apply traditional network properties to Twitter For example,
o Six degrees of separation doesn't make sense (most of the time) in Twitter except may be for Cliques o Is diameter a reliable measure for a Twitter Network ? Probably not o Do cut sets make sense ? Probably not o But citation network principles do apply; we can learn from cliques o Bipartite graphs do make sense
Cliques (1 of 2)
Maximal subset of the vertices in an undirected network such that every member of the set is connected by an edge to every other Cohesive subgroup, closely connected Near-cliques than a perfect clique (k-plex i.e. connected to at least n-k others) k-plex clique to discover sub groups in a sparse network; 1-plex being the perfect clique
Ref: Networks, An Introduction-Newman
Cliques (2 of 2)
k-core at least k others in the subset; (n-k)-plex k-clique no more than k distance away
o Path inside or outside the subset o k-clan or k-club (path inside the subset)
We will apply k-plex Cliques for one of our hands-on
Ref: Networks, An Introduction-Newman
Sentiment Analysis
Sentiment Analysis is an important & interesting work on the Twitter platform
o Collect Tweets o Opinion Estimation -Pass thru Classier, Sentiment Lexicons
Nave Bayes/Max Entropy Class/SVM
o Aggregated Text Sentiment/Moving Average
I chose not to dive deeper because of time constraints

o Couldnt do justice to API, Social Network and Sentiment Analysis, all in 3 hrs
Next 3 Slides have couple of interesting examples
Sentiment Analysis
Twitter Mining for Airline Sentiment Opinion Lexicon - +ve 2000, -ve 4800
h8p://www.inside-r.org/howto/mining-twi8er-airline-consumer-sentiment h8p://sentiment.christopherpo8s.net/lexicons.html#opinionlexicon
Need I say more ?

A bit of clever math can uncover interes4ng pa7erns that are not visible to the human eye
h8p://www.economist.com/blogs/schumpeter/2012/06/tracking-social-media?fsrc=scn/gp/wl/bl/moodofthemarket h8p://www.relevantdata.com/pdfs/IUStudy.pdf
Project Ideas
Interesting Vectors of Exploration

1. Find trending #tags & then related #tags using cliques over co-#tag-citation, which infers topics related to trending topics 2. Related #tag topics over a set of tweets by a user or group of users 3. Analysis-In/Out ow, Tweet Flow
Frequent @mention
4. Find aliation networks by List memberships, #tags or frequent @mentions
Interesting Vectors of Exploration

5. Use centrality measures to determine mortals vs. celebrities 6. Classify Tweet networks/cliques based on message passing characteristics
Tweets vs. Retweets, No of reweets, Measure Inuence by retweet count & frequency Information contagion by looking at dierent retweet network subcomponents who, when, how much,
7. Retweet Network
Twi8er Network Graph Analysis

An Example
Analysis Story Board

@clouderati is a popular cloud related Twitter account Goals:
In this tutorial For you to explore !!
o Analyze the social graph characteristics of the users who are following the account Dig one level deep, to the followers & friends, of the followers of @clouderati o How many cliques ? How strong are they ? o Does the @mention support the clique inferences ? o What are the retweet characteristics ? o How does the #tag network graph look like ?
Twi8er Analysis Pipeline Story Board Stages, Strategies, APIs & Tasks
Stage 4
o Get & Store User details (distinct user list) o Unroll
Note: Needed a command buer to manage scale (~980,000 users)
For e follo ach @c loud w o erat Find er i frie nd=f inte rsec o tion llower Note: Unroll - se t stage took time
& missteps
Stag e 5
Stage 3
o Get distinct user list applying the set(union(list)) operation
o o o
raph ocial g heory s Create twork t ne Apply ues & other liq Infer c s tie proper
tage 6 S
@clouderati Twi8er Social Graph

Stats (Retrospect after the runs):
o Stage 1 @clouderati has 2072 followers o Stage 2 Limiting followers to 5,000 per user o Stage 3 Digging 1st level (set union of followers & friends of the followers of @clouderati) explodes into ~980,000 distinct users o MongoDB of the cache and intermediate datasets ~10 GB o The database was hosted at AWS (Hi Mem XLarge m2.xlarge ), 8 X 15 GB, Raid 10, opened to Internet with DB authentication
Code & Run Walk Through

o Code:
oscon_2012_user_list_spider_01.py Stage 1
o Get @clouderati Followers o Store in MongoDB
o Challenges:
Nothing fancy Get the record and store Would have had to recurse through a REST cursor if there were more than 5000 followers @clouderati has 2072 followers
o Interesting Points:

o Code:
oscon_2012_user_list_spider_02.py oscon_2012_twitter_utils.py oscon_2012_mongo.py oscon_2012_validate_dataset.py Multiple runs, errors et al !
o Challenges:
Stage 2
o Crawl 1 level deep o Get friends & followers o Validate, re-crawl & refresh
Set operation between two mongo collections for restart buer Protected users, some had 0 followers, or 0 friends Interesting operations for validate, re-crawl and refresh Added status_code to dierentiate protected users {'$set': {'status_code': '401 Unauthorized,401 Unauthorized'}} Getting friends & followers of 2000 users is the hardest (or so I thought, until I got through the next stage!)
Validate-Recrawl-Refresh Logs
pymongo version = 2.2 o 1st run 132 bad records Connected to DB! o This is the classic Erlang-style 2075 supervisor Error Friends : <type 'exceptions.KeyError'> o The crawl continue on transport errors 43cd40e5557c00c7000000 - none has 2072 followers & 0 friends without worrying about retry Error Friends : <type 'exceptions.KeyError'> o Validate will recrawl & refresh as 43a958e5557cfc58000000 - none has 2072 followers & 0 friends Error Friends : <type 'exceptions.KeyError'> needed 43ccdee5557c00b6000000 - none has 2072 followers & 0 friends 43d3b9e5557c01b900001e - 371187804 has 0 followers & 0 friends 43d3d8e5557c01b9000048 - 63488295 has 155 followers & 0 friends 43d3d9e5557c01b9000049 - 342712617 has 0 followers & 0 friends 43d3d9e5557c01b900004a - 21266738 has 0 followers & 0 friends 43d3dae5557c01b900004b - 204652853 has 0 followers & 0 friends 4475cfe5557c1657000074 - 258944989 has 0 followers & 0 friends 4475d3e5557c165700007d - 327286780 has 0 followers & 0 friends Looks like we have 132 not so good records Elapsed Time = 0.546846

o Code:
oscon2012_analytics_01.py Stage 3
o Get distinct user list applying the set(union(list)) operation
o Challenges:
o Figure out the right Set operations
973,323 unique users ! Recursively apply set union over 400,00 lists Set operations took slightly more than a minute

o Code:
Stage 4
o Get & Store User details (distinct user list) o Unroll
oscon2012_analytics_01.py (focus on cmd string creation) oscon2012_get_user_info_01.py oscon2012_unroll_user_list_01.py oscon2012_unroll_user_list_02.py
o Challenges:
Where do I start ? In the next few slides Took me a few days to get it right (along with my daily job!) Unfortunately I did not employ parallelism & didnt use my MacPro with 32 GB memory. So the runs were long But learned hard lessons on check point & restart Tracking Control Numbers Time Marathon unroll run 19:33:33 !
Twi8er @ scale Pa8ern

Challenge: Problem: Solution:
o You want to get screen names, follower counts and other details for a million users o No easy REST API o https://api.twitter.com/1/users/lookup.json will take 100 user_ids and give details o This is a scalability challenge. Approach it like so o Create a command buer collection in MongoDB splitting millon user_ids into batches of 100 o Have a done ag initialized to 0 for checkpoint & restart o After each cmd str is executed, rest done:1 o For subsequent runs, ignore done:1. o Also helps in control number tracking
Control numbers
Control Numbers
> db.t_users_info.count() 8122 > db.api_str.count({"done":0,"seq_no":{"$lt":8185}},{"seq_no:) 63 > db.api_str.nd({"done":0,"seq_no":{"$lt":8185}},{"seq_no":1}) The collection should have 8185 { "_id" : ObjectId("44daeae5557c28bf001d53"), "seq_no" : 5433 } documents But it has nly 8122. { "_id" : ObjectId("44daeae5557c28bf001d59"), "seq_no" :o5439 } Where did the rest go ? { "_id" : ObjectId("44daeae5557c28bf001d5f"), "seq_no" : 5445 } 63 of them still have done=0 8122 + 63 = 8185 ! { "_id" : ObjectId("44daebe5557c28bf001d74"), "seq_no" : 5466 } Aha, mystery solved. { "_id" : ObjectId("44daece5557c28bf001d7a"), "seq_no" : 5472 } They fell through the cracks Need a catch-all nal run { "_id" : ObjectId("44daece5557c28bf001d80"), "seq_no" : 5478 } { "_id" : ObjectId("44daede5557c28bf001d90"), "seq_no" : 5494 } { "_id" : ObjectId("44daefe5557c28bf001daf"), "seq_no" : 5525 }
Day in the life of a Control Number Detective Run #1

Remember : 973,323 users. So, 9734 cmd strings (100 users perstring) > > db.api_str.count() 9831 > db.api_str.count({"done":0}) 239 >> db.t_users_info.count() 9592 > > db.api_str.count({"api_str":""}) 97 So we should have 9831 97 = 9734 records The second run should generate 9734-9592 = 142 calls (i.e. 350-142=208 rate-limit should remain). Let us see. { "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "209", } Yep, 209 left >
Day in the life of a Control Number Detective Run #2

Remember : 973,323 users. So, 9734 cmd strings (100 users perstring) > db.t_users_info.count() 9728 > db.api_str.count({"api_str":""}) 97 > db.api_str.count({"done":0}) 103 >9734-9728=6, same as 103-97 ! Run once more ! > db.api_str.nd({"done":0},{"seq_no":1}) { "_id" : ObjectId("44dbd4e5557c28bf002e22"), "seq_no" : 9736 } { "_id" : ObjectId("44db05e5557c28bf001f47"), "seq_no" : 5933 } { "_id" : ObjectId("44db8be5557c28bf0028f6"), "seq_no" : 8412 } { "_id" : ObjectId("44dba2e5557c28bf002a8c"), "seq_no" : 8818 } { "_id" : ObjectId("44dbaee5557c28bf002b69"), "seq_no" : 9039 } { "_id" : ObjectId("44dbb8e5557c28bf002c1c"), "seq_no" : 9218 } { "x-ratelimit-limit": "350", "x-ratelimit-remaining": "344", } Yep, 6 more records > db.t_users_info.count() 9734
Good, got 9734 !
Professor Layton would be proud !

In fact, I have all the four & plan to spend sometime with them & Laphraig !
Monitor runs & track control numbers
Unroll run 8:48 PM to ~4:08 PM next day !
Track error & the document numbers

o Code:
Stage 5
o For each @clouderati follower o Find friend=follower - set intersection
oscon2012_nd_strong_ties_01.py oscon2012_social_graph_stats_01.py None. Python set operations made this easy Even at this scale, single machine is not enough Should have tried data parallelism Was getting invalid cursor error from MongoDB So had to do the updates in two steps
This task is well suited to leverage data parallelism as it is commutative & associative
o Challenges:

o Code:
Stage 6
o Create social graph o Apply network theory o Infer cliques & other properties
oscon2012_nd_cliques_01.py o Lots of good information hidden in the data ! o Memory !
o Challenges: o Interesting Points:

o Graph, List & set operations o networkx has lots of interesting graph algorithms o Collections.Counter to the rescue
Twi8er Social Graph Analysis of @clouderati

o o o o o o o 2072 Followers; 973,323 unique users one level down w/ followers/friends trimmed at 5,000 Strong ties 235,697 users, 462, 419 edges 501,367 Cliques 253 unique users 8,906 Cliques w/ > 10 users GeorgeReese in 7,973 of them ! See List for 1st 125 krishnan 3,446,randy 2,197, joe 1,977, sam 1,937, jp 485, stu 403, urquhart 263,beaker 226, acroll 149, adrian 63, gevaperry 24 Of course, clique analysis does not tell us the whole story
o follower=friend
Clique Distribution = {2: 296521, 3: 58368, 4: 36421, 5: 28788, 6: 24197, 7: 20240, 8: 15997, 9: 11929, 10: 6576, 11: 1909, 12: 364, 13: 55, 14: 2}

Celebrity very low strong ties Higher Celebrity, low strong ties
o sort by followers vs. sort by strong ties is interesting

Medium Celebrity, medium strong ties

o A higher Strong Ties number is interesting
It means a very high follower-friend intersection Reeves 62%, bgolden 85%
Bur a high clique with a smaller Strong ties show more cohesive & stronger social graph eg.Krishnan - 15% friends-followers Samj 33%

o Ideas for more Exploration
Include all followers (instead of stopping at the 5000 cap) Get tweets & track @mention Frequent @mention shows more stronger ties #tag analysis could show some interesting networks

1. 2. Twitter APIs are (more or less) congruent & symmetric Twitter is usually right & simple - recheck when you get unexpected results before blaming Twitter
o o o I was getting numbers when I was expecting screen_names in user objects. Was ready to send blasting e-mails to Twitter team. Decided to check one more time and found that my parameter key was wrong-screen_name instead of user_id Always test with one or two records before a long run ! - learned the hard way In a week, you can pull in 4-5 million users & some tweets ! Night runs are far more faster & error-free Would make it easy to work with Twitter at scale I use MongoDB Keep the schema simple & no fancy transformation And as far as possible same as the ( json) response Use NOSQL CLI for trimming records et al
3. 4.
Twitter APIs are very powerful consistent use can bear huge data
o o o o o o
Use a NOSQL data store as a command buer & data buer
The Beg in The ning A End s
5.
Always use a big data pipeline

o

Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize That way you can orthogonally extend, with functional components like command buers, validation et al
6.
Use functional approach for a scalable pipeline

o o o
7.
o The equivalent of the traditional ETL o Validation stage & validation routines are important
Cannot expect perfect runs Cannot manually look at data either, when data is at scale
Crawl-Store-Validate-Recrawl-Refresh cycle
Compose your data big pipeline with well dened granular functions, each doing only one thing Dont overload the functional components (i.e. no collect, unroll & store as a single component) Have well dened functional components with appropriate caching, buering, checkpoints & restart techniques This did create some trouble for me, as we will see later
8.
Have control numbers to validate runs & monitor them

o o
I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! There will be a separate printout of the control numbers that will be kept in the operations files

9. Program defensively
o
o o o o o
o
more so for a REST-based-Big Data-Analytics systems Expect failures at the transport layer & accommodate for them
10. Have Erlang-style supervisors in your pipeline

Fail fast & move on Dont linger and try to x errors that cannot be controlled at that layer A higher layer process will circle back and do incremental runs to correct missing spiders and crawls Be aware of visibility & lack of context. Validate at the lowest layer that has enough context to take corrective actions I have an example in part 2
11. Data will never be perfect

o Know your data & accommodate for its idiosyncrasies
for example: 0 followers, protected users, 0 friends,

12. Check Point frequently (preferably after ever API call) & have a re-startable command buer cache
o
See a MongoDB example in Part 2
13. Dont bombard the URL

o
o
Wait a few seconds before successful calls. This will end up with a scalable system, eventually
I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in !
14. Always measure the elapsed time of your API runs & processing
o Kind of early warning when something is wrong
15. Develop incrementally; dont fail to check cut & paste errors
16. The Twitter big data pipeline has lots of opportunities for parallelism
o o
o o o o
17. Pay attention to handos between stages
Leverage data parallelism frameworks like MapReduce But rst : Prototype as a linear system, Optimize and tweak the functional modules & cache strategies, Note down stages and tasks that can be parallelized and Then parallelize them For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial They might require transformation for example collect & store might store a user list as multiple arrays, while the model requires each user to be a document for aggregation But resist the urge to overload collect with transform o i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents Add transformation as a granular function of course, with appropriate buffering, caching, checkpoints & restart techniques
18. Have a good log management system to capture and wade through logs

19. Understand the underlying network characteristics for the inference you want to make
o o o o Twitter Network != Facebook Network , Twitter Graph != LinkedIn Graph Twitter Network is more of an Interest Network So, many of the traditional network mechanisms & mechanics, like network diameter & degrees of separation, might not make sense But, others like Cliques and Bipartite Graphs do
Twitter Gripes
1. Need more rich APIs for #tags
o o o o Somewhat similar to users viz. followers, friends et al Might make sense to make #tags a top level object with its own semantics Returns 400 bad Request instead of 420 Granted, there is enough information to figure this out
2. HTTP Error Return is not uniform
3. Need an easier way to get screen_name from user_id 4. following vs. friends_count i.e. following is a dummy variable. 5.
o
o
Parameter Validation is not uniform
There are a few like this, most probably for backward compatibility
Gives 404 Not found instead of 406 Not Acceptable or 416 Range Unacceptable
6. Overall more validation would help

o
Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out





I had a good time researching & preparing for this Tutorial. I hope you learned a few new things & have a few vectors to follow

The Art of Social Media Analysis With Twitter & Python

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Art of Social Media Analysis With Twitter & Python

Uploaded by

Copyright:

Available Formats

The Art of Social Media Analysis with Twitter & Python

krishna sankar @ksankar

Twitter Network Analysis Pipeline

NLP, NLTK, Sentiment Analysis #tag Network

Cliques, social graph

Twitter Network Analysis Pipeline

NLP, NLTK, Sentiment Analysis #tag Network

Cliques, social graph

Other related Presentations

Twitter Tips A Bakers Dozen

Use a NOSQL data store as a command buer & data buer

The End Beg As Th inni ng e

Always use a big data pipeline

Twitter Tips A Bakers Dozen

Use functional approach for a scalable pipeline

Have control numbers to validate runs & monitor them

Twitter Tips A Bakers Dozen

10. Have Erlang-style supervisors in your pipeline

11. Data will never be perfect

Twitter Tips A Bakers Dozen

See a MongoDB example in Part 2

13. Dont bombard the URL

Twitter Tips A Bakers Dozen

17. Pay attention to handos between stages

Twitter Tips A Bakers Dozen

2. HTTP Error Return is not uniform

Parameter Validation is not uniform

6. Overall more validation would help

Not enough time for both

I chose the Social Graph route

A minute about Twitter as platform & its evolution

For Hands on Today

o Python 2.7.3 o easy_install v requests http://docs.python-requests.org/en/latest/user/quickstart/#make-a- request o easy_install v requests-oauth

For advanced data science with social graphs

Thanks To these Giants

Problem Domain For this tutorial

Twi8er & Comics

Mechanics : Twitter API (1:30 PM - 3:00 PM)

Open This First

Twi8er API : Read These First

Read These Links First

Only One version of Twitter APIs

API Status Page

https://dev.twitter.com/status https://dev.twitter.com/issues https://dev.twitter.com/discussions

Open This First

Formats xml, json, atom & rss

Follow users, topics, data mining

Seach & Trend

Rate Limit Header

Rate Limit-ed Header

Rate Limit Example

It iterates through a list to get followers List is 2072 long

API with OAuth

OAuth api-identied 1 hr reset 350 calls

Rate Limit resets during consecutive calls

A Day in the life of Twitter Rate Limit

OK after 5 min sleep

Use the client instead of requests

OAuth Authorize screen

HTTP Status Codes

HTTP status Codes

401 UnAuthorized 403 Forbidden

o Beware you could get this for other reasons as well.