Professional Documents
Culture Documents
http://www.oscon.com/oscon2012/public/schedule/detail/23130
Update[Aug 5, 2012]
I
have
a
set
of
blogs
annotating
these
slides
@
Have
tried
to
add
more
details
which
were
covered
at
the
Tutorial
Let
me
know
if
you
need
more
explanations.
I
will
update
the
blog
http://doubleclix.wordpress.com/2012/08/04/big- data-with-twitter-api-twitter-tips-a-bakers-dozen/
Intro
API, Objects,
o House Rules (1 of 2)
o Doesnt assume any knowledge of Twitter API o Goal: Everybody in the same page & get a working knowledge of Twitter API o To bootstrap your exploration into Social Network Analysis & Twitter o Simple programs, to illustrate usage & data manipulation
We will analyze @clouderati, 2072 followers, exploding to ~980,000 distinct users down one level
@mention network
Rewteeet analytics, Information contagion
Intro
API, Objects,
o House
Rules
(2
of
2)
o Am
using
the
requests
library
o There
are
good
Twitter
frameworks
for
python,
but
wanted
to
build
from
the
basics.
Once
one
understands
the
fundamentals,
frameworks
can
help
o Many
areas
to
explore
not
enough
time.
So
decided
to
focus
on
social
graph,
cliques
&
networkx
We will analyze @clouderati, 2072 followers, exploding to ~980,000 distinct users down one level
@mention network
Rewteeet analytics, Information contagion
About Me
Director
of
Engineering
&
Data
Science
/
Data
Scientist/AWS
Ops
Guy
at
genophen.com
o o o o o o o o o o o o Co-chair
2012
IEEE
Precision
Time
Synchronization
http://www.ispcs.org/2012/index.html
Blog
:
http://doubleclix.wordpress.com/
Quora
:
http://www.quora.com/Krishna-Sankar
Prior Gigs
Lead Architect (Egnyte) Distinguished Engineer (CSCO) Employee #64439 (CSCO) to #39(Egnyte) & now #9 !
Current Focus:
Design, build & ops of BioInformatics/Consumer Infrastructure on AWS, MongoDB, Solr, Drupal,GitHub, Big Data (more of variety, variability, context & graphs, than volume or velocity so far !) Overlay based semantic search & ranking http://goo.gl/P1rhc Big Data Engineering Top 10 Pragmatics (Summary) http://goo.gl/0SQDV The Art of Big Data (Detailed) http://goo.gl/EaUKH The Hitchhikers Guide to Kaggle OSCON 2011 Tutorial
3. 4.
Twitter
APIs
are
very
powerful
consistent
use
can
bear
huge
data
o o o o o o
5.
Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize That way you can orthogonally extend, with functional components like command buers, validation et al
6.
7.
o The
equivalent
of
the
traditional
ETL
o Validation
stage
&
validation
routines
are
important
Cannot
expect
perfect
runs
Cannot
manually
look
at
data
either,
when
data
is
at
scale
Crawl-Store-Validate-Recrawl-Refresh cycle
Compose your data big pipeline with well dened granular functions, each doing only one thing Dont overload the functional components (i.e. no collect, unroll & store as a single component) Have well dened functional components with appropriate caching, buering, checkpoints & restart techniques This did create some trouble for me, as we will see later
8.
I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! There will be a separate printout of the control numbers that will be kept in the operations files
o o o o o
o
more so for a REST-based-Big Data-Analytics systems Expect failures at the transport layer & accommodate for them
Wait a few seconds before successful calls. This will end up with a scalable system, eventually
I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in !
14. Always
measure
the
elapsed
time
of
your
API
runs
&
processing
o Kind
of
early
warning
when
something
is
wrong
15. Develop incrementally; dont fail to check cut & paste errors
16. The
Twitter
big
data
pipeline
has
lots
of
opportunities
for
parallelism
o o
o o o o
Leverage data parallelism frameworks like MapReduce But rst : Prototype as a linear system, Optimize and tweak the functional modules & cache strategies, Note down stages and tasks that can be parallelized and Then parallelize them For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial They might require transformation for example collect & store might store a user list as multiple arrays, while the model requires each user to be a document for aggregation But resist the urge to overload collect with transform o i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents Add transformation as a granular function of course, with appropriate buffering, caching, checkpoints & restart techniques
18. Have a good log management system to capture and wade through logs
Twitter Gripes
1. Need
more
rich
APIs
for
#tags
o o o o Somewhat
similar
to
users
viz.
followers,
friends
et
al
Might
make
sense
to
make
#tags
a
top
level
object
with
its
own
semantics
Returns
400
bad
Request
instead
of
420
Granted, there is enough information to figure this out
3. Need
an
easier
way
to
get
screen_name
from
user_id
4. following
vs.
friends_count
i.e.
following
is
a
dummy
variable.
5.
o
o
There are a few like this, most probably for backward compatibility
Gives 404 Not found instead of 406 Not Acceptable or 413 Too Long or 416 Range Unacceptable
Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out
A Fork
deep & NLTK
, NLP weets into
T ment
4 o Sen ysis
Anal
.. we want to make sure that the Twitter experience is straightforward and easy to understand -- whether youre on Twitter.com or elsewhere on the web-Michael!
The micro-blogging service must find the right balance of running a profitable business and maintaining a robust developers' community. Chenda, CBS news!
My
Wish
&
Hope
I
spend
a
lot
of
time
with
Twitter
&
derive
value;
the
platform
is
rich
&
the
APIs
intuitive
I
did
like
the
fact
that
tweets
are
part
of
LinkedIn.
I
still
used
Twitter
more
than
LinkedIn
o o I
dont
think
showing
Tweets
in
LinkedIn
took
anything
away
from
the
Twitter
experience
LinkedIn
experience
&
Twitter
experience
are
dierent
&
distinct.
Showing
tweets
in
LinkedIn
didnt
change
that
I sincerely hope that the platform grows with a rich developer eco system Orthogonally extensible platform is essential Of course, along with a congruent user experience core Twitter consumption experience through consistent tools
Setup
By looking at the domain in aggregate to derive inferences & actionable recommendations Which also means, you need to be deliberate & systemic ( i.e. not look at a uctuation as a trend but dig deeper before pronouncing a trend)
h8p://www.buzzfeed.com/ma8buchanan/the-rst-30-tweets-ever h8p://frankandernest.com/cgi/view/display.pl?111-02-24
I.
Agenda
II. Break
(3:00
PM
-
3:30
PM)
III. Twitter
Social
Graph
Analysis
(3:30
PM
-
5:00
PM)
o o Underlying
Concepts
Social
Graph
Analysis
of
@clouderati
Stages,
Strategies
&
Tasks
Code
Walk
thru
h8ps://dev.twi8er.com/status
http://www.buzzfeed.com/tommywilhelm/google- users-being-total-dicks-about-the-twitter
Run
o oscon2012_rate_limit_status.py
o Use
http://www.epochconverter.com
to
check
reset_time
Twitter
API
Near-realtime, High Volume
Core Data,
Core Twitter Objects
REST
Streaming
Twitter REST
Twitter Search
Public Streams User Streams Site Streams Firehose Keywords Specic User Trends Rate Limit : Complexity & Frequency
Build Prole Create/Post Tweets Reply Favorite, Re-tweet Rate Limit : 150/350
Rate Limit
Rate Limits
By
API
type
&
Authentication
Mode
API
REST
Search
Streaming
Fire
hose
none
No authC
150/hr
Complexity
&
Frequency
authC
350/hr
-N/A-
Upto
1%
none
400
420
Error
{ "date": "Wed, 04 Jul 2012 00:54:16 GMT", "status": "200 OK", "vary": "Accept-Encoding", "x-frame-options": "SAMEORIGIN", "x-mid": "f31c7278ef8b6e28571166d359132f152289c3b8", "x-ratelimit-class": "api", Last time, it gave me 5 min. "x-ratelimit-limit": "150", Now the reset timer is 1 "x-ratelimit-remaining": "147", hour "x-ratelimit-reset": "1341366831", 150 calls, not authenticated "x-runtime": "0.02768", "x-transaction": "f1bafd60112dddeb", "x-transaction-mask": "a6183a5f8ca9431b53b5644ef11417281dbc" }
{ "cache-control": "no-cache, max-age=300", "content-encoding": "gzip", "content-type": "application/json; charset=utf-8", "date": "Wed, 04 Jul 2012 00:55:04 GMT", And Rate Limit kicked-in "status": "400 Bad Request", "transfer-encoding": "chunked", "vary": "Accept-Encoding", "x-ratelimit-class": "api", "x-ratelimit-limit": "150", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1341366831", "x-runtime": "0.01342" }
{ "date": "Thu, 05 Jul 2012 14:56:05 GMT", "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "133", "x-ratelimit-reset": "1341500165", } ******** 2416 { "date": "Thu, 05 Jul 2012 14:56:18 GMT", "status": "200 OK", . "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "349", "x-ratelimit-reset": "1341503776", ******** 2417
+1 hour
Unexplained Errors
Traceback
(most
recent
call
last):
File
"oscon2012_get_user_info_01.py",
line
39,
in
<module>
r
=
client.get(url,
params=payload)
File
"build/bdist.macosx-10.6-intel/egg/requests/sessions.py",
line
244,
in
get
File
"build/bdist.macosx-10.6-intel/egg/requests/sessions.py",
line
230,
in
request
File
"build/bdist.macosx-10.6-intel/egg/requests/models.py",
line
609,
in
send
requests.exceptions.ConnectionError:
HTTPSConnectionPool(host='api.twitter.com',
port=443):
Max
retries
exceeded
with
url:
/1/users/lookup.json? user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44 614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C38854 7381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8 962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C While
trying
to
get
details
of
1,000,000
users,
I
get
this
error
17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C usually
10-6
AM
PST
42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C 8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%
2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084% Got
around
by
Trap
&
wait
5
seconds
2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%
2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C2648155 56%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C362260 Night
Runs
are
relatively
error
free
09%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C446 14626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C886 54836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C 13727232%2C199803906%2C220435108%2C268531201
{ "date": "Fri, 06 Jul 2012 03:41:09 GMT", "expires": "Fri, 06 Jul 2012 03:46:09 GMT", "server": "tfe", "set-cookie": "dnt=; domain=.twitter.com; path=/; expires=Thu, 01-Jan-1970 00:00:00 GMT", "status": "400 Bad Request", "vary": "Accept-Encoding", "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1341546334", "x-runtime": "0.01918" } Error, sleeping { "date": "Fri, 06 Jul 2012 03:46:12 GMT", "status": "200 OK", "x-ratelimit-class": "api_identied", "x-ratelimit-limit": "350", "x-ratelimit-remaining": "349",
Missed by 4 min!
Strategies
I
have
no
exotic
strategies,
so
far
!
1. Obvious
:
Track
elapsed
time
&
sleep
when
rate
limit
kicks
in
2. Combine
authenticated
&
non-authenticated
calls
3. Use
multiple
API
types
4. Cache
5. Store
&
get
only
what
is
needed
6. Checkpoint
&
buer
request
commands
7. Distributed
data
parallelism
for
example
AWS
instances
http://www.epochconverter.com/
<-
useful
to
debug
the
timer
Pl share your tips and tricks for conserving the Rate Limit
Authentication
Authentication
Three
modes
As
of
Aug
31,
2010,
only
Anonymous
or
OAuth
are
supported
OAuth
enables
the
user
to
authorize
an
application
without
sharing
credentials
Also
has
the
ability
to
revoke
Twitter
supports
OAuth
1.0a
OAuth
2.0
is
the
new
standard,
much
simpler
o No
timeframe
for
Twitter
support,
yet
o Anonymous
o HTTP
Basic
Auth
o OAuth
OAuth Pragmatics
Helpful
Links
o o o o https://dev.twitter.com/docs/auth/oauth
https://dev.twitter.com/docs/auth/moving-from-basic-auth-to-oauth
https://dev.twitter.com/docs/auth/oauth/single-user-with-examples
http://blog.andydenmark.com/2009/03/how-to-build-oauth-consumer.html
Discussion
on
OAuth
internal
mechanisms
is
better
left
for
another
day
For
headless
applications
to
get
OAuth
token,
go
to
https:// dev.twitter.com/apps
Create
an
application
&
get
four
credential
pieces
o Consumer
Key,
Consumer
Secret,
Access
Token
&
Access
Token
Secret
All the frameworks have support for OAuth. So plug in these values & use the frameworks calls I used request-oauth library like so:
request-oauth
def
get_oauth_client():
consumer_key
=
"5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"
consumer_secret
=
"fceb3aedb960374e74f559caeabab3562efe97b4"
access_token
=
"df919acd38722bc0bd553651c80674fab2b465086782Ls"
access_token_secret
=
"1370adbe858f9d726a43211afea2b2d9928ed878"
header_auth
=
True
oauth_hook
=
OAuthHook(access_token,
access_token_secret,
consumer_key,
consumer_secret,
header_auth)
client
=
requests.session(hooks={'pre_request':
oauth_hook})
return
client
Get client using the token, key & secret from dev.twitter.com/apps
def
get_followers(user_id):
url
=
'https://api.twitter.com/1/followers/ids.json
payload={"user_id":user_id}
#
if
cursor
is
needed
{"cursor":-1,"user_id":scr_name}
r
=
requests.get(url,
params=payload)
def
get_followers_with_oauth(user_id,client):
url
=
'https://api.twitter.com/1/followers/ids.json'
payload={"user_id":user_id}
#
if
cursor
is
needed
{"cursor":-1,"user_id":scr_name}
r
=
client.get(url,
params=payload)
Ref: h8p://pypi.python.org/pypi/requests-oauth
404
Not
Found
406
Not
Acceptable
413
Too
Long
416
Range
Unacceptable
420
Enhance
Your
Calm
o Rate
Limited
500
Internal
Server
Error
502
Bad
Gateway
503
Service
Unavailable
o Overloaded
o Down
for
maintenance
o Overloaded
Fail
whale
h8ps://dev.twi8er.com/docs/error-codes-responses
Objects
Friends
Follow
Followers
Are Followed By
Users
Status Update
Entities
Temporally Ordered
TimeLine
Tweets
urls
Places
embe d
media
hashtags
h8ps://dev.twi8er.com/docs/platform-objects
Tweets
A.k.a
Status
Updates
Interesting
elds
o o o o o o Coordinates
<-
geo
location
created_at
entities
(will
see
later)
Id,
id_str
possibly
sensitive
user
(will
see
later)
perspectival
attributes
embedded
within
a
child
object
of
an
unlike
parent
hard
to
maintain
at
scale
https://dev.twitter.com/docs/faq#6981
o withheld_in_countries
https://dev.twitter.com/blog/new-withheld-content-elds-api-responses
h8ps://dev.twi8er.com/docs/platform-objects/tweets
Tweets - example
Let
us
run
oscon2012-tweets.py
Example
of
tweet
o coordinates
o id
o id_str
Users
followers_count
geo_enabled
Id,
Id_str
name,
screen_name
Protected
status,
statuses_count
withheld_in_countries
h8ps://dev.twi8er.com/docs/platform-objects/users
Can use information for customizing the users screen in your web app
Entities
Metadata
&
Contextual
Information
You
can
parse
them,
but
Entities
parse
them
out
as
structured
data
REST
API/Search
API
include_entities=1
Streaming
API
included
by
default
hashtags,
media,
urls,
user_mentions
h8ps://dev.twi8er.com/docs/platform-objects/entities
h8ps://dev.twi8er.com/docs/tweet-entities
h8ps://dev.twi8er.com/docs/tco-url-wrapper
Entities
Run
o oscon2012_entities.py
Places
attributes
bounding_box
Id
(as
a
string!)
country
name
h8ps://dev.twi8er.com/docs/platform-objects/places h8ps://dev.twi8er.com/docs/about-geo-place-a8ributes
Places
Can
search
for
tweets
near
a
place
like
so:
Get
latlong
of
conv
center
[45.52929,-122.66289]
o Tweets
near
that
place
Tweets near San Jose [37.395715,-122.102308] We will not see further here. But very useful
Timelines
Collections
of
tweets
ordered
by
time
Use
max_id
&
since_id
for
navigation
h8ps://dev.twi8er.com/docs/working-with-timelines
Friends
Follow
Followers
Are Followed By
Users
Status Update
Entities
Temporally Ordered
TimeLine
Tweets
urls
Places
embe d
media
hashtags
h8ps://dev.twi8er.com/docs/platform-objects
Lookup users by screen_name - oscon12_users.py Lookup users by id - oscon12_rst_20_ids.py Lookup tweets - oscon12_tweets.py Get entities - oscon12_entities.py
Twi8er APIs
Twitter
API
Near-realtime, High Volume
Core Data,
Core Twitter Objects
REST
Streaming
Twitter REST
Twitter Search
Public Streams User Streams Site Streams Firehose Keywords Specic User Trends Rate Limit : Complexity & Frequency
Build Prole Create/Post Tweets Reply Favorite, Re-tweet Rate Limit : 150/350
Twi8er Trends
oscon2012-trends.py
Trends/weekly,
Trends/monthly
Let
us
run
some
examples
o oscon2012_trends_daily.py
o oscon2012_trends_weekly.py
o Growth of a brand w.r.t the industry o Surge in popularity could be due to ve or +ve buzz. Need to understand & correlate using Twitter APIs & metrics
Google I/O showed a spike on 6/27- 6/28 OReillyMedia shares some spike Looking at a few days worth of data, our best inference is that oscon doesnt track with googleio Clouderati doesnt track at all
NBA, MiamiHeat, okcthunder track each other Used % than absolute numbers to compare The hike on 7/6 to 7/10 is interesting.
Search API
Very
simple
o GET
http://search.twitter.com/search.json?q=<blah>
Based
on
a
search
criteria
The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets
Recent
=
Last
6-9
days
worth
of
tweets
Anonymous
Call
Rate
Limit
o Not
No.
of
calls/hour,
but
Complexity
&
Frequency
h8ps://dev.twi8er.com/docs/using-search
h8ps://dev.twi8er.com/docs/api/1/get/search
Search API
Filters
o o o o o Search
URL
encoded
@
=
%40,
#=%23
emoticons
:)
and
:(,
http://search.twitter.com/search.atom?q=sometimes+%3A)
http://search.twitter.com/search.atom?q=sometimes+%3A(
Streaming API
Not
request
response;
but
stream
Twitter
frameworks
have
the
support
Rate
Limit
:
Upto
1%
Stall
warning
if
the
client
is
falling
behind
Good
Documentation
Links
o https://dev.twitter.com/docs/streaming-apis/connecting
o https://dev.twitter.com/docs/streaming-apis/parameters
o https://dev.twitter.com/docs/streaming-apis/processing
Firehose
~
400
million
public
tweets/day
If
you
are
working
with
Twitter
rehose,
I
envy
you
!
If you hit real limits, then explore the rehose route AFAIK, it is not cheap, but worth it
4. Cache
as
much
as
you
can
5. Set
the
User-Agent
header
for
debugging
I have listed a few good blogs that have API best practices, in the reference section, at the end of this presentation
These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the sources
Twitter
API
Near-realtime, High Volume
Core Data,
Core Twitter Objects
REST
Streaming
Twitter REST
Twitter Search
Build Prole Create/Post Tweets Reply Favorite, Re-tweet Rate Limit : 150/350
Questions ?
Part II
SNA
1. Collect
2. Store
Trivia
Social
Network
Analysis
originated
as
Sociometry
&
the
social
network
was
called
a
sociogram
Back
then,
Facebook
was
called
SocioBinder!
Jacob
Levi
Morano,
is
considered
the
originator
o NYTimes,
April
3,
1933,
P.
17
Twi8er Networks-Denitions
Nodes
o Users
o #tags
Edges
o o o o Follows
Friends
@mentions
#tags
Directed
Twi8er Networks-Denitions
In-degree
o Followers
Out-Degree
o Friends/Follow
Twi8er Networks-Properties
Concepts
From
Citation
Networks
M N L
A G H F
K
o Cocitation Common papers that cite a paper B Common Followers o C & G (Followed by F & H) C o Bibliographic Coupling Cite the same papers D E Common Friends (i.e. follow same person) o D, E, F & H
J I
Twi8er Networks-Properties
Concepts
From
Citation
Networks
M K A G N o Cocitation
Common
papers
that
cite
a
paper
Common
Followers
L
o C
&
G
(Followed
by
F
&
H)
o Bibliographic
Coupling
Cite
the
same
papers
B Common
Friends
(i.e.
follow
same
person)
o D,
E,
F
&
H
follow
C
o H
&
F
follow
C
&
G
C So
H
&
F
have
high
coupling
D Hence,
if
H
follows
A,
we
can
recommend
F
to
follow
A
E J
I
H F
Twi8er Networks-Properties
Bipartite/Aliation
Networks
o Two
disjoint
subsets
o The
bipartite
concept
is
very
relevant
to
Twitter
social
graph
o Membership
in
Lists
lists
vs.
users
bipartite
graph
o Common
#Tags
in
Tweets
#tags
vs.
members
bipartite
graph
o @mention
together
?
Can
this
be
a
bipartite
graph
?
How
would
we
fold
this
?
Network
Diameter
Weak
Ties
Follower
velocity
(+ve
&
ve),
Association
strength
o Unfollow
not
a
reliable
measure
o But
an
interesting
property
to
investigate
when
it
happens
o Easy to build a Gn,p graph o Assumes equal likelihood of edges between two nodes o In a Twitter social network, we can create a more realistic expected distribution (adding the social reality dimension) by inspecting the #tags & @mentions
Twi8er Networks-Properties
Be
cognizant
of
the
above
when
you
apply
traditional
network
properties
to
Twitter
For
example,
o Six
degrees
of
separation
doesn't
make
sense
(most
of
the
time)
in
Twitter
except
may
be
for
Cliques
o Is
diameter
a
reliable
measure
for
a
Twitter
Network
?
Probably
not
o Do
cut
sets
make
sense
?
Probably
not
o But
citation
network
principles
do
apply;
we
can
learn
from
cliques
o Bipartite
graphs
do
make
sense
Cliques (1 of 2)
Maximal
subset
of
the
vertices
in
an
undirected
network
such
that
every
member
of
the
set
is
connected
by
an
edge
to
every
other
Cohesive
subgroup,
closely
connected
Near-cliques
than
a
perfect
clique
(k-plex
i.e.
connected
to
at
least
n-k
others)
k-plex
clique
to
discover
sub
groups
in
a
sparse
network;
1-plex
being
the
perfect
clique
Cliques (2 of 2)
k-core
at
least
k
others
in
the
subset;
(n-k)-plex
k-clique
no
more
than
k
distance
away
o Path
inside
or
outside
the
subset
o k-clan
or
k-club
(path
inside
the
subset)
Sentiment Analysis
Sentiment
Analysis
is
an
important
&
interesting
work
on
the
Twitter
platform
o Collect
Tweets
o Opinion
Estimation
-Pass
thru
Classier,
Sentiment
Lexicons
Nave
Bayes/Max
Entropy
Class/SVM
Sentiment Analysis
Twitter
Mining
for
Airline
Sentiment
Opinion
Lexicon
-
+ve
2000,
-ve
4800
h8p://www.inside-r.org/howto/mining-twi8er-airline-consumer-sentiment h8p://sentiment.christopherpo8s.net/lexicons.html#opinionlexicon
h8p://www.economist.com/blogs/schumpeter/2012/06/tracking-social-media?fsrc=scn/gp/wl/bl/moodofthemarket h8p://www.relevantdata.com/pdfs/IUStudy.pdf
Project Ideas
7. Retweet Network
Twi8er Analysis Pipeline Story Board Stages, Strategies, APIs & Tasks
Stage
4
o Get
&
Store
User
details
(distinct
user
list)
o Unroll
Note:
Needed
a
command
buer
to
manage
scale
(~980,000
users)
For
e follo ach
@c loud w o erat Find er
i
frie nd=f inte rsec o tion llower
Note:
Unroll
-
se t
stage
took
time
&
missteps
Stag e 5
Stage
3
o Get
distinct
user
list
applying
the
set(union(list))
operation
o o o
raph ocial g heory s Create twork t ne Apply ues & other liq Infer c s tie proper
tage 6 S
o Challenges:
Nothing
fancy
Get
the
record
and
store
Would
have
had
to
recurse
through
a
REST
cursor
if
there
were
more
than
5000
followers
@clouderati
has
2072
followers
o Interesting Points:
o Challenges:
Stage
2
o Crawl
1
level
deep
o Get
friends
&
followers
o Validate,
re-crawl
&
refresh
o Interesting Points:
Set operation between two mongo collections for restart buer Protected users, some had 0 followers, or 0 friends Interesting operations for validate, re-crawl and refresh Added status_code to dierentiate protected users {'$set': {'status_code': '401 Unauthorized,401 Unauthorized'}} Getting friends & followers of 2000 users is the hardest (or so I thought, until I got through the next stage!)
Validate-Recrawl-Refresh Logs
pymongo
version
=
2.2
o 1st
run
132
bad
records
Connected
to
DB!
o This
is
the
classic
Erlang-style
2075
supervisor
Error
Friends
:
<type
'exceptions.KeyError'>
o The
crawl
continue
on
transport
errors
43cd40e5557c00c7000000
-
none
has
2072
followers
&
0
friends
without
worrying
about
retry
Error
Friends
:
<type
'exceptions.KeyError'>
o Validate
will
recrawl
&
refresh
as
43a958e5557cfc58000000
-
none
has
2072
followers
&
0
friends
Error
Friends
:
<type
'exceptions.KeyError'>
needed
43ccdee5557c00b6000000
-
none
has
2072
followers
&
0
friends
43d3b9e5557c01b900001e
-
371187804
has
0
followers
&
0
friends
43d3d8e5557c01b9000048
-
63488295
has
155
followers
&
0
friends
43d3d9e5557c01b9000049
-
342712617
has
0
followers
&
0
friends
43d3d9e5557c01b900004a
-
21266738
has
0
followers
&
0
friends
43d3dae5557c01b900004b
-
204652853
has
0
followers
&
0
friends
4475cfe5557c1657000074
-
258944989
has
0
followers
&
0
friends
4475d3e5557c165700007d
-
327286780
has
0
followers
&
0
friends
Looks
like
we
have
132
not
so
good
records
Elapsed
Time
=
0.546846
o Challenges:
o Figure
out
the
right
Set
operations
o Interesting
Points:
973,323
unique
users
!
Recursively
apply
set
union
over
400,00
lists
Set
operations
took
slightly
more
than
a
minute
o Challenges:
Where do I start ? In the next few slides Took me a few days to get it right (along with my daily job!) Unfortunately I did not employ parallelism & didnt use my MacPro with 32 GB memory. So the runs were long But learned hard lessons on check point & restart Tracking Control Numbers Time Marathon unroll run 19:33:33 !
o Interesting Points:
Control numbers
Control Numbers
>
db.t_users_info.count()
8122
>
db.api_str.count({"done":0,"seq_no":{"$lt":8185}},{"seq_no:)
63
>
db.api_str.nd({"done":0,"seq_no":{"$lt":8185}},{"seq_no":1})
The
collection
should
have
8185
{
"_id"
:
ObjectId("44daeae5557c28bf001d53"),
"seq_no"
:
5433
}
documents
But
it
has
nly
8122.
{
"_id"
:
ObjectId("44daeae5557c28bf001d59"),
"seq_no"
:o5439
}
Where
did
the
rest
go
?
{
"_id"
:
ObjectId("44daeae5557c28bf001d5f"),
"seq_no"
:
5445
}
63
of
them
still
have
done=0
8122
+
63
=
8185
!
{
"_id"
:
ObjectId("44daebe5557c28bf001d74"),
"seq_no"
:
5466
}
Aha,
mystery
solved.
{
"_id"
:
ObjectId("44daece5557c28bf001d7a"),
"seq_no"
:
5472
}
They
fell
through
the
cracks
Need
a
catch-all
nal
run
{
"_id"
:
ObjectId("44daece5557c28bf001d80"),
"seq_no"
:
5478
}
{
"_id"
:
ObjectId("44daede5557c28bf001d90"),
"seq_no"
:
5494
}
{
"_id"
:
ObjectId("44daefe5557c28bf001daf"),
"seq_no"
:
5525
}
oscon2012_nd_strong_ties_01.py
oscon2012_social_graph_stats_01.py
None.
Python
set
operations
made
this
easy
Even
at
this
scale,
single
machine
is
not
enough
Should
have
tried
data
parallelism
Was
getting
invalid
cursor
error
from
MongoDB
So
had
to
do
the
updates
in
two
steps
This
task
is
well
suited
to
leverage
data
parallelism
as
it
is
commutative
&
associative
o Challenges:
o Interesting Points:
Clique Distribution = {2: 296521, 3: 58368, 4: 36421, 5: 28788, 6: 24197, 7: 20240, 8: 15997, 9: 11929, 10: 6576, 11: 1909, 12: 364, 13: 55, 14: 2}
Bur a high clique with a smaller Strong ties show more cohesive & stronger social graph eg.Krishnan - 15% friends-followers Samj 33%
3. 4.
Twitter
APIs
are
very
powerful
consistent
use
can
bear
huge
data
o o o o o o
5.
Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize That way you can orthogonally extend, with functional components like command buers, validation et al
6.
7.
o The
equivalent
of
the
traditional
ETL
o Validation
stage
&
validation
routines
are
important
Cannot
expect
perfect
runs
Cannot
manually
look
at
data
either,
when
data
is
at
scale
Crawl-Store-Validate-Recrawl-Refresh cycle
Compose your data big pipeline with well dened granular functions, each doing only one thing Dont overload the functional components (i.e. no collect, unroll & store as a single component) Have well dened functional components with appropriate caching, buering, checkpoints & restart techniques This did create some trouble for me, as we will see later
8.
I still remember control numbers which start with the number of punch cards in the input deck &d then follow that number through the various runs ! There will be a separate printout of the control numbers that will be kept in the operations files
o o o o o
o
more so for a REST-based-Big Data-Analytics systems Expect failures at the transport layer & accommodate for them
Wait a few seconds before successful calls. This will end up with a scalable system, eventually
I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to work with 5 seconds with wait & retry. Then, the rate limit started kicking in !
14. Always
measure
the
elapsed
time
of
your
API
runs
&
processing
o Kind
of
early
warning
when
something
is
wrong
15. Develop incrementally; dont fail to check cut & paste errors
16. The
Twitter
big
data
pipeline
has
lots
of
opportunities
for
parallelism
o o
o o o o
Leverage data parallelism frameworks like MapReduce But rst : Prototype as a linear system, Optimize and tweak the functional modules & cache strategies, Note down stages and tasks that can be parallelized and Then parallelize them For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial They might require transformation for example collect & store might store a user list as multiple arrays, while the model requires each user to be a document for aggregation But resist the urge to overload collect with transform o i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform the array to separate documents Add transformation as a granular function of course, with appropriate buffering, caching, checkpoints & restart techniques
18. Have a good log management system to capture and wade through logs
Twitter Gripes
1. Need
more
rich
APIs
for
#tags
o o o o Somewhat
similar
to
users
viz.
followers,
friends
et
al
Might
make
sense
to
make
#tags
a
top
level
object
with
its
own
semantics
Returns
400
bad
Request
instead
of
420
Granted, there is enough information to figure this out
3. Need
an
easier
way
to
get
screen_name
from
user_id
4. following
vs.
friends_count
i.e.
following
is
a
dummy
variable.
5.
o
o
There are a few like this, most probably for backward compatibility
Gives 404 Not found instead of 406 Not Acceptable or 416 Range Unacceptable
Granted, it is more of growing pains. Once one comes across a few inconsistencies, the rest is easy to figure out
I had a good time researching & preparing for this Tutorial. I hope you learned a few new things & have a few vectors to follow