Twarfing: Gathering Intelligence From Twitter Data

Twarfing: Gathering
Intelligence from Twitter Data

Morton Swimmer
Senior Threat Researcher
Trend Micro, Inc.
Thursday, January 21, 2010
Overview
• What is Twitter?
• Malicious activity on Twitter
• URL shortening services
• The Twarfing architecture
• Examples
• Conclusions

Thanks
• Ben April, Trend, for providing the

massive hardware
• Based in part on a Virus Bulletin
Presentation with Costin Riau, Chief of
R&D, Kaspersky Labs

What is Twitter
• Founded by Jack Dorsey, Biz Stone

and Evan Williams back in 2006
• Publish/Subscribe Communications
system
• SMS
• Web site
• WS API and RSS
• Application

Twitter Internals
• 140 chars max to be SMS compatible

• SMS has a 160 char restriction
• But Twitter needed to add the user
name
• Just text and metadata

Users not necessarily human!
• Devices
• From buoys to power
meters
• Search for Twitter on
instructables.com

Users not necessarily human!
• Not surprising that malware would use

it, but
• Not good for Command and Control
• Easily blocked after detection
• … and twitter has been trigger happy
with blocking

August 2008: Malware on Twitter

April 2009 – Twitter gets hit by XSS worm
• Multiple variants of JS.Twettir.a-h identified

• Thousands of spam messages containing the
word "Mikeyy“ filled the timeline
• Proof of concept – no malicious intent
• Later, the author (Mikey Mooney) got a job at
exqSoft Solutions, a web security company
• September 2009 - Yet another attack
• Phishing for the login details

2009 Trending topics start being exploited

June 2009 – Koobface spreading in Twitter
• Originally only targeted Facebook and MySpace users
• Now spreads through more social networks:
• Facebook, MySpace, Hi5, Bebo, Tagged, Netlog and
most recently… Twitter

Twitter architecture, historically
• Multiple Ruby on Rails servers

• Mongrel HTTP servers
• Centrally MySQL backed

Twitter architecture, now
• Details super-secret, but • Scala-based code
this is what we think
• MySQL
• Front end
• denormalized
• Ruby-based front data whenever
end possible
• Mongrel HTTP • Only for backup

servers and persistance
• Back end • Lots of caching using

memcached
• Starling for queuing/
messaging

Stats (June 2009)
• 25M users
• I measured 475K different users posting over a one
week period in August 2009
• 300 tweets/sec
• MySQL handles 2400 reqs per second
• API traffic == 10x website traffic!
• Indicates that far more people are using applications
• TweetDeck, Twitteriffic, Digsby, Twhirl
• Many are Adobe Air based (!)
• One key to Twitter's success!

Twitter metadata in each tweet!
user
<id> information
<user>
<text>
<created_at> Tweet text
<source>
<truncated>
<in_reply_to_status_id>
<in_reply_to_user_id>
<favorited>
<in_reply_to_screen_name>

Tweet text
• Retweets: RT passes the • show group associations
note along
• just for tagging
• now also in metadata
• User reference: @ for
• Location: L tells friends public discussion
where I am
• URLs
• Hashtags: #
• topics

Long URLs, short URLs
• URL shortening services have grown up around

Twitter
• longurl.org counts 208 different ones
• Malicious URLs are one potential threat
• obscure the true URL
• May become malicious later
• RickRolling, but maliciously
• Benefits:
• ‘bit.ly’ and others blocks malicious URLs

Google SB API
• August 2009
• Twitter began filtering malicious URLs
• Mikko Hyppönen:
• seemed to indicate Google SB API!
• After more testing, we discovered it used
some additional filtering

User data
<id>
<name>
<screen_name>
<location>
<favourites_count>
<description>
<utc_offset>
<profile_image_url>
<time_zone>
<url>
<profile_background_image_url>
<protected>
<profile_background_tile>
<followers_count>
<statuses_count>
<profile_background_color>
<notifications>
<profile_text_color>
<following>
<profile_link_color>
<verified>
<profile_sidebar_fill_color>
<profile_sidebar_border_color>
<friends_count>
<created_at>

Twitter to RDF
• Goal
• Avoid inventing own vocabulary
• Makes it easier to mix-in data later
• eg. Facebook, Tumblr, ...
• Most Twitter data fit readily into the SIOC
ontology
• The rest used DublinCore
• Proprietary ontology for internal data

Using existing vocabs
• DublinCore, http://dublincore.org/
• SIOC, http://sioc-project.org/
• describe information from online
communities
• FOAF, http://www.foaf-project.org/
• machine-readable pages describing people
• GeoOWL

Example
float3r: RT @THErealDVORAK http://www.usdoj.gov/

ndic/pubs31/31379/31379p.pdf
Sun, 23 Aug 2009 15:10:31 +0000
<http://twitter.com/float3r/status/3492845110> a sioc:Post ;
dc:created "Sun, 23 Aug 2009 15:10:31 +0000" ;
sioc:has_creator <http://twitter.com/float3r> ;
sioc:content "RT @THErealDVORAK http://www.usdoj.gov/ndic/
pubs31/31379/31379p.pdf"@no ;
.
<http://twitter.com/float3r> a sioc:User ;
sioc:id "251294"^^xsd:integer ;
rdfs:label "float3r" ;
sioc:avatar "http://a1.twimg.com/profile_images/53008680/
Picture_022_normal.jpg"^^xsd:anyURI ;
.

Example, continued
float3r: RT @THErealDVORAK http://www.usdoj.gov/

ndic/pubs31/31379/31379p.pdf
Sun, 23 Aug 2009 15:10:31 +0000

WhiteTwarf/RedTwarf
White Twarf

Whitetwarf
• Receives a subset of the tweets via
twitter search
• Stores metadata from twitter
• Processes text part for internal
metadata
• User references, hashtags,
Informal tags, URLs
• Creates canonical text
representations
• Export to an RDF store for analysis

WhiteTwarf architecture v.2
tweet
Twitter Stream
processing
couchDB
URL tweets users

RDF
processing Converter
URLs
4Store
RDF
Domain RDF
Graph
Converter
Reputation
Tweet Processing
• Tweet fetch process • Extract Tags, URLS,

user references
• Fetch a limited number
of tweets as JSON • Text signatures
• Extract metadata and • Meant to remove

text small differences in
text
• Separate user metadata
from tweet metadata • Normalization and
whitespace removal
• Store in the CouchDB
database • UTF-8 tricks
expansion/removal
• Tweet processing
• CouchDB views
CouchDB
• Scalable document and key-

value store
• APIs REST and JSON based
• Embedded Javascript and
MapReduce for querying
• http://couchdb.apache.org/

4Store
• Scalable RDF Store

• Developed at Garlik in C and is GPLed
• Has REST and native APIs
• Queries via SPARQL
• http://4store.org/

RDF Stores in general
• Store RDF triples (often as quads or

named graphs)
• Have recently become much more
scalable
• Still have limitations, though
• eg. have locking issues on writes
• Allegrograph 4 (Franz, Inc) may resolve
some of these for us

URL redirector processing
• For every URL extracted by the view we follow the

link
• In most cases we get a 30x response (redirect)
• These are also followed after storing
• Testing showed: usually faster to use shortener APIs
• So we are testing code that uses API instead for
known shorteners
• We also capture other HTTP metadata
• Basically we are looking for malicious links

Detection malicious activity
• Data exported to an RDF Store

• This is a graph database
• Allows for complex queries
• Does have some performance issues and is not
real time
• Simple Attack scenario
• User is observed to post to a malicious domain
• We want to see what else he has posted

Correlation with domain reputations
tw:hasURL
tweet/1234 http://mal.com/evil.exe
drs:hasFQDN
tw:posts
mal.com
mal drs:hasRating
malicious
tw:posts
Note that the
examples in this tw:hasURL
presentation are all tweet/5678 http://unk.com/what.exe
radically simplified
for clarity.
Matching this in SPARQL
tw:hasURL
?t1 ?u1
drs:hasFQDN
tw:posts
?f1
drs:hasRating
?m
malicious
tw:posts
tw:hasURL
?t2 ?u2

Matching this in SPARQL
tw:hasURL
?t1 ?u1
SELECT ?m ?u2
drs:hasFQDN
WHEREposts {
?m tw:posts ?t1.
?f1
?t1 tw:hasURL ?u1. drs:hasRating
?m
?u1 drs:hasFQDN ?f1.
malicious
?f1 drs:hasRating drs:Malicious.
?m tw:posts ?t2.
?t2poststw:hasURL ?u2.
} tw:hasURL
?t2 ?u2

Advantages of using RDF
• Avoid multiple explicite joins

• An intuitive visualization available
• Mixing in new data as easy as adding it
to the RDF store

Another attack scenario
• Observed: User modified URL on

retweet to be malicious
@notniceman: RT: @iceman: This
@iceman: This link is cool
link is cool http://c00l.com/
http://cool.com/ice.html
ice.exe
tw:posts tw:hasURL
http://cool.com/ice.html
iceman tweet/1001
tw:hasTextSignature thislinkiscool
tw:posts
notniceman tweet/1005 http://c001.com/ice.exe
tw:hasURL

Matching it
tw:posts tw:hasURL
?u1
?m1 ?t1
tw:hasTextSignature ?ts ≠
tw:posts
?m2 ?t2 ?u2
tw:hasURL

A few stats
• New architecture since October

• Since the new version started, we logged
3.6 million users
• and keep about 12 million tweets around
• Which is about 300 million triples
• Once out of experimental phase we need
to increase this number dramatically

RedTwarf
• Goal: look for new attack

patterns
• Based on WhiteTwarf data
• Using Data and Text-
mining techniques to
detect rules
• Probably will be Lucene
based

Conclusions
• Twarf is still an experimental side project

• Twitter is becoming a popular attack
vector
• Goal: protecting you, our customers
• Identifying the future development
directions of Twitter threats
I would like to thank you for your support with 140

characters or less and guess what? I just did it! #swnyc

Questions?
morton@swimmer.org

Twarfing: Gathering Intelligence From Twitter Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Twarfing: Gathering Intelligence From Twitter Data

Uploaded by

Copyright:

Available Formats

Twarfing: Gathering

Intelligence from Twitter Data

Thursday, January 21, 2010

• Ben April, Trend, for providing the

Thursday, January 21, 2010

• Founded by Jack Dorsey, Biz Stone

Thursday, January 21, 2010

• 140 chars max to be SMS compatible

Thursday, January 21, 2010

Thursday, January 21, 2010

• Not surprising that malware would use

Thursday, January 21, 2010

Thursday, January 21, 2010

• Multiple variants of JS.Twettir.a-h identified

Thursday, January 21, 2010

Thursday, January 21, 2010

Thursday, January 21, 2010

• Multiple Ruby on Rails servers

Thursday, January 21, 2010

• Mongrel HTTP • Only for backup

• Back end • Lots of caching using

Thursday, January 21, 2010

Thursday, January 21, 2010

Thursday, January 21, 2010

Thursday, January 21, 2010

• URL shortening services have grown up around

Thursday, January 21, 2010

Thursday, January 21, 2010

Thursday, January 21, 2010

Thursday, January 21, 2010

Thursday, January 21, 2010

float3r: RT @THErealDVORAK http://www.usdoj.gov/

Thursday, January 21, 2010

float3r: RT @THErealDVORAK http://www.usdoj.gov/

Thursday, January 21, 2010

Thursday, January 21, 2010

Thursday, January 21, 2010

URL tweets users

• Tweet fetch process • Extract Tags, URLS,

• Extract metadata and • Meant to remove

• Scalable document and key-

Thursday, January 21, 2010

• Scalable RDF Store

Thursday, January 21, 2010

• Store RDF triples (often as quads or

Thursday, January 21, 2010

• For every URL extracted by the view we follow the

Thursday, January 21, 2010

• Data exported to an RDF Store

Thursday, January 21, 2010

Thursday, January 21, 2010

Thursday, January 21, 2010

• Avoid multiple explicite joins

Thursday, January 21, 2010

• Observed: User modified URL on

Thursday, January 21, 2010

Thursday, January 21, 2010

• New architecture since October

Thursday, January 21, 2010

• Goal: look for new attack

Thursday, January 21, 2010

• Twarf is still an experimental side project

I would like to thank you for your support with 140