You are on page 1of 42

Twarfing: Gathering

Intelligence from Twitter Data


Morton Swimmer
Senior Threat Researcher
Trend Micro, Inc.
Thursday, January 21, 2010
Overview

• What is Twitter?
• Malicious activity on Twitter
• URL shortening services
• The Twarfing architecture
• Examples
• Conclusions

Thursday, January 21, 2010


Thanks

• Ben April, Trend, for providing the


massive hardware
• Based in part on a Virus Bulletin
Presentation with Costin Riau, Chief of
R&D, Kaspersky Labs

Thursday, January 21, 2010


What is Twitter

• Founded by Jack Dorsey, Biz Stone


and Evan Williams back in 2006
• Publish/Subscribe Communications
system
• SMS
• Web site
• WS API and RSS
• Application

Thursday, January 21, 2010


Twitter Internals

• 140 chars max to be SMS compatible


• SMS has a 160 char restriction
• But Twitter needed to add the user
name
• Just text and metadata

Thursday, January 21, 2010


Users not necessarily human!

• Devices
• From buoys to power
meters
• Search for Twitter on
instructables.com

Thursday, January 21, 2010


Users not necessarily human!

• Not surprising that malware would use


it, but
• Not good for Command and Control
• Easily blocked after detection
• … and twitter has been trigger happy
with blocking

Thursday, January 21, 2010


August 2008: Malware on Twitter

Thursday, January 21, 2010


April 2009 – Twitter gets hit by XSS worm

• Multiple variants of JS.Twettir.a-h identified


• Thousands of spam messages containing the
word "Mikeyy“ filled the timeline
• Proof of concept – no malicious intent
• Later, the author (Mikey Mooney) got a job at
exqSoft Solutions, a web security company
• September 2009 - Yet another attack
• Phishing for the login details

Thursday, January 21, 2010


2009 Trending topics start being exploited

Thursday, January 21, 2010


June 2009 – Koobface spreading in Twitter
• Originally only targeted Facebook and MySpace users
• Now spreads through more social networks:
• Facebook, MySpace, Hi5, Bebo, Tagged, Netlog and
most recently… Twitter

Thursday, January 21, 2010


Twitter architecture, historically

• Multiple Ruby on Rails servers


• Mongrel HTTP servers
• Centrally MySQL backed

Thursday, January 21, 2010


Twitter architecture, now
• Details super-secret, but • Scala-based code
this is what we think
• MySQL
• Front end
• denormalized
• Ruby-based front data whenever
end possible

• Mongrel HTTP • Only for backup


servers and persistance

• Back end • Lots of caching using


memcached
• Starling for queuing/
messaging

Thursday, January 21, 2010


Stats (June 2009)
• 25M users
• I measured 475K different users posting over a one
week period in August 2009
• 300 tweets/sec
• MySQL handles 2400 reqs per second
• API traffic == 10x website traffic!
• Indicates that far more people are using applications
• TweetDeck, Twitteriffic, Digsby, Twhirl
• Many are Adobe Air based (!)
• One key to Twitter's success!

Thursday, January 21, 2010


Twitter metadata in each tweet!

user
<id> information
<user>
<text>
<created_at> Tweet text
<source>
<truncated>
<in_reply_to_status_id>
<in_reply_to_user_id>
<favorited>
<in_reply_to_screen_name>

Thursday, January 21, 2010


Tweet text
• Retweets: RT passes the • show group associations
note along
• just for tagging
• now also in metadata
• User reference: @ for
• Location: L tells friends public discussion
where I am
• URLs
• Hashtags: #

• topics

Thursday, January 21, 2010


Long URLs, short URLs

• URL shortening services have grown up around


Twitter
• longurl.org counts 208 different ones
• Malicious URLs are one potential threat
• obscure the true URL
• May become malicious later
• RickRolling, but maliciously
• Benefits:
• ‘bit.ly’ and others blocks malicious URLs

Thursday, January 21, 2010


Google SB API

• August 2009
• Twitter began filtering malicious URLs
• Mikko Hyppönen:
• seemed to indicate Google SB API!
• After more testing, we discovered it used
some additional filtering

Thursday, January 21, 2010


User data

<id>
<name>
<screen_name>
<location>
<favourites_count>
<description>
<utc_offset>
<profile_image_url>
<time_zone>
<url>
<profile_background_image_url>
<protected>
<profile_background_tile>
<followers_count>
<statuses_count>
<profile_background_color>
<notifications>
<profile_text_color>
<following>
<profile_link_color>
<verified>
<profile_sidebar_fill_color>
<profile_sidebar_border_color>
<friends_count>
<created_at>

Thursday, January 21, 2010


Twitter to RDF

• Goal
• Avoid inventing own vocabulary
• Makes it easier to mix-in data later
• eg. Facebook, Tumblr, ...
• Most Twitter data fit readily into the SIOC
ontology
• The rest used DublinCore
• Proprietary ontology for internal data

Thursday, January 21, 2010


Using existing vocabs

• DublinCore, http://dublincore.org/
• SIOC, http://sioc-project.org/
• describe information from online
communities
• FOAF, http://www.foaf-project.org/
• machine-readable pages describing people
• GeoOWL

Thursday, January 21, 2010


Example

float3r: RT @THErealDVORAK http://www.usdoj.gov/


ndic/pubs31/31379/31379p.pdf
Sun, 23 Aug 2009 15:10:31 +0000

<http://twitter.com/float3r/status/3492845110> a sioc:Post ;
dc:created "Sun, 23 Aug 2009 15:10:31 +0000" ;
sioc:has_creator <http://twitter.com/float3r> ;
sioc:content "RT @THErealDVORAK http://www.usdoj.gov/ndic/
pubs31/31379/31379p.pdf"@no ;
.
<http://twitter.com/float3r> a sioc:User ;
sioc:id "251294"^^xsd:integer ;
rdfs:label "float3r" ;
sioc:avatar "http://a1.twimg.com/profile_images/53008680/
Picture_022_normal.jpg"^^xsd:anyURI ;
.

Thursday, January 21, 2010


Example, continued

float3r: RT @THErealDVORAK http://www.usdoj.gov/


ndic/pubs31/31379/31379p.pdf
Sun, 23 Aug 2009 15:10:31 +0000

Thursday, January 21, 2010


WhiteTwarf/RedTwarf

White Twarf

Thursday, January 21, 2010


Whitetwarf
• Receives a subset of the tweets via
twitter search
• Stores metadata from twitter
• Processes text part for internal
metadata
• User references, hashtags,
Informal tags, URLs
• Creates canonical text
representations
• Export to an RDF store for analysis

Thursday, January 21, 2010


WhiteTwarf architecture v.2

tweet
Twitter Stream
processing
couchDB

URL tweets users


RDF

processing Converter

URLs
4Store

RDF
Domain RDF
Graph
Converter
Reputation
Thursday, January 21, 2010
Tweet Processing

• Tweet fetch process • Extract Tags, URLS,


user references
• Fetch a limited number
of tweets as JSON • Text signatures

• Extract metadata and • Meant to remove


text small differences in
text
• Separate user metadata
from tweet metadata • Normalization and
whitespace removal
• Store in the CouchDB
database • UTF-8 tricks
expansion/removal
• Tweet processing
• CouchDB views
Thursday, January 21, 2010
CouchDB

• Scalable document and key-


value store
• APIs REST and JSON based
• Embedded Javascript and
MapReduce for querying
• http://couchdb.apache.org/

Thursday, January 21, 2010


4Store

• Scalable RDF Store


• Developed at Garlik in C and is GPLed
• Has REST and native APIs
• Queries via SPARQL
• http://4store.org/

Thursday, January 21, 2010


RDF Stores in general

• Store RDF triples (often as quads or


named graphs)
• Have recently become much more
scalable
• Still have limitations, though
• eg. have locking issues on writes
• Allegrograph 4 (Franz, Inc) may resolve
some of these for us

Thursday, January 21, 2010


URL redirector processing

• For every URL extracted by the view we follow the


link
• In most cases we get a 30x response (redirect)
• These are also followed after storing
• Testing showed: usually faster to use shortener APIs
• So we are testing code that uses API instead for
known shorteners
• We also capture other HTTP metadata
• Basically we are looking for malicious links

Thursday, January 21, 2010


Detection malicious activity

• Data exported to an RDF Store


• This is a graph database
• Allows for complex queries
• Does have some performance issues and is not
real time
• Simple Attack scenario
• User is observed to post to a malicious domain
• We want to see what else he has posted

Thursday, January 21, 2010


Correlation with domain reputations
tw:hasURL
tweet/1234 http://mal.com/evil.exe

drs:hasFQDN
tw:posts

mal.com

mal drs:hasRating

malicious

tw:posts
Note that the
examples in this tw:hasURL
presentation are all tweet/5678 http://unk.com/what.exe
radically simplified
for clarity.
Thursday, January 21, 2010
Matching this in SPARQL
tw:hasURL
?t1 ?u1

drs:hasFQDN
tw:posts

?f1

drs:hasRating
?m

malicious

tw:posts

tw:hasURL
?t2 ?u2

Thursday, January 21, 2010


Matching this in SPARQL
tw:hasURL
?t1 ?u1
SELECT ?m ?u2
drs:hasFQDN
WHEREposts {
?m tw:posts ?t1.
?f1
?t1 tw:hasURL ?u1. drs:hasRating
?m
?u1 drs:hasFQDN ?f1.
malicious
?f1 drs:hasRating drs:Malicious.
?m tw:posts ?t2.
?t2poststw:hasURL ?u2.
} tw:hasURL
?t2 ?u2

Thursday, January 21, 2010


Advantages of using RDF

• Avoid multiple explicite joins


• An intuitive visualization available
• Mixing in new data as easy as adding it
to the RDF store

Thursday, January 21, 2010


Another attack scenario

• Observed: User modified URL on


retweet to be malicious
@notniceman: RT: @iceman: This
@iceman: This link is cool
link is cool http://c00l.com/
http://cool.com/ice.html
ice.exe

tw:posts tw:hasURL
http://cool.com/ice.html
iceman tweet/1001

tw:hasTextSignature thislinkiscool

tw:posts
notniceman tweet/1005 http://c001.com/ice.exe
tw:hasURL

Thursday, January 21, 2010


Matching it

tw:posts tw:hasURL
?u1
?m1 ?t1

tw:hasTextSignature ?ts ≠

tw:posts
?m2 ?t2 ?u2
tw:hasURL

Thursday, January 21, 2010


A few stats

• New architecture since October


• Since the new version started, we logged
3.6 million users
• and keep about 12 million tweets around
• Which is about 300 million triples
• Once out of experimental phase we need
to increase this number dramatically

Thursday, January 21, 2010


RedTwarf

• Goal: look for new attack


patterns
• Based on WhiteTwarf data
• Using Data and Text-
mining techniques to
detect rules
• Probably will be Lucene
based

Thursday, January 21, 2010


Conclusions

• Twarf is still an experimental side project


• Twitter is becoming a popular attack
vector
• Goal: protecting you, our customers
• Identifying the future development
directions of Twitter threats

I would like to thank you for your support with 140


characters or less and guess what? I just did it! #swnyc

Thursday, January 21, 2010


Questions?
morton@swimmer.org

Thursday, January 21, 2010

You might also like