Foursquare - ML Presentation

Big Data @ foursquare
Infrastructure, Analy6cs, Predic6on,

and Beyond
Jus6n Moore -‐ @injust

3/22/2011 Machine Learning Meetup Ma;hew Rathbone -‐ @rathboma
Overview
•  What is foursquare
•  Analy6cs and Data
•  Machine Learning, Recommenda6ons

What is Foursquare?
•  Loca6on based startup, applica6on
that helps you to explore your city,
discover new places
•  Visit places, check-‐in, earn rewards,

stay connected with your friends
•  Game elements: single-‐player, mul6-‐

player
What is Foursquare? (cont.)
•  7M+ users, 15M+ venues,
500M+ check-‐ins
•  Large reach (every country,

North Pole, Space, Everest)
•  Na6ve app for almost every

smartphone, also available
on SMS, web, mobile-‐web
Explore
•  Our new social-‐
recommenda6on
engine
•  Real-‐6me sugges6ons
based on your social
graph.

Data Model
Users Check-‐ins Venues
Shouts Tips/To-‐dos

Analy6cs @ Foursquare
I’m going to talk about:
•  Why produc6on db’s are bad for analy6cs
•  What we do to make it be;er (hint: hadoop)
•  Our custom Dashboard
•  Usage examples
•  Thoughts about the hadoop/hive experience
Our Data:
Problems using the Produc6on
Databases

Our data: So we turn to our friends
Our repor6ng / analy6cs / data mining stack is thanks

to open source sobware
Our data: What we do instead
Log Files
About Hadoop and Hive
Hadoop:
•  Distributed Data processing
framework (map-‐reduce).
•  Wri;en in Java

Hive:
•  SQL layer on top of hadoop
•  Lets us do “select count(1)
from checkins” instead of
having to write our own
map-‐reduce java classes.
Image from ibm.com

About Hive
•  Create/Drop/Insert/Select etc
•  Table Joins
•  Aggrega6on Func6ons
•  Date Func6ons
•  URL parsing func6ons
•  Cool n-‐gram func6ons
•  Just now gegng database drivers for popular
languages (JAVA)

About Hive
Select * from x;
Select count(1) from x;
Select sum(x.price) from x;
Select a, sum(price) from x group by a;
Select a from x where datediff(‘2011-‐01-‐01’, d) = 0;
Drop table x;

Hadoop vs Hive
#mapper: SELECT
$stdin.each do |line|
date, country, id = line.split created_date,
puts date + “,” + country
end
country,
#reducer
counts = Hash.new(0)
VS count(1)
$stdin.each do |line|
FROM checkins
counts[line] += 1 GROUP BY
end
puts counts created_date,
country
Our Hadoop Infrastructure
•  We use clusters generated through amazon’s Elas6c MapReduce
•  That means we store all of our data in flat files in Amazon S3 (which
keeps things simple)
•  We export data from both MongoDB and h;p proxy log-‐files
•  We manage everything using a custom ruby-‐on-‐rails dashboard
“rake cluster:start[30]” => starts a 30 node cluster, just like that
Our Dashboard
•  Define and schedule reports through it
•  Allow ad-‐hoc access to (internal) users
•  Controls data imports into S3 from mongo/

log-‐files
•  Provides an intermediate DB layer for rollup

data caching(experimental atm)
•  Allows you to do a bunch of cool stuff with

queries aber they’ve run

Example: Impor6ng Data

Example: Query Walkthrough
venuename city total
Find top 20 zurich airport (zrh)

geneva-‐cointrin airport (gva)
kloten
grand-‐saconnex
3746
3012
venues in
zurich hauptbahnhof zurich 1780
sony ericsson football hotspot basel 773
basel bahnhof sbb basel 761
Switzerland gare de cornavin

bern hauptbahnhof
gare de lausanne
geneva
bern
lausanne
760
736
672
apple store zurich 670
bahnhof luzern luzern 477
terminal e kloten 458
bellevueplatz zurich 457
terminal a kloten 455
bahnhof oerlikon zurich 453
bahnhof stadelhofen zurich 444
sihlcity zurich 400
zurich flughafen bahnhof zurich 400
bahnhof olten olten 391
bahnhof winterthur winterthur 379
bahnhof hardbrÃ¼cke zurich 369

Walkthrough: Start the query

Walkthrough: Get the results in email

Walkthrough: Top Venues

Walkthrough
If we want to schedule
something to run daily/weekly/
monthly we can do that too

Reports are represented as
Ac6veRecord models

Walkthrough: Reports feed our
dashboards

Walkthrough: queries allow data
explora6on

Stats on the Stats Stack
•  25-‐machine clusters
•  Reports on check-‐in data (joining venues and/or
users) usually take 5-‐15 minutes to run
•  Reports on log data usually take 10-‐20 minutes to
run
•  We run 10-‐30 reports a day
•  Most data goes into a Google spreadsheet for
people to look at.
Thoughts on Amazon’s EMR
•  The API has very low rate limits
•  Everything is a HTTP get request (even
crea6ng a cluster)
•  The ruby library/client is unusable as a client
library. (we shell out to it in order to capture
the resul6ng JSON)

Thoughts on Hive
•  Generally good
•  Some6mes it will act crazy
•  Par66oning data is harder than it looks
•  The JSON serde makes all sorts of weird stuff
happen when you’re joining tables
•  Always join LAST!
Working With Hive
SELECT SELECT
v.venuename, v.venuename,
count(*) c.total
FROM FROM
checkins c (SELECT
JOIN venues v venueid,
ON c.venueid = v.id count(1)
GROUP BY v.address FROM checkins
GROUP BY venueid
) c
JOIN venues v
on c.venueid = v.id
OK BETTER
Our Data: End
•  Hadoop + Hive > Mongo + Scripts
•  Simple ruby dashboard == super useful
•  Lots of data == fun charts
QUESTIONS?
foursquare 3.0: Explore

Engineering an Online
Recommenda6on System

Engineering cont.
Goals:
•  “Here and now”
•  No new signals
•  Use all of our textual data
•  100ms per query

Engineering cont.
Pain points:
•  Geo indexes, compound
geo indexes
•  Limi6ng queries in
minimally impac€ul ways
•  Cached datastores
(building rollup
collec6ons)
•  Geo indexes
Compu6ng a Similarity Matrix
•  Analyzing similarity func6ons OK on single
machine
•  10M+ venues = 100 trillion element sparse
matrix
–  Compute without visi6ng every element
–  Parallelize, cross machine

Compute Similarity Matrix, cont.
•  Leverage Mahout’s library of similarity
func6ons, easy to extend
•  Job system controls execu6on of sequen6al
dependent M-‐R tasks
•  Hadoop: easily scalable to large commodity
machine clusters, elas6c makes increasing
cluster size trivial
Compute Similarity Matrix, cont.
Series of “Jobs,” each do a Map-‐Reduce
1.  Convert input flat file dumped from Hive to binary sparse
vector representa6on
2.  Compute pairwise co-‐occurrences
3.  Compute column based weights (column normaliza6on),
retrieve all vectors with co-‐occurrences
4.  Compute pairwise similari6es, store in sparse matrix
format
5.  Fla;en sparse matrix to text format that we can load
into DB

The Value of Why
•  Show people which friends visited, which places
are co-‐visited (not the same as “similar”?)
•  Lowers the bar for precision
–  Allows users to choose for themselves among recs
–  Increase propensity to check-‐in (sales pitch for the
venue)
•  Mix with the social, story-‐telling aspects of
product
•  Collabora6ve filtering allows for easy descrip6on

Case Study: Defining “Interes6ng”
•  Need to show ranked venues for “cold-‐start”
•  Various influencing factors in what makes a place “interes6ng”
–  Number of users checked in
–  Average visits per user
–  Tips leb, to-‐dos done
–  How people check-‐in (broadcast to T/FB, off-‐the-‐grid?)
–  Trending direc6on (more popular lately?)
•  Measuring raw popularity poses problems
–  Places open just for lunch, smaller dining rooms, longer meal 6mes
–  Been in system longer, opened recently
–  Differences between categories (coffee shops != burger joints)

Defining “Interes6ng” cont.
7
“Local
6
Favorite”
5
Visits Per User
4
3
2
“Must See”
1
0
Unique Users

Future Direc6ons
•  S6ll a big unknown, collect user feedback to
drive development
•  Scale beyond just co-‐occurrences, improve
predic6on in new territory
•  Planning mode (beyond the here and now)
•  Joint recommenda6ons (where do I go with
this set of friends?)
Help us get there
foursquare is hiring
www.foursquare.com/jobs

Jus6n Moore Ma;hew Rathbone
@injust @rathboma
jus6n@foursquare.com ma;hew@foursquare.com


Foursquare - ML Presentation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Foursquare - ML Presentation

Uploaded by

Copyright:

Available Formats

Big Data @ foursquare

Infrastructure, Analy6cs, Predic6on,

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

• Visit places, check-­‐in, earn rewards,

• Game elements: single-­‐player, mul6-­‐

• Large reach (every country,

• Na6ve app for almost every

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Our repor6ng / analy6cs / data mining stack is thanks

Image from ibm.com

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

• We export data from both MongoDB and h;p proxy log-­‐ﬁles

• We manage everything using a custom ruby-­‐on-­‐rails dashboard

• Allow ad-­‐hoc access to (internal) users

• Controls data imports into S3 from mongo/

• Provides an intermediate DB layer for rollup

• Allows you to do a bunch of cool stuﬀ with

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Find top 20 zurich airport (zrh)

Switzerland gare de cornavin

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

• Simple ruby dashboard == super useful

• Lots of data == fun charts

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

Jus6n Moore -­‐ @injust

You might also like

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

•  Visit places, check-‐in, earn rewards,

•  Game elements: single-‐player, mul6-‐

•  Large reach (every country,

•  Na6ve app for almost every

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

•  We export data from both MongoDB and h;p proxy log-‐ﬁles

•  We manage everything using a custom ruby-‐on-‐rails dashboard

•  Allow ad-‐hoc access to (internal) users

•  Controls data imports into S3 from mongo/

•  Provides an intermediate DB layer for rollup

•  Allows you to do a bunch of cool stuﬀ with

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

•  Simple ruby dashboard == super useful

•  Lots of data == fun charts

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust

Jus6n Moore -‐ @injust