Adventures  in  Machine  Learning,   Sta7s7cs,  and  Big  Data  

What  is  Foursquare?  
•  Loca7on  based  startup,  applica7on  that  helps   you  to  explore  your  city   •  Visit  places,  check-­‐in,  earn  rewards,  stay   connected  with  your  friends   •  Game  elements:  single-­‐player,  mul7-­‐player  

What  is  Foursquare?  (cont.)  
•  4M+  users,  10M+  venues,  200M+  check-­‐ins   •  Large  reach  (most  major  countries,  North  Pole,   Space)   •  Na7ve  app  for  almost  every  smartphone,  also   available  on  SMS,  web,  mobile-­‐web  

Data  Model  
Users   Check-­‐ins   Venues  



Big  Data  
•  Some  problems  no  longer  solvable  by  simple/ naïve  algorithms,  simple  crowd-­‐sourcing   •  Interes7ng  problems  can  now  be  solved  using   sta7s7cal  methods:  predic7on,  classifica7on,   op7miza7on  

Example  Problems  
•  Predic'on:  Recommending  places  to  people,   people  to  places,  people  to  people,  places  from   places,  7ps,  events,  checking-­‐in,  “interes7ngness”   •  Op'miza'on:  Search,  ranking,  user  experience   •  Classifica'on:  Categorizing  venues,  removing   junk,  spam,  duplicates   •  The  list  goes  on…  

Zooming  in…   Recommending  Places  to  People  
•  What  do  we  have  available?  
–  Check-­‐ins  (user  history,  venue  history)   –  Venue  meta-­‐data,  User  meta-­‐data   –  Friend  graph  

•  What  do  we  want  to  do?  
–  Social,  fun  (interes7ng  >?  op7mal)   –  Be  smart,  provide  serendipity,  hit  the  tail  

Ini7al  Thoughts  
•  Want  a  hybrid  model,  many  features   •  Need  to  be  scalable,  fast  (web-­‐scale,  offline   computa7ons  are  OK  as  long  as  they  scale   linearly)   •  Start  with  dumb,  get  smarter  where  possible.   Iterate.    Something  is  befer  than  nothing.     Data  is  key.  

Start  with  Simple  
•  Popularity:  user  independent,  works  for  cold-­‐ start,  can  be  extended:  
–  Decay  popularity:  recently  popular,  new,  long-­‐term   –  Break  down  by  7me  of  day,  day  of  week   –  Unique  users  vs.  hits  per  user  (must  see  vs.  hidden   gem  vs.  generally  popular)  

•  Bubble  up  “interes7ng”  things:  specials,  todos,     similar  7ps  

Unstructured  vs.  Commu7ng   hfp://  

Breakfast  vs.  Brunch  

More  Complex  
•  Add  some  social  elements.    Where  do  your   friends  go?    Can  we  rate  your  friends?  
–  Good  for  users  with  small  check-­‐in  history,  large  #   of  friends   –  Can  we  determine  friend  quality  from  check-­‐ins?   –  Even  if  weak  mathema7cally,  social  can  triumph:   “Jane,  John,  and  17  other  friends  went  here”  

Hard:  Check-­‐in  History  
•  Can  we  accurately  predict  where  you  want  to  go  based  on  where   you  went?    Seems  good,  but  how  to  do  it?    How  to  scale  it?   •  Lots  of  research  in  this  area  lately,  mostly  because  of  Neklix  and   Amazon  before  them  
–  –  –  –  –  –  Collabora7ve  filtering  (venue-­‐to-­‐venue,  user-­‐to-­‐user)   Factoriza7on,  dimensionality  reduc7on   Clustering   SVM   Linear  models   Context  Filtering/Search  

•  The  branching  factor  for  choosing  a  method  increases  drama7cally   at  this  stage.    Although  it  provides  the  most  value,  it  is  the  most   difficult  to  do  right.  

Choosing  a  Venue  Similarity  Metric  
•  •  •  • 

Correla7on   std(A)*std(b)   AŸB   Cosine  similarity! ||A||*||B||   How  do  adjust  for  scale?    How  much?   How  to  remove  neighborhood  effect?  


kNN  Collabora7ve  Filtering  
val result =! for { historyVenue <- venueHistory! vPair <- scores.get(currentVenue, historyVenue)! ! if (vPair.distance > 0 && vPair.r > 0)! ! val modscore = vPair.r /! (1 + math.exp(vPair.distance / -1000.0 + 1.75))! WTF?   } yield PairScore(venue, modscore, vPair.numCheckins)! ! val kNN = result.sortBy(_.modscore).reverse.take(K)! ! => n.modScore * math.log(n.numCheckins + 1)).sum /! => n.modScore).sum!  

Venue   Venue  

Similarity  Data:  Correla7on?  
0   5   10   15   20  

No  Data  

0.3   0.25   0.2   0.15   0.1   0.05   0   -­‐0.05  

Distance  Between  Venues  (km)  

Lazy?   Interes7ng!  

Similarity  Matrix   (symmetric?)  

-­‐0.1   -­‐0.15   -­‐0.2   -­‐0.25  

What’s  the  Difference?  

Rela7vely  close  (300m),  high  correla7on  (.26)  

Rela7vely  far  (7.6km),  low  correla7on  (.04)  

We’re  Hiring!  
•  Looking  for  developers  in  the  field  of  ML  and/ or  Sta7s7cs,  also  hiring  across  the  board   •  We  use  cool  tech:  Scala,  Lis,  MongoDB   •  Small  company  (35  people),  flexible  work   environment,  lots  of  big  projects  to  work  on  

Seriously,  Come  Work  for  Us  
•  Lots  of  ex-­‐finance  employees  (almost  half  our   engineers!),  lots  of  ex-­‐*sos  employes  (more   than  half  our  engineers!)…  note  the  “ex-­‐”   •  Fast  growing  company,  lots  of  innova7on   •  Many  very  smart  people  with  common  goals  

Sign up to vote on this title
UsefulNot useful