Online Learning With Streamdrill

Mikio L. Braun, @mikiobraun http://blog.mikiobraun.de TWIMPACT http://twimpact.com

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Event Data
Finance Gaming Monitoring

Advertisment

Sensor Networks

Social Media

Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Event Data is Huge: Volume

The problem: You easily get A LOT OF DATA!
– – – – –

100 events per second 360k events per hour 8.6M events per day 260M events per month 3.2B events per year

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Event Data is Huge: Diversity

Potentially large spaces:
– – –

distinct words: >100k IP addresses: >100M users in a social network: >10M

http://wordle.net

http://www.flickr.com/photos/arenamontanus/269158554/

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Recommenders

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Stream Mining to the rescue

Stream mining algorithms:

answer “stream queries” with finite resources how often does an item appear in a stream? how many distinct elements are in the stream? what are the top-k most frequent items?
Continuous Stream of Data

Typical examples:
– – –

Bounded Resource Analyzer

Stream Queries

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

The Trade-Off
Big Data
Stream Mining Map Reduce and friends

Fast

Exact

First seen here: http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Heavy Hitters (a.k.a. Top-k)

Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) Interested in most active elements only.
Case 1: element already in data base paul paul 12 frank paul jan leo conrad stephan 15 12 8 5 3 2 hugo 3 Case 2: new element hugo stephan 2 13

Fixed tables of counts

Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Heavy Hitters over Time-Window

Time

Keep quite a big log (a month?) Constant write/erase in database Alternative: Exponential decay

DB

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Exponential Decay

Instead of a fixed window, use exponential timestamp decay
score halftime

The beauty: updates are recursive

time shift term
Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Count-Min Sketches

Summarize histograms over large feature sets Like bloom filters, but better
m bins 0 1 0 2 0 1 5 4 3 0 3 5 0 2 2 0 0 0 1 0 2 3 3 2 0 5 7 0 0 2 3 8 n different hash functions

Query result: 1

Updates for new entry

Query: Take minimum over all hash functions

G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

What you can do with Stream Mining

Count things:
– – – – –

user interactions items purchased pages viewed songs played … most active pages most active users …

Trends:
– – –

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

But there is more: Counting is Statistics

Empirical mean:

Correlations:

Principal Component Analysis:

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Profiling User Behavior

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Statistics and Outlier Detection

When does behavior change?

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Maximum-Likelihood

Estimate probabilistic models

based on

which is slightly biased, but simpler

But wait, how do I “1/n” with randomly spaced events?

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Outlier detection

Once you have a model, you can compute p-values (based on recent time frames!)

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Online TF-IDF

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Online TF-IDF

estimate word – document frequencies

● ● ●

for each word: update(word, t, 1.0) for each document: update(“#docs”, t, 1.0) query: score(word) / score(“#docs”)

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Streamdrill
● ●

Heavy Hitters counting + exponential decay Instant counts & top-k results over time windows. Indices! Snapshots for historical analysis Beta demo available at http://streamdrill.com, launch imminent

● ● ●

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Architecture Overview

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Example: Twitter Stock Analysis

http://play.streamdrill.com/vis/
Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Example: Twitter Stock Analysis

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Example: Twitter Stock Analysis

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Summary
● ●

Doesn't always have to be scaling! Stream mining: Approximate results with finite resources. streamdrill: stream analysis engine

Recommender Stammtisch, plista, Berlin, June 5, 2013 © 2013 Mikio L. Braun

Sign up to vote on this title
UsefulNot useful