Online Learning with Stream Mining

Mikio L. Braun, @mikiobraun http://blog.mikiobraun.de TWIMPACT http://twimpact.com

Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Event Data
Finance Gaming Monitoring

Advertisment

Sensor Networks

Social Media

Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

Online Learning

Isn't all learning online?
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

April 24. if http://leon.bottou.org/research/stochastic Machine Learning Meetup San Francisco. 2013 (c) 2013 by TWIMPACT . e.Isn't Machine Learning easily Online? ● Stochastic gradient descent converges.g.

com/site/advancedstreamingtutorial Machine Learning Meetup San Francisco. 2013 (c) 2013 by TWIMPACT . April 24. on these issues at http://sites.Online vs Batch: Non-stationarities Very good tutorial by Albert Bifet et al.google.

April 24. Learning rate You can't just do online learning on event data! Machine Learning Meetup San Francisco. 2013 (c) 2013 by TWIMPACT .Time horizons vs.

Event Data is huge ● The problem: You easily get A LOT OF DATA! – – – – – 100 events per second 360k events per hour 8. 2013 (c) 2013 by TWIMPACT .6M events per day 260M events per month 3. April 24.Also.2B events per year Machine Learning Meetup San Francisco.

online learning challenges: So much data! Concept Drift Online (as in “not batch”) is not the whole story.So. 2013 (c) 2013 by TWIMPACT . April 24. ● ● ● Machine Learning Meetup San Francisco.

Digging into Least Squares Idea: Batch method like least squares on recent portion of the data. April 24. this could be huge! d with entries d d x d is probably ok But: It's just a sum! Machine Learning Meetup San Francisco. 2013 (c) 2013 by TWIMPACT .

net http://www. 2013 (c) 2013 by TWIMPACT .Another Problem: High-dimensional Spaces ● Potentially large spaces: – – – distinct words: >100k IP addresses: >100M users in a social network: >10M http://wordle.com/photos/arenamontanus/269158554/ Machine Learning Meetup San Francisco. April 24.flickr.

Stream Mining to the rescue ● Stream mining algorithms: – answer “stream queries” with finite resources how often does an item appear in a stream? how many distinct elements are in the stream? what are the top-k most frequent items? Continuous Stream of Data ● Typical examples: – – – Bounded Resource Analyzer Stream Queries Machine Learning Meetup San Francisco. 2013 (c) 2013 by TWIMPACT . April 24.

net/acunu/realtime-analytics-with-apache-cassandra Machine Learning Meetup San Francisco.slideshare. 2013 (c) 2013 by TWIMPACT . April 24.The Trade-Off Big Data Stream Mining Map Reduce and friends Fast Exact First seen here: http://www.

even more. 2013 (c) 2013 by TWIMPACT . Abbadi. e. Internation Conference on Database Theory. IP addresses. Case 1: element already in data base 142 142 12 132 142 432 553 712 023 15 12 8 5 3 2 713 3 Case 2: new element 713 023 2 13 ● Fixed tables of counts Metwally. Agrawal.g.a. Efficient computation of Frequent and Top-k Elements in Data Streams. April 24. Top-k) ● Count activities over large item sets (millions. 2005 Machine Learning Meetup San Francisco. Twitter users) Interested in most active elements only.k.Heavy Hitters (a.

Algorithm 55(1): 58-75 (2005) . Cormode and S. LATIN 2004. April 24.Count-Min Sketches ● ● Summarize histograms over large feature sets Like bloom filters. Muthukrishnan. J. . 2013 (c) 2013 by TWIMPACT G. An improved data stream summary: The count-min sketch and its applications. but better m bins 0 1 0 2 0 1 5 4 3 0 3 5 0 2 2 0 0 0 1 0 2 3 3 2 0 5 7 0 0 2 3 8 n different hash functions Query result: 1 ● Updates for new entry Query: Take minimum over all hash functions Machine Learning Meetup San Francisco.

IEEE International Conference on Data Engineering . A Framework for Clustering Massive-Domain Data Streams. 2009 Machine Learning Meetup San Francisco. April 24. 2013 (c) 2013 by TWIMPACT .Clustering with count-min Sketches ● Online clustering – For each data point: ● ● Map to closest centroid (⇒ compute distances) Update centroid – count-min sketches to represent sum over all vectors in a class 0 1 0 2 0 1 5 4 3 0 3 5 0 2 2 0 0 0 1 0 2 3 3 2 0 5 7 0 0 2 3 8 Aggarwal.

2013 (c) 2013 by TWIMPACT .Heavy Hitters over Time-Window ● Time ● Keep quite a big log (a month?) Constant write/erase in database Alternative: Exponential decay ● DB Machine Learning Meetup San Francisco. April 24.

Exponential Decay ● Instead of a fixed window. use exponential timestamp decay score halftime ● The beauty: updates are recursive time shift term Machine Learning Meetup San Francisco. April 24. 2013 (c) 2013 by TWIMPACT .

timestamp. item.lastupdate = timestamp ● score(C.ts[item]) * C. C.counters[item] C. C.counters[item] Machine Learning Meetup San Francisco. item) – return score return weight(C. count) – update counts C.lastupdate.ts[item] = timestamp C.Exponential Decay ● Collect stats by a table of expdecay counters counters[item] ts[item] # counters # last timestamp ● update(C.counters[item] = count + weight(timestamp.ts[item]) * C. April 24. 2013 (c) 2013 by TWIMPACT .

Least Squares Revisited ● ● Need to compute For each – – do ● Then. reconstruct – As a reminder: Machine Learning Meetup San Francisco. 2013 (c) 2013 by TWIMPACT . April 24.

2013 (c) 2013 by TWIMPACT . April 24. but simpler ● But wait. how do I “1/n” with randomly spaced events? Machine Learning Meetup San Francisco.More: Maximum-Likelihood ● Estimate probabilistic models based on which is slightly biased.

Outlier detection ● Once you have a model. you can compute p-values (based on recent time frames!) Machine Learning Meetup San Francisco. April 24. 2013 (c) 2013 by TWIMPACT .

1. April 24. t. 2013 (c) 2013 by TWIMPACT .0) query: score(word) / score(“#docs”) Machine Learning Meetup San Francisco. 1.TF-IDF ● estimate word – document frequencies ● ● ● for each word: update(word.0) for each document: update(“#docs”. t.

April 24. 2013 (c) 2013 by TWIMPACT .Extracting a relevant subset Machine Learning Meetup San Francisco.

April 24. right? frequency of word in document Number of times word appears in class class priors Priors Multinomnial naïve Bayes Total number of words in class Machine Learning Meetup San Francisco. 2013 (c) 2013 by TWIMPACT .Classification with Naïve Bayes ● Naive Bayes is also just counting.

Classification with Naive Bayes ICML 2003 Machine Learning Meetup San Francisco. April 24. 2013 (c) 2013 by TWIMPACT .

+ 1) IDF-style normalization square length normalization use complement probability another log normalize those weights again Predict linearly using those weights Machine Learning Meetup San Francisco. April 24.Classification with Naive Bayes ● 7 Steps to improve NB: – – – – – – – transform TF to log( . 2013 (c) 2013 by TWIMPACT .

g. April 24. e. SVMs sum over all ● elements! Could still use streamdrill to extract a representative subset. Machine Learning Meetup San Francisco. no real accumulation of information in statistics. 2013 (c) 2013 by TWIMPACT .What about non-parametric methods and Kernel Methods? ● Problem here.

Indices! Snapshots for historical analysis Beta demo available at http://streamdrill. 2013 (c) 2013 by TWIMPACT . April 24.Streamdrill ● ● Heavy Hitters counting + exponential decay Instant counts & top-k results over time windows.com. launch imminent ● ● ● Machine Learning Meetup San Francisco.

2013 (c) 2013 by TWIMPACT .Architecture Overview Machine Learning Meetup San Francisco. April 24.

April 24. 2013 (c) 2013 by TWIMPACT ● .hour ● Update a word /1/update/plays/frank:123123:San+Francisco ● Another word (with timestamp) /1/query/plays?city=San+Francisco /1/query/plays?user=paul /1/update/plays/paul:145323:Berlin?ts=131341354135 ● Get most played songs for SF or Paul Get score for a word /1/query/score/hello Machine Learning Meetup San Francisco.REST API ● Create a trend /1/create/plays/user:song:location?size=1000? timescales=day.

April 24.Example: Twitter Stock Analysis http://play. 2013 (c) 2013 by TWIMPACT .com/vis/ Machine Learning Meetup San Francisco.streamdrill.

wsj. 2013 (c) 2013 by TWIMPACT .com/15fHaZW Machine Learning Meetup San Francisco.Example: Twitter Stock Analysis ● Trends: – – – – – – symbol:combinations symbol:hashtag symbol:keywords symbol:mentions symbol trend symbol:url $AAPL:$GOOG $AAPL:#trading $GOOG:disruption $GOOG:WallStreetCom $AAPL $FB:http://on. April 24.

Example: Twitter Stock Analysis Machine Learning Meetup San Francisco. April 24. 2013 (c) 2013 by TWIMPACT .

2013 (c) 2013 by TWIMPACT . April 24.Example: Twitter Stock Analysis Machine Learning Meetup San Francisco.

April 24. 2013 (c) 2013 by TWIMPACT .Example: Twitter Stock Analysis Twitter tweets JavaScript via REST Tweet Analyzer updates streamdrill Machine Learning Meetup San Francisco.

Summary ● ● Doesn't always have to be scaling! Stream mining: Approximate results with finite resources. April 24. 2013 (c) 2013 by TWIMPACT . streamdrill: stream analysis engine ● Machine Learning Meetup San Francisco.

Sign up to vote on this title
UsefulNot useful