Clustering & Retrieval:: Emily Fox & Carlos Guestrin

Clustering &
Retrieval:
A machine learning perspective
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Part of a specialization

This course is a part of the
Machine Learning Specialization
1. Foundations
4. Clustering 5. Recommender
2. Regression 3. Classification
& Retrieval Systems
6. Capstone

What is the course about?

What is retrieval?
Search for related items
ML
Nearest
Data Intelligence
Method
Neighbor
Input x,{x’}:
features for Compute
query point
distances to Output xNN:
+
features of other x’ “nearest” point or
all other datapoints set of points to query
Retrieve “nearest neighbor” article
Space of all articles,
organized by similarity of text
query article
nearest neighbor
Or set of nearest neighbors
Space of all articles,
organized by similarity of text
query article
set of nearest neighbors

Retrieval applications
Just about everything…
Products
Images
Streaming content:
Social networks
-  Songs (people you might want
News articles
-  Movies to connect with)
-  TV shows
-  …

What is clustering?
Discover groups of similar inputs
ML
Data Clustering Intelligence
Method
Input {x}:
features for Separate
points in
points into Output {z}:
dataset
disjoint sets cluster labels per
datapoint
Case Study:
Clustering documents by “topic”
ML
Data Clustering Intelligence
Method
SPORTS WORLD NEWS
ENTERTAINMENT SCIENCE
Just like retrieval, clustering has applications
almost everywhere

Clustering images
For search, group as:
- Ocean
- Pink flower
- Dog
- Sunset
- Clouds
- …

Or Coursera learners…
Discover groups of
learners for better
targeting of courses

Impact of retrieval & clustering

Impact of retrieval & clustering
•  Foundational ideas
•  Lots of information can be extracted using these tools
(exploring user interests and interpretable structure
relating groups of users based on observed behavior)

Course overview

Course philosophy: Always use case studies & …
Core
Visual Algorithm
concept
Advanced
Practical Implement
topics
O N A L
OPTI
Overview of content
Models Algorithms Core ML
Nearest
KD-trees Distance metrics
neighbors
Locality sensitive Approximation

Clustering
hashing algorithms
Mixture of Unsupervised
k-means
Gaussians learning
Latent Dirichlet Probabilistic

MapReduce
allocation modeling
Expectation Data parallel

Maximization problems
Bayesian
Gibbs sampling
inference

Course outline

Overview of content
Models Algorithms Core ML
Nearest neighbors KD-trees Distance metrics
Module 1 Module 1 Module 1
Locality sensitive Approximation

Clustering
hashing algorithms
Module 2, 3
Module 1 Module 1
Unsupervised
Mixture of Gaussians k-means
learning
Module 3 Module 2
Module 2
Latent Dirichlet Probabilistic

MapReduce
allocation modeling
Module 2
Module 4 Module 2, 3, 4
Expectation Data parallel

Maximization problems
Module 3 Module 2
Gibbs sampling Bayesian inference

Module 4 Module 4

Module 1: Nearest neighbor search
Reading doc
and want to
find related
doc

Compute distances to all other
documents and return closest
Critical elements:
-  Doc representation
-  Distance measure
query article
Bin index:
Eﬃcient and [0 0 0] Line 2
#awful
approximate Bin index:
[0 1 0]
NN search … Line 1 Bin index:
4 [1 1 0]
LSH 3
KD-trees 2
Line 3
1 Bin index:
[1 1 1]
0
0 1 2 3 4 … #awesome

Module 2: k-means and MapReduce
Discover clusters of related documents
Cluster 1 Cluster 2
Cluster 3 Cluster 4
k-means aims to minimize sum of
square distances to cluster centers
Makes hard assignments

of data points to clusters
Unsupervised
learning task

Map Phase Shuﬄe Phase Reduce Phase
(k1,v1) (k1,v1)
M1 (k2,v2) M1 (k2,v2)
… …
(k1’,v1’) (k3,v3)
Assign tuple (ki,vi) to

M2 (k2’,v2’) Parallelize M2
(k4,v4)
machine h[ki]
across machines
Big Data
…
operations …
Split data
across
multiple
machines
(k1’’’,v1’’’) (k5,v5)
M1000 M1000
(k2’’’,v2’’’) (k6,v6)
… …
Module 3: Mixture Models
Probabilistic clustering model
captures
Cluster 1 Cluster 2 uncertainty
in clustering
Cluster 3 Cluster 4
Cluster 1 Cluster 2
Learn user
topic
preferences
Cluster 3 Cluster 4
Assignments of docs to clusters based on
location and shape, not just location
Make soft assignments

of docs to each cluster

EM algorithm à
Data soft assignments

Modeling the Complex Dynamics and Changing
Correlations of Epileptic Events
Module 4: Latent Dirichlet Allocation
a
Drausin F. Wulsina , Emily B. Foxc , Brian Litta,b
Department of Bioengineering, University of Pennsylvania, Philadelphia, PA
b
Department of Neurology, University of Pennsylvania, Philadelphia, PA
c
Department of Statistics, University of Washington, Seattle, WA
Based on science
Abstract related words, maybe
Patients with epilepsy can manifest short, sub-clinical epileptic “bursts” in
addition to full-blown clinical seizures. We believe the relationship between doc in cluster 4
these two classes of events—something not previously studied quantitatively—
could yield important insights into the nature and intrinsic dynamics of
seizures. A goal of our work is to parse these complex epileptic events
into distinct dynamic regimes. A challenge posed by the intracranial EEG 1
(iEEG) data we study is the fact that the number and placement of electrodes
can vary between patients. We develop a Bayesian nonparametric Markov
switching process that allows for (i) shared dynamic regimes between a vari-
able number of channels, (ii) asynchronous regime-switching, and (iii) an
unknown dictionary of dynamic regimes. We encode a sparse and changing
set of dependencies between the channels using a Markov-switching Gaussian 0
graphical model for the innovations process driving the channel dynamics and
demonstrate the importance of this model in parsing and out-of-sample pre-
dictions of iEEG data. We show that our model produces intuitive state
assignments that can help automate clinical analysis of seizures and enable
the comparison of sub-clinical bursts and full clinical seizures.
Keywords: Bayesian nonparametric, EEG, factorial hidden Markov model,
graphical model, time series
1. Introduction
Despite over three decades of research, we still have very little idea of
what defines a seizure. This ignorance stems both from the complexity of
epilepsy as a disease and a paucity of quantitative tools that are flexible
a
b
c
Abstract Or maybe cluster 2

addition to full-blown clinical seizures. We believe the relationship between (technology) is a better fit
into distinct dynamic regimes. A challenge posed by the intracranial EEG 1
1. Introduction
a
b
c
Abstract Really, it’s about science

addition to full-blown clinical seizures. We believe the relationship between
and technology
into distinct dynamic regimes. A challenge posed by the intracranial EEG 0.6
can vary between patients. We develop a Bayesian nonparametric Markov 0.4
able number of channels, (ii) asynchronous regime-switching, and (iii) an 0.2
1. Introduction
Each cluster/topic defined by probability of words in vocab
SCIENCE TECH SPORTS

experiment 0.1 develop 0.18 player 0.15
test
discover
0.08
0.05
computer 0.09
processor 0.032
score
team
0.07
0.06 …
hypothesize 0.03 user 0.027 goal 0.03
climate 0.01 internet 0.02 injury 0.01
… … … … … …

Topic vocab Drausin F. Wulsina , Emily B. Foxc , Brian Litta,b
distributions: a
b
Document topic
arXiv:1402.6951v2 [stat.ML] 14 Jul 2014
proportions:
c
TOPIC 1
Word 1 ?
Word 2 ?
Word 3 ? Abstract 0.6
Word 4 ? Patients with epilepsy can manifest short, sub-clinical epileptic “bursts” in
Word 5 ? addition to full-blown clinical seizures. We believe the relationship between 0.4
… … these two classes of events—something not previously studied quantitatively—
? ? ? ?
seizures. A goal of our work is to parse these complex epileptic events 0.2
TOPIC 2 into distinct dynamic regimes. A challenge posed by the intracranial EEG
Word 1 ? (iEEG) data we study is the fact that the number and placement of electrodes
0
Word 2 ?
Word 3 ?
Word 4 ?
Word 5 ? set of dependencies between the channels using a Markov-switching Gaussian
… … graphical model for the innovations process driving the channel dynamics and
Unsupervised
TOPIC 3 assignments that can help automate clinical analysis of seizures and enable
learning task
Word 1 ? the comparison of sub-clinical bursts and full clinical seizures.
Word 2 ?
Word 3 ? graphical model, time series
Word 4 ?
Word 5 ?
… … 1. Introduction
…
Assumed background

Courses 1, 2, & 3 in this ML Specialization
•  Course 1: Foundations
-  Overview of ML case studies
-  Black-box view of ML tasks
-  Programming & data
manipulation skills
•  Course 2: Regression
-  Data representation (input, output, features)
-  Basic statistical concepts: mean/variance
-  Basic ML concepts:
•  ML algorithm
•  Coordinate ascent
•  Overfitting
•  Regularization
•  Course 3: Classification
-  Distributions and conditional distributions
-  Maximum likelihood estimation
-  References to:
•  Linear classifier
•  Multiclass classification
•  Boosting
Math background
•  Basic linear algebra
- Vectors
- Matrices
- Matrix multiply
•  Basic probability
- Fundamental laws
- Distribution and
conditional distribution

Programming experience
•  Basic Python used
- Can pick up along the way if
knowledge of other language

Reliance on GraphLab Create
•  SFrames will be used, though not required
-  open source project of Dato
(creators of GraphLab Create)
-  can use pandas and numpy instead
•  Assignments will:
1.  Use GraphLab Create to
explore high-level concepts
2.  Ask you to implement most
algorithms without GraphLab Create
•  Net result:
-  learn how to code methods in Python
Computing needs
•  Using your own computer:
-  Basic desktop or laptop
•  64-bit required if using SFrame
-  Access to internet
-  Ability to:
•  Install and run Python (and Numpy, GraphLab Create,…)
•  Store a few GB of data
•  Will also provide alternative, pre-configured
machine in Cloud

Let’s get started!

Clustering & Retrieval:: Emily Fox & Carlos Guestrin

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering & Retrieval:: Emily Fox & Carlos Guestrin

Uploaded by

Copyright:

Available Formats

Clustering &

2 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

3 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

4 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

set of nearest neighbors

8 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

SPORTS WORLD NEWS

11 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

12 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

13 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

14 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

15 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

16 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Locality sensitive Approximation

Latent Dirichlet Probabilistic

Expectation Data parallel

18 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

19 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Locality sensitive Approximation

Latent Dirichlet Probabilistic

Expectation Data parallel

Gibbs sampling Bayesian inference

20 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

21 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

23 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Makes hard assignments

25 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Assign tuple (ki,vi) to

Make soft assignments

29 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

30 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Abstract Or maybe cluster 2

Abstract Really, it’s about science

SCIENCE TECH SPORTS

34 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

36 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

38 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

39 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

41 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

42 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

You might also like