You are on page 1of 42

Clustering &

Retrieval:
A machine learning perspective
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Part of a specialization

2 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


This course is a part of the
Machine Learning Specialization

1. Foundations

4. Clustering 5. Recommender
2. Regression 3. Classification
& Retrieval Systems

6. Capstone

3 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


What is the course about?

4 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


What is retrieval?
Search for related items

ML
Nearest
Data Intelligence
Method
Neighbor

Input x,{x’}:
features for Compute
query point
distances to Output xNN:
+
features of other x’ “nearest” point or
all other datapoints set of points to query
5 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Retrieve “nearest neighbor” article
Space of all articles,
organized by similarity of text

query article

nearest neighbor
6 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Or set of nearest neighbors
Space of all articles,
organized by similarity of text

query article

set of nearest neighbors


7 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Retrieval applications
Just about everything…
Products
Images

Streaming content:
Social networks
-  Songs (people you might want
News articles
-  Movies to connect with)
-  TV shows
-  …

8 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


What is clustering?
Discover groups of similar inputs

ML
Data Clustering Intelligence
Method

Input {x}:
features for Separate
points in
points into Output {z}:
dataset
disjoint sets cluster labels per
datapoint
9 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Case Study:
Clustering documents by “topic”

ML
Data Clustering Intelligence
Method

SPORTS WORLD NEWS

ENTERTAINMENT SCIENCE
10 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Just like retrieval, clustering has applications
almost everywhere

11 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Clustering images
For search, group as:
- Ocean
- Pink flower
- Dog
- Sunset
- Clouds
- …

12 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Or Coursera learners…
Discover groups of
learners for better
targeting of courses

13 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Impact of retrieval & clustering

14 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Impact of retrieval & clustering
•  Foundational ideas
•  Lots of information can be extracted using these tools
(exploring user interests and interpretable structure
relating groups of users based on observed behavior)

15 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Course overview

16 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Course philosophy: Always use case studies & …

Core
Visual Algorithm
concept

Advanced
Practical Implement
topics
O N A L
OPTI
17 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Overview of content
Models Algorithms Core ML
Nearest
KD-trees Distance metrics
neighbors

Locality sensitive Approximation


Clustering
hashing algorithms

Mixture of Unsupervised
k-means
Gaussians learning

Latent Dirichlet Probabilistic


MapReduce
allocation modeling

Expectation Data parallel


Maximization problems

Bayesian
Gibbs sampling
inference

18 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Course outline

19 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Overview of content
Models Algorithms Core ML
Nearest neighbors KD-trees Distance metrics
Module 1 Module 1 Module 1

Locality sensitive Approximation


Clustering
hashing algorithms
Module 2, 3
Module 1 Module 1

Unsupervised
Mixture of Gaussians k-means
learning
Module 3 Module 2
Module 2

Latent Dirichlet Probabilistic


MapReduce
allocation modeling
Module 2
Module 4 Module 2, 3, 4

Expectation Data parallel


Maximization problems
Module 3 Module 2

Gibbs sampling Bayesian inference


Module 4 Module 4

20 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Module 1: Nearest neighbor search
Reading doc
and want to
find related
doc

21 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Module 1: Nearest neighbor search
Compute distances to all other
documents and return closest

Critical elements:
-  Doc representation
-  Distance measure

query article
22 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Module 1: Nearest neighbor search
Bin index:
Efficient and [0 0 0] Line 2

#awful
approximate Bin index:
[0 1 0]
NN search … Line 1 Bin index:
4 [1 1 0]
LSH 3
KD-trees 2
Line 3
1 Bin index:
[1 1 1]
0
0 1 2 3 4 … #awesome

23 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Module 2: k-means and MapReduce
Discover clusters of related documents

Cluster 1 Cluster 2

Cluster 3 Cluster 4
24 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Module 2: k-means and MapReduce
k-means aims to minimize sum of
square distances to cluster centers

Makes hard assignments


of data points to clusters

Unsupervised
learning task

25 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Module 2: k-means and MapReduce
Map Phase Shuffle Phase Reduce Phase
(k1,v1) (k1,v1)
M1 (k2,v2) M1 (k2,v2)
… …

(k1’,v1’) (k3,v3)

Assign tuple (ki,vi) to


M2 (k2’,v2’) Parallelize M2
(k4,v4)

machine h[ki]
across machines
Big Data


operations …
Split data

across
multiple
machines

(k1’’’,v1’’’) (k5,v5)
M1000 M1000
(k2’’’,v2’’’) (k6,v6)
… …
26 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Module 3: Mixture Models
Probabilistic clustering model

captures
Cluster 1 Cluster 2 uncertainty
in clustering

Cluster 3 Cluster 4
27 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Module 3: Mixture Models

Cluster 1 Cluster 2
Learn user
topic
preferences
Cluster 3 Cluster 4
28 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Module 3: Mixture Models
Assignments of docs to clusters based on
location and shape, not just location

Make soft assignments


of docs to each cluster

29 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Module 3: Mixture Models
EM algorithm à
Data soft assignments

30 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Modeling the Complex Dynamics and Changing
Correlations of Epileptic Events
Module 4: Latent Dirichlet Allocation
a
Drausin F. Wulsina , Emily B. Foxc , Brian Litta,b
Department of Bioengineering, University of Pennsylvania, Philadelphia, PA
b
Department of Neurology, University of Pennsylvania, Philadelphia, PA
c
Department of Statistics, University of Washington, Seattle, WA

Based on science
Abstract related words, maybe
Patients with epilepsy can manifest short, sub-clinical epileptic “bursts” in
addition to full-blown clinical seizures. We believe the relationship between doc in cluster 4
these two classes of events—something not previously studied quantitatively—
could yield important insights into the nature and intrinsic dynamics of
seizures. A goal of our work is to parse these complex epileptic events
into distinct dynamic regimes. A challenge posed by the intracranial EEG 1
(iEEG) data we study is the fact that the number and placement of electrodes
can vary between patients. We develop a Bayesian nonparametric Markov
switching process that allows for (i) shared dynamic regimes between a vari-
able number of channels, (ii) asynchronous regime-switching, and (iii) an
unknown dictionary of dynamic regimes. We encode a sparse and changing
set of dependencies between the channels using a Markov-switching Gaussian 0
graphical model for the innovations process driving the channel dynamics and
demonstrate the importance of this model in parsing and out-of-sample pre-
dictions of iEEG data. We show that our model produces intuitive state
assignments that can help automate clinical analysis of seizures and enable
the comparison of sub-clinical bursts and full clinical seizures.
Keywords: Bayesian nonparametric, EEG, factorial hidden Markov model,
graphical model, time series

1. Introduction
Despite over three decades of research, we still have very little idea of
what defines a seizure. This ignorance stems both from the complexity of
epilepsy as a disease and a paucity of quantitative tools that are flexible
31 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Modeling the Complex Dynamics and Changing
Correlations of Epileptic Events
Module 4: Latent Dirichlet Allocation
a
Drausin F. Wulsina , Emily B. Foxc , Brian Litta,b
Department of Bioengineering, University of Pennsylvania, Philadelphia, PA
b
Department of Neurology, University of Pennsylvania, Philadelphia, PA
c
Department of Statistics, University of Washington, Seattle, WA

Abstract Or maybe cluster 2


Patients with epilepsy can manifest short, sub-clinical epileptic “bursts” in
addition to full-blown clinical seizures. We believe the relationship between (technology) is a better fit
these two classes of events—something not previously studied quantitatively—
could yield important insights into the nature and intrinsic dynamics of
seizures. A goal of our work is to parse these complex epileptic events
into distinct dynamic regimes. A challenge posed by the intracranial EEG 1
(iEEG) data we study is the fact that the number and placement of electrodes
can vary between patients. We develop a Bayesian nonparametric Markov
switching process that allows for (i) shared dynamic regimes between a vari-
able number of channels, (ii) asynchronous regime-switching, and (iii) an
unknown dictionary of dynamic regimes. We encode a sparse and changing
set of dependencies between the channels using a Markov-switching Gaussian 0
graphical model for the innovations process driving the channel dynamics and
demonstrate the importance of this model in parsing and out-of-sample pre-
dictions of iEEG data. We show that our model produces intuitive state
assignments that can help automate clinical analysis of seizures and enable
the comparison of sub-clinical bursts and full clinical seizures.
Keywords: Bayesian nonparametric, EEG, factorial hidden Markov model,
graphical model, time series

1. Introduction
Despite over three decades of research, we still have very little idea of
what defines a seizure. This ignorance stems both from the complexity of
epilepsy as a disease and a paucity of quantitative tools that are flexible
32 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Modeling the Complex Dynamics and Changing
Correlations of Epileptic Events
Module 4: Latent Dirichlet Allocation
a
Drausin F. Wulsina , Emily B. Foxc , Brian Litta,b
Department of Bioengineering, University of Pennsylvania, Philadelphia, PA
b
Department of Neurology, University of Pennsylvania, Philadelphia, PA
c
Department of Statistics, University of Washington, Seattle, WA

Abstract Really, it’s about science


Patients with epilepsy can manifest short, sub-clinical epileptic “bursts” in
addition to full-blown clinical seizures. We believe the relationship between
these two classes of events—something not previously studied quantitatively—
and technology
could yield important insights into the nature and intrinsic dynamics of
seizures. A goal of our work is to parse these complex epileptic events
into distinct dynamic regimes. A challenge posed by the intracranial EEG 0.6
(iEEG) data we study is the fact that the number and placement of electrodes
can vary between patients. We develop a Bayesian nonparametric Markov 0.4
switching process that allows for (i) shared dynamic regimes between a vari-
able number of channels, (ii) asynchronous regime-switching, and (iii) an 0.2
unknown dictionary of dynamic regimes. We encode a sparse and changing
set of dependencies between the channels using a Markov-switching Gaussian 0
graphical model for the innovations process driving the channel dynamics and
demonstrate the importance of this model in parsing and out-of-sample pre-
dictions of iEEG data. We show that our model produces intuitive state
assignments that can help automate clinical analysis of seizures and enable
the comparison of sub-clinical bursts and full clinical seizures.
Keywords: Bayesian nonparametric, EEG, factorial hidden Markov model,
graphical model, time series

1. Introduction
Despite over three decades of research, we still have very little idea of
what defines a seizure. This ignorance stems both from the complexity of
epilepsy as a disease and a paucity of quantitative tools that are flexible
33 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Module 4: Latent Dirichlet Allocation
Each cluster/topic defined by probability of words in vocab

SCIENCE TECH SPORTS


experiment 0.1 develop 0.18 player 0.15
test
discover
0.08
0.05
computer 0.09
processor 0.032
score
team
0.07
0.06 …
hypothesize 0.03 user 0.027 goal 0.03
climate 0.01 internet 0.02 injury 0.01
… … … … … …

34 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Modeling the Complex Dynamics and Changing
Correlations of Epileptic Events
Topic vocab Drausin F. Wulsina , Emily B. Foxc , Brian Litta,b
distributions: a
Department of Bioengineering, University of Pennsylvania, Philadelphia, PA
b
Department of Neurology, University of Pennsylvania, Philadelphia, PA
Document topic
arXiv:1402.6951v2 [stat.ML] 14 Jul 2014

proportions:
c
Department of Statistics, University of Washington, Seattle, WA
TOPIC 1
Word 1 ?
Word 2 ?
Word 3 ? Abstract 0.6
Word 4 ? Patients with epilepsy can manifest short, sub-clinical epileptic “bursts” in
Word 5 ? addition to full-blown clinical seizures. We believe the relationship between 0.4
… … these two classes of events—something not previously studied quantitatively—

? ? ? ?
could yield important insights into the nature and intrinsic dynamics of
seizures. A goal of our work is to parse these complex epileptic events 0.2
TOPIC 2 into distinct dynamic regimes. A challenge posed by the intracranial EEG
Word 1 ? (iEEG) data we study is the fact that the number and placement of electrodes
can vary between patients. We develop a Bayesian nonparametric Markov
0
Word 2 ?
switching process that allows for (i) shared dynamic regimes between a vari-
Word 3 ?
able number of channels, (ii) asynchronous regime-switching, and (iii) an
Word 4 ?
unknown dictionary of dynamic regimes. We encode a sparse and changing
Word 5 ? set of dependencies between the channels using a Markov-switching Gaussian
… … graphical model for the innovations process driving the channel dynamics and

Unsupervised
demonstrate the importance of this model in parsing and out-of-sample pre-
dictions of iEEG data. We show that our model produces intuitive state
TOPIC 3 assignments that can help automate clinical analysis of seizures and enable

learning task
Word 1 ? the comparison of sub-clinical bursts and full clinical seizures.
Word 2 ?
Keywords: Bayesian nonparametric, EEG, factorial hidden Markov model,
Word 3 ? graphical model, time series
Word 4 ?
Word 5 ?
… … 1. Introduction
Despite over three decades of research, we still have very little idea of

what defines a seizure. This ignorance stems both from the complexity of
epilepsy as a disease and a paucity of quantitative tools that are flexible
35 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Assumed background

36 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Courses 1, 2, & 3 in this ML Specialization
•  Course 1: Foundations
-  Overview of ML case studies
-  Black-box view of ML tasks
-  Programming & data
manipulation skills

•  Course 2: Regression
-  Data representation (input, output, features)
-  Basic statistical concepts: mean/variance
-  Basic ML concepts:
•  ML algorithm
•  Coordinate ascent
•  Overfitting
•  Regularization

•  Course 3: Classification
-  Distributions and conditional distributions
-  Maximum likelihood estimation
-  References to:
•  Linear classifier
•  Multiclass classification
•  Boosting
37 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Math background
•  Basic linear algebra
- Vectors
- Matrices
- Matrix multiply
•  Basic probability
- Fundamental laws
- Distribution and
conditional distribution

38 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Programming experience
•  Basic Python used
- Can pick up along the way if
knowledge of other language

39 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Reliance on GraphLab Create
•  SFrames will be used, though not required
-  open source project of Dato
(creators of GraphLab Create)
-  can use pandas and numpy instead
•  Assignments will:
1.  Use GraphLab Create to
explore high-level concepts
2.  Ask you to implement most
algorithms without GraphLab Create
•  Net result:
-  learn how to code methods in Python
40 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Computing needs
•  Using your own computer:
-  Basic desktop or laptop
•  64-bit required if using SFrame
-  Access to internet
-  Ability to:
•  Install and run Python (and Numpy, GraphLab Create,…)
•  Store a few GB of data
•  Will also provide alternative, pre-configured
machine in Cloud

41 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on


Let’s get started!

42 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

You might also like