Professional Documents
Culture Documents
Retrieval:
A machine learning perspective
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Part of a specialization
1. Foundations
4. Clustering 5. Recommender
2. Regression 3. Classification
& Retrieval Systems
6. Capstone
ML
Nearest
Data Intelligence
Method
Neighbor
Input x,{x’}:
features for Compute
query point
distances to Output xNN:
+
features of other x’ “nearest” point or
all other datapoints set of points to query
5 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Retrieve “nearest neighbor” article
Space of all articles,
organized by similarity of text
query article
nearest neighbor
6 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Or set of nearest neighbors
Space of all articles,
organized by similarity of text
query article
Streaming content:
Social networks
- Songs (people you might want
News articles
- Movies to connect with)
- TV shows
- …
ML
Data Clustering Intelligence
Method
Input {x}:
features for Separate
points in
points into Output {z}:
dataset
disjoint sets cluster labels per
datapoint
9 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Case Study:
Clustering documents by “topic”
ML
Data Clustering Intelligence
Method
ENTERTAINMENT SCIENCE
10 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Just like retrieval, clustering has applications
almost everywhere
Core
Visual Algorithm
concept
Advanced
Practical Implement
topics
O N A L
OPTI
17 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Overview of content
Models Algorithms Core ML
Nearest
KD-trees Distance metrics
neighbors
Mixture of Unsupervised
k-means
Gaussians learning
Bayesian
Gibbs sampling
inference
Unsupervised
Mixture of Gaussians k-means
learning
Module 3 Module 2
Module 2
Critical elements:
- Doc representation
- Distance measure
query article
22 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Module 1: Nearest neighbor search
Bin index:
Efficient and [0 0 0] Line 2
#awful
approximate Bin index:
[0 1 0]
NN search … Line 1 Bin index:
4 [1 1 0]
LSH 3
KD-trees 2
Line 3
1 Bin index:
[1 1 1]
0
0 1 2 3 4 … #awesome
Cluster 1 Cluster 2
Cluster 3 Cluster 4
24 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Module 2: k-means and MapReduce
k-means aims to minimize sum of
square distances to cluster centers
Unsupervised
learning task
(k1’,v1’) (k3,v3)
machine h[ki]
across machines
Big Data
…
operations …
Split data
across
multiple
machines
(k1’’’,v1’’’) (k5,v5)
M1000 M1000
(k2’’’,v2’’’) (k6,v6)
… …
26 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Module 3: Mixture Models
Probabilistic clustering model
captures
Cluster 1 Cluster 2 uncertainty
in clustering
Cluster 3 Cluster 4
27 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Module 3: Mixture Models
Cluster 1 Cluster 2
Learn user
topic
preferences
Cluster 3 Cluster 4
28 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Module 3: Mixture Models
Assignments of docs to clusters based on
location and shape, not just location
Based on science
Abstract related words, maybe
Patients with epilepsy can manifest short, sub-clinical epileptic “bursts” in
addition to full-blown clinical seizures. We believe the relationship between doc in cluster 4
these two classes of events—something not previously studied quantitatively—
could yield important insights into the nature and intrinsic dynamics of
seizures. A goal of our work is to parse these complex epileptic events
into distinct dynamic regimes. A challenge posed by the intracranial EEG 1
(iEEG) data we study is the fact that the number and placement of electrodes
can vary between patients. We develop a Bayesian nonparametric Markov
switching process that allows for (i) shared dynamic regimes between a vari-
able number of channels, (ii) asynchronous regime-switching, and (iii) an
unknown dictionary of dynamic regimes. We encode a sparse and changing
set of dependencies between the channels using a Markov-switching Gaussian 0
graphical model for the innovations process driving the channel dynamics and
demonstrate the importance of this model in parsing and out-of-sample pre-
dictions of iEEG data. We show that our model produces intuitive state
assignments that can help automate clinical analysis of seizures and enable
the comparison of sub-clinical bursts and full clinical seizures.
Keywords: Bayesian nonparametric, EEG, factorial hidden Markov model,
graphical model, time series
1. Introduction
Despite over three decades of research, we still have very little idea of
what defines a seizure. This ignorance stems both from the complexity of
epilepsy as a disease and a paucity of quantitative tools that are flexible
31 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Modeling the Complex Dynamics and Changing
Correlations of Epileptic Events
Module 4: Latent Dirichlet Allocation
a
Drausin F. Wulsina , Emily B. Foxc , Brian Litta,b
Department of Bioengineering, University of Pennsylvania, Philadelphia, PA
b
Department of Neurology, University of Pennsylvania, Philadelphia, PA
c
Department of Statistics, University of Washington, Seattle, WA
1. Introduction
Despite over three decades of research, we still have very little idea of
what defines a seizure. This ignorance stems both from the complexity of
epilepsy as a disease and a paucity of quantitative tools that are flexible
32 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Modeling the Complex Dynamics and Changing
Correlations of Epileptic Events
Module 4: Latent Dirichlet Allocation
a
Drausin F. Wulsina , Emily B. Foxc , Brian Litta,b
Department of Bioengineering, University of Pennsylvania, Philadelphia, PA
b
Department of Neurology, University of Pennsylvania, Philadelphia, PA
c
Department of Statistics, University of Washington, Seattle, WA
1. Introduction
Despite over three decades of research, we still have very little idea of
what defines a seizure. This ignorance stems both from the complexity of
epilepsy as a disease and a paucity of quantitative tools that are flexible
33 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Module 4: Latent Dirichlet Allocation
Each cluster/topic defined by probability of words in vocab
proportions:
c
Department of Statistics, University of Washington, Seattle, WA
TOPIC 1
Word 1 ?
Word 2 ?
Word 3 ? Abstract 0.6
Word 4 ? Patients with epilepsy can manifest short, sub-clinical epileptic “bursts” in
Word 5 ? addition to full-blown clinical seizures. We believe the relationship between 0.4
… … these two classes of events—something not previously studied quantitatively—
? ? ? ?
could yield important insights into the nature and intrinsic dynamics of
seizures. A goal of our work is to parse these complex epileptic events 0.2
TOPIC 2 into distinct dynamic regimes. A challenge posed by the intracranial EEG
Word 1 ? (iEEG) data we study is the fact that the number and placement of electrodes
can vary between patients. We develop a Bayesian nonparametric Markov
0
Word 2 ?
switching process that allows for (i) shared dynamic regimes between a vari-
Word 3 ?
able number of channels, (ii) asynchronous regime-switching, and (iii) an
Word 4 ?
unknown dictionary of dynamic regimes. We encode a sparse and changing
Word 5 ? set of dependencies between the channels using a Markov-switching Gaussian
… … graphical model for the innovations process driving the channel dynamics and
Unsupervised
demonstrate the importance of this model in parsing and out-of-sample pre-
dictions of iEEG data. We show that our model produces intuitive state
TOPIC 3 assignments that can help automate clinical analysis of seizures and enable
learning task
Word 1 ? the comparison of sub-clinical bursts and full clinical seizures.
Word 2 ?
Keywords: Bayesian nonparametric, EEG, factorial hidden Markov model,
Word 3 ? graphical model, time series
Word 4 ?
Word 5 ?
… … 1. Introduction
Despite over three decades of research, we still have very little idea of
…
what defines a seizure. This ignorance stems both from the complexity of
epilepsy as a disease and a paucity of quantitative tools that are flexible
35 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Assumed background
• Course 2: Regression
- Data representation (input, output, features)
- Basic statistical concepts: mean/variance
- Basic ML concepts:
• ML algorithm
• Coordinate ascent
• Overfitting
• Regularization
• Course 3: Classification
- Distributions and conditional distributions
- Maximum likelihood estimation
- References to:
• Linear classifier
• Multiclass classification
• Boosting
37 ©2016 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Math background
• Basic linear algebra
- Vectors
- Matrices
- Matrix multiply
• Basic probability
- Fundamental laws
- Distribution and
conditional distribution