Professional Documents
Culture Documents
2018
Active Learning
Data Mining Guestlecture
Georg Krempl
Algorithmic Data Analysis
Information & Computing Sciences Department
Utrecht University, The Netherlands
1 / 57
Lecture Outline
I Lecture’s Context
I Semi-Supervised Learning
I Active Learning
I Motivation
I Approaches
I Uncertainty Sampling
I Expected Error Reduction
I Ensemble / Query by Committee
I Probabilistic Active Learning
I Evaluation
2 / 57
Lecture Outline
I Lecture’s Context
I Semi-Supervised Learning
I Active Learning
I Motivation
I Approaches
I Uncertainty Sampling
I Expected Error Reduction
I Ensemble / Query by Committee
I Probabilistic Active Learning
I Evaluation
3 / 57
Recommended Literature on
Active and Semi-Supervised Learning
4 / 57
Context of Active Learning
within Data Mining / Machine Learning
active
Super- SSL Unsuper-
vised passive vised
SSL
5 / 57
Supervised (Passive) Classification: Recapitulation
I Historical Data
Credit Scoring: will a new order be paid? e.g. previous customer’s records
I Generate Training Sample L with
amount X1 ... default?
feature variables (e.g. order value X )
and class label (e.g. default Y )
2.2 … no
1.7 … yes I Aim: Classifier function f : X → Y
1.3 … yes
E.g. using Bayes theorem
3.1 … no
... ... ...
Pr (X |Y ) · Pr (Y )
Pr (Y |X ) =
Pr (X )
Classifier Function
+ -
density
6 / 57
Supervised Classification: Limitations
1
See [Krempl et al., 2014b] in SIGKDD Explorations.
7 / 57
Lecture Outline
I Lecture’s Context
I Semi-Supervised Learning
I Active Learning
I Motivation
I Approaches
I Uncertainty Sampling
I Expected Error Reduction
I Ensemble / Query by Committee
I Probabilistic Active Learning
I Evaluation
8 / 57
Label Paucity:
I Completeness of information is not met:
Only some labels are available
I Semi-Supervised Classification
+
-
+
9 / 57
Semi-Supervised Learning2
Basic situation
I Not all data are labelled
Objective
I Use unlabelled data to improve classification model
Assumptions
Basic Assumption: There is some structure in the data
I Smoothness assumption
I Neighbouring instances are similar w.r.t. class label,
i.e.the closer, the more likely is a similar class label
I We need to find low density regions
I Decision boundary is placed in low-density regions
I Cluster assumption
I Instances form clusters (subpopulations)
I Within one cluster, similar label is more likely
I In extreme case, label Y is conditionally independent of X given cluster Z
I Manifold assumption
I Data are on a manifold of much lower dimension than the feature space
I Learning the manifold eases classification (remedy against curse of dimensionality)
Objective
I Use unlabelled data to improve classification model
Assumptions
Basic Assumption: There is some structure in the data
I Smoothness assumption
I Cluster assumption
I Manifold assumption
Methods
I Self-Learning, Co-Learning
I Generative Models
I Low-density Separation
I Graph-Based Methods
12 / 57
Lecture Outline
I Lecture’s Context
I Semi-Supervised Learning
I Active Learning
I Motivation
I Approaches
I Uncertainty Sampling
I Expected Error Reduction
I Ensemble / Query by Committee
I Probabilistic Active Learning
I Evaluation
13 / 57
Active Learning: Motivation4
Motivation
I some labels are expensive to obtain
Selection is important
I Even (or in particular!) in “Big Data” applications
I Efficient allocation of limited resources
I Sample where we expect something interesting
4
See, e.g., [Settles, 2012, Fu et al., 2012, Zliobaitė et al., 2013].
14 / 57
Active Learning: Context & History
History
I Optimal experimental design [Fedorov, 1972]
I Learning with queries/query synthesis [Angluin, 1988]
I Selective sampling [Cohn et al., 1990]
15 / 57
Selective Data Acquisition Tasks5
5
Own categorization, inspired by [Attenberg et al., 2011, Saar-Tsechansky et al., 2009, Settles, 2009].
16 / 57
Active Learning Strategies: Motivating Example
Example
I Uniform distributed instances
Feature x2
Feature x1
17 / 57
Active Learning Strategies: Motivating Example
Feature x2
Feature x1
Based on these two labels, what are possible true label distributions?
Feature x2
Feature x2
and
many
others
···
Feature x1 Feature x1
18 / 57
Active Learning Strategies: Motivating Example
Problem
Possible hypotheses about
the decision boundary: I Many distributions could have generated the
Feature x2
Core Ideas:
+
Feature x1
I Exploration of feature space!
Feature x1
exploration vs. exploitation tradeoff
I Furthermore, consider density for sampling in
high-density areas
rather than focusing on potentially irrelevant
outliers 19 / 57
Active Learning Strategies: Motivating Example (2)
Example
I Multimodal distributed instances
Feature x2
Feature x1
20 / 57
Lecture Outline
I Lecture’s Context
I Semi-Supervised Learning
I Active Learning
I Motivation
I Approaches
I Uncertainty Sampling
I Expected Error Reduction
I Ensemble / Query by Committee
I Probabilistic Active Learning
I Evaluation
21 / 57
Active Learning Approaches
22 / 57
Uncertainty-Based Strategy
Idea
Select those instances where we are least certain
about the label
Feature x2
Approach:
I 3 labels preselected
I Linear classifier
I Use distance to the decision boundary as
Feature x1
uncertainty measure
23 / 57
Uncertainty Measures:
Three families of measures have been proposed6 :
Margin
I Uncertainty is inverse proportional to the distance to the decision boundary
I the closer to the boundary, the less certain
Confidence (Posterior)
I Difference of the posteriors of different classes
I abs (P(y = +|x) − P(y = −|x))
I the higher the difference, the more certain
Entropy
P
I − y ∈{+,−} p(y |x) log (p(y |x))
I the higher the entropy, the less certain
II I
uniform
observed distribution of labels (p)
ˆ
+
+ +
+
+ + I a label’s value depends
+ on the label information
in its neighbourhood
I label information
+ I number of labels
I share of classes
non-uniform
- -
+ + -
+ I uncertainty sampling ignores
- the number of similar labels
III IV
low high
number of labels (n)
25 / 57
Uncertainty Measures: Discussion
Margin
II I
uniform
observed distribution of labels (p)
ˆ
+
non-uniform
- -
+ +
+
- Confidence
- I Requests instances with low posterior
III IV difference
low high
number of labels (n)
I Results in requests close to the decision
boundary
I Again ignores the number of already requested
labels
Entropy
I As above.
I All those approaches ignore the number of previously requested labels!
26 / 57
Error-Reduction Strategy
Observation
Valuable are instances that help us to improve.
Feature x2
Idea
Select instances that reduce the error the most
Observation
I Given the data on the left,
Feature x1 I generated by two Gaussian clusters,
I where will we need many labels?
Feature x2
Questions
1. How compute outcome-weights wŷ ?
2. How compute new error rate eŷ ?
Specifically: On which data?
28 / 57
Error-Reduction Strategy
Outcome-Weights Computation
1. Average: Equal weights wŷ = 0.5 for all ŷ
2. Most likely: Weight the most likely outcome one
(most likely based on the current classifiers prediction), the other outcome zero
3. Posterior-based: Use posterior estimates by current classifier as weights:
wŷ = p̂(y = ŷ |x)
Error-Rate Computation
I Use current classifier to label the unlabelled instances
Similar to self-training
I Good estimate of the generalisation capability
I But: Might again lead to getting stuck in the wrong hypothesis
29 / 57
Ensemble-Based Strategy
r2
Idea
if ie
Use disagreement between base classifiers
las s
C
Feature x2
Approach
Disagreement 1. Get an initial set of labels
Clas 2. Split that set into (overlapping) subsets
sif ie
r1
3. On each subset, train a different base-classifier
4. Repeat
Feature x1 5. On each unlabelled instance do
6. Apply all base-classifiers
7. Request label, if base-classifiers disagree
8. Update all base-classifiers
9. Go to step 4
10. Until convergence
30 / 57
Probabilistic Active Learning: Illustrative Example
+
-
+
?
-
31 / 57
Principles of Probabilistic Active Learning 7
+
-
+
?
-
7
See [Krempl et al., 2014a].
32 / 57
Principles of Probabilistic Active Learning
+
-
+
?
-
33 / 57
Principles of Probabilistic Active Learning
+
-
+
?
-
34 / 57
Principles of Probabilistic Active Learning
+
-
+
?
-
y?
p. & (1 − p) (1)
+ −
35 / 57
Principles of Probabilistic Active Learning
+
-
+
?
-
I These labels are the realisation of n Bernoulli trials with success probability p
(cmp. urn model):
n+ ∼ Binomialn,p = Binn,p (2)
36 / 57
Principles of Probabilistic Active Learning
+
-
+
?
-
Γ(n+ + n− + 1)
L(p|ls) = Binn+ +n− ,p (n+ ) = · p n+ · (1 − p)n− (3)
Γ(n+ + 1) · Γ(n− + 1)
L(p|ls)g (p)
ωls (p) = R 1 = (n+ + n− + 1) · L(p|ls) = Betaα,β (p) (4)
0 L(ψ|ls)g (ψ)dψ
37 / 57
Probabilistic Active Learning – Excursus: Beta Prior
5
no labels (n=0) 0.91
I Uniform prior: Prior to the first
few labels (n=2,p=0.5)
^
label’s arrival, all values of p are
more labels (n=3,p=2/3)
^
3
I Left: Plot of normalised likelihoods
for different values of α, β
I The peak of this function becomes
2
0.67 the more distinct, the more labels are
0.5 obtained.
0
0 0.2 0.4 0.6 0.8 1
True posterior (p)
38 / 57
Principles of Probabilistic Active Learning
+
-
+
?
-
39 / 57
Probabilistic Gain
40 / 57
Probabilistic Gain – GOPAL
uncertainty
const. w.r.t. n:
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
posterior
41 / 57
Active Learning Strategies - Discussion
I Uncertainty-Based:
fast and simple, waste labels in regions with high Bayesian error rate
I Error-Reduction:
good quality, high computational cost
I Ensemble-Based:
good quality, convergence avoids unnecessary sampling in regions with high
Bayesian error rate, conceptually easily understandable, multiple classifiers needed
I Probabilistic Approach:
good quality, considering labelstatistics avoids unnecessary sampling in regions
with high Bayesian error rate
42 / 57
Evaluating Active Learning Approaches
Learning Curves
ER 1
ER 2
Random
I Plot performance against
Performance
43 / 57
Evaluation Example: Multi-Class Probabilistic AL (McPAL): (Selection8 )
e rror
e rror
VoI
e rror
60 40 15
Ra nd Ra nd Ra nd
35 10
40
30 5
25 0
20
20 5
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
# of s a m ple d ins ta nc e s # of s a m ple d ins ta nc e s # of s a m ple d ins ta nc e s
Ra nd Ra nd
e rror
e rror
140
20 Ra nd
300
15
120
10
250
100
5
80 0 200
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
# of s a m ple d ins ta nc e s # of s a m ple d ins ta nc e s # of s a m ple d ins ta nc e s
Learning curves of mean misclassification cost (including standard deviation as error bars) of
McPAL and its competitors on all six datasets using the pKNN classifier.
8
For more results, please consult ECAI paper [Kottke et al., 2016].
44 / 57
Lecture Outline
I Lecture’s Context
I Semi-Supervised Learning
I Active Learning
I Motivation
I Approaches
I Uncertainty Sampling
I Expected Error Reduction
I Ensemble / Query by Committee
I Probabilistic Active Learning
I Evaluation
45 / 57
Summary and Questions
Summary
I Motivation & Aim: Not all labels are available
I General Idea: Use structure of the data!
I Semi-Supervised Learning:
I Assumptions: Smoothness, Clusters, or Manifold
I Approaches: Self-Learning, Generative Models, Low-Density Separation, Graph-Based
Models
I Active Learning:
I Scenarios: Query Synthesis, Pool, Stream
I Approaches: Uncertainty Sampling, Expected Error Reduction, Query by Committee,
Probabilistic Active Learning
I Evaluation: Learning Curves
Questions?
46 / 57
Bibliography I
Angluin, D. (1988).
Queries and concept learning.
Machine Learning, 2:319–342.
Attenberg, J., Melville, P., Provost, F., and Saar-Tsechansky, M. (2011).
Selective data acquisition for machine learning.
In Krishnapuram, B., Yu, S., and Rao, R. B., editors, Cost-Sensitive Machine
Learning. CRC Press, Boca Raton, FL, USA, 1st edition.
Chapelle, O., Schölkopf, B., and Zien, A., editors (2006).
Semi-supervised Learning.
MIT Press.
Cohn, D., Atlas, L., Ladner, R., El-Sharkawi, M., Marks, R., Aggoune, M., and
Park, D. (1990).
Training connectionist networks with queries and selective sampling.
In Advances in Neural Information Processing Systems (NIPS). Morgan
Kaufmann.
Fedorov, V. V. (1972).
Theory of Optimal Experiments Design.
Academic Press.
Fu, Y., Zhu, X., and Li, B. (2012).
A survey on instance selection for active learning.
Knowledge and Information Systems, 35(2):249–283.
47 / 57
Bibliography II
Kottke, D., Krempl, G., Lang, D., Teschner, J., and Spiliopoulou, M. (2016).
Multi-class probabilistic active learning.
In Fox, M., Kaminka, G., Hüllermeier, E., and Bouquet, P., editors, Proc. of the
22nd Europ. Conf. on Artificial Intelligence (ECAI2016), 2016, Frontiers in
Artificial Intelligence and Applications. IOS Press.
Krempl, G., Kottke, D., and Spiliopoulou, M. (2014a).
Probabilistic active learning: Towards combining versatility, optimality and
efficiency.
In Dzeroski, S., Panov, P., Kocev, D., and Todorovski, L., editors, Proceedings of
the 17th Int. Conf. on Discovery Science (DS), Bled, volume 8777 of Lecture
Notes in Computer Science, page 168–179. Springer.
Krempl, G., Zliobaitė, I., Brzeziński, D., Hüllermeier, E., Last, M., Lemaire, V.,
Noack, T., Shaker, A., Sievi, S., Spiliopoulou, M., and Stefanowski, J. (2014b).
Open challenges for data stream mining research.
SIGKDD Explorations, 16(1):1–10.
Special Issue on Big Data.
Saar-Tsechansky, M., Melville, P., and Provost, F. (2009).
Active feature-value acquisition.
Management Science, 55(4):664–684.
48 / 57
Bibliography III
Settles, B. (2009).
Active learning literature survey.
Computer Sciences Technical Report 1648, University of Wisconsin-Madison,
Madison, Wisconsin, USA.
Settles, B. (2012).
Active Learning.
Number 18 in Synthesis Lectures on Artificial Intelligence and Machine Learning.
Morgan and Claypool Publishers.
Zhu, X. (2008).
Semi-supervised learning literature survey.
Technical Report 1530, University of Wisconsin.
Zliobaitė, I., Bifet, A., Pfahringer, B., and Holmes, G. (2013).
Active learning with drifting streaming data.
IEEE Transactions on Neural Networks and Learning Systems, 25(1):27–39.
49 / 57
SSL: Approaches
I Self-Learning, Co-Learning
I Generative Models
I Low-density Separation
I Graph-Based Methods
50 / 57
Generative Models
I Assumption:
P(X , Y ) = P(Y )P(X |Y ) with P(X |Y ) being an identifiably mixture distribution
I Objective:
Identify generative model that produces the data
I Basic Idea:
Learn distribution P(X |Y ) of each class y ∈ Y
This requires some assumptions upon the generative model
I Probabilistic approach:
Find likelihood–maximising parameter set Θ, e.g. using expectation–maximisation
I Algorithmic approach:
Cluster – and – Label
Back to SSL Discussion
51 / 57
Generative Models - Example
9
Illustrations from Zhu 2005 [Zhu, 2008].
52 / 57
Generative Models - Example (2)
10
Illustrations from Zhu 2005 [Zhu, 2008].
53 / 57
Generative Models - Challenges
I Identifiability
It may happen, that solution is not unique, even if type of distributions involved
is known
Example from Zhu 2005:
I Given P(X ) ∼ U(0; 1) and two labelled instances (x1 = 0.1; y1 = +),
(x2 = 0.9; y2 = −)
I We know that P(X |Y = +) ∼ U(0; τ ) and P(X |Y = −) ∼ U(τ ; 1),
but we do not know value of τ
I Different values of τ result in concurring models
We can not identify the true model given the two instances
I Model correctness
I If assumptions in the model are met, using unlabelled data increases quality of
prediction model
I But: Otherwise, model can be worse than a model based solely on labelled data
I Model fitting
Parameter set might not be global optimum
Back to SSL Discussion
54 / 57
Low-Density Separation
I Assumption:
Classes are separated by low density regions
I Objective:
Placement of decision boundary in regions of low density
I Approaches:
Transductive Support Vector Machines (TSVMs)
Conventional SVM: aximize margin between labelled instances
Transductive SVM: Maximize margin between all instances
(i.e. labelled and unlabelled ones)
55 / 57
Graph-Based Methods
I Assumption:
Manifold assumption, smoothness over graph
I Approach:
I Represent data by a graph
I Vertices/Nodes are labelled and unlabelled instances
(e.g. connect an instance to its k nearest neighbours, or to all neighbours within some
-neighbourhood)
I Edges represent similarities
I Estimate a function f on the graph,
that is close to given labels and smooth to the whole graph
I Use f to propagate label to neighbouring instances in graph
Back to SSL Discussion
56 / 57
Evaluation of Influence Factors
I Original McPAL compared to variants
ignoring reliability: normalising the ~
P~
I k vector to k=n=1
I ignoring posterior: using uniform frequency estimate ki = n/C , 1 ≤ i ≤ C
I ignoring density: setting it to a constant value
ecoli (pKNN) glass (pKNN) iris (pKNN)
90 60 40
McPAL McPAL Mc PAL
80 55 35
w/o reliability w/o reliability w/o re lia bility
w/o posterior w/o posterior 30 w/o pos te rior
70 50
w/o density w/o density 25 w/o im pa c t
60 45
20
error
error
e rror
50 40
15
40 35
10
30 30
5
20 25 0
10 20 5
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
# of sampled instances # of sampled instances # of s a m ple d ins ta nc e s
error
e rror
error
20
140
300
10
120
250
100 0
80 10 200
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
# of s a m ple d ins ta nc e s # of sampled instances # of sampled instances
Learning curves in mean misclassification cost (including standard deviation as error bars).
57 / 57