Data Mining - Utrecht University - 13. Active Learning

2.11.
2018
Active Learning
Data Mining Guestlecture
Georg Krempl
Algorithmic Data Analysis
Information & Computing Sciences Department
Utrecht University, The Netherlands
1 / 57
Lecture Outline
I Lecture’s Context
I Semi-Supervised Learning
I Active Learning
I Motivation
I Approaches
I Uncertainty Sampling
I Expected Error Reduction
I Ensemble / Query by Committee
I Probabilistic Active Learning
I Evaluation
I Summary and Questions
2 / 57
Lecture Outline
I Active Learning
I Motivation
I Approaches
I Evaluation
3 / 57
Recommended Literature on
Active and Semi-Supervised Learning
(a) [Settles, 2012] (b) [Chapelle et al., 2006]
4 / 57
Context of Active Learning
within Data Mining / Machine Learning
active
Super- SSL Unsuper-
vised passive vised
SSL
Figure: Context of Active Learning within Machine Learning
5 / 57
Supervised (Passive) Classification: Recapitulation
I Historical Data
Credit Scoring: will a new order be paid? e.g. previous customer’s records
I Generate Training Sample L with
amount X1 ... default?
feature variables (e.g. order value X )
and class label (e.g. default Y )
2.2 … no
1.7 … yes I Aim: Classifier function f : X → Y
1.3 … yes
E.g. using Bayes theorem
3.1 … no
... ... ...
Pr (X |Y ) · Pr (Y )
Pr (Y |X ) =
Pr (X )
Classifier Function
+ -
density
d(x) I Estimate Distributions

d(x,+) d(x,-) joint distributions d(X , Y ) or
posterior distributions
d(X ,Y )
+ + + - - - d(Y |X ) = d(X )
decision boundary X1
I Derive Decision Boundary
at intersections of posterior
distributions
6 / 57
Supervised Classification: Limitations
Common assumptions about the training data:

Data (features, labels) on each instance is
I unbiased
I correct
I complete
I and available at once
I available at no cost and without the classifier controlling label selection
What if these assumptions do not hold?

I Challenge for conventional supervised learning approaches1
Requires techniques such as adaptive, active, semi-supervised, transfer learning
1
See [Krempl et al., 2014b] in SIGKDD Explorations.
7 / 57
Lecture Outline
I Active Learning
I Motivation
I Approaches
I Evaluation
8 / 57
Label Paucity:
I Completeness of information is not met:
Only some labels are available
I Semi-Supervised Classification
+
-
+
Figure: Classification with Partially Labelled Data
9 / 57
Semi-Supervised Learning2
Basic situation
I Not all data are labelled
Objective
I Use unlabelled data to improve classification model
Assumptions
Basic Assumption: There is some structure in the data
I Smoothness assumption
I Neighbouring instances are similar w.r.t. class label,
i.e.the closer, the more likely is a similar class label
I We need to find low density regions
I Decision boundary is placed in low-density regions
I Cluster assumption
I Instances form clusters (subpopulations)
I Within one cluster, similar label is more likely
I In extreme case, label Y is conditionally independent of X given cluster Z
I Manifold assumption
I Data are on a manifold of much lower dimension than the feature space
I Learning the manifold eases classification (remedy against curse of dimensionality)
Details of SSL Approaches

2
See for example [Chapelle et al., 2006, Zhu, 2008].
10 / 57
Semi-Supervised Learning3
Basic situation
I Not all data are labelled
Objective
I Use unlabelled data to improve classification model
Assumptions
Basic Assumption: There is some structure in the data
I Smoothness assumption
I Cluster assumption
I Manifold assumption
Methods
I Self-Learning, Co-Learning
I Generative Models
I Low-density Separation
I Graph-Based Methods
Details of SSL Approaches

3
Literature survey by Zhu 2005 [Zhu, 2008].
11 / 57
SSL - Limitations
I Ex-post, passive viewpoint (we work with the data we have)

I What if we can influence which labels we obtain?
12 / 57
Lecture Outline
I Active Learning
I Motivation
I Approaches
I Evaluation
13 / 57
Active Learning: Motivation4
Motivation
I some labels are expensive to obtain
Active Learning in Exemplary Applications

I Credit Scoring:
E.g. costly to accept high risk clients for model building
I Brain Computer Interfaces:

E.g. performing tasks for calibration can be tedious for user
Selection is important
I Even (or in particular!) in “Big Data” applications
I Efficient allocation of limited resources
I Sample where we expect something interesting
4
See, e.g., [Settles, 2012, Fu et al., 2012, Zliobaitė et al., 2013].
14 / 57
Active Learning: Context & History
Context & Aim

I unlabelled data U is abundant,
I labelling is costly (small set of labelled data L)
I control over the label selection process
I select the most valuable (informative) instances for labelling
History
I Optimal experimental design [Fedorov, 1972]
I Learning with queries/query synthesis [Angluin, 1988]
I Selective sampling [Cohn et al., 1990]
15 / 57
Selective Data Acquisition Tasks5
Active Learning Scenarios

I Query synthesis: example generated upon query
I Pool U of unlabelled data: static, repeated access
I Stream: sequential arrival, no repeated access
Type of Selected Information

I Active label acquisition
I Active feature (value) acquisition Stream
I Active class selection, also denoted x4 x3 x2
x5 x1
Active class-conditional example acquisition y3 y2
Instances y5 y4
y1
I ...
Time
5
Own categorization, inspired by [Attenberg et al., 2011, Saar-Tsechansky et al., 2009, Settles, 2009].
16 / 57
Active Learning Strategies: Motivating Example
Example
I Uniform distributed instances
Feature x2
I Which label to request?

I Idea: Try to maximise diversity of requested
labels
Feature x1
17 / 57
Feature x2
Feature x1
Based on these two labels, what are possible true label distributions?
Feature x2
Feature x2
and
many
others
···
Feature x1 Feature x1
18 / 57
Problem
Possible hypotheses about
the decision boundary: I Many distributions could have generated the
Feature x2
data (see figure on the top left)

+
-
+ I How to know which one is the true one?
-
Core Ideas:
+
Maximise diversity of requested labels

-
Feature x1
I Exploration of feature space!
I Reduce the number of remaining hypotheses

i.e. use the requested data to eliminate
unlikely hypotheses
Feature x2
I Exploitation of acquired knowledge!

I Figure on the bottom left:
Ask for Requesting labels at the top-right and
those two
labels
bottom-left corners helps eliminating many
possible hypotheses)
I But: Be aware of this
Feature x1
exploration vs. exploitation tradeoff
I Furthermore, consider density for sampling in
high-density areas
rather than focusing on potentially irrelevant
outliers 19 / 57
Active Learning Strategies: Motivating Example (2)
Example
I Multimodal distributed instances
Feature x2
I Which label to request?

I Idea: Try to exploit structure of data
Feature x1
20 / 57
Lecture Outline
I Active Learning
I Motivation
I Approaches
I Evaluation
21 / 57
Active Learning Approaches
Selected Active Learning Approaches

22 / 57
Uncertainty-Based Strategy
Idea
Select those instances where we are least certain
about the label
Feature x2
Approach:
I 3 labels preselected
I Linear classifier
I Use distance to the decision boundary as
Feature x1
uncertainty measure
I But: Is this a good uncertainty measure?

I Are there other ways of measuring
uncertainty?
23 / 57
Uncertainty Measures:
Three families of measures have been proposed6 :
Margin
I Uncertainty is inverse proportional to the distance to the decision boundary
I the closer to the boundary, the less certain
Confidence (Posterior)
I Difference of the posteriors of different classes
I abs (P(y = +|x) − P(y = −|x))
I the higher the difference, the more certain
Entropy
P
I − y ∈{+,−} p(y |x) log (p(y |x))
I the higher the entropy, the less certain
I But how suitable are they?

6
See [Settles, 2012].
24 / 57
Exemplary AL Situations
II I
uniform
observed distribution of labels (p)
ˆ
+
+ +
+
+ + I a label’s value depends
+ on the label information
in its neighbourhood
I label information
+ I number of labels
I share of classes
non-uniform
- -
+ + -
+ I uncertainty sampling ignores
- the number of similar labels
III IV
low high
number of labels (n)
25 / 57
Uncertainty Measures: Discussion
Margin
II I
uniform
observed distribution of labels (p)
ˆ
+ I Requests instances close to the decision

+ +
+ boundary
+ +
+
I Avoids regions of low variance,
ignores number of already requested labels
+
non-uniform
- -
+ +
+
- Confidence
- I Requests instances with low posterior
III IV difference
low high
number of labels (n)
I Results in requests close to the decision
boundary
I Again ignores the number of already requested
labels
Entropy
I As above.
I All those approaches ignore the number of previously requested labels!
26 / 57
Error-Reduction Strategy
Observation
Valuable are instances that help us to improve.
Feature x2
Idea
Select instances that reduce the error the most
Observation
I Given the data on the left,
Feature x1 I generated by two Gaussian clusters,
I where will we need many labels?
Feature x2
I Center: High variance

Some labels needed to learn that posteriors are
equal, further labels will not improve classifier
I Top-left, bottom-right: Low variance
Labels here are valuable, but a few may suffice
I Top-right, bottom-left: Not frequent
The performance here is irrelevant
Feature x1
27 / 57
Formalisation: Compute expected error reduction
Approach
1. given current error rate e
2. for each possible outcome ŷ ∈ {+, −} do
3. update classifier with ŷ
4. compute new error rate eŷ
5. compute error reduction δŷ = e − eŷ
6. compute weight wŷ for this outcome ŷ
7. end
P
8. compute expected error reduction E [δ] = ŷ ∈{+,−} wŷ · δŷ
9. request the label that maximises E [δ]
Questions
1. How compute outcome-weights wŷ ?
2. How compute new error rate eŷ ?
Specifically: On which data?
28 / 57
Outcome-Weights Computation
1. Average: Equal weights wŷ = 0.5 for all ŷ
2. Most likely: Weight the most likely outcome one
(most likely based on the current classifiers prediction), the other outcome zero
3. Posterior-based: Use posterior estimates by current classifier as weights:
wŷ = p̂(y = ŷ |x)
I Variant 1: Favours exploration

Only good for cases of high label variance
I Variant 2: Favours exploitation, similar to self-training
Risk of getting stuck in the wrong hypothesis
I Variant 3: Inbetween one and two
Error-Rate Computation
I Use current classifier to label the unlabelled instances
Similar to self-training
I Good estimate of the generalisation capability
I But: Might again lead to getting stuck in the wrong hypothesis
29 / 57
Ensemble-Based Strategy
r2
Idea
if ie
Use disagreement between base classifiers
las s
C
Feature x2
Approach
Disagreement 1. Get an initial set of labels
Clas 2. Split that set into (overlapping) subsets
sif ie
r1
3. On each subset, train a different base-classifier
4. Repeat
Feature x1 5. On each unlabelled instance do
6. Apply all base-classifiers
7. Request label, if base-classifiers disagree
8. Update all base-classifiers
9. Go to step 4
10. Until convergence
30 / 57
Probabilistic Active Learning: Illustrative Example
I Given: Dataset with labelled ( - / + ) and unlabelled ( ) instances

I Objective: Select the most valuable candidate for labelling
How?
+
-
+
?
-
I Decision Theoretic Approach: Determine the gain in classification performance

from labelling a candidate such as, e.g., x? ( ? )
31 / 57
Principles of Probabilistic Active Learning 7
+
-
+
?
-
I How would we classify instances at the position of ? ?

I Two neighbouring labels, half of them positive, half of them negative
ˆ (Y = +|X = x? ) = 0.5)
(observed posterior probability p̂ = Pr
I Tie, e.g. let’s classify instances here as negative
7
See [Krempl et al., 2014a].
32 / 57
Principles of Probabilistic Active Learning
+
-
+
?
-
I What is our current classification performance here, e.g. accuracy?

I Depends on the true posterior probability p = Pr (Y = +|X = x? ) here
I Unknown, but let’s assume it to be, e.g., p = 0.6
I Then, our current accuracy is 40%
33 / 57
+
-
+
?
-
I What happens if we get the label y? of x? ?

I Depends on the label’s realisation
I Unknown (yet), but let’s assume it to be, e.g., positive: y? = +
I Then, our classification rule changes to predicting positive labels
I Consequently, our current accuracy increases to 60%
I This results in a gain in classification performance by +0.2
34 / 57
+
-
+
?
-
I In reality we do not know the label realisation y? yet!

I However, y? is the realisation of a Bernoulli trial:
y?
p. & (1 − p) (1)
+ −
I Probability of success is the true posterior p

I E.g., in the case above, we get with p = 0.6 a positive label
35 / 57
+
-
+
?
-
I But how to determine p?

I Unknown, but the already available labels provide some information:
Let’s summarise them by label statistics ls = (n+ , n− ),
where
I n = n+ + n− = 2 is the number of labels
n+
I p̂ = n +n = 12 is the share of positives therein (posterior estimate)
+ −
I These labels are the realisation of n Bernoulli trials with success probability p
(cmp. urn model):
n+ ∼ Binomialn,p = Binn,p (2)
36 / 57
+
-
+
?
-
I The likelihood of p (as parameter of the above Binomial distribution),

given the data ls = (n+ , n− ) is
Γ(n+ + n− + 1)
L(p|ls) = Binn+ +n− ,p (n+ ) = · p n+ · (1 − p)n− (3)
Γ(n+ + 1) · Γ(n− + 1)
I Normalising this likelihood yields the Beta distribution:
L(p|ls)g (p)
ωls (p) = R 1 = (n+ + n− + 1) · L(p|ls) = Betaα,β (p) (4)
0 L(ψ|ls)g (ψ)dψ
37 / 57
Probabilistic Active Learning – Excursus: Beta Prior
5
no labels (n=0) 0.91
I Uniform prior: Prior to the first
few labels (n=2,p=0.5)
^
label’s arrival, all values of p are
more labels (n=3,p=2/3)
^
many labels (n=11,p=10/11)

^
assumed equally plausible.
4
I A Bayesian approach yields for the
normalised likelihood corresponds a
beta distribution with parameters:
α Number of positive labels plus one
β Number of negative labels plus one
Normalized Likelihood
3
I Left: Plot of normalised likelihoods
for different values of α, β
I The peak of this function becomes
2
0.67 the more distinct, the more labels are
0.5 obtained.
0
0 0.2 0.4 0.6 0.8 1
True posterior (p)
38 / 57
+
-
+
?
-
I Thus, given label statistics ls = (n+ , n− ) summarising labelled data,

I we have two random variables
I the true posterior P ∼ Betan+ +1,n− +1
I the label realisation Y ∼ Bernoullip = Berp
I and calculate the performance gain in expectation over p, y

pgain(ls) = Ep Ey performancegainp (ls, y ) (5)
Z 1 X
= Betaα,β (p) · Berp (y ) · gainp (ls, y ) dp (6)
0 y ∈{0,1}
39 / 57
Probabilistic Gain
This probabilistic gain quantifies

I the expected change in classification performance +
-
I at the candidate’s position in feature space, +
?
I in each and every future classification there, -
I given that one additional label is acquired.
I Weight pgain with the density dx over labelled and

unlabelled data at the candidate’s position.
I Select the candidate with highest density-weighted

probabilistic gain.
40 / 57
Probabilistic Gain – GOPAL
Probabilistic Gain for Equal Misclassification Costs

I The probabilistic gain in accuracy as
a function of ls = (n, p̂) is
I monotone with variable n,
I symmetric with respect to p̂ = 0.5,
I zero for irrelevant candidates.
I Compare to uncertainty:
0.5
(in confidence) 0.4
uncertainty
const. w.r.t. n:
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
posterior
41 / 57
Active Learning Strategies - Discussion
I Uncertainty-Based:
fast and simple, waste labels in regions with high Bayesian error rate
I Error-Reduction:
good quality, high computational cost
I Ensemble-Based:
good quality, convergence avoids unnecessary sampling in regions with high
Bayesian error rate, conceptually easily understandable, multiple classifiers needed
I Probabilistic Approach:
good quality, considering labelstatistics avoids unnecessary sampling in regions
with high Bayesian error rate
42 / 57
Evaluating Active Learning Approaches
Learning Curves
ER 1
ER 2
Random
I Plot performance against
Performance
the number of requested labels

I Expected Behaviour:
I Performance increases with number of labels
I Convergence: after ∞ label requests,
all strategies should have the same performance
I Caveat: Allways compare using the same
classifier!
Number of Labels
43 / 57
Evaluation Example: Multi-Class Probabilistic AL (McPAL): (Selection8 )
e c oli (pKNN) g la s s (pKNN) iris (pKNN)

60 35
Mc PAL Mc PAL Mc PAL

100 55 30
Bvs SB Bvs SB Bvs SB
Ma xECos t 50 Ma xECos t 25 Ma xECos t
80 Conf Conf Conf
45 20
Entr Entr Entr
VoI VoI
e rror
e rror
VoI
e rror
60 40 15
Ra nd Ra nd Ra nd
35 10
40
30 5
25 0
20
20 5
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
# of s a m ple d ins ta nc e s # of s a m ple d ins ta nc e s # of s a m ple d ins ta nc e s
ve hic le (pKNN) wine (pKNN) ye a s t (pKNN)

220 45 450
Mc PAL Mc PAL Mc PAL
40
200 Bvs SB Bvs SB Bvs SB
35 Ma xECos t 400
Ma xECos t Ma xECos t
180 Conf
Conf 30 Conf
Entr Entr 350 Entr
160 25
VoI
e rror
Ra nd Ra nd
e rror
e rror
140
20 Ra nd
300
15
120
10
250
100
5
80 0 200
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
# of s a m ple d ins ta nc e s # of s a m ple d ins ta nc e s # of s a m ple d ins ta nc e s
Learning curves of mean misclassification cost (including standard deviation as error bars) of
McPAL and its competitors on all six datasets using the pKNN classifier.
8
For more results, please consult ECAI paper [Kottke et al., 2016].
44 / 57
Lecture Outline
I Active Learning
I Motivation
I Approaches
I Evaluation
45 / 57
Summary and Questions
Summary
I Motivation & Aim: Not all labels are available
I General Idea: Use structure of the data!
I Semi-Supervised Learning:
I Assumptions: Smoothness, Clusters, or Manifold
I Approaches: Self-Learning, Generative Models, Low-Density Separation, Graph-Based
Models
I Active Learning:
I Scenarios: Query Synthesis, Pool, Stream
I Approaches: Uncertainty Sampling, Expected Error Reduction, Query by Committee,
Probabilistic Active Learning
I Evaluation: Learning Curves
Questions?
46 / 57
Bibliography I
Angluin, D. (1988).
Queries and concept learning.
Machine Learning, 2:319–342.
Attenberg, J., Melville, P., Provost, F., and Saar-Tsechansky, M. (2011).
Selective data acquisition for machine learning.
In Krishnapuram, B., Yu, S., and Rao, R. B., editors, Cost-Sensitive Machine
Learning. CRC Press, Boca Raton, FL, USA, 1st edition.
Chapelle, O., Schölkopf, B., and Zien, A., editors (2006).
Semi-supervised Learning.
MIT Press.
Cohn, D., Atlas, L., Ladner, R., El-Sharkawi, M., Marks, R., Aggoune, M., and
Park, D. (1990).
Training connectionist networks with queries and selective sampling.
In Advances in Neural Information Processing Systems (NIPS). Morgan
Kaufmann.
Fedorov, V. V. (1972).
Theory of Optimal Experiments Design.
Academic Press.
Fu, Y., Zhu, X., and Li, B. (2012).
A survey on instance selection for active learning.
Knowledge and Information Systems, 35(2):249–283.
47 / 57
Bibliography II
Kottke, D., Krempl, G., Lang, D., Teschner, J., and Spiliopoulou, M. (2016).
Multi-class probabilistic active learning.
In Fox, M., Kaminka, G., Hüllermeier, E., and Bouquet, P., editors, Proc. of the
22nd Europ. Conf. on Artificial Intelligence (ECAI2016), 2016, Frontiers in
Artificial Intelligence and Applications. IOS Press.
Krempl, G., Kottke, D., and Spiliopoulou, M. (2014a).
Probabilistic active learning: Towards combining versatility, optimality and
efficiency.
In Dzeroski, S., Panov, P., Kocev, D., and Todorovski, L., editors, Proceedings of
the 17th Int. Conf. on Discovery Science (DS), Bled, volume 8777 of Lecture
Notes in Computer Science, page 168–179. Springer.
Krempl, G., Zliobaitė, I., Brzeziński, D., Hüllermeier, E., Last, M., Lemaire, V.,
Noack, T., Shaker, A., Sievi, S., Spiliopoulou, M., and Stefanowski, J. (2014b).
Open challenges for data stream mining research.
SIGKDD Explorations, 16(1):1–10.
Special Issue on Big Data.
Saar-Tsechansky, M., Melville, P., and Provost, F. (2009).
Active feature-value acquisition.
Management Science, 55(4):664–684.
48 / 57
Bibliography III
Settles, B. (2009).
Active learning literature survey.
Computer Sciences Technical Report 1648, University of Wisconsin-Madison,
Madison, Wisconsin, USA.
Settles, B. (2012).
Active Learning.
Number 18 in Synthesis Lectures on Artificial Intelligence and Machine Learning.
Morgan and Claypool Publishers.
Zhu, X. (2008).
Semi-supervised learning literature survey.
Technical Report 1530, University of Wisconsin.
Zliobaitė, I., Bifet, A., Pfahringer, B., and Holmes, G. (2013).
Active learning with drifting streaming data.
IEEE Transactions on Neural Networks and Learning Systems, 25(1):27–39.
49 / 57
SSL: Approaches
I Self-Learning, Co-Learning
I Generative Models
I Low-density Separation
I Graph-Based Methods
50 / 57
Generative Models
I Assumption:
P(X , Y ) = P(Y )P(X |Y ) with P(X |Y ) being an identifiably mixture distribution
I Objective:
Identify generative model that produces the data
I Basic Idea:
Learn distribution P(X |Y ) of each class y ∈ Y
This requires some assumptions upon the generative model
I Probabilistic approach:
Find likelihood–maximising parameter set Θ, e.g. using expectation–maximisation
I Algorithmic approach:
Cluster – and – Label
Back to SSL Discussion
51 / 57
Generative Models - Example
Gaussian Mixture Model Example9
Labelled data Labelled and unlabelled data
9
Illustrations from Zhu 2005 [Zhu, 2008].
52 / 57
Generative Models - Example (2)
Gaussian Mixture Model Example10
Decision model learnt solely on labelled

Decision model learnt on all data
data
10
Illustrations from Zhu 2005 [Zhu, 2008].
53 / 57
Generative Models - Challenges
I Identifiability
It may happen, that solution is not unique, even if type of distributions involved
is known
Example from Zhu 2005:
I Given P(X ) ∼ U(0; 1) and two labelled instances (x1 = 0.1; y1 = +),
(x2 = 0.9; y2 = −)
I We know that P(X |Y = +) ∼ U(0; τ ) and P(X |Y = −) ∼ U(τ ; 1),
but we do not know value of τ
I Different values of τ result in concurring models
We can not identify the true model given the two instances
I Model correctness
I If assumptions in the model are met, using unlabelled data increases quality of
prediction model
I But: Otherwise, model can be worse than a model based solely on labelled data
I Model fitting
Parameter set might not be global optimum
54 / 57
Low-Density Separation
I Assumption:
Classes are separated by low density regions
I Objective:
Placement of decision boundary in regions of low density
I Approaches:
Transductive Support Vector Machines (TSVMs)
Conventional SVM: aximize margin between labelled instances
Transductive SVM: Maximize margin between all instances
(i.e. labelled and unlabelled ones)
55 / 57
Graph-Based Methods
I Assumption:
Manifold assumption, smoothness over graph
I Approach:
I Represent data by a graph
I Vertices/Nodes are labelled and unlabelled instances
(e.g. connect an instance to its k nearest neighbours, or to all neighbours within some
-neighbourhood)
I Edges represent similarities
I Estimate a function f on the graph,
that is close to given labels and smooth to the whole graph
I Use f to propagate label to neighbouring instances in graph
56 / 57
Evaluation of Influence Factors
I Original McPAL compared to variants
ignoring reliability: normalising the ~
P~
I k vector to k=n=1
I ignoring posterior: using uniform frequency estimate ki = n/C , 1 ≤ i ≤ C
I ignoring density: setting it to a constant value
ecoli (pKNN) glass (pKNN) iris (pKNN)
90 60 40
McPAL McPAL Mc PAL
80 55 35
w/o reliability w/o reliability w/o re lia bility
w/o posterior w/o posterior 30 w/o pos te rior
70 50
w/o density w/o density 25 w/o im pa c t
60 45
20
error
error
e rror
50 40
15
40 35
10
30 30
5
20 25 0
10 20 5
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
# of sampled instances # of sampled instances # of s a m ple d ins ta nc e s
ve hic le (PWC) wine (PWC) yeast (PWC)

220 50 450
Mc PAL McPAL McPAL
200 w/o re lia bility w/o reliability w/o reliability
40
400
w/o pos te rior w/o posterior w/o posterior
180
w/o im pa c t 30 w/o density w/o density
350
160
error
e rror
error
20
140
300
10
120
250
100 0
80 10 200
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
# of s a m ple d ins ta nc e s # of sampled instances # of sampled instances
Learning curves in mean misclassification cost (including standard deviation as error bars).
57 / 57

Data Mining - Utrecht University - 13. Active Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining - Utrecht University - 13. Active Learning

Uploaded by

Copyright:

Available Formats

2.11.

I Summary and Questions

I Summary and Questions

(a) [Settles, 2012] (b) [Chapelle et al., 2006]

Figure: Context of Active Learning within Machine Learning

d(x) I Estimate Distributions

Common assumptions about the training data:

What if these assumptions do not hold?

I Summary and Questions

Figure: Classification with Partially Labelled Data

Details of SSL Approaches

Details of SSL Approaches

I Ex-post, passive viewpoint (we work with the data we have)

I Summary and Questions

Active Learning in Exemplary Applications

I Brain Computer Interfaces:

Context & Aim

Active Learning Scenarios

Type of Selected Information

I Which label to request?

data (see figure on the top left)

Maximise diversity of requested labels

I Reduce the number of remaining hypotheses

I Exploitation of acquired knowledge!

I But: Be aware of this

I Which label to request?

I Summary and Questions

Selected Active Learning Approaches

I Expected Error Reduction

I Ensemble / Query by Committee

I Probabilistic Active Learning

I But: Is this a good uncertainty measure?

I But how suitable are they?

+ I Requests instances close to the decision

I Center: High variance

I Variant 1: Favours exploration

I Given: Dataset with labelled ( - / + ) and unlabelled ( ) instances

I Decision Theoretic Approach: Determine the gain in classification performance

I How would we classify instances at the position of ? ?

I What is our current classification performance here, e.g. accuracy?

I What happens if we get the label y? of x? ?

I In reality we do not know the label realisation y? yet!

I Probability of success is the true posterior p

I But how to determine p?

I The likelihood of p (as parameter of the above Binomial distribution),

I Normalising this likelihood yields the Beta distribution:

many labels (n=11,p=10/11)

I Thus, given label statistics ls = (n+ , n− ) summarising labelled data,

I and calculate the performance gain in expectation over p, y

This probabilistic gain quantifies

I given that one additional label is acquired.

I Weight pgain with the density dx over labelled and

I Select the candidate with highest density-weighted

Probabilistic Gain for Equal Misclassification Costs

(in confidence) 0.4

the number of requested labels

e c oli (pKNN) g la s s (pKNN) iris (pKNN)

Mc PAL Mc PAL Mc PAL

ve hic le (pKNN) wine (pKNN) ye a s t (pKNN)

I Summary and Questions

Gaussian Mixture Model Example9

Labelled data Labelled and unlabelled data

Back to SSL Discussion