Are you sure?
This action might not be possible to undo. Are you sure you want to continue?
Covering
Machines
Learning with Data Balls
Agenda
Introductions
Life Before SCM
SCM Description
Examples
Intuitive
AlgorithmBased
With Errors
Performance Results
Conclusion
Who We Are
Stan James
Cognitive Science Masters student
stan@wanderingstan.com
Edward Dale
1Semester Exchange student
scompt@scompt.com
Who They Are
Mario Marchand
marchand@site.uottawa.ca
John ShaweTaylor
jst@cs.rhul.ac.uk
What’s The Problem?
Given a set of training examples labeled
positive or negative, construct a logical
expression that can predict future cases. (Like
SVMs, Decision Trees, and so on...)
Examples:
Disease diagnosis
Loan applications
Chess
Mushrooms
What is Possible
Positive Error
Misclassifying a
negative data point as
positive
What is Possible
Negative Error
Misclassifying a positive
data point as negative
SVM Refresher
Separates input vectors using a hyperplane
Separating hyperplane has the largest
possible margin
If function to learn depends only on a very
small subset of a large set of given features,
there may exist a better solution.
SCM History
Twostep algorithm proposed by Valiant and
Haussler
‘Greedy Set Cover Algorithm’
Two Problems
Defined only on boolean attributes
No accuracycomplexity tradeoff
Set Covering Machines extend this algorithm
to solve these problems
SCM Basic Concept
Use ‘DataDependent Balls’ to create boolean
features from training points
Create a function that is the smallest
conjunction of positive features (balls)
Also possible to create a disjunction of
negative features
The Input Space
Input space can be ndimensional
All that is required is that there exist a
suitable distance function
Our examples
2dimensional
Distance function is the Euclidian distance
Data Dependent Balls Defined
¹
´
¦
s
÷
=
otherwise
) , ( if
,
µ
µ
i
i
i
i
x x d
y
y
h
Balls defined by a training example and a
distance.
A training example x is contained inside a
ball y
i
if the distance between x and x
i
is less
than or equal to µ.
y
1
µ
x
i
d(x,x
i
)
x
y
i
Balls Everywhere
Before running the
algorithm, the set of all
possible balls H is
created
H consists of every ball
centered at a training
example having another
training example at it’s
border
A Quick Explanation
Agenda
Introductions
Life Before SCM
SCM Description
Examples
Intuitive
AlgorithmBased
With Errors
Performance Results
Conclusion
The Party
The Party
Time at Party
Didn’t like it
Had fun
Ball Covering
Time at Party
Negative Balls
Time at Party
What we learned
Time at Party
f(x) = b1 . ¬b2 . ¬b3
b1
b2
b3
Agenda
Introductions
Life Before SCM
SCM Description
Examples
Intuitive
AlgorithmBased
With Errors
Performance Results
Conclusion
Divide the data set
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
., and let E
i
be the subset of P
for which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value Q
k
 ÷ p*E
k

Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or R > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
1
Time at Party
Divide the data set
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
., and let E
i
be the subset of P
for which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value Q
k
 ÷ p*E
k

Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or R > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
h
3
Q
3
Select the best ball
Time at Party
If no errors are allowed, a positive ball must
include all positive data points.
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
, and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value Q
k
 ÷ p*E
k

Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or R > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
Time at Party
Remove covered points
Remember the ball we chose,
remove data points that are now covered.
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
. and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value Q
k
 ÷ p*E
k

Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ R
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or R > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
R
h
k
Remove unneeded balls
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
. and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value Q
k
 ÷ p*E
k

Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or R > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
Time at Party
Remove unneeded balls
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
. and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value Q
k
 ÷ p*E
k

Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or R > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
Time at Party
...and repeat
Repeat until all data points are covered.
Time at Party
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
. and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value Q
k
 ÷ p*E
k

Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or R > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
We're done
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
. and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value Q
k
 ÷ p*E
k

Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ R
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or R > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
The learned function is
f(x) = b
1
. ¬b
2
. ¬b
3
Agenda
Introductions
Life Before SCM
SCM Description
Examples
Intuitive
AlgorithmBased
With Errors
Performance Results
Conclusion
Generalize
1
Time at Party
Learn the same data, but...
• Another person at party
• Maximum of 3 conjunctions
Limit our solution size
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
, and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value Q
k
 ÷ p*E
k

Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or R > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
1
Time at Party
Tolerate some errors
1
Time at Party
Noise
Tolerate some errors
1
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
, and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value Q
k
 ÷ p*E
k

Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C) then stop. Else repeat from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
Our existing algorithm will try to choose
more balls.
Time at Party
Time at Party
Tolerate some errors
1
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
, and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value Q
k
 ÷ p*E
k

Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C) then stop. Else repeat from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
Our existing algorithm will try to choose
more smaller balls to avoid the data point.
Tolerate some errors
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
, and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value Q
k
 ÷ p*E
k

Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or R > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
h
3
Better to allow some errors, and get a
more general solution.
Time at Party
Agenda
Introductions
Life Before SCM
SCM Description
Examples
Intuitive
AlgorithmBased
With Errors
Performance Results
Conclusion
Performance Comparison
Function Size
0
100
200
300
400
500
600
#
o
f
F
e
a
t
u
r
e
s
SCM SVM
Error
0
50
100
150
200
250
300
#
o
f
E
r
r
o
r
s
SCM SVM
Performance Tuning of SCM
SCM Size
0
50
100
150
#
o
f
F
e
a
t
u
r
e
s
Inifinite s Finite s
SCM Error
0
50
100
150
200
250
#
o
f
E
r
r
o
r
s
Inifinite s Finite s
Conclusion
Set Covering Machine uses ‘DataDependent Balls’ to
group training examples into features
Algorithm selects balls that are the most useful in
classifying data
http://www.inf.uos.de/barbara/lectures
/ml/papers/set.pdf
Edward Dale (scompt@scompt.com)
Stan James (stan@wanderingstan.com)
Questions?
This action might not be possible to undo. Are you sure you want to continue?