# Set

Covering
Machines
Learning with Data Balls
Agenda
 Introductions
 Life Before SCM
 SCM Description
 Examples
 Intuitive
 Algorithm-Based
 With Errors
 Performance Results
 Conclusion
Who We Are
 Stan James
 Cognitive Science Masters student
 stan@wanderingstan.com
 Edward Dale
 1-Semester Exchange student
 scompt@scompt.com
Who They Are
Mario Marchand
marchand@site.uottawa.ca
John Shawe-Taylor
jst@cs.rhul.ac.uk
What’s The Problem?
 Given a set of training examples labeled
positive or negative, construct a logical
expression that can predict future cases. (Like
SVMs, Decision Trees, and so on...)
 Examples:
 Disease diagnosis
 Loan applications
 Chess
 Mushrooms
What is Possible
Positive Error
Misclassifying a
negative data point as
positive
What is Possible
Negative Error
Misclassifying a positive
data point as negative
SVM Refresher
 Separates input vectors using a hyperplane
 Separating hyperplane has the largest
possible margin
 If function to learn depends only on a very
small subset of a large set of given features,
there may exist a better solution.
SCM History
 Two-step algorithm proposed by Valiant and
Haussler
 ‘Greedy Set Cover Algorithm’
 Two Problems
 Defined only on boolean attributes
 Set Covering Machines extend this algorithm
to solve these problems
SCM Basic Concept
 Use ‘Data-Dependent Balls’ to create boolean
features from training points
 Create a function that is the smallest
conjunction of positive features (balls)
 Also possible to create a disjunction of
negative features
The Input Space
 Input space can be n-dimensional
 All that is required is that there exist a
suitable distance function
 Our examples
 2-dimensional
 Distance function is the Euclidian distance
Data Dependent Balls Defined
¹
´
¦
s
÷
=
otherwise
) , ( if
,
µ
µ
i
i
i
i
x x d
y
y
h
 Balls defined by a training example and a
distance.
 A training example x is contained inside a
ball y
i
if the distance between x and x
i
is less
than or equal to µ.
y
1

µ
x
i

d(x,x
i
)
x
y
i

Balls Everywhere
 Before running the
algorithm, the set of all
possible balls H is
created
 H consists of every ball
centered at a training
example having another
training example at it’s
border
A Quick Explanation
Agenda
 Introductions
 Life Before SCM
 SCM Description
 Examples
 Intuitive
 Algorithm-Based
 With Errors
 Performance Results
 Conclusion
The Party
The Party
Time at Party
Didn’t like it
Ball Covering
Time at Party
Negative Balls
Time at Party
What we learned
Time at Party
f(x) = b1 . ¬b2 . ¬b3
b1
b2
b3
Agenda
 Introductions
 Life Before SCM
 SCM Description
 Examples
 Intuitive
 Algorithm-Based
 With Errors
 Performance Results
 Conclusion
Divide the data set
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
., and let E
i
be the subset of P
for which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value |Q
k
| ÷ p*|E
k
|
Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or |R| > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
1
Time at Party
Divide the data set
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
., and let E
i
be the subset of P
for which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value |Q
k
| ÷ p*|E
k
|
Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or |R| > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
h
3
Q
3
Select the best ball
Time at Party
If no errors are allowed, a positive ball must
include all positive data points.
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
, and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value |Q
k
| ÷ p*|E
k
|
Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or |R| > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
Time at Party
Remove covered points
Remember the ball we chose,
remove data points that are now covered.
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
. and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value |Q
k
| ÷ p*|E
k
|
Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ R
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or |R| > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
R
h
k
Remove unneeded balls
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
. and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value |Q
k
| ÷ p*|E
k
|
Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or |R| > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
Time at Party
Remove unneeded balls
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
. and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value |Q
k
| ÷ p*|E
k
|
Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or |R| > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
Time at Party
...and repeat
Repeat until all data points are covered.
Time at Party
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
. and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value |Q
k
| ÷ p*|E
k
|
Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or |R| > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
We're done
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
. and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value |Q
k
| ÷ p*|E
k
|
Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ R
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or |R| > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
The learned function is
f(x) = b
1
. ¬b
2
. ¬b
3

Agenda
 Introductions
 Life Before SCM
 SCM Description
 Examples
 Intuitive
 Algorithm-Based
 With Errors
 Performance Results
 Conclusion
Generalize
1
Time at Party

Learn the same data, but...
• Another person at party
• Maximum of 3 conjunctions
Limit our solution size
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
, and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value |Q
k
| ÷ p*|E
k
|
Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or |R| > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
1
Time at Party
Tolerate some errors
1

Time at Party
Noise
Tolerate some errors
1

For each feature h
i
, let Q
i
be the subset of N
covered by h
i
, and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value |Q
k
| ÷ p*|E
k
|
Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C) then stop. Else repeat from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
Our existing algorithm will try to choose
more balls.
Time at Party
Time at Party
Tolerate some errors
1

For each feature h
i
, let Q
i
be the subset of N
covered by h
i
, and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value |Q
k
| ÷ p*|E
k
|
Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C) then stop. Else repeat from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
Our existing algorithm will try to choose
more smaller balls to avoid the data point.
Tolerate some errors
For each feature h
i
, let Q
i
be the subset of N
covered by h
i
, and let E
i
be the subset of P for
which h
i
makes an error.
Let h
k
be the feature with the largest
usefulness value |Q
k
| ÷ p*|E
k
|
Let R ÷ R { h
k
}. Let N ÷ N ÷ Q
k
and let
Let P ÷ P ÷ E
k
.
For all i do: Q
i
÷ Q
i
÷ Q
k
and E
i
÷ E
i
÷ E
k
.
If ( N = C or |R| > s ) then stop. Else repeat
from Step 2.
Return f(x), where f(x) = h
o
. h
1
… h
n
, h
n
e R
h
3
Better to allow some errors, and get a
more general solution.
Time at Party
Agenda
 Introductions
 Life Before SCM
 SCM Description
 Examples
 Intuitive
 Algorithm-Based
 With Errors
 Performance Results
 Conclusion
Performance Comparison
Function Size
0
100
200
300
400
500
600
#

o
f

F
e
a
t
u
r
e
s
SCM SVM
Error
0
50
100
150
200
250
300
#

o
f

E
r
r
o
r
s
SCM SVM
Performance Tuning of SCM
SCM Size
0
50
100
150
#

o
f

F
e
a
t
u
r
e
s
Inifinite s Finite s
SCM Error
0
50
100
150
200
250
#

o
f

E
r
r
o
r
s
Inifinite s Finite s
Conclusion
 Set Covering Machine uses ‘Data-Dependent Balls’ to
group training examples into features
 Algorithm selects balls that are the most useful in
classifying data
 http://www.inf.uos.de/barbara/lectures
/ml/papers/set.pdf
 Edward Dale (scompt@scompt.com)
 Stan James (stan@wanderingstan.com)
Questions?