Professional Documents
Culture Documents
1
The background
• Duke
– open source data matching engine (Java)
– can find near-duplicate database records
– probabilistic configuration
– http://code.google.com/p/duke/
• People find making configurations difficult
– can we help them? Field Record 1 Record 2 Probability
Name acme inc acme inc 0.9
Assoc no 177477707 0.5
Zip code 9161 9161 0.6
Country norway norway 0.51
Address 1 mb 113 mailbox 113 0.49
Address 2 0.5
2
The idea
• Given
– a test file showing the correct linkages
• can we
– evolve a configuration
• using
– genetic algorithms?
3
What a configuration looks like
4
The hill-climbing problem
5
How it works
6
Actual code
for generation in range(POPULATIONS):
print "===== GENERATION %s ================================" % generation
for c in population:
f = evaluate(c)
if f > highest:
best = c
highest = f
show_best(best, False)
# mutate
population = [c.make_new(population) for c in population]
7
Actual code #2
class GeneticConfiguration:
def __init__(self):
self._props = []
self._threshold = 0.0
def _copy(self):
c = GeneticConfiguration()
c.set_threshold(self._threshold)
for prop in self.get_properties():
if prop.getName() == "ID":
c.add_property(Property(prop.getName()))
else:
c.add_property(Property(prop.getName(), prop.getComparator(), prop.getLowProbability(), prop.getHighProbability()))
return c
8
But ... does it work?!?
9
Linking countries
Id http://dbpedia.org/resource/Samoa Id 17019
10
The actual configuration
Threshold 0.6
Confusing.
11
Semantic dogfood
Threshold 0.91
PersonNameComparator?!?
Otherwise as expected.
13
Hafslund
• 1st generation
– best scores: 0.47, 0.43, 0.3
• 2nd generation
– mutated 0.47 configuration scores 0.136, 0.467, 0.002,
and 0.49
– best scores: 0.49, 0.467, 0.4, and 0.38
• 3rd generation
– mutated 0.49 scores 0.001, 0.49, 0.46, and 0.25
– best scores: 0.49, 0.46, 0.45, and 0.42
• 4th generation
– we hit 0.525 (modified from 0.21)
15
The progress of evolution #2
• 5th generation
– we hit 0.568 (modified from 0.479)
• 6th generation
– 0.602
• 7th generation
– 0.702
• ...
• 60th generation
– 0.765
– I’d done no better than 0.64 manually
16
Evaluation
• We don’t know
• The experts say genetic algorithms tend to get
stuck at local maxima
– they also point out that well-known techniques for
dealing with this are described in the literature
• Rerunning tends to produce similar
configurations
18
The literature
http://www.cleveralgorithms.com/ http://www.gp-field-guide.org.uk/
19
Conclusion
• Easy to implement
– you don’t need a GP library
• Requires reliable test data
• It actually works
• Configurations may not be very tweakable
– because they don’t necessarily make any sense
• This is a big field, with lots to learn
20 http://www.garshol.priv.no/blog/225.html