Duke - Genetic Algorithm

Experiments in genetic programming
Bouvet BigOne, 2012-03-29

Lars Marius Garshol, <larsga@bouvet.no>
http://twitter.com/larsga
1
The background
• Duke
– open source data matching engine (Java)
– can find near-duplicate database records
– probabilistic configuration
– http://code.google.com/p/duke/
• People find making configurations difficult
– can we help them? Field Record 1 Record 2 Probability
Name acme inc acme inc 0.9
Assoc no 177477707 0.5
Zip code 9161 9161 0.6
Country norway norway 0.51
Address 1 mb 113 mailbox 113 0.49
Address 2 0.5
2
The idea
• Given
– a test file showing the correct linkages
• can we
– evolve a configuration
• using
– genetic algorithms?
3
What a configuration looks like
• Threshold for accepting matches

– a number between 0.0 and 1.0
• For each property
– a comparator function (Exact, Levenshtein, numeric...)
– a low probability (0.0-0.5)
– a high probability (0.5-1.0)
4
The hill-climbing problem
5
How it works
1. Generate a population of 100 random

configurations
2. Evaluate the population
3. Throw away the 25 worst, duplicate the 25
best
4. Randomly modify the entire population
5. Go back to 2
6
Actual code
for generation in range(POPULATIONS):
print "===== GENERATION %s ================================" % generation
for c in population:
f = evaluate(c)
if f > highest:
best = c
highest = f
show_best(best, False)
# make new generation

population = sorted(population, key = lambda c: 1.0 - index[c])
# ditch lower quartile

population = population[ : -25]
# double upper quartile
population = population[ : 25] + population
# mutate
population = [c.make_new(population) for c in population]
7
Actual code #2
class GeneticConfiguration:
def __init__(self):
self._props = []
self._threshold = 0.0
# set/get threshold, add/get properties
def make_new(self, population):

# either we make a number or random modifications, or we mate.
# draw a number, if 0 modifications, we mate.
mods = random.randint(0, 3)
if mods:
return self._mutate(mods)
else:
return self._mate(random.choice(population))
def _mutate(self, mods):

c = self._copy()
for ix in range(mods):
aspect = random.choice(aspects)
aspect.modify(c)
return c
def _mate(self, other):

c = self._copy()
for aspect in aspects:
aspect.set(c, aspect.get(random.choice([self, other])))
return c
def _copy(self):
c = GeneticConfiguration()
c.set_threshold(self._threshold)
for prop in self.get_properties():
if prop.getName() == "ID":
c.add_property(Property(prop.getName()))
else:
c.add_property(Property(prop.getName(), prop.getComparator(), prop.getLowProbability(), prop.getHighProbability()))
return c
8
But ... does it work?!?
9
Linking countries
• Linking countries from DBpedia and Mondial

– no common identifiers
• Manually I manage 95.4% accuracy
– genetic script manages 95.7% in first generation
– then improves to 98.9%
– this was too easy...
DBPEDIA MONDIAL
Id http://dbpedia.org/resource/Samoa Id 17019
Name Samoa Name Western Samoa
Capital Apia Capital Apia, Samoa
Area 2831 Area 2860
10
The actual configuration
Threshold 0.6
PROPERTY COMPARATOR LOW HIGH

NAME Exact 0.19 0.91
CAPITAL Exact 0.25 0.86
AREA Numeric 0.36 0.72
Confusing.
Why exact name comparisons?
Why is area comparison given such weight?
Who knows. There’s nobody to ask.
11
Semantic dogfood
• Data about papers presented at semantic web

conferences
– has duplicate speakers
– about 7,000 records, many long string values
• Manually I get 88% accuracy
– after two weeks, the script gets 82% accuracy
– but it’s only half-way
Name Grigorios Antoniou Name Grigoris Antoniou
Homepage http://www.ics.forth.gr/~antoniou Homepage http://www.ics.forth.gr/~antoniou
Mbox_Sha1 f44cd7769f416e96864ac43498b08215 Mbox_Sha1 f44cd7769f416e96864ac43498b08215

5196829e 5196829e
Affiliation Affiliation http://data.semanticweb.org/organizat
ion/forth-ics
12
The configuration
Threshold 0.91
PROPERTY COMPARATOR LOW HIGH

NAME JaroWinklerTokenized 0.2 0.9
AFFILIATION DiceCoefficient 0.49 0.61
HOMEPAGE Exact 0.09 0.67
MBOX_HASH PersonNameComparator 0.42 0.87
Some strange choices of comparator.
PersonNameComparator?!?
DiceCoefficient is essentially same as Exact, for those values.
Otherwise as expected.
13
Hafslund
• I took a subset of customer data from Hafslund

– roughly 3000 records
– then made a difficult manual test file, where different
parts of organizations are treated as different
– so NSB Logistikk != NSB Bane
– then made another subset for testing
• Manually I can do no better than 64% on this
data set
– interestingly, on the test data set I score 84%
• With a cut-down data set, I could run the
script overnight, and have a result in the
morning
14
The progress of evolution
• 1st generation
– best scores: 0.47, 0.43, 0.3
• 2nd generation
– mutated 0.47 configuration scores 0.136, 0.467, 0.002,
and 0.49
– best scores: 0.49, 0.467, 0.4, and 0.38
• 3rd generation
– mutated 0.49 scores 0.001, 0.49, 0.46, and 0.25
– best scores: 0.49, 0.46, 0.45, and 0.42
• 4th generation
– we hit 0.525 (modified from 0.21)
15
The progress of evolution #2
• 5th generation
– we hit 0.568 (modified from 0.479)
• 6th generation
– 0.602
• 7th generation
– 0.702
• ...
• 60th generation
– 0.765
– I’d done no better than 0.64 manually
16
Evaluation
CONFIGURATION TRAINING TEST

Genetic #1 0.766 0.881
Genetic #2 0.776 0.859
Manual #1 0.57 0.838
Manual #2 0.64 0.803
Threshold: 0.98 Threshold: 0.95

PROPERTY COMPARATOR LOW HIGH PROPERTY COMPARATOR LOW HIGH
NAME Levenshtein 0.17 0.95 NAME Levenshtein 0.42 0.96
ASSOCIATION_NO Exact 0.06 0.69 ASSOCIATION_NO DiceCoefficien 0.0 0.67

t
ADDRESS1 Numeric 0.02 0.92
ADDRESS1 Numeric 0.1 0.61
ADDRESS2 PersonName 0.18 0.76
ADDRESS2 Levenshtein 0.03 0.8
ZIP_CODE DiceCoefficien 0.47 0.79
t ZIP_CODE DiceCoefficien 0.35 0.69
17 t
COUNTRY Levenshtein 0.12 0.64
COUNTRY JaroWinklerT. 0.44 0.68
Does it find the best configuration?
• We don’t know
• The experts say genetic algorithms tend to get
stuck at local maxima
– they also point out that well-known techniques for
dealing with this are described in the literature
• Rerunning tends to produce similar
configurations
18
The literature
http://www.cleveralgorithms.com/ http://www.gp-field-guide.org.uk/
19
Conclusion
• Easy to implement
– you don’t need a GP library
• Requires reliable test data
• It actually works
• Configurations may not be very tweakable
– because they don’t necessarily make any sense
• This is a big field, with lots to learn
20 http://www.garshol.priv.no/blog/225.html

Duke - Genetic Algorithm

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Duke - Genetic Algorithm

Uploaded by

Copyright:

Available Formats

Experiments in genetic programming

Bouvet BigOne, 2012-03-29

• Threshold for accepting matches

1. Generate a population of 100 random

# make new generation

# ditch lower quartile

# set/get threshold, add/get properties

def make_new(self, population):

def _mutate(self, mods):

def _mate(self, other):

• Linking countries from DBpedia and Mondial

Name Samoa Name Western Samoa

Capital Apia Capital Apia, Samoa

Area 2831 Area 2860

PROPERTY COMPARATOR LOW HIGH

Why exact name comparisons?

Why is area comparison given such weight?

Who knows. There’s nobody to ask.

• Data about papers presented at semantic web

Homepage http://www.ics.forth.gr/~antoniou Homepage http://www.ics.forth.gr/~antoniou

Mbox_Sha1 f44cd7769f416e96864ac43498b08215 Mbox_Sha1 f44cd7769f416e96864ac43498b08215

PROPERTY COMPARATOR LOW HIGH

Some strange choices of comparator.

DiceCoefficient is essentially same as Exact, for those values.

• I took a subset of customer data from Hafslund

CONFIGURATION TRAINING TEST

Threshold: 0.98 Threshold: 0.95

ASSOCIATION_NO Exact 0.06 0.69 ASSOCIATION_NO DiceCoefficien 0.0 0.67

You might also like