Duke Genetic Algorithm Linking

Experiments in genetic programming
Bouvet BigOne, 2012-03-29

Lars Marius Garshol, <larsga@bouvet.no>
http://twitter.com/larsga
1
The background
• Duke
– open source data matching engine (Java)
– can find near-duplicate database records
– probabilistic configuration
– http://code.google.com/p/duke/
• People find making configurations difficult
– can we help them? Field Record 1 Record 2 Probability
Name acme inc acme inc 0.9
Assoc no 177477707 0.5
Zip code 9161 9161 0.6
Country norway norway 0.51
Address 1 mb 113 mailbox 113 0.49
Address 2 0.5
2
The idea
• Given
– a test file showing the correct linkages
• can we
– evolve a configuration
• using
– genetic algorithms?
3
What a configuration looks like
• Threshold for accepting matches

– a number between 0.0 and 1.0
• For each property
– a comparator function (Exact, Levenshtein, numeric...)
– a low probability (0.0-0.5)
– a high probability (0.5-1.0)
4
The hill-climbing problem
5
How it works
1. Generate a population of 100 random

configurations
2. Evaluate the population
3. Throw away the 25 worst, duplicate the 25
best
4. Randomly modify the entire population
5. Go back to 2
6
Actual code
for generation in range(POPULATIONS):
print "===== GENERATION %s ================================" % generation
for c in population:
f = evaluate(c)
if f > highest:
best = c
highest = f
show_best(best, False)
# make new generation

population = sorted(population, key = lambda c: 1.0 - index[c])
# ditch lower quartile

population = population[ : -25]
# double upper quartile
population = population[ : 25] + population
# mutate
population = [c.make_new(population) for c in population]
7
Actual code #2
class GeneticConfiguration:
def __init__(self):
self._props = []
self._threshold = 0.0
# set/get threshold, add/get properties
def make_new(self, population):

# either we make a number or random modifications, or we mate.
# draw a number, if 0 modifications, we mate.
mods = random.randint(0, 3)
if mods:
return self._mutate(mods)
else:
return self._mate(random.choice(population))
def _mutate(self, mods):

c = self._copy()
for ix in range(mods):
aspect = random.choice(aspects)
aspect.modify(c)
return c
def _mate(self, other):

c = self._copy()
for aspect in aspects:
aspect.set(c, aspect.get(random.choice([self, other])))
return c
def _copy(self):
c = GeneticConfiguration()
c.set_threshold(self._threshold)
for prop in self.get_properties():
if prop.getName() == "ID":
c.add_property(Property(prop.getName()))
else:
c.add_property(Property(prop.getName(), prop.getComparator(), prop.getLowProbability(), prop.getHighProbability()))
return c
8
But ... does it work?!?
9
Linking countries
• Linking countries from DBpedia and Mondial

– no common identifiers
• Manually I manage 95.4% accuracy
– genetic script manages 95.7% in first generation
– then improves to 98.9%
– this was too easy...
DBPEDIA MONDIAL
Id http://dbpedia.org/resource/Samoa Id 17019
Name Samoa Name Western Samoa
Capital Apia Capital Apia, Samoa
Area 2831 Area 2860
10
The actual configuration
Threshold 0.6
PROPERTY COMPARATOR LOW HIGH

NAME Exact 0.19 0.91
CAPITAL Exact 0.25 0.86
AREA Numeric 0.36 0.72
Confusing.
Why exact name comparisons?
Why is area comparison given such weight?
Who knows. There’s nobody to ask.
11
Semantic dogfood
• Data about papers presented at semantic web

conferences
– has duplicate speakers
– about 7,000 records, many long string values
• Manually I get 88% accuracy
– after two weeks, the script gets 82% accuracy
– but it’s only half-way
Name Grigorios Antoniou Name Grigoris Antoniou
Homepage http://www.ics.forth.gr/~antoniou Homepage http://www.ics.forth.gr/~antoniou
Mbox_Sha1 f44cd7769f416e96864ac43498b08215 Mbox_Sha1 f44cd7769f416e96864ac43498b08215

5196829e 5196829e
Affiliation Affiliation http://data.semanticweb.org/organizat
ion/forth-ics
12
The configuration
Threshold 0.91
PROPERTY COMPARATOR LOW HIGH

NAME JaroWinklerTokenized 0.2 0.9
AFFILIATION DiceCoefficient 0.49 0.61
HOMEPAGE Exact 0.09 0.67
MBOX_HASH PersonNameComparator 0.42 0.87
Some strange choices of comparator.
PersonNameComparator?!?
DiceCoefficient is essentially same as Exact, for those values.
Otherwise as expected.
13
Hafslund
• I took a subset of customer data from Hafslund

– roughly 3000 records
– then made a difficult manual test file, where different
parts of organizations are treated as different
– so NSB Logistikk != NSB Bane
– then made another subset for testing
• Manually I can do no better than 64% on this
data set
– interestingly, on the test data set I score 84%
• With a cut-down data set, I could run the
script overnight, and have a result in the
morning
14
The progress of evolution
• 1st generation
– best scores: 0.47, 0.43, 0.3
• 2nd generation
– mutated 0.47 configuration scores 0.136, 0.467, 0.002,
and 0.49
– best scores: 0.49, 0.467, 0.4, and 0.38
• 3rd generation
– mutated 0.49 scores 0.001, 0.49, 0.46, and 0.25
– best scores: 0.49, 0.46, 0.45, and 0.42
• 4th generation
– we hit 0.525 (modified from 0.21)
15
The progress of evolution #2
• 5th generation
– we hit 0.568 (modified from 0.479)
• 6th generation
– 0.602
• 7th generation
– 0.702
• ...
• 60th generation
– 0.765
– I’d done no better than 0.64 manually
16
Evaluation
CONFIGURATION TRAINING TEST

Genetic #1 0.766 0.881
Genetic #2 0.776 0.859
Manual #1 0.57 0.838
Manual #2 0.64 0.803
Threshold: 0.98 Threshold: 0.95

PROPERTY COMPARATOR LOW HIGH PROPERTY COMPARATOR LOW HIGH
NAME Levenshtein 0.17 0.95 NAME Levenshtein 0.42 0.96
ASSOCIATION_NO Exact 0.06 0.69 ASSOCIATION_NO DiceCoefficien 0.0 0.67

t
ADDRESS1 Numeric 0.02 0.92
ADDRESS1 Numeric 0.1 0.61
ADDRESS2 PersonName 0.18 0.76
ADDRESS2 Levenshtein 0.03 0.8
ZIP_CODE DiceCoefficien 0.47 0.79
t ZIP_CODE DiceCoefficien 0.35 0.69
17 t
COUNTRY Levenshtein 0.12 0.64
COUNTRY JaroWinklerT. 0.44 0.68
Does it find the best configuration?
• We don’t know
• The experts say genetic algorithms tend to get
stuck at local maxima
– they also point out that well-known techniques for
dealing with this are described in the literature
• Rerunning tends to produce similar
configurations
18
The literature
http://www.cleveralgorithms.com/ http://www.gp-field-guide.org.uk/
19
Conclusion
• Easy to implement
– you don’t need a GP library
• Requires reliable test data
• It actually works
• Configurations may not be very tweakable
– because they don’t necessarily make any sense
• This is a big field, with lots to learn
20 http://www.garshol.priv.no/blog/225.html
Linking data without common identifiers
Semantic Web Meetup, 2011-08-23

Lars Marius Garshol, <larsga@bouvet.no>
http://twitter.com/larsga
1
About me
• Lars Marius Garshol, Bouvet consultant

• Worked with semantic technologies since 1999
– mostly Topic Maps
• Before that I worked with XML
• Also background from
– Opera Software (Unicode support)
– open source development (Java/Python)
2
Agenda
• Linking data – the problem

• Record linkage theory
• Duke
• A real-world example
• Usage at Hafslund
3
The problem
• Data sets from different sources generally

don’t share identifiers
– names tend to be spelled every which way
• So how can they be linked?
4
A real-world example
DBPEDIA MONDIAL
Founding date 1962-01-01 Independence 01 01 1962
Currency Tala Population 214384
Area 2831 Area 2860
Leader name Tuilaepa Aiono Sailele Malielegaoi GDP 415
5
A difficult problem
• It requires n2 comparisons for n records

– a million comparisons for 1000 records, 100 million
for 10,000, ...
• Exact string comparison is not enough
– must handle misspellings, name variants, etc
• Interpreting the data can be difficult even for a
human being
– is the address different because there are two
different people, or because the person moved?
• ...
6
Statisticians to the rescue!
Help from an unexpected quarter
7
Record linkage
• Statisticians very often must connect data sets

from different sources
• They call it “record linkage”
– term coined in 19461)
– mathematical foundations laid in 19592)
– formalized in 1969 as “Fellegi-Sunter” model3)
• A whole subject area has been developed with
well-known techniques, methods, and tools
– these can of course be applied outside of statistics
1) http://ajph.aphapublications.org/cgi/reprint/36/12/1412
8 2) http://www.sciencemag.org/content/130/3381/954.citation
3) http://www.jstor.org/pss/2286061
Other terms for the same thing
• Has been independently invented many times

• Under many names
– entity resolution
– identity resolution
– merge/purge
– deduplication
– ...
• This makes Googling for information an
absolute nightmare
9
Application areas
• Statistics (obviously)
• Data cleaning
• Data integration
• Conversion
• Fraud detection / intelligence / surveillance
10
Mathematical model
11
Model, simplified
• We work with records, each of which has fields

with values
– before doing record linkage we’ve processed the data so
that they have the same set of fields
– (fields which exist in only one of the sources get
discarded, as they are no help to us)
• We compare values for the same fields one by one,
then estimate the probability that records
represent the same entity given these values
– probability depends on which field and what values
– estimation can be very complex
• Finally, we can compute overall probability
12
Example
Field Record 1 Record 2 Probability

Name acme inc acme inc 0.9
Assoc no 177477707 0.5
Zip code 9161 9161 0.6
Country norway norway 0.51
Address 1 mb 113 mailbox 113 0.49
Address 2 0.5
13
String comparisons
• Must handle spelling errors and name variants

– must estimate probability that values belong to same entity
despite differences
• Examples
– John Smith ≈ Jonh Smith
– J. Random Hacker ≈ James Random Hacker
– ...
• Many well-known algorithms have been developed
– no one best choice, unfortunately
– some are eager to merge, others less so
• Many are computationally expensive
– O(n2) where n is length of string is common
– this gets very costly when number of record pairs to compare is
already high
14
Examples of algorithms
• Levenshtein
– edit distance
– ie: number of edits needed to change string 1 into 2
– not so eager to merge
• Jaro-Winkler
– specifically for short person names
– very promiscuous
• Soundex
– essentially a phonetic hashing algorithm
– very fast, but also very crude
• Longest common subsequence / q-grams
– haven’t tried this yet
• Monge-Elkan
– one of the best, but computationally expensive
• TFIDF
– based on term frequency
– studies often find this performs best, but it’s expensive
15
Existing record linkage tools
• Commercial tools
– big, sophisticated, and expensive
– have found little information on what they actually do
– presumably also effective
• Open source tools
– generally made by and for statisticians
– nice user interfaces and rich configurability
– architecture often not as flexible as it could be
16
Standard algorithm
• n2 comparisons for n records is unacceptable

– must reduce number of direct comparisons
• Solution
– produce a key from field values,
– sort records by the key,
– for each record, compare with n nearest neighbours
– sometimes several different keys are produced, to increase
chances of finding matches
• Downsides
– requires coming up with a key
– difficult to apply incrementally
– sorting is expensive
17
Good research papers
• Threat and Fraud Intelligence, Las Vegas Style,

Jeff Jonas
– http://jeffjonas.typepad.com/IEEE.Identity.Resolution.p
df
• Real-world data is dirty: Data Cleansing and the
Merge/Purge Problem, Hernandez & Stolfo
– http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.
1.30.3496&rep=rep1&type=pdf
• Swoosh: a generic approach to entity resolution,
Benjelloun, Garcia-Molina et al
– http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.
1.122.5696&rep=rep1&type=pdf
18
Duke
DUplicate KillEr
19
Context
• Doing a project for Hafslund where we

integrate data from many sources
• Entities are duplicated both inside systems and
across systems
Suppliers
Customers Customers Customers
Companies
ERP CRM Billing
20
Requirements
• Must be flexible and configurable

– no way to know in advance exactly what data we will need
to deduplicate
• Must scale to large data sets
– CRM alone has 1.4 million customer records
– that’s 2 trillion comparisons with naïve approach
• Must have an API
– project uses SDshare protocol everywhere
– must therefore be able to build connectors
• Must be able to work incrementally
– process data as it changes and update conclusions on the
fly
21
Reviewed existing tools...
• ...but didn’t find anything matching the criteria

• Therefore...
22
Duke
• Java deduplication engine

– released as open source
– http://code.google.com/p/duke/
• Does not use key approach
– instead indexes data with Lucene
– does Lucene searches to find potential matches
• Still a work in progress, but
– high performance (1M records in 10 minutes),
– several data sources and comparators,
– being used for real in real projects,
– flexible architecture
23
How it works
• XML configuration file defines setup

– lists data sources, properties, probabilities, etc
• Data sources provide streams of records
• Steps:
– normalize values so they are ready for comparison
– collect a batch of records, then index it with Lucene
– compare records one by one against Lucene index
– do detailed comparisons against search results
– pairs with probability above configurable threshold
are considered matches
• For now, only API and command-line
24
Record comparisons in detail
• Comparator compares field values

– return number in range 1.0 – 0.0
– probability for field is high probability if number is
1.0, or low probability if 0.0, otherwise in between
the two
• Probability for entire record is computed using
Bayes’s formula
– approach known as “naïve Bayes” in research
literature
25
Components
Data sources Comparators

• CSV • ExactComparator
• JDBC • NumericComparator
• Sparql • SoundexComparator
• NTriples • TokenizedComparator
• <plug in your own> • Levenshtein
• JaroWinkler
• Dice coefficient
26
Features
• Fairly complete command-line tool

– with debugging support
• API for embedding the tool
• Two modes:
– deduplication (all records matched against each other)
– linking (only matching across groups of records)
• Pluggable data sources, comparators, and cleaners
• Framework for composing cleaners
• Incremental processing
• High performance
27
Matching Mondial and DBpedia
A real-world example
28
Finding properties to match
• Need properties providing identity evidence

• Matching on the properties in bold below
• Extracted data to CSV for ease of use
DBPEDIA MONDIAL
Founding date 1962-01-01 Independence 01 01 1962
Currency Tala Population 214384
Area 2831 Area 2860
Leader name Tuilaepa Aiono Sailele Malielegaoi GDP 415
29
Configuration – data sources
<group> <group>
<csv> <csv>
<param name="input-file" value="dbpedia.csv"/> <param name="input-file" value="mondial.csv"/>
<param name="header-line" value="false"/>
<column name="id" property="ID"/>
<column name="1" property="ID"/> <column name="country"
<column name="2" cleaner="no.priv...examples.CountryNameCleaner"
cleaner="no.priv...CountryNameCleaner" property="NAME"/>
property="NAME"/> <column name="capital"
<column name="3" cleaner="no.priv...LowerCaseNormalizeCleaner"
property="AREA"/> property="CAPITAL"/>
<column name="4" <column name="area"
cleaner="no.priv...CapitalCleaner" property="AREA"/>
property="CAPITAL"/> </csv>
</csv> </group>
</group>
30
Configuration – matching
<schema>
<threshold>0.65</threshold>
Duke analyzes this setup and decides
<property type="id"> only NAME and CAPITAL need to be
<name>ID</name> searched on in Lucene.
</property>
<property>
<name>NAME</name>
<comparator>no.priv.garshol.duke.Levenshtein</comparator>
<low>0.3</low>
<high>0.88</high>
</property>
<property>
<name>AREA</name>
<comparator>AreaComparator</comparator>
<low>0.2</low> <object class="no.priv.garshol.duke.NumericComparator"
<high>0.6</high> name="AreaComparator">
</property> <param name="min-ratio" value="0.7"/>
<property> </object>
<name>CAPITAL</name>
<comparator>no.priv.garshol.duke.Levenshtein</comparator>
<low>0.4</low>
<high>0.88</high>
</property>
</schema>
31
Result
• Correct links found: 206 / 217 (94.9%)

• Wrong links found: 0 / 12 (0.0%)
• Unknown links found: 0
• Percent of links correct 100.0%, wrong 0.0%,
unknown 0.0%
• Records with no link: 25
• Precision 100.0%, recall 94.9%, f-number
0.974
32
Examples
Field DBpedia Mondial Field DBpedia Mondial
Name albania albania Name kazakhstan kazakstan
Area 28748 28750 Area 2724900 2717300
Capital tirana tirane Capital astana almaty
Probability 0.980 Probability 0.838

Name côte d'ivoire cote divoire Name grande comore comoros
Area 322460 322460 Area 1148 2170
Capital yamoussoukro yamoussoukro Capital moroni moroní

Name samoa western samoa Name serbia serbia and mont
Area 2831 2860 Area 102350 88361
Capital apia apia Capital sarajevo sarajevo
33
Choosing the right match
Field DBpedia Mondial Probability

Name samoa western samoa 0.3
Area 2831 2860 0.6
Capital apia apia 0.88
Probability 0.824
Field DBpedia Mondial Probability

Name samoa american samoa 0.3
Area 2831 199 0.4
Capital apia pago pago 0.4
Probability 0.067
34
An example of failure
Field DBpedia Mondial
• Duke doesn’t find this match Name kazakhstan kazakstan
Area 2724900 2717300
– no tokens matching exactly Capital astana almaty
– Lucene search finds nothing Probability 0.838
• Detailed comparison gives correct result

– so only problem is Lucene search
• Lucene does have Levenshtein search, but
– in Lucene 3.x it’s very slow
– therefore not enabled now
– thinking of adding option to enable where needed
– Lucene 4.x will fix the performance problem
35
The effects of value matching
Case Precision Recall F

Optimal 100% 94.9% 97.4%
Cleaning, 99.3% 73.3% 84.4%
exact compare
No cleaning, 100% 93.4% 96.6%
good compare
No cleaning, ? ? ?
exact compare
Cleaning, 97.6% 96.7% 97.2%
JaroWinkler
Cleaning, 95.9% 97.2% 96.6%
Soundex
36
Usage at Hafslund
Duke in real life
37
The SESAM project
• Building a new archival system

– does automatic tagging of documents based on
knowledge about the data
– knowledge extracted from backend systems
• Search engine-based frontend
– using Recommind for this
• Very flexible architecture
– extracted data stored in triple store (Virtuoso)
– all data transfer based on SDshare protocol
– data extracted from RDBMSs with Ontopia’s DB2TM
38
The big picture
DUPLICATES!
SDshare 360 SDshare
Recom SDshare SDshare

Virtuoso ERP
mind
CRM
SDshare Billing
Duke SDshare
contains owl:sameAs and
haf:possiblySameAs
39
Experiences so far
• Incremental processing works fine

– links added and retracted as data changes
• Performance not an issue at all
– but then only ~50,000 records so far...
• Matching works well, but not perfect
– data are very noisy and messy
– also, matching is hard
40
Duke roadmap
• 0.3
– clean up the public API and document properly
– maybe some more comparators
– support for writing owl:sameAs to Sparql endpoint
• 0.4
– add a web service interface
• 0.5 and onwards
– more comparators
– maybe some parallelism
41
Comments/questions?
Slides will appear on http://slideshare.net/larsga
42

Duke Genetic Algorithm Linking

Uploaded by

Document Information

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Duke Genetic Algorithm Linking

Uploaded by

Copyright:

Experiments in genetic programming

Bouvet BigOne, 2012-03-29

• Threshold for accepting matches

1. Generate a population of 100 random

# make new generation

# ditch lower quartile

# set/get threshold, add/get properties

def make_new(self, population):

def _mutate(self, mods):

def _mate(self, other):

• Linking countries from DBpedia and Mondial

Name Samoa Name Western Samoa

Capital Apia Capital Apia, Samoa

Area 2831 Area 2860

PROPERTY COMPARATOR LOW HIGH

Why exact name comparisons?

Why is area comparison given such weight?

Who knows. There’s nobody to ask.

• Data about papers presented at semantic web

Homepage http://www.ics.forth.gr/~antoniou Homepage http://www.ics.forth.gr/~antoniou

Mbox_Sha1 f44cd7769f416e96864ac43498b08215 Mbox_Sha1 f44cd7769f416e96864ac43498b08215

PROPERTY COMPARATOR LOW HIGH

Some strange choices of comparator.

DiceCoefficient is essentially same as Exact, for those values.

• I took a subset of customer data from Hafslund

CONFIGURATION TRAINING TEST

Threshold: 0.98 Threshold: 0.95

ASSOCIATION_NO Exact 0.06 0.69 ASSOCIATION_NO DiceCoefficien 0.0 0.67

Semantic Web Meetup, 2011-08-23

• Lars Marius Garshol, Bouvet consultant

• Linking data – the problem

• Data sets from different sources generally

Name Samoa Name Western Samoa

Founding date 1962-01-01 Independence 01 01 1962

Capital Apia Capital Apia, Samoa

Currency Tala Population 214384

Area 2831 Area 2860

Leader name Tuilaepa Aiono Sailele Malielegaoi GDP 415

• It requires n2 comparisons for n records

Help from an unexpected quarter

• Statisticians very often must connect data sets

• Has been independently invented many times

• We work with records, each of which has fields

Field Record 1 Record 2 Probability

• Must handle spelling errors and name variants

• n2 comparisons for n records is unacceptable

• Threat and Fraud Intelligence, Las Vegas Style,

• Doing a project for Hafslund where we

Customers Customers Customers

ERP CRM Billing

• Must be flexible and configurable

• ...but didn’t find anything matching the criteria

• Java deduplication engine

• XML configuration file defines setup

• Comparator compares field values

Data sources Comparators

• Fairly complete command-line tool

• Need properties providing identity evidence

Name Samoa Name Western Samoa

Founding date 1962-01-01 Independence 01 01 1962

Capital Apia Capital Apia, Samoa

Currency Tala Population 214384

Area 2831 Area 2860

Leader name Tuilaepa Aiono Sailele Malielegaoi GDP 415

• Correct links found: 206 / 217 (94.9%)

Field DBpedia Mondial Field DBpedia Mondial