Professional Documents
Culture Documents
Duke Linking Data
Duke Linking Data
Duke Linking Data
1
About me
2
Agenda
3
The problem
4
A real-world example
DBPEDIA MONDIAL
Id http://dbpedia.org/resource/Samoa Id 17019
5
A difficult problem
7
Record linkage
1) http://ajph.aphapublications.org/cgi/reprint/36/12/1412
8 2) http://www.sciencemag.org/content/130/3381/954.citation
3) http://www.jstor.org/pss/2286061
Other terms for the same thing
9
Application areas
• Statistics (obviously)
• Data cleaning
• Data integration
• Conversion
• Fraud detection / intelligence / surveillance
10
Mathematical model
11
Model, simplified
12
Example
13
String comparisons
15
Existing record linkage tools
• Commercial tools
– big, sophisticated, and expensive
– have found little information on what they actually do
– presumably also effective
• Open source tools
– generally made by and for statisticians
– nice user interfaces and rich configurability
– architecture often not as flexible as it could be
16
Standard algorithm
17
Good research papers
18
Duke
DUplicate KillEr
19
Context
Suppliers
Companies
20
Requirements
21
Reviewed existing tools...
22
Duke
25
Components
26
Features
A real-world example
28
Finding properties to match
Id http://dbpedia.org/resource/Samoa Id 17019
29
Configuration – data sources
<group> <group>
<csv> <csv>
<param name="input-file" value="dbpedia.csv"/> <param name="input-file" value="mondial.csv"/>
<param name="header-line" value="false"/>
<column name="id" property="ID"/>
<column name="1" property="ID"/> <column name="country"
<column name="2" cleaner="no.priv...examples.CountryNameCleaner"
cleaner="no.priv...CountryNameCleaner" property="NAME"/>
property="NAME"/> <column name="capital"
<column name="3" cleaner="no.priv...LowerCaseNormalizeCleaner"
property="AREA"/> property="CAPITAL"/>
<column name="4" <column name="area"
cleaner="no.priv...CapitalCleaner" property="AREA"/>
property="CAPITAL"/> </csv>
</csv> </group>
</group>
30
Configuration – matching
<schema>
<threshold>0.65</threshold>
Duke analyzes this setup and decides
<property type="id"> only NAME and CAPITAL need to be
<name>ID</name> searched on in Lucene.
</property>
<property>
<name>NAME</name>
<comparator>no.priv.garshol.duke.Levenshtein</comparator>
<low>0.3</low>
<high>0.88</high>
</property>
<property>
<name>AREA</name>
<comparator>AreaComparator</comparator>
<low>0.2</low> <object class="no.priv.garshol.duke.NumericComparator"
<high>0.6</high> name="AreaComparator">
</property> <param name="min-ratio" value="0.7"/>
<property> </object>
<name>CAPITAL</name>
<comparator>no.priv.garshol.duke.Levenshtein</comparator>
<low>0.4</low>
<high>0.88</high>
</property>
</schema>
31
Result
32
Examples
Field DBpedia Mondial Field DBpedia Mondial
Name albania albania Name kazakhstan kazakstan
Area 28748 28750 Area 2724900 2717300
Capital tirana tirane Capital astana almaty
Probability 0.980 Probability 0.838
33
Choosing the right match
34
An example of failure
Field DBpedia Mondial
• Duke doesn’t find this match Name kazakhstan kazakstan
Area 2724900 2717300
– no tokens matching exactly Capital astana almaty
36
Usage at Hafslund
37
The SESAM project
38
The big picture
DUPLICATES!
SDshare 360 SDshare
CRM
SDshare Billing
Duke SDshare
contains owl:sameAs and
haf:possiblySameAs
39
Experiences so far
40
Duke roadmap
• 0.3
– clean up the public API and document properly
– maybe some more comparators
– support for writing owl:sameAs to Sparql endpoint
• 0.4
– add a web service interface
• 0.5 and onwards
– more comparators
– maybe some parallelism
41
Comments/questions?
42