You are on page 1of 1

Similarity-Aware Indexing for Real-Time Entity Resolution

Peter Christen Ross Gaylery David Hawkingz


Introduction
The index consists of three data structures: the block index BI
Entity resolution is the task of identifying and matching records contains encodings and their corresponding attribute values (known
that refer to the same entities from different databases. Tradition- as `blocks' in record linkage); the similarity index SI holds the pre-
ally, this task is applied in batch-mode and on static databases, for calculated similarities within each block; and the record identi er
example to nd records that relate to the same patient in different index RI contains attribute values and their identi ers.
health databases, or duplicate records in bibliographic databases.
Query records can either refer to an entity stored in the index or to
Most research in entity resolution has concentrated on improving a new, unknown entity. It is assumed that query records can con-
the matching quality and scalability to large databases, or reduc- tain variations, errors, out-of-date and missing values. For values
ing the manual efforts required in the entity resolution process. not stored in the index, the similarities between attribute values
Many organisations are however increasingly faced with the chal- need to be calculated at query time.
lenge of having large databases containing entities that need to be For each query record, the matching process returns a ranked list
matched in real-time with a stream of query records also contain- of potential matches and their similarities with the query record.
ing entities, such that the best matching records are retrieved. Ex- A true match is achieved if one of the top ranked records refers to
ample applications include identity veri cation for online services the same entity as the query record.
and bene ts, national security databases, and digital libraries.
The aim of real-time entity resolution is to match query records Experimental Evaluation
containing entities as fast as possible to one or several databases We compared the similarity-aware index approach with standard
that contain records about existing entities. The approach must fa- blocking (basic inverted index) as traditionally used for entity res-
cilitate approximate matching; scale ef ciently to large databases olution. Both were implemented in Python, and experiments were
that contain many million records; and generate a match score conducted on an idle Linux server with two quad-core 2.33 GHz
that indicates the likelihood that a matched record in the database 64-bit CPUs and 8 GBytes of memory.
refers to the same entity as the one of the query record.
The experiments were conducted using a data set of nearly 7 mil-
lion records containing surnames, postcodes and suburb (town)
Indexing for Real-time Entity Resolution names sourced from an Australian telephone directory from 2002. 1

The idea of our similarity-aware index is to calculate similarities A given name attribute was added to this data set based on a list
between unique attribute values once during the build phase, so of around 80,000 unique given names and their frequencies.
they don't need to be calculated for every query record. Different
To evaluate scalability, we created test data sets of four differ-
to other approximate query matching techniques, our approach
ent sizes containing 10%, 40%, 70% and 100% of the full data
allows any similarity comparison function, and any `blocking' (en-
set. Both index approaches were queried with ve sets of query
coding) function, both possibly domain speci c, to be used.
records containing zero to four manual modi cations per record.
Record ID Surname Soundex encoding
r1 smith s530 Build time Average query time
r2 miller m460 Standard Blocking Standard Blocking
r3 peter p360 Sim-Aware Index Sim-Aware Index
10
r4 myler m460 Example query record: 1000

!
r5 smyth s530
log seconds

log seconds

r6 millar m460 Surname Soundex encoding 1


r7 smith s530 miller m460
r8 miller m460 100
0.1

(1) Use encoding


to get values 0.01
BI from same block 10
m460 millar miller myler 691,751 2,767,006 4,842,260 6,917,514 691,751 2,767,006 4,842,260 6,917,514
Number of records in data set Number of records in data set

p360 peter
Memory usage Accuracy for data set with 6,917,514 records

s530 smith smyth 8000 Standard Blocking


120
Standard blocking
Sim-Aware Index Sim-Aware Index
(2) Get all relevant 100
4000
pre−calculated 80
SI
log MBytes

similarities
Accuracy

millar miller 0.9 myler 0.7 60

miller millar 0.9 myler 0.8 1000


40

20
myler miller 0.8 millar 0.7
400 0
(3) Accumulate 691,751 2,767,006 4,842,260 6,917,514 0 1 2 3 4
peter similarities Number of records in data set Number of modifications per record
for matching
smith smyth 0.9 record Figure 2: Summary of experimental results.
identifiers
smyth smith 0.9

Conclusion and Future Work


RI We presented a novel index approach for real-time entity resolu-
millar miller myler peter smith smyth
tion, and evaluated it experimentally on a real-world data set. The
r6 r2 r4 r3 r1 r5 experiments showed that this approach can match query records
r8 r7 more than two orders of magnitude faster than a basic index ap-
proach traditionally used for entity resolution.
Figure 1: Example database and corresponding similarity-aware index.
Future work includes improving the accuracy of the proposed ap-
proach, a proper analysis of its time and space complexity, improv-
ing scalability and query matching time, and conducting experi-

School of Computer Science, The Australian National University, Canberra ACT 0200, Australia; peter.christen@enu.edu.au
y
Scoring Solutions, Veda Advantage, Melbourne VIC 3000, Australia; ross.gayler@vedaadvantage.com
y
Funnelback Pty Ltd, Dickson ACT 2601, Australia; david.hawking@acm.org
ments on various other large data sets.
1
http://www.australiaondisc.com

You might also like