Professional Documents
Culture Documents
The idea of our similarity-aware index is to calculate similarities A given name attribute was added to this data set based on a list
between unique attribute values once during the build phase, so of around 80,000 unique given names and their frequencies.
they don't need to be calculated for every query record. Different
To evaluate scalability, we created test data sets of four differ-
to other approximate query matching techniques, our approach
ent sizes containing 10%, 40%, 70% and 100% of the full data
allows any similarity comparison function, and any `blocking' (en-
set. Both index approaches were queried with ve sets of query
coding) function, both possibly domain speci c, to be used.
records containing zero to four manual modi cations per record.
Record ID Surname Soundex encoding
r1 smith s530 Build time Average query time
r2 miller m460 Standard Blocking Standard Blocking
r3 peter p360 Sim-Aware Index Sim-Aware Index
10
r4 myler m460 Example query record: 1000
!
r5 smyth s530
log seconds
log seconds
p360 peter
Memory usage Accuracy for data set with 6,917,514 records
similarities
Accuracy
20
myler miller 0.8 millar 0.7
400 0
(3) Accumulate 691,751 2,767,006 4,842,260 6,917,514 0 1 2 3 4
peter similarities Number of records in data set Number of modifications per record
for matching
smith smyth 0.9 record Figure 2: Summary of experimental results.
identifiers
smyth smith 0.9