You are on page 1of 30

Unsupervised Graph-Based Entity Resolution

for Complex Entities

NISHADI KIRIELLE, PETER CHRISTEN, and THILINA RANBADUGE,


School of Computing, The Australian National University, Australia

Entity resolution (ER) is the process of linking records that refer to the same entity. Traditionally, this process
compares attribute values of records to calculate similarities and then classifies pairs of records as referring
to the same entity or not based on these similarities. Recently developed graph-based ER approaches combine
relationships between records with attribute similarities to improve linkage quality. Most of these approaches
only consider databases containing basic entities that have static attribute values and static relationships, such
as publications in bibliographic databases. In contrast, temporal record linkage addresses the problem where
attribute values of entities can change over time. However, neither existing graph-based ER nor temporal
record linkage can achieve high linkage quality on databases with complex entities, where an entity (such as
a person) can change its attribute values over time while having different relationships with other entities at
different points in time. In this article, we propose an unsupervised graph-based ER framework that is aimed
at linking records of complex entities. Our framework provides five key contributions. First, we propagate
positive evidence encountered when linking records to use in subsequent links by propagating attribute val-
ues that have changed. Second, we employ negative evidence by applying temporal and link constraints to
restrict which candidate record pairs to consider for linking. Third, we leverage the ambiguity of attribute val- 12
ues to disambiguate similar records that, however, belong to different entities. Fourth, we adaptively exploit
the structure of relationships to link records that have different relationships. Fifth, using graph measures,
we refine matched clusters of records by removing likely wrong links between records. We conduct extensive
experiments on seven real-world datasets from different domains showing that on average our unsupervised
graph-based ER framework can improve precision by up to 25% and recall by up to 29% compared to several
state-of-the-art ER techniques.
CCS Concepts: • Theory of computation → Data integration; • Information systems → Entity reso-
lution;
Additional Key Words and Phrases: Record linkage, data linkage, data cleaning, dependency graph, temporal
data, ambiguity
ACM Reference format:
Nishadi Kirielle, Peter Christen, and Thilina Ranbaduge. 2023. Unsupervised Graph-Based Entity Resolution
for Complex Entities. ACM Trans. Knowl. Discov. Data. 17, 1, Article 12 (February 2023), 30 pages.
https://doi.org/10.1145/3533016

Authors’ address: N. Kirielle, P. Christen, and T. Ranbaduge, School of Computing, The Australian National University,
Canberra, ACT 2600, Australia; emails: {nishadi.kirielle, peter.christen, thilina.ranbaduge}@anu.edu.au.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs International 4.0


License.
© 2023 Copyright held by the owner/author(s).
1556-4681/2023/02-ART12
https://doi.org/10.1145/3533016

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:2 N. Kirielle et al.

1 INTRODUCTION
Entity resolution (ER) is the process used in data integration to identify and group records
into clusters that refer to the same entity where records can be sourced from one or multiple
databases [7, 41]. Generally, records used in ER have multiple attributes (commonly known as
quasi-identifiers [10]) that describe an entity. For example, a person entity can have a birth record
with attributes such as the baby’s name, sex, place of birth, date of birth, as well as the details of
the parents. The process of integrating different databases is required in domains such as health
analytics, national censuses, e-commerce, crime and fraud detection, national security, and the
social sciences [7, 15, 41].
Traditional ER approaches only consider the similarities between attribute values of each com-
pared record pair individually to identify matches [16], while graph-based collective ER techniques
make use of the relationships between entities to improve match decisions [2, 14, 28, 33].1
Most research in ER has only focused on entities that have static attribute values, where these
values can contain variations, abbreviations, and errors, or be missing. Such entities also only
have static relationships that are the same in all records that represent the same entity [2, 14]. We
refer to such entities as basic entities. Basic entities are, for example, publications or consumer
products. When linking publication records across two bibliographic databases, for example, an
article published in a conference or journal has the same title, venue, and a single author or a group
of coauthors across both databases, potentially with some data errors, variations, or missing values
in these attributes. These attribute values (unless being corrected after publication), however, do
not change for a given publication record. Similarly, the relationship of being an author in a given
publication is fixed and also does not change over time. This is assuming the ER task is to link
publications across two databases; the task of linking authors will involve complex entities (as
described next) because the details of authors, such as their names and affiliations, can change
over time.
Research in temporal record linkage has explored the effect of temporally changing attribute
values, such as address or name changes when people move or get married, in the ER process [35].
However, these approaches are limited to adjusting the attribute similarities between individual
records based on their temporal distances and the likelihood that an attribute value can change
over time. For example, address values generally change more often than surname values [27] as
people are more likely to change their address in a given period of time than their surname. These
approaches, however, do not consider that the relationships encountered between certain types of
entities, such as people, can also be different at various points in time. We refer to types of entities
that can have changing attribute values as well as different types of relationships at various points
in time as complex entities. As we show in Section 6, existing graph-based collective ER techniques
fail to achieve high linkage quality for situations where complex entities need to be resolved [2]
due to the changing nature of attribute values and different relationships.
As an example, if we consider a set of birth, marriage, and death certificates as a set of databases
of complex entities, then these databases will contain records of people at different stages of their
life. For instance, the same person can appear as a baby in a birth certificate, a bride in a marriage
certificate, and then as a mother in the birth certificates of her own children. The structure of
relationships in these certificates, within the same or across different databases, can be complex
because the same entity can play a different role in each relationship and can have different types
of relationships at different points in their lives. As a baby, an entity has a childOf relationship with

1 Aswe described formally in Section 3, throughout this article, we refer to a set of matched records that supposedly
correspond to the same entity as a cluster of records while we name a set of records that are relationally connected as a
group of records.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:3

Table 1. Frequencies (Minimum, Average, and Maximum) of the Number of Entities that Share an
Attribute Value in Databases that Contain Complex or Basic Entities, where we Only Show the Most
Unique Attribute (with the Lowest Average Frequency)
Dataset Domain Entity Attribute with Entity Attribute Value Frequencies
Type Least Ambiguity Count Minimum Average Maximum
Isle of Skye [45] Demographic Complex First name (birth babies) 12,285 1 21.80 1,089
Kilmarnock [45] Demographic Complex First name (birth babies) 23,715 1 5.82 1,837
IPUMS [47] Census Complex First name 21,828 1 5.68 1,009
NCVR [47] Census Complex Middle name 8,214,211 1 18.31 181,839
Scholar [13] Bibliographic Basic Publication title 64,263 1 1.02 51
Million Songs [13] Songs Basic Song title 35,463 1 1.14 110
IMDB [13] Movies Basic Movie name 6,407 1 1.15 3

her parents from the birth record, while as a married bride she then has a spouseOf relationship
with her husband, and when she has a baby herself will have a motherOf relationship in her baby’s
birth certificate. Using a motivating example, in Section 2, we describe the different challenges that
can occur with complex entities.
Furthermore, ambiguity in attribute values is a common problem in the ER process that involves
both basic and complex entities. Entities such as people seem to have higher levels of ambiguity in
their attribute values compared to entities such as publications. It is common for many individuals,
for example, to share the same first name or surname, the same city and postcode values, or the
same occupation. On the other hand, publication titles are generally rather unique. In Table 1,
we illustrate this issue by showing the least ambiguous attribute (where values are shared by the
smallest numbers of entities based on ground truth data) from a variety of datasets as commonly
used in ER research. Publication titles in Scholar, song titles in Million Songs, and movie titles in
IMDB, stand out with an average of only slightly more than one entity having a given attribute
value. On the other hand, for the Isle of Skye (IOS) dataset [45] (which we use in our evaluation
in Section 6), the values of first names are on average shared by more than twenty individuals (at
least five individuals for the other datasets that contain complex entities). This higher ambiguity
makes the ER process more challenging.
Although there exist collective ER work that studies disambiguation [2] and changing attribute
values [14], none of them has explored how to address the problem of disambiguation in a context
where attribute values as well as relationships can change over time. For example, if we have two
person records of a woman before and after her marriage in which she changes her surname, and
both surnames before and after her marriage are ambiguous (such as “Miller” and “Smith”), then
we still need to be able to identify that these two records refer to the same woman.
Another important aspect is that many practical ER applications suffer from missing, incomplete,
or biased ground truth data (known true matches and non-matches). Particularly in the context
of databases that contain complex entities such as person records, ground truth data are often
not available, or if available, they might be limited to manually curated, biased, and incomplete
matches [10]. Therefore, in spite of the growing interest in applying supervised techniques such
as deep learning [31, 37, 39, 53], unsupervised techniques are still required in many practical ER
applications.
Being able to link complex entities is highly important in domains such as medical research,
where linking patient records of individuals and families over time can help detect patterns of
how diseases spread through households and communities, and even facilitate novel genealogy
studies [33]; in national census analyses that help governments to better understand patterns
of education, migration, fertility, and social mobility over time [10, 18]; in social network anal-
ysis to identify the interests and connections of individuals; and in the domain of population
reconstruction [3] that intends to link databases of whole populations to reconstruct family trees

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:4 N. Kirielle et al.

Table 2. Sample Records from One Birth Certificate and Two Death Certificates, as Discussed
in the Example in Section 2
Certificate ID Event Birth Baby/ Mother Father Spouse
Type Year Deceased Person
Birth B 1 1767 Mary Smith (r 1 ) Margeret Smith (r 2 ) John Smith (r 3 ) –
Death D 1 1827 Mary Taylor (r 4 ) Margery Smyth (r 5 ) John Smith (r 6 ) Nichol Taylor (r 7 )
Death D 2 1777 Anne Smith (r 8 ) Maria Smith (r 9 ) Jonn Smith (r 10 ) Duncan Hunter (r 11 )
For simplicity, we only show the name attribute of each record. However, in real data each such record will have
various other attributes including an address, an occupation, and a date of birth, marriage or death, respectively, to
name a few.

over time that can be used for data analysis in demography, sociology, and genealogy [17]. Of
current interest, reconstructing a historical population from 1918 will allow the analysis of how
the Spanish flu has spread [48]. Better understanding such historical pandemics at the scale of
a full population can help public health researchers and governments when dealing with health
crises, such as the current COVID-19 pandemic, and to be better prepared for future outbreaks of
infectious diseases.
Our aim in this work is to provide an unsupervised ER framework that can link records of
complex (as well as basic) entities while addressing the challenges current graph-based ER and
temporal linkage cannot handle adequately. We address five challenges in ER which are funda-
mental in linking records about complex entities, where we elaborate on these challenges with a
motivating example in the following section. We conduct extensive experiments on four datasets
that contain complex entities and three datasets containing basic entities to illustrate how our
proposed framework outperforms state-of-the-art ER approaches.
Contribution: We propose a novel unsupervised graph-based ER framework that is focused
on addressing the challenges associated with resolving complex entities (referred to as RELATER,
which stands for propagation of constRaints and attributE values, reLationships, Ambiguity, and
refinemenT for Entity Resolution, the main contributions of our work). We propose a global
method of propagating attribute values and constraints to capture changing attribute values and
different relationships, a method for leveraging ambiguity in the ER process, an adaptive method of
incorporating relationship structure, and a dynamic refinement step to improve clusters of records
by removing likely wrong links between records. RELATER can be employed to resolve records of
both basic and complex entities, as we will show using extensive experiments in Section 6.

2 MOTIVATING EXAMPLE
As shown in Table 2, let us consider a set of complex entities where we are interested in resolving
eleven person records (r 1 to r 11 ) from one birth certificate and two death certificates. We assume
a birth (B) certificate describes a birth baby (Bb) and its mother (Bm) and father (B f ), while a
death (D) certificate describes a deceased person (Dd), their mother (Dm) and father (D f ), and
possibly their spouse (Ds). Similarly, a marriage (M) certificate would describe a bride (Mb) and a
groom (Mд), the bride’s mother (Mbm) and father (Mb f ), and the groom’s mother (Mдm) and
father (Mд f ).
Given the three example certificates in Table 2, we are interested in finding which person entities
are associated with these eleven records, hence which records need to be linked such that each
resulting cluster of records represents one entity. As an initial step, we need to extract the records
from the certificates where B 1 will contribute three person records, Mary Smith (r 1 ), Margeret
Smith (r 2 ), and John Smith (r 3 ), and likewise for the other certificates. We then need to determine
if Mary Smith in B 1 is the deceased person in D 1 or D 2 , or if she is the mother on either/both of

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:5

these two death certificates. Similarly, the other records in this example have different roles and
relationships.
Assume that Mary Smith (r 1 ) in B 1 is the deceased in D 1 , Mary Taylor (r 4 ), and the deceased in
D 2 , Anne Smith (r 8 ) could be the sibling of Mary Smith (r 1 ). Therefore, the goal of our ER process is
to find the following clusters of records which correspond to six different person entities: (r 1 , r 4 ),
(r 2 , r 5 , r 9 ), (r 3 , r 6 , r 10 ), (r 7 ), (r 8 ), and (r 11 ).
There exist different challenges in this example that are of interest particularly for resolving
complex entities. The primary challenge is the identification problem as defined by Bhattacharya
and Getoor [2], where we need to figure out the set of records that refer to each entity. While this
problem has been explored in the collective ER literature [2, 14, 28, 33], as we discuss next, some
aspects in our example have either not been investigated so far, or improvements are required
because existing methods fail to obtain high linkage quality for complex entities (as we will show
in Section 6).
The second challenge is how to resolve entities with changing attribute values. Mary Smith in
r 1 has a different surname (r 4 ) in her death certificate, D 1 , which is likely due to the change of her
surname when she got married. Assume we have linked r 1 with Mary Smith’s marriage certificate
(not shown), where this link of records states that her surname has changed to Taylor. In such a
scenario, if we can propagate the link decision (of her birth record with her marriage record) to
the link decision of her birth and death records, then we can easily identify that Mary Smith is
the same person as Mary Taylor based on her linked birth and marriage records. While existing
temporal record linkage approaches [27, 35] address the challenge of changing attribute values by
applying techniques such as temporal decays of attribute weights to capture temporal changes,
these solutions do not address the problem of different relationships of the same entity found in
records at various points in time.
The third challenge is how to incorporate the different relationships into the ER process to dis-
cover positive or negative evidence to guide the ER process. Assume r 1 and r 4 in the example in
Table 2 have been linked, and now we are interested in knowing if r 4 and r 9 refer to the same
entity as we still do not know if r 1 and r 8 are siblings. Here, even though both r 1 and r 4 have rela-
tionships with their mothers and fathers, r 9 has different relationships, namely her baby Anne (r 8 )
and her spouse, John (r 10 ). These different relationships occurring in records at different points in
time can provide negative evidence for any subsequent link decisions, for instance in the form of
constraints. For example, in order to decide if r 4 refers to the same entity as r 9 , we can propagate
temporal information from the link decision of r 4 with r 1 (Bb) discussed above. When it comes to
the temporal domain, biological constraints become relevant, in our example for a birth baby to
become a mother there should be a gap of at least around 15 years. Therefore, we can decide that
r 1 and r 9 cannot be linked (refer to the same entity) as they are only 10 years apart.
In a context where relationships are considered, Dong et al. [14] propagated link decisions by
considering attribute value changes and applying constraints. They perform an exhaustive search
to find all record pairs associated with any of the linked records, and then merge attributes and
use the transitive closure property to remove any additional record pairs [14]. However, no exist-
ing graph-based ER work has explored how to efficiently propagate attribute value changes and
apply constraints. As we discuss in Section 4.1, we propose an efficient method that avoids an
exhaustive search to propagate link decisions. Furthermore, no research has so far explored how
this propagation of link decisions is affected when the attribute values of entities are ambiguous.
This disambiguation problem, as we showed in Table 1, is where a given attribute value is shared
by multiple (possible many) entities. Values that are shared by only a smaller number of entities
provide stronger evidence that two records refer to the same entity. For example, if we look at

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:6 N. Kirielle et al.

Table 3. Main Notation used Throughout the Article


Notation Description
r , R, B A record, a set of records, a set of blocks each consisting of a set of records
o, O An entity, a set of entities
m, M A matched cluster of records, a set of matched clusters of records
a, A An attribute, a set of attributes
v, va , va An attribute value, a value of attribute a, the set of attribute values of attribute a
GO = (NO , EO ) An entity graph with a set of nodes and edges
GD = (ND , ED ) A dependency graph with a set of nodes and edges
n A , NA An atomic node, a set of atomic nodes
n R , NR A relational node, a set of relational nodes
g, Q A group of nodes, a priority queue of node groups
C = CA ∪ CR A set of adjacent nodes consisting of adjacent atomic and adjacent relational nodes
ρ, P A role type, a set of role types
T, L Sets of temporal and link constraints
sim, sima , simd Total, atomic, and disambiguation similarity score
γ The weight distribution for the two similarity components, sima and simd
sim M , simC , sim E Average similarities of must, core, and extra attributes
w M , wC , w E Weights given to must, core, and extra attribute categories
t b , tm , t a Thresholds for bootstrapping, merging, and atomic node similarity
tn Threshold for minimum number of nodes in a cluster to split by bridges
td Threshold for minimum density of a cluster to refine
Mo f , Fo f , Co f , So f motherOf, fatherOf, childOf, and spouseOf relationships

the attribute values of the records in Table 2, we can see that the surname Smith occurs more
often than the surname Hunter. As a result, if we have two records, such as r 1 and r 8 , which both
have the surname Smith, then this shared value does not provide sufficient evidence to link those
records because Smith is ambiguous. However, if we find a new record with a surname Hunter,
it is more likely that the new record represents the same entity as r 11 because Hunter is unique
in our example. Bhattacharya and Getoor [2] have explored ambiguity of static attribute values
in relational clustering for collective ER. However, they have not investigated how to incorporate
disambiguation while propagating link decisions, or when attribute values can change over time.
In collective ER, we are interested in linking records that are relationally connected with other
records. For example, consider the two connected record groups of B 1 and D 2 in Table 2. If we
assume that Mary Smith and Anne Smith are siblings, then we should not link them. However,
the parent record pairs in that group, (r 2 , r 9 ) and (r 3 , r 10 ), need to be linked as they refer to their
parents. We refer to this challenge as the partial match group problem, where only a subset of
relationally connected records correspond to the same entities while others do not. While recent
ER approaches take relationships into account by either incorporating relationship information
into the similarity calculation [2, 14] or by making a group link decision [19, 40], these approaches
would fail to properly link the parent records in this example because the overall similarity drops
due to the different sibling first names.
The final challenge is the one of incorrect link decisions. Because the process of linking two
records is no longer independent from linking other records in the context of collective ER [2, 14],
a single wrong link might propagate into other link decisions and result in an increase in the
number of false matches as well as missed true matches. For example, assume in Table 2 that
we have incorrectly linked r 1 with r 8 given both their parent’s first names are similar and their
surnames are the same. However, as a deceased person can only be linked to a single birth baby, r 8
will then not be linked to its correct birth record, and similarly r 4 might get linked to a wrong birth
record. To the best of our knowledge, this challenge has not yet been addressed in the literature.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:7

Fig. 1. The challenges of resolving complex entities defined in Section 3. We use v S N to show values from
a surname attribute. The direction of arrows corresponds to the relationships between records as well as
attribute values. Attributes are shown in green, records in light blue, entities in squares, and record clusters
with a shaded box.

3 PROBLEM DEFINITION AND OVERVIEW


We now define the problem of ER for databases of complex entities. We show the main notation we
use throughout the article in Table 3, where we use bold letters for lists and sets (with upper-case
bold letters for lists of sets and lists), and normal type letters for numbers and text.
Let R be a set of records from a database and O be a set of real-world entities (such as people).
We assume each record r ∈ R has a reference r .o to the entity o ∈ O that is represented by r , with
O = {r .o : ∀r ∈ R}, and where these r .o are unknown at the beginning of the ER process. Each
entity o ∈ O is represented by a set of records, m ⊂ R, that describe the entity. We denote such a
record set m as a cluster of records, and the set of all such clusters with M. Each record r ∈ R can
have a set of other records R  ⊂ R, with r  R , that are connected to r by relationships such as
motherOf (Mof), fatherOf (Fof), childOf (Cof), spouseOf (Sof), or authorOf. We refer to such a set
R  as a record group. Each record r ∈ R contains values, v, for a set of attributes, A, that provide
information such as the name, address, and gender for a person; or the author name, venue, and title
for a publication. Each record r also has a timestamp, r .t, that stands for the point in time (usually
a date) when the event that corresponds to r occurred. Similarly, each record r is associated with
a role, r .ρ ∈ P, where P is the set of all possible roles such as a mother, a child, an author, or a
publication.
We first describe the challenges that are specific for resolving complex entities, illustrating them
in Figure 1. We then formally define the problem of ER to resolve records of complex entities.
(a) Changing attribute values: Let r i , r j , r k ∈ R be three records that represent the same
entity (i.e., r i .o = r j .o = r k .o). Assume r i .va = r j .va and r j .va  r k .va , where a ∈ A, and
r i .t < r j .t < r k .t. If r j .va and r k .va are not variations of the same attribute value (i.e., are
not different values due to typographical errors but actual changed values [7, 35], such as
surname Smith to Taylor), then this is the challenge of changing attribute values where an
entity has different values for the same attribute, a, at the timestamps of r j .t and r k .t.
(b) Different relationships: Let r i , r j , ru , rv ∈ R be four records where r i and r j are relationally
connected with r i .o  r j .o and r i .t = r j .t. Similarly, ru and rv are also relationally connected
with ru .o  rv .o and ru .t = rv .t. Assume r i .o = ru .o and r i .t  ru .t (and therefore r j .t  rv .t).
There can be situations where r j .o  rv .o because at different timestamps r i and ru have
different relationships with r j and rv . This is the challenge where we encounter different
relationships for the same entity in its records at different points in time, such as a baby to
mother relationship at birth versus a bride to groom relationship at marriage.
(c) Disambiguation problem: Let Rα ⊂ R be a set of records having the same value for a
given attribute, a ∈ A (∀r i , r j ∈ Rα : r i .va = r j .va ∧ i  j). Assume each of these records
represents a different entity in O such that ∀r i , r j ∈ Rα : r i .o  r j .o. We refer to the challenge

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:8 N. Kirielle et al.

Fig. 2. Pipeline of RELATER where blue coloured boxes are the key techniques described in Section 4 and
white coloured boxes represent the three main steps described in Section 5.

of distinguishing such entities having the same attribute value (for example, many people
having the common surname Smith) as the disambiguation problem.
(d) Partial match group problem: Let Rα ⊂ R and Rβ ⊂ R be two groups of records, with
Rα ∩ Rβ = ∅, where we assume the records in each group are relationally connected with
each other. When two such groups are compared for linking, in the set of paired records,
{(r i , r k ), (r j , rl )}, where r i , r j ∈ Rα and r k , rl ∈ Rβ , if ∃(r i , r k ), (r j , rl ) : r i .o  r j .o ∧ r k .o =
rl .o, we define such a group as a partial match group. We refer to this challenge of having
some record pairs that refer to different entities while other record pairs referring to the
same entity in relationally connected record groups, (such as linking parents across the birth
records of siblings, but not linking the siblings), as the partial match group problem.
(e) Incorrect link problem: Let M be the set of record clusters in the record set R that have
been linked. Assume mk ⊂ R where mk ∈ M and ∃(r i , r j ) ∈ mk : r i .o  r j .o. This challenge is
the incorrect link problem where we have records representing different entities in the same
cluster of records, such as an entity cluster that represents a certain individual to contain a
record of a sibling.
Definition 3.1 (ER of Complex Entities). Given a set of records, R, the ER problem of resolving
records of complex entities is to link records r i ∈ R into clusters of records mk such that R =
{r i : ∀r i ∈ mk , ∀mk ∈ M} (all records in R have been inserted into a cluster) with M = ∪{mk }
and ∀mi , mj ∈ M : mi ∩ mj = ∅ (each record has been inserted into only one record cluster);
O = {r i .o : ∀r i ∈ mk , ∀mk ∈ M} and ∀r i ∈ mk : r i .o = o j , ∀mk ∈ M (every entity in O is
represented by one record cluster); and ∀mk ∈ M : |mk | ≥ 1 (each record cluster contains one
or more records, where records that were not linked are clusters of size 1) in a context where
relationally connected record groups can contain partial match groups and the records of an entity
can have changing attribute values, different relationships in different records, and ambiguous
attribute values shared with other entities.
Figure 2 shows the pipeline of our framework where the input is the groups of relationally
connected records extracted from one or more databases, and the output is a set of entities repre-
sented as clusters of records. We now provide an overview of the three main steps of RELATER as
described in detail in Section 5 (the white coloured boxes in Figure 2). In Section 4, we then discuss
how each key technique (the blue coloured boxes in Figure 2) contributes to the pipeline.
(1) Dependency Graph Generation: To resolve records, we need to represent them in a data
structure that can capture the relationships among records. Hence, we generate a depen-
dency graph [14] defined as follows.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:9

Fig. 3. Dependency graph generation from a birth certificate and two death certificates, as discussed in
Section 2. Relationships between records are derived from the structure of certificates (such as a birth cer-
tificate containing a baby Bb, a mother Bm, and a father B f ). Each double ended arrow corresponds to two
single directed edges in the dependency graph while a single directed edge means that the node at the head
(arrow) depends on the similarity of the node at the tail. Atomic nodes are shown in green while active
relational nodes are shown in blue.

Definition 3.2. A dependency graph is a directed graph, GD , that consists of a set of nodes,
ND , where these nodes represent pairs of attribute values or pairs of records; and a set of
edges, ED , that represent relationships between nodes. ND consists of atomic nodes, NA , that
represent pairs of attribute values, and relational nodes, NR , that represent pairs of candidate
records that possibly refer to the same entity, where ND = NA ∪ NR .

To generate the dependency graph, we potentially first have to extract records represent-
ing individual entities (unless the input dataset already contains such individual records).
For example, as shown in Table 2 and Figure 3, to generate the dependency graph for per-
son data, we first extract individual records from birth and death certificates. Then, as we
describe in Section 5.1, for each pair of similar values in an attribute a ∈ A (with similarities
greater than a threshold ta ), vi and v j , we add a node (vi , v j ) ∈ NA to GD . We repeat this
process for a selected set of quasi-identifying attributes that represent an entity. For each
pair of records, (r i , r j ) ∈ R, that possibly refer to the same entity (based on blocking as we
elaborate on in Section 5.1), we add a node (r i , r j ) ∈ NR to GD .
A directed edge in GD represents that the similarity of the destination node depends on the
similarity of the source node. Edges between nodes in NR represent relationships between
records, such as motherOf, or authorOf. For each node in NR , the set of its adjacent nodes
with incoming edges is denoted by C = CA ∪ CR , where CA and CR are the sets of adjacent
atomic and relational nodes of the specified node, respectively. For each r i .vi and r j .v j , if the
node (vi , v j ) ∈ NA , then there is a directed edge (vi , v j ) → (r i , r j ), with (vi , v j ) ∈ CA for the
relational node (r i , r j ). For each pair of nodes ni , n j ∈ NR , if there is a relationship between
these nodes then there exist two directed edges between them: ni → n j (where ni ∈ Cj for
n j ) and n j → ni (where n j ∈ Ci for ni ). For example, there will be two edges between a
mother node and a child node. To keep it simple, we show these as double-arrowed edges in
the example figures.
Figure 3(b) shows an example of a small dependency graph. Since each relational node
in this graph is associated with two records, we refer to linking two records in a node as
merging the node. Each node in GD is also associated with a node state [14], where this state
changes throughout the running of our framework. The possible states are active (considered
for merging), inactive (failed merging due to insufficient evidence such as low similarity),
merged (two records in the node are linked), and non-merge (not considered for merging due
to constraint violations).
(2) Bootstrapping: In this step, described in more detail in Section 5.2, we merge highly
similar groups of nodes in GD that have an average similarity greater than a predefined

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:10 N. Kirielle et al.

bootstrapping threshold, tb . As we then propagate link decisions, it is important to boot-


strap the framework only with highly confident record pairs.
(3) Iterative Merging and Entity Graph Generation: In this step, we iteratively merge candi-
date nodes in GD considering their relationship structure and the ambiguity of their attribute
values. We also propagate link decisions to account for changing attribute values and differ-
ent relationships. In this process, we generate an entity graph (defined below) to capture the
relationships between entities as defined below. Finally, we dynamically refine the record
clusters to remove likely wrong links, as we describe in Section 4.5.
Definition 3.3. An entity graph is a directed graph, GO that consists of a set of nodes, NO ,
where each node represents an entity o ∈ O; and a set of edges, EO , that represent the
relationships between these entities.
After describing the key techniques next, in Section 5, we then discuss these three steps of
RELATER in detail.

4 KEY TECHNIQUES
In this section, we describe the key techniques, including all novel contributions, underlying the
RELATER framework that solve the five challenges described in the previous section. These tech-
niques help our framework to achieve high linkage quality specifically for complex entities when
compared to existing ER approaches.

4.1 Global Propagation of Attribute Values (PROP-A)


As we defined in Section 3, the first challenge is changing attribute values where values such as
names and addresses can change over time. To solve this problem, we propagate those changing
attribute values through the ER process in the bootstrapping and iterative merging steps of our
framework. When an attribute value changes over time, this change makes it difficult to decide if
two records refer to the same entity. Therefore, with this technique, we first check if any of the
records are associated with a record cluster. If there are associated entities, we check the attribute
values of all records associated with these entities to identify how its attribute values have changed
over time.
We use such changing attribute values in the ER process as positive evidence for any subsequent
links. For this, we maintain clusters of records, M, that aggregate all linked records and their
attribute values. Let us assume, we merge a node containing two records, (r i , r j ). To add these
two linked records to M, we consider three different cases. First, if neither r i or r j are associated
with a record cluster, then we create a new cluster, mk , add r i and r j into mk , and then add mk
to M. Second, if only one of r i or r j is associated with a record cluster, for instance r i is already
associated with a cluster mk , based on a previous link, while r j is not, then we add r j to mk . Third,
if both records are associated with two different record clusters, we merge those two clusters. For
example, in Figure 4, we can see that the merged node (r 1 , r 12 ) is associated with the record cluster
m1 , which contains all attribute values of r 1 and r 12 .
In order to propagate attribute values, when linking two records, r i , r j ∈ R, we find the most
similar attribute value pair of the two records by considering the associated record cluster(s), if
there are any. For example, when we consider the node (r 1 , r 4 ) in Figure 4, because r 1 is part of
the associated record cluster m1 , we compare all attributes of r 4 with the corresponding attribute
values of m1 to find the best matching atomic nodes with the most similar values. As the surname
of r 4 is Taylor, the node (r 1 , r 4 ) is already associated with the atomic node (Smith, Taylor). When
we compare Taylor with the surnames of m1 , and assuming the similarities sim(T ayler ,T aylor ) >
sim(Smith,T aylor ), we remove the edge from the atomic node (Smith, Taylor) and add a new edge

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:11

Fig. 4. Global propagation of attribute values. As r 1 is associated with a record cluster, m1 , we replace the
atomic node of surnames from node (r 1 , r 4 ) with the surnames that have the highest similarity between
m1 and r 4 , (T ayler ,T aylor ). Atomic nodes are shown in green while active and merged relational nodes are
shown in blue and yellow, respectively.

from the (Tayler, Taylor) node to the relational node (r 1 , r 4 ). In this way, even if an individual
changes their name or address over time, our framework can still identify them based on previous
links or highly similar attribute values. With this attribute propagation step, as connected atomic
nodes are changing, the similarity of each relational node can change through the ER process,
which is a significant improvement over previous collective ER approaches that do not consider
such attribute value changes.
In the context of collective ER, the idea of propagating linking decisions has first been proposed
by Dong et al. [14]. However, our propagation method is different from this previous approach
as we make a global propagation of attribute values using a unified view of all record clusters,
M, that represent entities. On the other hand, Dong et al. [14] propagated link decisions with an
exhaustive search that merges relational nodes in the graph.

4.2 Global Propagation of Constraints (PROP-C)


The purpose of this technique is to make use of the different relationships that we encounter in
records at different points in time. Since different relationships correspond to different entities,
we cannot directly compare them in making link decisions. However, relationships can be used as
negative evidence for any subsequent links. In our framework, we model such negative evidence
as constraints and apply these constraints in the steps of dependency graph generation, bootstrap-
ping, and iterative merging. For example, for a birth baby (Bb) to become a birth mother (Bm),
biologically, there should be a time gap of at least around 15 and at most around 55 years [45]. For
other role pairs, there are different constraints that can be applied based on domain knowledge [8].
We refer to those constraints that are associated with temporal aspects as temporal constraints.
In addition to that, there can be constraints that are associated with the properties of certain
relationships. For example, in Figure 4, after the node (r 1 , r 4 ) is merged, where we assume r 1 and r 4
refer to a baby and a deceased person, respectively, r 1 cannot be linked with any other death record
because a birth baby (Bb) can only be linked to one deceased person (Dd), and vice versa. We refer
to constraints that are associated with such relationships as link constraints. Link constraints are
one-to-one and one-to-many constraints that can be applied to pairs of entity roles.
As these constraints are domain dependent, they need to be manually specified by domain ex-
perts or learned from training data to be used by our framework. As we defined in Section 3, each
record, r ∈ R, is associated with a role, r .ρ, where ρ ∈ P and P is the set of all roles for a given
domain, and a timestamp, r .t. We define the temporal and link constraints that we use as follows.

Definition 4.1. Temporal constraints apply for databases with complex entities where such
constraints restrict (for specific role pairs) if two records should be considered for linking or not.

We model temporal constraints as a set T = ρ 1, ρ 2 ∈P Tρ 1, ρ 2 , of time periods where records can and
cannot be linked.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:12 N. Kirielle et al.

Definition 4.2. Link constraints restrict, for a given role pair, how many links a record can be

involved in. We model link constraints as a set L = ρ 1, ρ 2 ∈P Lρ 1, ρ 2 , of one-to-one or one-to-many
constraints that determine how many records can be involved in a specific relationship for this
role pair.
For example, the temporal constraint between the roles of birth baby and birth mother TBb, Bm
can be represented as (r i .ρ = Bb) ∧ (r j .ρ = Bm) ∧ (15 ≤ YearT imeGap(r i , r j ) ≤ 55) =⇒
V alidMerдe (r i , r j ). Similarly, the one-to-one link constraint between the roles of birth baby and
deceased person LBb, Dd can be represented as (ru .ρ = Bb) ∧ (rv .ρ = Dd ) ∧ (|Links (ru , Dd )| =
0) ∧ (|Links (rv , Bb)| = 0) =⇒ V alidMerдe (ru , rv ), which means both records ru and rv cannot
be involved in any other links to a deceased person and a birth baby, respectively.

4.3 Leverage Ambiguity of Attribute Values (AMB)


An interesting, yet sometimes overlooked, aspect of ER is the ambiguity of attribute values, where
potentially many entities can share the same value, such as a common surname or city name [7].
Such values can therefore become ambiguous, and possibly lead to record pairs with high at-
tribute similarities yet referring to different entities. To solve this challenge, we therefore propose
a method in the dependency graph generation step of our framework to calculate the overall sim-
ilarity, sim, of a relational node that incorporates ambiguity where we combine the similarity of
two components, an atomic similarity, sima and a disambiguation similarity, simd , defined as
w M · sim M (r i , r j ) + wC · simC (r i , r j ) + w E · sim E (r i , r j )
sima (r i , r j ) = , (1)
w M + wC + w E
log2 (|O|/(r i .f + r j .f ))
simd (r i , r j ) = , (2)
log2 |O|
sim(r i , r j ) = γ · sima (r i , r j ) + (1 − γ ) · simd (r i , r j ), (3)
where 0 ≤ γ ≤ 1 is the weight distribution for the two similarity components, sima and simd as
we describe next and illustrate in Figure 5.
To calculate the initial atomic similarity, sima , of a relational node, we consider its set of adjacent
atomic nodes with incoming edges, CA . The similarities between attribute values in atomic nodes
are assumed to be always between 0 (completely different values) and 1 (same values). Similarities
are generally calculated using approximate string comparison functions, such as Jaro-Winkler or
edit-distance [7], as appropriate to the values in an attribute. The importance of different attributes
towards the calculation of sima also varies. For example, in databases with complex entities, at-
tributes such as first names are more important because generally they are more complete and
also more stable over time, whereas attributes such as occupation or address can be missing and
they can change over time [8, 45]. In bibliographic data, for pairs of authors, their first names
and surnames can be considered more important than the venue of a publication, which can be
considered as a less important attribute that provides additional information because many publi-
cations share the same venue.
To this end, we group attributes into three categories: Must, Core, and Extra, based on their
importance in the ER process determined using domain knowledge or data characteristics, such as
completeness [10]. For two records to be classified similar, they need to have highly similar values
in the Must attributes (such as first name), but they can have a comparatively lower similarity in
Core attributes (like surname). Extra attributes (such as the occupation of a person or the venue of
a publication) provide further evidence of similarities between records.
We calculate an initial atomic similarity as shown in Equation (1), where sim M , simC , and sim E
represent the average of atomic node similarities of the Must, Core, and Extra attribute categories,

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:13

Fig. 5. Similarity calculation of node (r 1 , r 4 ). Assuming we set w M , wC , and w E as 0.5, 0.3, and 0.2, respectively
(determined via domain knowledge) and consider first name (Mary, Mary) as a must attribute, surname
(T ayler ,T aylor ) as a core attribute, and city (Klmor , Kilmore) as an extra attribute. Then, sima (r 1 , r 4 ) can
+ 0.3·0.9 + 0.2·0.9 = 0.95 using Equation (1). Similarly, assuming r .f = 45, r .f = 12,
be calculated as 0.5·1.00.5 + 0.3 + 0.2 1 4
log2 (100/(45 + 12))
and |O | = 100, using Equation (2) we can calculate simd (r 1 , r 4 ) as log2 (100) = 0.12.

while w M , wC , and w E represent their corresponding weights, which can be learnt from a train-
ing dataset or determined via domain knowledge [8, 45]. As Extra attributes are subsidiary, the
presence of an Extra attribute provides positive evidence for a match while its absence does not
necessarily provide negative evidence. This is supported because we add atomic nodes only if the
similarity of the two attribute values in a node are above a pre-defined threshold, ta . Therefore,
we set w E = 0.0 if all Extra attributes are absent.
If the pair of records in a relational node has attribute values that occur frequently in the set of
records R, then a high attribute similarity of a record pair is not important compared with a pair
of records that have rare attribute values and the same attribute similarity [16]. In our example
in Section 2, Smith occurs seven times whereas Hunter occurs only once. Two records having the
surname Hunter, therefore, have a higher likelihood of referring to the same real-world entity
compared to two records having the surname Smith. As the link decisions in our framework are
dependent on each other, we need to prioritise record pairs with unique or rare attribute values
such that they are processed before record pairs with ambiguous attribute values. As this is similar
to the concept of inverse document frequency as used in information retrieval, we use a normalised
score of inverse document frequency [46] as the disambiguation similarity score, simd . Assume
a α , a β ∈ A are the attributes that we consider for calculating ambiguity. Then, the frequency r .f
for a record r is calculated as the frequency of attribute value combinations, va α and v a β , in one
of the duplicate free datasets that we aim at linking. For two records r i and r j in a relational node,
let r i .f and r j .f be the frequencies calculated as described. If the number of unique records in the
dataset (i.e., number of entities) is |O|, we define simd as shown in Equation (2), where we can
estimate |O| using the same duplicate free dataset we used to calculate frequencies.

4.4 Adaptive Leveraging of Relationship Structure (REL)


The purpose of this technique is to leverage the relationship structure to link records in the boot-
strapping and iterative merging steps. Recent approaches have incorporated relational similarities
between nodes into the similarity score in different ways [2, 14, 19, 28]. A limitation of directly
incorporating relational similarities into overall similarity scores is that this can affect the link
decisions in partial match groups, as we defined in Section 3. In our example, as illustrated in
Figure 6, for linking the two mother records, r 2 (Bm) and r 9 (Dm), basically there are two different
methods in the literature that incorporate the similarity of relational nodes. The first method [40]
considers the average similarity of the group of nodes, (r 2 , r 9 ), (r 3 , r 10 ), and (r 1 , r 8 ). The second
method [2, 14] has a component in the similarity calculation that provides a separate weight to
the similarities of the relational nodes, (r 3 , r 10 ) and (r 1 , r 8 ). In both of these methods, the average
similarity or the similarity score of the node (r 2 , r 9 ) gets lowered because of the low similarity of
(r 1 , r 8 ), where this node has a low similarity because its two records represent two separate entities
(two siblings).

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:14 N. Kirielle et al.

Fig. 6. Adaptive leveraging of relationship structure. These group of nodes correspond to the parents and
siblings in the birth (B1) and death (D2) certificates discussed in Section 2 (see Table 2). In the linking process,
we iteratively remove the node with the lowest similarity, here (r 1 , r 8 ), that corresponds to the sibling node
while the parent nodes, (r 2 , r 9 ) and (r 3 , r 10 ), are proceeded for merging.

To overcome this problem, RELATER provides a novel adaptive method to exploit the relational
structure of entities. As GD is a dependency graph, a connected component of a group of relational
nodes represents the structure of relationships between records. In order to decide if a pair of
records in a relational node needs to be linked, we consider the average similarity of the relationally
connected node group. Then, if that average similarity is less than a predefined threshold tm , we
adaptively remove the node with the lowest similarity from the group and recalculate the average
similarity.
As per the previous example of siblings and as illustrated in Figure 6, GD will have a group of
three relational nodes (a triangle) representing the two mothers (r 2 , r 9 ), two fathers (r 3 , r 10 ), and
two siblings (r 1 , r 8 ). To leverage the relational structure, we consider the average similarity (0.63)
of all three nodes in the first iteration. If this average similarity is less than tm , then we remove the
node with the lowest similarity and continue to consider the remaining nodes. Therefore, in the
example in Figure 6, we ignore the lowest similarity sibling node (r 1 , r 8 ) (as the two records refer
to two different individuals), and continue with the remaining pair of parent nodes, (r 2 , r 9 ) and
(r 3 , r 10 ), which now have an average similarity of 0.85 to proceed with the merging and solving of
the partial match group problem we discussed in Section 3.

4.5 Dynamic Refining of Record Clusters (REF)


In collective ER, the propagation of link decisions to the subsequent linking of records can lead
to poor linkage quality if two records that refer to two different entities have been linked in a
previous iteration. However, if we can detect such incorrect links, removing them can facilitate
the correct pairs to be linked in following iterations.
To solve this problem, we propose a novel method to dynamically refine record clusters, mk ∈ M,
by removing wrongly linked records from such clusters. This step is conducted after each of the
bootstrapping and merging steps, as we show in Figure 2. Each time a node is merged, we add
the pair of records associated with the node to the corresponding record cluster, mk , and the node
corresponding to mk in the entity graph GO is updated. As we need to keep track of how records
are added into record clusters, we create separate undirected graphs for each such record cluster
where the nodes indicate records and edges are added between each linked record pair. We utilise
this graph structure of relations formed in each record cluster to identify likely wrong links.
Based on the hypothesis that loosely connected record clusters (such as chains) are more likely
to contain errors compared to densely connected record clusters (such as cliques), we apply the
graph measure based error identification method proposed by Randall et al. [44] on the graph
generated from each record cluster. As a link decision in our framework can be propagated into
future link decisions, early identification of wrongly linked record pairs allows correct record pairs
to be linked in the next iteration.
We used the graph measures of density and bridges (illustrated in Figure 7) to identify loosely
connected record clusters. A bridge is an edge that will disconnect the graph if removed; and
density, d, is measured by the number of edges out of the total number of possible edges in a
graph [44] calculated as d = 2|E|/(|N | · (|N | − 1)), where E and N are the sets of edges and nodes

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:15

Fig. 7. (a) A graph with a bridge in red colour (dotted), (b) a graph with high density, and (c) a graph with
low density.

of the undirected graph generated from a record cluster, mk . For such a cluster having at least
three records, |mk | ≥ 3, we calculate the density and if it is less than a predefined threshold, td ,
we remove the node with the lowest degree. Similarly, for a record cluster having more than tn
records, we split the record cluster by any existing bridges. In Section 6, we discuss how to set the
parameters, td and tn in our experimental evaluation.

5 ENTITY RESOLUTION WITH RELATER


In this section, we present our ER framework, as shown in Figure 2, which consists of three main
steps (dependency graph generation, bootstrapping, and iterative merging and entity graph gen-
eration). We detail how we utilise the five key techniques we discussed before in this framework.
Finally, we provide the time complexity of our framework.

5.1 Dependency Graph Generation


As we are representing records in a dependency graph, GD , if all possible record pairs and attribute
pairs are added into this graph, then it can get very large. Therefore, we apply a blocking technique
to reduce the comparison space by removing likely non-matching record pairs and grouping po-
tentially matching pairs [42]. We employ a locality sensitive hashing based blocking technique
that maps similar attribute value pairs to the same hash value to group likely matches [42]. After
blocking, GD is generated in two phases considering only record pairs in the generated blocks. In
the first phase, only attribute pairs that have a similarity of at least a threshold ta are added to the
graph from each block as atomic nodes, NA , along with their similarities.
In the second phase, we add relational nodes, NR , from each block to GD based on several
filtering methods. First, we consider only possible pairs of entity role types, such as pairs of two
authors or pairs of two birth mothers, while ignoring pairs with different genders, such as a birth
mother paired with a groom. Second, we filter record pairs by temporal constraints, T, for databases
having complex entities where such constraints can be applied. Then, we filter record pairs based
on the number of atomic nodes available that correspond to the three attribute categories Must,
Core, and Extra, as we discussed in Section 4.3. For example, we can define a rule such that at least
one atomic node from the Must and two from the Core attributes need to exist in NA for a record
pair to be added to NR . Our framework provides the flexibility to select such rules to obtain a high-
quality initial dependency graph, GD . These rules can be determined from domain knowledge or
can be learned using a training dataset containing known matches.
Relational nodes, NR , are added to GD in the active state with directed edges from their cor-
responding atomic and relative relational nodes. For example, in Figure 3(b), the node (r 1 , r 4 ) has
(r 2 , r 5 ), and (r 3 , r 6 ) as relational nodes. The incoming edges from (r 2 , r 5 ) to (r 1 , r 4 ), and from (r 1 , r 4 )
to (r 2 , r 5 ), which indicate the motherOf and childOf relationships, respectively, are added between
those two nodes. Similarly, relationship edges are added between the father’s relational node as
well. Then, for all relational nodes, we calculate the similarity score, sim, using Equation (3) where
we leverage the ambiguity of attribute values (AMB) as we discussed in Section 4.3.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:16 N. Kirielle et al.

5.2 Bootstrapping
As an initial step of the merging process, we first bootstrap the dependency graph, GD , by merging
highly similar relational nodes. In collective ER, any subsequent links always depend upon the
previous links [2]. Therefore, it is important to bootstrap this graph by only linking record pairs
for which we have a high confidence of them to be a correct match.
After GD is generated, we have groups of relational nodes of different sizes. In the bootstrap-
ping step, we consider only nodes in groups (leaving the singletons), where the average atomic
similarities, following Equation (1), of all nodes in a group must be at least the bootstrap threshold
which we set to tb = 0.95 in our evaluation in Section 6 (based on a set of initial experiments) to
bootstrap the graph by linking highly similar record pairs. We only consider groups of nodes con-
nected with relationships at this stage rather than singleton nodes that are not connected with any
other nodes by relationships because groups provide more relationship evidence than singletons.
While linking such highly similar node groups, we also propagate the attribute values (PROP-
A) and constraints (PROP-C) and adaptively leverage relationship structures (REL), as shown in
Figure 2. After bootstrapping the dependency graph, GD , we dynamically refine record clusters
(REF) to remove any incorrectly linked record pairs, as we discussed in Section 4.5.

5.3 Iterative Merging and Entity Graph Generation


The merging process used in RELATER, as outlined in Algorithm 1, iteratively merges nodes in
the dependency graph, GD , to find the entities that correspond to the records in nodes in this
graph. In the bootstrapping step, as we only linked the highly similar record pairs, we obtain the
bootstrapped dependency graph, GD , and the corresponding set of record clusters M. Therefore,
GD and M become the input to the iterative merging process. Additionally, we provide the set of
temporal T and link L constraints and the thresholds for merging tm , graph bridges tn , and graph
density td , as input to the algorithm.
The merging algorithm starts in line 1 by initialising the entity graph, GO , with the record
clusters generated in the bootstrapping step. Then, the algorithm proceeds by generating a priority
queue, Q, of node groups. These are relationally connected nodes with relationship edges that are
in the active state in GD , as we described in Section 5.1. This queue is initialised in line 2 using all
active relational node groups, where precedence is given to larger groups and then to groups with
high average similarity of nodes, sim, as calculated using Equation (3).
In every iteration, we perform merging on the top relational node group, g, in Q. For each
node in g in line 6, we check if the node is valid to be merged based on the temporal, T, and link
constraints, L. This is where we make use of the link decisions made in previous iterations to vali-
date the temporal and link constraints, as we detailed in the description of our global propagation
of constraints (PROP-C) in Section 4.2. Previous link decisions may or may not have associated
record clusters, mk ∈ M (that represents an entity), to the records in a node. Based on these as-
sociated clusters, there can be three different cases. First, if both records in a node are associated
with two different record clusters, we validate the node by applying constraints on every possible
record pair between the two clusters. Second, if only one record is associated with a record cluster,
we validate all possible records in the cluster against the record that does not have an associated
record cluster. Third, if none of the records is associated with record clusters, we apply constraints
between the original records to validate the node.
For each valid node, we propagate the attribute values as detailed in Section 4.1 (PROP-A)
(line 7 in Algorithm 1) and find the set of atomic nodes with the highest attribute similarities
corresponding to the node. With those atomic nodes, we then calculate the new similarity of the
node in line 8 while setting its state to inactive as it has been processed (line 9). Any node that

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:17

ALGORITHM 1: The Merging Process in RELATER


Input:
- GD : Bootstrapped dependency graph
- M: Set of record clusters from bootstrapping
- T: Set of temporal constraints of all role pairs
- L: Set of link constraints of all role pairs
- tm : Similarity threshold for merging
- tn : Threshold for minimum number of nodes in a cluster to split by bridges
- td : Threshold for minimum density of a cluster to refine
Output:
- GO : Entity graph

1: GO = InitGraph(M) // Initialise entity graph, GO , with M


2: Q = InitPriorityQueue (GD .NR ) // Initialise priority queue
3: while Q  ∅ do:
4: g = Q.dequeue () // Get top node group
5: for n ∈ g do: // Iterate through the nodes in the node group
6: if IsV alidNode (n, M, T, L) then: // Validate a node’s temporal and link constraints (PROP-C)
(PROP-C)
7: U pdateAtomicNodes (n, M) // Propagate attribute values (PROP-A)
8: U pdateSimilarity(n, M) // Update node similarity (AMB)
9: n.state = inactive // Set node state to inactive
10: else:
11: g.removeNode (n) // Remove invalid node from the node group
12: n.state = non-merge // Set node state to non-merge
13: while |g| ≥ 2 do: // Loop until the node group becomes a pair (REL)
14: simд = GetAveraдeGroupSim(д) // Calculate node group average similarity
15: if simд ≥ tm then: // Check if average similarity is greater than or equal to tm
16: M = MerдeNodes (g, M) // Merge and add to M
17: GO .U pdate (M) // Update GO with new record clusters M
18: n.state = merged // Set node state to merged
19: break // Break and continue to the next group (go to line 3)
20: g = RemoveLowSimNode (g) // Remove node with the lowest similarity
21: GO = Re f ineRecordClusters (GO , tn , td ) // Remove wrong links from record clusters (REF)
22: return GO

violates any constraints is removed from the group in line 11 and its state is updated to non-merge
in line 12.
We then calculate the average similarity, simд , of all nodes in line 14. If simд exceeds the merge
threshold, tm , then in lines 16 to 18, we merge all nodes in g, add the records to the corresponding
record cluster mk , update the entity graph, GO , with the updated record cluster, update the state
of the merged nodes to merged, and continue to the next group in the queue Q. Otherwise, we
remove the node with the lowest similarity from the group g and check the possibility to merge
the group until it is reduced to a pair (in lines 13 to 20) by adaptively leveraging the relationship
structure (REL), as we described in Section 4.4.
After we have processed all the nodes in the priority queue, Q, we finally refine (REF) the
entities in the entity graph, GO , in line 21. In order to identify and remove the likely wrong links
in the entities associated with the record clusters in M, we utilise the measures of graph bridges
and graph density as we described in Section 4.5. Finally, in line 22, we return the generated entity

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:18 N. Kirielle et al.

Table 4. Characteristics of the Datasets Used in the Experiments


Dataset Domain Entity Role Interpretation (Links between) Num. of Records Record True
Type Pair Role-1 Role-2 Pairs Matches
Isle of Skye Demographic Complex Bp-Bp Birth parents in birth certificates 34,272 34,272 436,518 83,132
(IOS) [45] Bp-Dp Parents in birth and death certificates 34,272 23,938 628,141 38,662
Kilmarnock Demographic Complex Bp-Bp Birth parents in birth certificates 74,948 74,948 1,571,991 135,346
(KIL) [45] Bp-Dp Parents in birth and death certificates 74,948 45,186 2,357,625 80,819
North Brabant Demographic Complex Bp-Bp Birth parents in birth certificates 906,710 906,710 11,012,512 N/A
(BHIC) [45] Bp-Dp Parents in birth and death certificates 906,710 1,436,217 861,041 N/A
IPUMS [47] Census Complex F-F Fathers in households 10,914 10,914 1,169,048 10,914
M-M Mothers in households 10,908 10,908 1,225,887 10,908
C-C Children in households 16,875 16,875 3,042,050 16,875
DBLP-ACM [13] Bibliographic Basic P-P Publications 2,616 2,294 38,998 2,220
DBLP-Scholar [13] Bibliographic Basic P-P Publications 2,616 64,263 772,911 5,347
Million Songs Music Basic S-S Songs 35,463 35,463 33,120 19,070
(MSD) [13]

graph, GO . The reason why we generate an entity graph as the end result of the framework is
that such a graph can capture all direct and indirect relationships between entities. Each entity
represented by a node in GO by now consists of a set of records. This set of records allows us to
infer all possible links to all related entities, which in turn enumerates all the indirect relationships
between entities.

5.4 Complexity Analysis


If we assume Ψ(·) is the worst case time complexity of the attribute value similarity calculation
function, then the atomic node generation time complexity is O (|A| · Ψ(|va  | 2 )), where |va  | is
the number of values of the attribute a  ∈ A with the largest number of unique attribute values.
Assuming B is the set of blocks after blocking records, then the relational node generation time
complexity is O (|R| 2 /|B|) [42] where R is the set of records being processed. The bootstrapping,
iterative merging, and entity graph generation steps all have a complexity of O (|NR |), assuming
each n ∈ NR has a small number of neighbouring nodes. In the datasets used for our experimental
evaluation, the average number of relationships per node was no more than 3.

6 EXPERIMENTAL EVALUATION
We conduct an extensive set of experiments to address the following questions: (1) How does
RELATER compare to other state-of-the-art ER baselines? (2) How does RELATER scale to large
datasets? (3) How do parameter values affect linkage quality? (4) How does each key technique in
our framework affect linkage quality?

6.1 Experimental Setup


We first describe the setup, we used for our evaluation including the evaluated datasets, baselines,
and settings.

6.1.1 Datasets. We evaluate RELATER on real datasets from different domains as detailed in
Table 4. To resolve complex entities, we use three demographic datasets, where the goal is to link
person records across birth, marriage, and death certificates; and one census dataset where the
interest is to link person records across several census snapshots. Furthermore, we resolve basic
entities in three publicly available datasets from the bibliographic and music domains to show that
our framework can obtain high linkage quality for both complex and basic entities.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:19

The demographic datasets include two proprietary datasets from Scotland [45], one from the
remote IOS and the other from the town of Kilmarnock (KIL). Both contain civil certificates
(birth, marriage, and death) of their population over the period from 1861 to 1901. A third dataset
is from the publicly available Brabant Historical Information Center (BHIC) [4]. It contains
civil certificates from North Brabant, a province of the Netherlands, in the period from 1759 to
1969. Demographers with expertise in linking such data have curated and linked the IOS and KIL
datasets [45]. Their semi-automatic approach is heavily biased towards certain types of links, such
as Bp-Bp (links between birth parents), as their interests were in identifying siblings of the same
mother to facilitate the analysis of child mortality. Therefore, we show results of Bp-Bp for which
we have directly curated ground truth links, along with the results of Bp-Dp (links between birth
and death parents) for which we have inferred ground truth links (where in Table 4 p represents
both mother, m, and father, f ). We utilise the BHIC dataset for evaluating the scalability of RELATER
since it is a significantly larger dataset compared to IOS and KIL. However, we cannot show the
linkage quality of the BHIC dataset as we do not have ground truth links. As census data, we used
the 1870 and 1880 census snapshots of families from the US census data (IPUMS) publicly available
from IPUMS [47].
To evaluate RELATER for resolving basic entities, we selected datasets with different data charac-
teristics and levels of difficulty to match records. We use a music dataset, Million Songs (MSD) [13],
and two bibliographic datasets, DBLP-ACM [13], and DBLP-Scholar [13]. As DBLP-ACM consists
of two well-structured datasets, it can be considered as a simple dataset to resolve [32]. However,
DBLP-Scholar is more challenging because the publications in Google Scholar have many quality
problems, such as misspellings and different representations of authors and venues [32].
6.1.2 Baselines. To compare RELATER with existing (state-of-the-art) ER approaches, we se-
lected five baselines where each of them represents a different ER approach. The first baseline,
Attr-Sim, provides a basic pairwise similarity approach such as used with traditional pairwise
linkage techniques [7]. Second, Dep-Graph is an implementation of the collective ER approach
proposed by Dong et al. [14] that propagates link decisions in the ER process. To allow a fair com-
parison, we used the same dependency graph and the same set of temporal and link constraints we
used to evaluate RELATER. Third, Rel-Cluster is an implementation of the collective ER approach
proposed by Bhattacharya and Getoor [2] that employs ambiguity of attribute values in the ER pro-
cess. In Rel-Cluster, we apply the same set of temporal and link constraints applied to RELATER for
a fair comparison. Fourth, ZeroER [50] is a recent unsupervised approach that employs generative
modelling for learning match and non-match distributions to resolve entities. Finally, Magellan is
a state-of-the-art supervised ER system available as an open-source library [31]. As training data,
we used the record pairs generated in the blocking step as we will describe in Section 6.1.3. For
our experiments, we selected four classifiers from Magellan (a support vector machine, a random
forest, a logistic regression, and a decision tree) and averaged their linkage quality results.
6.1.3 Settings. We implemented our framework and baselines in Python 2.7 except for Mag-
ellan, which is implemented in Python 3.6 and conducted all experiments on a server running
Ubuntu 18.04 with 64-bit Intel Xeon 2.10 GHz CPUs, and 512 GBytes of memory. The code of our
framework is available in an online repository2 to facilitate repeatability.
For all baselines and RELATER, in the blocking step, we grouped potential matches by employ-
ing a locality sensitive hashing based indexing technique that maps records with similar attribute
pairs to the same block [42]. In the record pair comparison step, we then employed similarity func-
tions such as the Jaro-Winkler similarity for names and the Jaccard similarity for other textual
2 https://github.com/nishadi/RELATER.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:20 N. Kirielle et al.

Table 5. Attribute Categorisation of the Datasets used in the Experiments


Dataset Must Attributes Core Attributes Extra Attributes
Isle of Skye First name, Surname Address, Parish Occupation, Birth Year, Birth Place,
(IOS) [45] Marriage Year, Marriage Place
Kilmarnock First name, Surname Address, Parish Occupation, Birth Year, Birth Place,
(KIL) [45] Marriage Year, Marriage Place
North Brabant – First name, Surname Birth Year, Birth Place, Marriage Year,
(BHIC) [45] Marriage Place
IPUMS [47] First name, Race, Birth Place Surname, Birth Year Occupation, State, City
DBLP-ACM [13] Publication Name Publication Year Publication Venue
DBLP-Scholar [13] Publication Name – Publication Year
Million Songs Title Artist, Year Length, Album
(MSD) [13]

strings [7] to compare attribute values between records. For numerical comparisons, we used the
maximum absolute difference [7], and for comparing addresses in the IOS dataset, we geocoded
address strings [30] and calculated similarities based on the distance between two locations. How-
ever, due to the absence or low quality of addresses, we did not consider geocoding for the other
datasets.
For RELATER and all unsupervised baselines, we use the same set of attributes for Must, Core,
and Extra attributes (as shown in Table 5) for calculating the attribute similarity for a fair com-
parison. In the presence and absence of the Extra attributes, we set w M , wC , w E to 0.6, 0.2, 0.2 and
0.7, 0.3, 0.0, respectively. As there can be several Extra attributes, w E is always lower than w M for
a single attribute.
We set the default merging threshold as tm = 0.85, the atomic node similarity threshold as
ta = 0.9, the weighting distribution in similarity scores as γ = 0.6, the graph measures tn = 15
(bridges), and td = 30% (density), based on the parameter sensitivity analysis in Section 6.4. We do
not show results varying td as it has a fairly small influence on precision and recall with regard to
different thresholds as the record clusters we are obtaining are not very big.

6.2 Linkage Quality Evaluation


We now compare the linkage quality of RELATER with the aforementioned baselines with respect
to three different evaluation measures: precision (P), recall (R), and the F∗ measure [24]. To describe
what each evaluation measure represents, we consider T P, F P, and F N as the number of true
matches, false matches, and false non-matches, respectively [7]. Precision is the number of true
matches against the total number of matches generated by a particular method, P = T P/(T P + F P ),
while recall is the number of true matches against the total number of matches in the linked ground
truth data, R = T P/(T P + F N ) [23]. We do not use the F-measure for evaluation as recent research
has found that it is not suitable for measuring linkage quality in ER [23] because the relative
importance given to precision and recall in the F-measure depends on the number of predicted
matches. Therefore, we use a more interpretable alternative to the F-measure, the F ∗ -measure [24],
calculated as F ∗ = T P/(T P + F P + F N ), which corresponds to the number of true matches against
the number of matches which are either misclassified or are correctly classified.
Table 6 provides the precision, recall, and F ∗ results for RELATER and the baselines evaluated.
Based on the average results, we can see that RELATER outperforms all baselines. We first discuss
how RELATER behaves with regard to resolving complex and basic entities. The databases with
complex entities involve several types of role pairs to resolve entities. Since we do not have a
complete set of ground truth links for the IOS and KIL datasets, we only show role pairs for which

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:21

Table 6. Precision (P), Recall (R), and F∗ Measure Results of RELATER Compared
to the Baselines (Averages ± Standard Deviations)
Dataset Role Pair RELATER Attr-Sim Dep-Graph Rel-Cluster ZeroER Magellan
P 98.73 63.67 90.87 93.59 60.53 77.9 ± 33.4
IOS Bp-Bp R 94.70 88.41 65.26 63.72 70.75 72.9 ± 35.1
F∗ 93.56 58.76 61.25 61.06 48.41 60.4 ± 38.6
P 86.44 43.05 0.00 80.91 58.37 67.8 ± 37.9
IOS Bp-Dp R 92.87 72.32 0.00 49.19 14.02 62.2 ± 41.4
F∗ 81.06 36.96 0.00 44.07 12.74 46.1 ± 40.4
P 97.81 30.26 54.81 71.81 79.45 69.6 ± 40.1
KIL Bp-Bp R 89.52 89.13 74.93 71.92 90.82 62.7 ± 46.7
F∗ 87.76 29.18 46.32 56.09 73.54 51.6 ± 45.9
P 74.36 11.05 28.95 30.35 45.71 63.9 ± 36.1
KIL Bp-Dp R 89.57 90.49 70.69 43.18 15.67 61.8 ± 44.1
F∗ 68.44 10.93 25.85 21.69 13.21 45.6 ± 39.4
P 100.0 99.86 99.98 95.70 99.99 84.0 ± 32.6
IPUMS F-F R 96.33 63.84 76.86 60.58 71.17 84.5 ± 32.7
F∗ 96.33 63.78 76.85 58.98 71.16 81.1 ± 32.0
P 100.0 99.85 99.96 93.68 99.97 80.0 ± 35.1
IPUMS M-M R 95.88 60.17 70.98 57.86 71.19 76.8 ± 38.8
F∗ 95.88 60.11 70.97 55.68 71.17 74.2 ± 37.8
P 89.68 99.60 99.33 77.55 99.96 81.9 ± 32.2
IPUMS C-C R 93.89 58.16 77.16 50.18 90.09 72.2 ± 39.6
F∗ 84.73 58.02 76.76 43.81 90.06 68.1 ± 37.8
P 98.94 71.90 98.89 81.04 99.45 96.8 ± 00.9
DBLP-ACM P-P R 96.49 96.71 96.67 96.44 98.60 97.8 ± 01.6
F∗ 95.50 70.19 95.63 78.68 98.07 94.7 ± 02.2
P 77.89 54.65 69.71 78.54 98.47 88.0 ± 03.3
DBLP-Scholar P-P R 80.10 79.60 80.94 49.41 44.72 87.5 ± 04.0
F∗ 65.26 47.93 59.88 43.53 44.41 78.1 ± 03.9
P 99.99 99.49 99.99 92.97 99.93 99.5 ± 00.3
MSD S-S R 99.26 99.77 95.20 99.79 91.81 98.2 ± 02.2
F∗ 99.24 99.26 95.19 92.79 91.75 97.7 ± 02.4
P 92.4 ± 9.3 67.3 ± 30.9 74.2 ± 33.9 79.6 ± 18.2 84.2 ± 21.5 80.9 ± 11.2
Averages R 92.9 ± 5.1 79.9 ± 14.6 70.9 ± 25.5 64.2 ± 18.7 65.9 ± 31.1 77.7 ± 13.2
F∗ 86.8 ± 11.4 53.5 ± 23.0 60.9 ± 28.5 55.6 ± 18.8 61.5 ± 30.9 69.8 ± 17.8
Best results in each row are shown in bold font.

we have manually curated or inferred ground truth links [45]. For both these datasets, RELATER
obtains both high precision and recall values for the role pair Bp-Bp, whereas for Bp-Dp we can
see a drop in precision and recall. This is to be expected as we have an incomplete (inferred or
biased) set of ground truth links for Bp-Dp [45]. In the IPUMS dataset, the F-F (father to father)
and M-M (mother to mother) role pairs have high precision, recall, and F ∗ results whereas we can
see a small drop for the C-C (children to children) role pair because the set of ground truth links
from IPUMS are more focused on linking couples than children [47]. In the context of resolving
basic entities, RELATER provides high precision and recall results for both DBLP-ACM and MSD.
We can see that the DBLP-Scholar dataset is challenging to resolve because for all the baselines
there is a drop in linkage quality. However, we can see that RELATER outperforms all the other
unsupervised baselines even for the challenging DBLP-Scholar dataset.
The results of the Attr-Sim baseline are not showing acceptable linkage quality in any of the
datasets with complex entities. This indicates that traditional pairwise linkage approaches are
insufficient for linking databases with complex entities because these approaches do not address
the challenges associated with resolving complex entities. With respect to the datasets with basic

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:22 N. Kirielle et al.

entities, good linkage quality can be achieved when resolving easy datasets such as MSD even
for the Attr-Sim baseline. For this dataset, the recall and F∗ measure have a slight improvement
when compared with RELATER. However, for more challenging datasets with basic entities, such
as DBLP-Scholar, the Attr-Sim baseline does not provide good results.
Dep-Graph [14] and Rel-Cluster [2] are two unsupervised collective ER baselines. Although they
exploit relationship information to resolve entities, they have poor performance compared to RE-
LATER when resolving complex entities. The Dep-Graph baseline addresses the problems of chang-
ing attribute values and different relationships by propagating attribute values and constraints.
However, as it does not address the problems of disambiguation, partial match groups, or incor-
rect links that RELATER addresses, for some role pairs (such as IOS Bp-Dp) Dep-Graph cannot
identify any true matches. Similarly, the drop in results in the Rel-Cluster baseline is because it
only addresses the disambiguation problem. However, as we can see both of these baselines per-
form better in resolving basic entities, the type of entities these two techniques were developed
for. Dep-Graph achieves slight improvements in recall and F∗ results for the DBLP-ACM dataset
pair. However, for all other datasets, RELATER performs better than Dep-Graph.
ZeroER [50] is a recently proposed unsupervised ER baseline that exploits the bi-modal nature
of ER problems to resolve entities. Based on the observation that the similarity vectors for matches
are different from those of non-matches, ZeroER employs generative models to learn the match and
non-match distributions. Therefore, when the features are well separated in simple basic entities
such as the DBLP-ACM dataset, ZeroER achieves the best results compared to all other baselines
and RELATER. However, when the datasets become challenging (even for basic entities), such as
DBLP-Scholar and MSD, we can see a drop of linkage quality results due to the absence of well
separated features for the match and non-match classes. Interestingly, none of the datasets with
complex entities achieve acceptable linkage quality with ZeroER compared to RELATER, because
ZeroER is unable to address the challenges associated with complex entities such as changing
attribute values and relationships, ambiguity, and the partial match group problem.
The precision, recall, and F∗ values for Magellan [31] are presented as averages with standard de-
viations because we use four different classifiers and two different settings to generate the training
and testing datasets. Since datasets with complex entities have different role pair types, we trained
Magellan in two different settings, whereas in the first we trained it only on record pairs of the
specific role pair that is being tested, and in the second we trained it on the full dataset. As most of
the datasets with complex entities have incomplete ground truth data, in practical scenarios one
likely will have to train on record pairs of all role pair types, for which Magellan obtains fairly
poor results. However, as expected in the first setting Magellan provides better results compared
to RELATER because Magellan is a supervised learning approach. For simple datasets with basic
entities, such as DBLP-ACM and MSD, we can see that RELATER outperforms Magellan. However,
for challenging datasets such as DBLP-Scholar, Magellan provides better results because it is a
supervised approach.

6.3 Scalability
Table 7 shows the runtimes of RELATER compared with the baselines. Attr-Sim shows the best
runtimes for most of the datasets because it simply links records without considering any rela-
tionships. The next best runtimes alternate between RELATER and Dep-Graph [14]. For datasets
with complex entities RELATER takes more time compared to Dep-Graph because it addresses all
problems specified in Section 3, whereas Dep-Graph addresses only the problems of changing at-
tribute values and different relationships. However, for datasets with basic entities, RELATER runs
faster than Dep-Graph because basic entities do not have most of the challenges complex entities
have. Rel-Cluster has higher runtimes compared to both RELATER and Dep-Graph because of the

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:23

Table 7. Runtime Results (in Seconds) for RELATER and Baselines


Dataset |NA | |NR | RELATER Attr-Sim Dep-Graph Rel-Cluster ZeroER Magellan
IOS 74,851 2,992,834 372 50 176 358 9,355 10,059
KIL 1,565,730 11,190,176 1,945 178 1,207 7,663 8,868 9,632
IPUMS 72,974 5,436,985 386 97 509 53,928 363 1,297
DBLP-ACM 43,726 52,200 3 3 6 20 254 15
DBLP-Scholar 1,259,501 1,093,444 152 70 172 3,837 6,063 30
MSD 70,136 33,120 3 1 4 105 88 86
Best results in each row are shown in bold font.

Table 8. Runtimes of RELATER for Different Graph Sizes of the BHIC Dataset Generated
for Different Time Periods
Time Number of Number of Generate Generate Bootstrap Iterative Merging and Linkage Linkage
Period Nodes Edges NA NR time Entity Graph time (ms) time (ms)
time (s) time (s) (s) Generation time (s) per node per edge
1900–1935 22,928,967 41,121,771 20,642 1,438 896 23,155 1.0 0.6
1890–1935 42,398,382 80,524,946 28,881 2,172 1,685 113,143 2.7 1.4
1880–1935 68,739,033 134,057,215 36,033 3,910 3,013 299,123 4.4 2.3
1870–1935 100,907,697 199,588,456 39,113 6,062 5,423 660,896 6.6 3.3
Linkage time is the total of bootstrapping, merging, inferring, and refining using the default settings described in
Section 6.

iterative clustering method employed. ZeroER shows worse runtimes compared to all other unsu-
pervised baselines because it involves a time consuming feature generation process. The worst
performing baseline is Magellan (these runtimes are averages for the four supervised classifiers
and two different settings we described in Section 6.2) as it consumes much time for training the
supervised classification models. Overall, the runtimes of RELATER are comparatively better than
the other baselines.
Next, we evaluate the scalability by comparing the runtimes of RELATER on different sized
datasets. For that, we vary the time periods of records considered for generating the graph with
the BHIC dataset. Table 8 provides an overview of runtimes for the atomic node and relational
node generation, bootstrapping, iterative merging, and entity graph generation steps of RELATER.
These runtimes indicate that the iterative merging and entity graph generation step accounts for
the largest component of the overall runtime because it is the most time consuming step that
involves most of the key techniques. We use total linkage time to measure the scalability of our
framework. Considering the values of linkage times per node and per edge, we can see that our
proposed framework has a near linear scalability with both, which indicates that RELATER can
scale to large graphs.

6.4 Parameter Sensitivity Analysis


We now show how RELATER is robust to the various parameter values used for ta , tm , γ , and
tn using linkage quality results of two datasets with complex entities and one dataset with basic
entities. We vary each parameter while keeping the other parameters at their default values, as we
discuss below.
In Figure 8, we vary t a in the range of [0.8, 0.95] to explore the sensitivity of ta . For all datasets,
we can see that ta is robust in the range of [0.8, 0.9] but results drop as we increase ta further.
A high ta value keeps only highly similar atomic nodes in the initial graph, resulting in a graph
that does not include many of the ground truth links. Lower ta values add more atomic nodes to
GD resulting in a larger graph. Therefore, we set ta = 0.9 as the default choice that works well for
all datasets in our experiments.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:24 N. Kirielle et al.

Fig. 8. Precision, Recall and F∗ sensitivity results of RELATER for different ta , tm , γ , and tn values as discussed
in Section 6.4.

We then vary the merging threshold, tm . For lower values of tm we can see that precision drops
for the IOS dataset because many false matches are being linked. However, in the other datasets, we
cannot see a drop in linkage quality results with lower tm values because the blocking technique
we used already provides an initial graph of high precision. We can see the precision, recall, and F∗
results are robust in the [0.8, 0.85] range for all datasets. When we further increase tm , recall drops
because we miss many true matches. Therefore, without a loss of generality, we set tm = 0.85 as
the default value that works well for all datasets.
Next, we discuss the impact on the linkage quality results of the value of γ that defines the
weight distribution of similarity components in Equation (3). When γ is lower a higher weight is
assigned to the disambiguation similarity simd and unambiguous record pairs that refer to different
entities can get linked. Therefore, we can see a drop in results at lower γ values. When we increase
γ recall drops because a higher weight is given to atomic similarity and disambiguation is ignored.
When we do not disambiguate (γ = 1.0), recall drops because ambiguous record pairs with high
similarity get linked and these links along with link constraints restrict true matches with lower
similarity getting linked. However, for the DBLP-ACM dataset, we cannot see a drop in recall when
we do not involve disambiguation similarity because the attribute values of basic entities are not
as ambiguous as complex entities, as we showed in Table 1. We, therefore, set γ = 0.6 as this value
provides a good balance between precision and recall for all datasets.
As we show in Figure 8, we can see that RELATER is fairly robust to tn , the threshold for the
minimum number of nodes in a cluster to split by bridges in the range of [10,20] for the IOS Bp-Bp
dataset. If we further increase tn there is a small drop in precision and F∗ because wrong links
are not removed from small clusters for high tn values. This drop is fairly small due to the small
clusters generated for this dataset. For datasets with larger cluster sizes the effect of tn will be
more significant. In other datasets, we cannot see any effect of tn because the clusters have at
most two records because we link only two datasets. We set tn = 15 as this value works well with
all datasets.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:25

Table 9. Ablation Analysis for RELATER that Shows How Each Key Component
in the Framework Affects Linkage Quality
Dataset Role Pair RELATER without PROP-A and PROP-C without AMB without REL without REF
P 98.73 86.79 99.22 99.88 98.02
IOS Bp-Bp R 94.70 95.20 93.56 61.58 94.87
F∗ 93.56 83.15 92.89 61.53 93.08
P 86.44 72.56 89.72 0.00 85.28
IOS Bp-Dp R 92.87 93.24 88.62 0.00 93.14
F∗ 81.06 68.93 80.45 0.00 80.24
P 100.0 92.75 100.0 0.00 100.0
IPUMS F-F R 96.33 96.33 87.16 0.00 96.33
F∗ 96.33 89.59 87.16 0.00 96.33
P 100.0 92.19 100.0 0.00 100.0
IPUMS M-M R 95.88 95.89 86.77 0.00 95.88
F∗ 95.88 88.69 86.77 0.00 95.88
P 89.68 47.96 90.75 0.00 89.68
IPUMS C-C R 93.89 93.93 86.71 0.00 93.89
F∗ 84.73 46.52 79.67 0.00 84.73
Best results in each row are shown in bold font.

6.5 Ablation Analysis


Table 9 shows the contributions of each key technique in the RELATER framework for different
role pairs in two datasets with complex entities. In Section 4, we described the key techniques that
we use to address the challenges related to complex entities, including PROP-A, PROP-C, AMB,
REL, and REF. As both PROP-A and PROP-C propagate link decisions through the ER process,
we consider them as a single component in this analysis, and show results without PROP-A and
PROP-C, without AMB, without REL, and without REF, separately.
When we remove PROP-A and PROP-C from the RELATER framework, then we neither prop-
agate negative nor positive evidence throughout the ER process. We, therefore, do not consider
attribute value changes and we do not enforce any constraints while linking records. We can see
that for all datasets precision is reduced along with a drop of F ∗ of up to 41% because we link record
pairs that refer to different entities in the absence of constraints. Similarly, as we do not propagate
attribute value changes, record pairs that are true links do not get linked due to lower similarity.
We incorporated ambiguity in RELATER by including a disambiguation component (AMB) in the
similarity calculation, as we discussed in Section 4. Without AMB corresponds to RELATER when
similarities are solely calculated based on attribute similarities by ignoring the disambiguation
similarity. We can see a drop in recall when disambiguation similarity is not involved. This is
because ambiguous record pairs with high attribute similarity get linked wrongly, and the enforced
constraints prevent correct record pairs from being linked.
Next, we remove the adaptive leveraging of relationship structure (REL). Interestingly, except
for the IOS Bp-Bp dataset, we can see all other datasets provide zero results for all linkage quality
measures. For example, with the IPUMS dataset, we know that most families have siblings, and
therefore almost every group that we consider to link in RELATER is a partial match group as we
defined in Section 3. As a result none of the correct record pairs gets linked.
Finally, when we remove dynamic refining of record clusters from RELATER (without REF), we
can see the precision drops for both role pairs in the IOS datasets. The reason for this is that the
wrong links are removed when refining the record clusters. The improvement of linkage results is
small here because we have small clusters in the IOS dataset. However, we cannot see any differ-
ences in results for the IPUMS dataset because the cluster sizes in this dataset are limited to two
as we link only two census snapshots. Therefore those clusters cannot be refined.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:26 N. Kirielle et al.

An important aspect to notice in the ablation study is that whenever we remove a key technique
from RELATER the overall results always drops. In some scenarios precision increases compromis-
ing recall and F∗ while in other scenarios recall increases compromising precision and F∗ . How-
ever, in none of the scenarios the overall linkage quality results (as indicated by F∗ ) are higher than
RELATER with all its key techniques included.

7 RELATED WORK
Various ER approaches have been developed since the 1950s to link records in databases [7, 21,
26, 41]. Most recent ER methods are based on supervised learning and deep learning approaches,
and the majority of recent works that aim at overcoming the lack of ground truth data are using
active learning or transfer learning. We now describe ER approaches related to ours that exploit
the relationships among entities to achieve high linkage quality.
On et al. [40] used relationship information for group linkage using weighted bipartite match-
ing. However, because groups are linked independently from each other, this approach does not
propagate relationship information. Fu et al. [19] pioneered the use of group linkage for person
records by linking individuals within households in census data (only considering relationships
within households), thereby substantially reducing the number of ambiguous links.
In contrast to pairwise classification based approaches, graph-based collective ER approaches
provide more accurate results by exploiting relational information [21]. Kalashnikov et al. [28]
proposed an approach for reference disambiguation based on random walks, aiming to identify
the entity to which each record refers to Bhattacharya and Getoor [2] also used relational informa-
tion between different types of entities by employing an iterative cluster merging process using a
relationship graph. However, these approaches focused on basic entities that have static attribute
values and static relationships, and they have mostly been evaluated on bibliographic data. In our
work, we address the problems associated with complex entities which have changing attribute
values and diverse relationships at different points in time. Kouki et al. [33] proposed a collective
ER approach for building familial networks based on probabilistic soft logic. Although the predi-
cates in their probabilistic soft logic capture relationships, they do not capture diverse relationships
encountered at different points in time. Similarly, they do not capture attribute values that change
over time.
Dong et al. [14] proposed a dependency graph-based approach to propagate link decisions
among multiple types of entities through the linkage process. We consider their approach as a
baseline (named Dep-Graph) because they also propagate link decisions to capture changing at-
tribute values and apply constraints. However, as we showed in the experimental evaluation, their
approach is not successful in addressing the problems associated with complex entities, such as
the disambiguation problem, the partial match group problem, or the incorrect link problem.
The ambiguity of attribute values in ER has been discussed since the development of probabilis-
tic record linkage by Fellegi and Sunter [16] in 1969. Li et al. [36] discussed the problem of am-
biguity in entities that occur in unstructured textual documents. In their approach, Kalashnikov
et al. [28] employed relationship analysis to enhance feature-based similarities between ambiguous
reference entity choices. This approach is applicable when the set of entities are known prior to
linking, and the task is to match records to entities. In our context, however, the set of entities is ini-
tially unknown. The approach developed by Bhattacharya and Getoor [2] incorporated ambiguity
in neighbours into the calculation of relational similarities. As this is the closest approach to ours,
we consider this as a baseline (named Rel-Cluster). However, as our experiments indicate, we can
see that this approach is not providing good linkage results because besides the disambiguation
problem it does not consider the other challenges our framework addresses.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:27

Recently efforts have been made to identify ER errors using graph theory measures [12, 44].
However, none of them has been proposed in the context of collective ER. We utilise simple graph
theory measures such as bridges and density [44] in our RELATER framework. The clusters of
records we obtain are small and, therefore, more sophisticated “repair” operations such as those
proposed by Croset et al. [12] cannot be applied.
In the context of ER, several approaches have been proposed to incorporate temporal constraints
into the linkage process. Li et al. [35] and Chiang et al. [6] were the first to use temporal information
for improved supervised pair-wise record pair classification in the bibliographic domain. More
recent work on ER for person records provides strong evidence for the improvement of linkage
quality when temporal constraints are applied [11, 38]. Although they address the problem of
changing attribute values over time, none of these temporal record linkage solutions [5, 9, 27, 34]
address the challenges associated with complex entities, such as that different relationships can
occur at different points in time, or the disambiguation or partial match group problems.
A growing body of research has studied supervised techniques in the context of ER. Magellan is
one such framework that supports an end-to-end ER pipeline with supervised techniques [31]. Re-
cently, deep learning techniques have also been proposed that provide good linkage results [37, 39].
However, as we discussed in the experiments in Section 6, databases with complex entities gen-
erally suffer from a lack of ground truth links that makes it challenging if not impossible to
use supervised techniques. Similarly, semi-supervised ER techniques including active learning ap-
proaches [29, 43] that query external sources to resolve challenging training cases, as well as crowd-
based approaches [1, 22, 49] that employ hybrid machine and human-based systems for resolving
entities are challenging with person data due to privacy and confidentiality issues[10]. Further-
more, as transfer learning ER approaches such as [53] use pre-trained models, it is questionable
how to incorporate temporal and relational aspects into the linkage process.
Recent advancements in the ER literature have influenced unsupervised approaches towards
self-supervised learning methods. One recently proposed state-of-the-art unsupervised ER ap-
proach is ZeroER [50], which employs generative modelling to learn match and non-match distri-
butions to resolve entities. However, as we show in the experimental evaluation, ZeroER performs
well only when the features representing similarities are well separated, such as in datasets that
contain basic entities. When the datasets have complex entities, ZeroER does not perform well as it
is unable to distinguish match and non-match distributions due to the challenges associated with
complex entities.
The problem of graph-based ER is related to the graph alignment or graph matching prob-
lem [25, 51, 52], where the aim is to identify nodes that correspond to the same entity in two
graphs. Similarly, ER is also related to link mining [20], which is a research area that focuses on
classification, clustering, prediction, and modelling of links in graphs. However, these techniques
are not suitable for resolving complex entities because they do not address the challenges in resolv-
ing complex entities, including the propagation of link decisions or the disambiguation problem.

8 CONCLUSION AND FUTURE WORK


We have presented a novel unsupervised graph-based ER framework for resolving entities in
datasets that contain complex (as well as basic) entities. Our framework, RELATER, addresses five
challenges in resolving complex entities. First, we propagate positive evidence through the ER
process to account for the attribute values of entities that change over time. Second, we consider
diverse relationships encountered by an entity at different points in time by propagating negative
evidence such as temporal and link constraints throughout the ER process. RELATER achieves an
average improvement of 18% precision and 22% recall over the ER approach of Dong et al. [14]
that propagates link decisions locally. Third, we address the ambiguities of attribute values by

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:28 N. Kirielle et al.

introducing a disambiguation similarity. Our framework achieves an average improvement of 13%


precision and 29% recall over the ER method proposed by Bhattacharya and Getoor [2], which also
leverages ambiguity. Fourth, the novel technique we propose to leverage the relationship structure
is the first to link partial match groups. Fifth, we are the first to address the problem of likely wrong
links in the context of collective ER by dynamically refining record clusters.
We show that our framework outperforms several state-of-the-art ER baselines on seven
datasets from different domains, where it is one the most efficient ER methods among the com-
pared baselines. Moreover, we show that our framework is robust to parameter settings and each
key component substantially contributes to improving linkage quality. While in our work, we
considered demographic and census datasets, for other types of person data, such as publication
records containing author affiliations, a user will need to define suitable parameter settings for RE-
LATER, including any temporal and linkage constraints, and categorise attributes into must, core,
and extra. These settings can be defined through domain knowledge.
In future work, we plan to improve the scalability of RELATER by developing parallel versions
of our framework, explore how to use other graph measures to identify any likely wrong links in
record clusters, and examine how we can utilise transfer learning [53] when using existing biased
ground truth datasets.

REFERENCES
[1] Asma Abboura, Soror Sahrl, Mourad Ouziri, and Salima Benbernou. 2015. CrowdMD: Crowdsourcing-based ap-
proach for deduplication. In Proceedings of the International Conference on Big Data. IEEE, 2621–2627.
[2] Indrajit Bhattacharya and Lise Getoor. 2007. Collective entity resolution in relational data. Transactions on Knowledge
Discovery from Data 1, 1 (2007), 5–es.
[3] Gerrit Bloothooft, Peter Christen, Kees Mandemakers, and Marijn Schraagen. 2015. Population Reconstruction.
Springer, Cham.
[4] Brabant Historical Information Center. 2021. Genealogie.Retrieved June 29, 2021 from https://opendata.picturae.com/
organization/bhic.
[5] Yueh-Hsuan Chiang, AnHai Doan, and Jeffrey F Naughton. 2014. Modeling entity evolution for temporal record
matching. In Proceedings of the SIGMOD International Conference on Management of Data. ACM, 1175–1186.
[6] Yueh-Hsuan Chiang, AnHai Doan, and Jeffrey F. Naughton. 2014. Tracking entities in the dynamic world: A fast
algorithm for matching temporal records. VLDB Endowment 7, 6 (2014), 469–480.
[7] Peter Christen. 2012. Data Matching—Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate
Detection. Springer.
[8] Peter Christen. 2016. Application of advanced record linkage techniques for complex population reconstruction.
arXiv:1612.04286. Retrieved from https://arxiv.org/abs/1612.04286.
[9] Peter Christen and Ross W. Gayler. 2013. Adaptive temporal entity resolution on dynamic databases. In Proceedings
of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 558–569.
[10] Peter Christen, Thilina Ranbaduge, and Rainer Schnell. 2020. Linking Sensitive Data. Springer.
[11] Victor Christen, Anika Groß, Jeffrey Fisher, Qing Wang, Peter Christen, and Erhard Rahm. 2017. Temporal group
linkage and evolution analysis for census data. In Proceedings of the International Conference on Extending Database
Technology. 620–631.
[12] Samuel Croset, Joachim Rupp, and Martin Romacker. 2015. Flexible data integration and curation using a graph-based
approach. Bioinformatics 32, 6 (2015), 918–925.
[13] Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, Pradap Konda, Yash Govind, and Derek Paulsen.
2021. The Magellan Data Repository. Retrieved May 05, 2021 from https://sites.google.com/site/anhaidgroup/useful-
stuff/data.
[14] Xin Luna Dong, Alon Halevy, and Jayant Madhavan. 2005. Reference reconciliation in complex information spaces.
In Proceedings of the SIGMOD International Conference on Management of Data. ACM, 85–96.
[15] Xin Luna Dong and Divesh Srivastava. 2015. Big Data Integration. Morgan and Claypool Publishers.
[16] Ivan P. Fellegi and Alan B. Sunter. 1969. A theory for record linkage. Journal of the American Statistical Association
64, 328 (1969), 1183–1210.
[17] Tyler Folkman, Rey Furner, and Drew Pearson. 2018. GenERes: A genealogical entity resolution system. In Proceed-
ings of the International Conference on Data Mining Workshops (ICDMW’18). IEEE, 495–501.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:29

[18] Zhichun Fu, Peter Christen, and Jun Zhou. 2014. A graph matching method for historical census household linkage.
In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 485–496.
[19] Zhichun Fu, Jun Zhou, Peter Christen, and Mac Boot. 2012. Multiple instance learning for group record linkage. In
Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 171–182.
[20] Lise Getoor and Christopher P. Diehl. 2005. Link mining: A survey. SIGKDD Explorations 7, 2 (2005), 3–12.
[21] Lise Getoor and Ashwin Machanavajjhala. 2013. Entity resolution for big data. In Proceedings of the SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining. ACM, 1527–1527.
[22] Yash Govind, Erik Paulson, Palaniappan Nagarajan, Paul Suganthan G. C., AnHai Doan, Youngchoon Park, Glenn M.
Fung, Devin Conathan, Marshall Carter, and Mingju Sun. 2018. Cloudmatcher: A hands-off cloud/crowd service for
entity matching. VLDB Endowment 11, 12 (2018), 2042–2045.
[23] David J. Hand and Peter Christen. 2018. A note on using the f-measure for evaluating record linkage algorithms.
Statistics and Computing 28, 3 (2018), 539–547.
[24] David J. Hand, Peter Christen, and Nishadi Kirielle. 2021. F*: An interpretable transformation of the f-measure. Ma-
chine Learning 110, 3 (2021), 451–456.
[25] Mark Heimann, Haoming Shen, Tara Safavi, and Danai Koutra. 2018. Regal: Representation learning-based graph
alignment. In Proceedings of the International Conference on Information and Knowledge Management. ACM, 117–126.
[26] Thomas Herzog, Fritz Scheuren, and William Winkler. 2007. Data Quality and Record Linkage Techniques. Springer.
[27] Yichen Hu, Qing Wang, Dinusha Vatsalan, and Peter Christen. 2017. Improving temporal record linkage using regres-
sion classification. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer,
561–573.
[28] Dmitri V. Kalashnikov and Sharad Mehrotra. 2006. Domain-independent data cleaning via analysis of entity-
relationship graph. Transactions on Database Systems 31, 2 (2006), 716–767.
[29] Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution
with transfer and active learning. In Proceedings of the Association for Computational Linguistics. ACL, 5851–5861.
[30] Nishadi Kirielle, Peter Christen, and Thilina Ranbaduge. 2019. Outlier detection based accurate geocoding of histor-
ical addresses. In Proceedings of the Australasian Conference on Data Mining. Springer, 41–53.
[31] Pradap Konda, Sanjib Das, Paul Suganthan G.C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah
Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, Vijay Raghavendra. 2016. Mag-
ellan: Toward building entity matching management systems. VLDB Endowment 9, 12 (2016), 1197–1208.
[32] Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world
match problems. VLDB Endowment 3, 1–2 (2010), 484–493.
[33] Pigi Kouki, Jay Pujara, Christopher Marcum, Laura Koehly, and Lise Getoor. 2019. Collective entity resolution in
multi-relational familial networks. Knowledge and Information Systems 61, 3 (2019), 1547–1581.
[34] Furong Li, Mong Li Lee, Wynne Hsu, and Wang-Chiew Tan. 2015. Linking temporal records for profiling entities. In
Proceedings of the SIGMOD International Conference on Management of Data. ACM, 593–605.
[35] Pei Li, Xin Luna Dong, Andrea Maurino, and Divesh Srivastava. 2011. Linking temporal records. VLDB Endowment
4, 11 (2011), 956–967.
[36] Xin Li, Paul Morie, and Dan Roth. 2005. Semantic integration in text: From ambiguous names to identifiable entities.
AI Magazine 26, 1 (2005), 45–45.
[37] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep,
Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In
Proceedings of the SIGMOD International Conference on Management of Data. ACM, 19–34.
[38] Charini Nanayakkara, Peter Christen, and Thilina Ranbaduge. 2019. Robust temporal graph clustering for group
record linkage. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer,
526–538.
[39] Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. Deep sequence-
to-sequence entity matching for heterogeneous entity resolution. In Proceedings of the International Conference on
Information and Knowledge Management. ACM, 629–638.
[40] Byung-Won On, N. Koudas, Dongwon Lee, and D. Srivastava. 2007. Group linkage. In Proceedings of the International
Conference on Data Engineering. IEEE, 496–505.
[41] George Papadakis, Ekaterini Ioannou, Emanouil Thanos, and Themis Palpanas. 2021. The Four Generations of Entity
Resolution. Morgan and Claypool Publishers.
[42] George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and filtering tech-
niques for entity resolution: A survey. Computing Surveys 53, 2 (2020), 1–42.
[43] Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active learning for large-scale entity resolution. In Proceedings of
the Conference on Information and Knowledge Management. ACM, 1379–1388.

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:30 N. Kirielle et al.

[44] Sean M. Randall, James H. Boyd, Anna M. Ferrante, Jacqueline K. Bauer, and James B. Semmens. 2014. Use of graph
theory measures to identify errors in record linkage. Computer Methods and Programs in Biomedicine 115, 2 (2014),
55–63.
[45] Alice Reid, Ros Davies, and Eilidh Garrett. 2002. Nineteenth-century scottish demography from linked censuses and
civil registers: A ‘sets of related individuals’ approach. History and Computing 14, 1–2 (2002), 61–86.
[46] Stephen Robertson. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of
Documentation 60, 5 (2004), 503–520.
[47] Steven Ruggles, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas, Megan Schouweiler, and Matthew Sobek.
2021. IPUMS USA: Version 11.0 [dataset]. DOI:https://doi.org/10.18128/D010.V11.0
[48] Laura Spinney. 2017. Pale Rider: The Spanish Flu of 1918 and How it Changed the World. PublicAffairs, New York.
[49] Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution.
VLDB Endowment 5, 11 (2012), 1483–1494.
[50] Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. ZeroER: Entity res-
olution using zero labeled examples. In Proceedings of the SIGMOD International Conference on Management of Data.
ACM, 1149–1164.
[51] Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li, and
Kuansan Wang. 2019. Oag: Toward linking large-scale heterogeneous entity graphs. In Proceedings of the SIGKDD
International Conference on Knowledge Discovery and Data Mining. ACM, 2585–2595.
[52] Jing Zhang, Bo Chen, Xianming Wang, Hong Chen, Cuiping Li, Fengmei Jin, Guojie Song, and Yutao Zhang. 2018.
MEgo2Vec: Embedding matched ego networks for user alignment across social networks. In Proceedings of the Inter-
national Conference on Information and Knowledge Management. ACM, 327–336.
[53] Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and
transfer learning. In Proceedings of the World Wide Web Conference. ACM, 2413–2424.

Received 22 July 2021; revised 14 February 2022; accepted 20 April 2022

ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.

You might also like