Professional Documents
Culture Documents
Entity resolution (ER) is the process of linking records that refer to the same entity. Traditionally, this process
compares attribute values of records to calculate similarities and then classifies pairs of records as referring
to the same entity or not based on these similarities. Recently developed graph-based ER approaches combine
relationships between records with attribute similarities to improve linkage quality. Most of these approaches
only consider databases containing basic entities that have static attribute values and static relationships, such
as publications in bibliographic databases. In contrast, temporal record linkage addresses the problem where
attribute values of entities can change over time. However, neither existing graph-based ER nor temporal
record linkage can achieve high linkage quality on databases with complex entities, where an entity (such as
a person) can change its attribute values over time while having different relationships with other entities at
different points in time. In this article, we propose an unsupervised graph-based ER framework that is aimed
at linking records of complex entities. Our framework provides five key contributions. First, we propagate
positive evidence encountered when linking records to use in subsequent links by propagating attribute val-
ues that have changed. Second, we employ negative evidence by applying temporal and link constraints to
restrict which candidate record pairs to consider for linking. Third, we leverage the ambiguity of attribute val- 12
ues to disambiguate similar records that, however, belong to different entities. Fourth, we adaptively exploit
the structure of relationships to link records that have different relationships. Fifth, using graph measures,
we refine matched clusters of records by removing likely wrong links between records. We conduct extensive
experiments on seven real-world datasets from different domains showing that on average our unsupervised
graph-based ER framework can improve precision by up to 25% and recall by up to 29% compared to several
state-of-the-art ER techniques.
CCS Concepts: • Theory of computation → Data integration; • Information systems → Entity reso-
lution;
Additional Key Words and Phrases: Record linkage, data linkage, data cleaning, dependency graph, temporal
data, ambiguity
ACM Reference format:
Nishadi Kirielle, Peter Christen, and Thilina Ranbaduge. 2023. Unsupervised Graph-Based Entity Resolution
for Complex Entities. ACM Trans. Knowl. Discov. Data. 17, 1, Article 12 (February 2023), 30 pages.
https://doi.org/10.1145/3533016
Authors’ address: N. Kirielle, P. Christen, and T. Ranbaduge, School of Computing, The Australian National University,
Canberra, ACT 2600, Australia; emails: {nishadi.kirielle, peter.christen, thilina.ranbaduge}@anu.edu.au.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:2 N. Kirielle et al.
1 INTRODUCTION
Entity resolution (ER) is the process used in data integration to identify and group records
into clusters that refer to the same entity where records can be sourced from one or multiple
databases [7, 41]. Generally, records used in ER have multiple attributes (commonly known as
quasi-identifiers [10]) that describe an entity. For example, a person entity can have a birth record
with attributes such as the baby’s name, sex, place of birth, date of birth, as well as the details of
the parents. The process of integrating different databases is required in domains such as health
analytics, national censuses, e-commerce, crime and fraud detection, national security, and the
social sciences [7, 15, 41].
Traditional ER approaches only consider the similarities between attribute values of each com-
pared record pair individually to identify matches [16], while graph-based collective ER techniques
make use of the relationships between entities to improve match decisions [2, 14, 28, 33].1
Most research in ER has only focused on entities that have static attribute values, where these
values can contain variations, abbreviations, and errors, or be missing. Such entities also only
have static relationships that are the same in all records that represent the same entity [2, 14]. We
refer to such entities as basic entities. Basic entities are, for example, publications or consumer
products. When linking publication records across two bibliographic databases, for example, an
article published in a conference or journal has the same title, venue, and a single author or a group
of coauthors across both databases, potentially with some data errors, variations, or missing values
in these attributes. These attribute values (unless being corrected after publication), however, do
not change for a given publication record. Similarly, the relationship of being an author in a given
publication is fixed and also does not change over time. This is assuming the ER task is to link
publications across two databases; the task of linking authors will involve complex entities (as
described next) because the details of authors, such as their names and affiliations, can change
over time.
Research in temporal record linkage has explored the effect of temporally changing attribute
values, such as address or name changes when people move or get married, in the ER process [35].
However, these approaches are limited to adjusting the attribute similarities between individual
records based on their temporal distances and the likelihood that an attribute value can change
over time. For example, address values generally change more often than surname values [27] as
people are more likely to change their address in a given period of time than their surname. These
approaches, however, do not consider that the relationships encountered between certain types of
entities, such as people, can also be different at various points in time. We refer to types of entities
that can have changing attribute values as well as different types of relationships at various points
in time as complex entities. As we show in Section 6, existing graph-based collective ER techniques
fail to achieve high linkage quality for situations where complex entities need to be resolved [2]
due to the changing nature of attribute values and different relationships.
As an example, if we consider a set of birth, marriage, and death certificates as a set of databases
of complex entities, then these databases will contain records of people at different stages of their
life. For instance, the same person can appear as a baby in a birth certificate, a bride in a marriage
certificate, and then as a mother in the birth certificates of her own children. The structure of
relationships in these certificates, within the same or across different databases, can be complex
because the same entity can play a different role in each relationship and can have different types
of relationships at different points in their lives. As a baby, an entity has a childOf relationship with
1 Aswe described formally in Section 3, throughout this article, we refer to a set of matched records that supposedly
correspond to the same entity as a cluster of records while we name a set of records that are relationally connected as a
group of records.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:3
Table 1. Frequencies (Minimum, Average, and Maximum) of the Number of Entities that Share an
Attribute Value in Databases that Contain Complex or Basic Entities, where we Only Show the Most
Unique Attribute (with the Lowest Average Frequency)
Dataset Domain Entity Attribute with Entity Attribute Value Frequencies
Type Least Ambiguity Count Minimum Average Maximum
Isle of Skye [45] Demographic Complex First name (birth babies) 12,285 1 21.80 1,089
Kilmarnock [45] Demographic Complex First name (birth babies) 23,715 1 5.82 1,837
IPUMS [47] Census Complex First name 21,828 1 5.68 1,009
NCVR [47] Census Complex Middle name 8,214,211 1 18.31 181,839
Scholar [13] Bibliographic Basic Publication title 64,263 1 1.02 51
Million Songs [13] Songs Basic Song title 35,463 1 1.14 110
IMDB [13] Movies Basic Movie name 6,407 1 1.15 3
her parents from the birth record, while as a married bride she then has a spouseOf relationship
with her husband, and when she has a baby herself will have a motherOf relationship in her baby’s
birth certificate. Using a motivating example, in Section 2, we describe the different challenges that
can occur with complex entities.
Furthermore, ambiguity in attribute values is a common problem in the ER process that involves
both basic and complex entities. Entities such as people seem to have higher levels of ambiguity in
their attribute values compared to entities such as publications. It is common for many individuals,
for example, to share the same first name or surname, the same city and postcode values, or the
same occupation. On the other hand, publication titles are generally rather unique. In Table 1,
we illustrate this issue by showing the least ambiguous attribute (where values are shared by the
smallest numbers of entities based on ground truth data) from a variety of datasets as commonly
used in ER research. Publication titles in Scholar, song titles in Million Songs, and movie titles in
IMDB, stand out with an average of only slightly more than one entity having a given attribute
value. On the other hand, for the Isle of Skye (IOS) dataset [45] (which we use in our evaluation
in Section 6), the values of first names are on average shared by more than twenty individuals (at
least five individuals for the other datasets that contain complex entities). This higher ambiguity
makes the ER process more challenging.
Although there exist collective ER work that studies disambiguation [2] and changing attribute
values [14], none of them has explored how to address the problem of disambiguation in a context
where attribute values as well as relationships can change over time. For example, if we have two
person records of a woman before and after her marriage in which she changes her surname, and
both surnames before and after her marriage are ambiguous (such as “Miller” and “Smith”), then
we still need to be able to identify that these two records refer to the same woman.
Another important aspect is that many practical ER applications suffer from missing, incomplete,
or biased ground truth data (known true matches and non-matches). Particularly in the context
of databases that contain complex entities such as person records, ground truth data are often
not available, or if available, they might be limited to manually curated, biased, and incomplete
matches [10]. Therefore, in spite of the growing interest in applying supervised techniques such
as deep learning [31, 37, 39, 53], unsupervised techniques are still required in many practical ER
applications.
Being able to link complex entities is highly important in domains such as medical research,
where linking patient records of individuals and families over time can help detect patterns of
how diseases spread through households and communities, and even facilitate novel genealogy
studies [33]; in national census analyses that help governments to better understand patterns
of education, migration, fertility, and social mobility over time [10, 18]; in social network anal-
ysis to identify the interests and connections of individuals; and in the domain of population
reconstruction [3] that intends to link databases of whole populations to reconstruct family trees
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:4 N. Kirielle et al.
Table 2. Sample Records from One Birth Certificate and Two Death Certificates, as Discussed
in the Example in Section 2
Certificate ID Event Birth Baby/ Mother Father Spouse
Type Year Deceased Person
Birth B 1 1767 Mary Smith (r 1 ) Margeret Smith (r 2 ) John Smith (r 3 ) –
Death D 1 1827 Mary Taylor (r 4 ) Margery Smyth (r 5 ) John Smith (r 6 ) Nichol Taylor (r 7 )
Death D 2 1777 Anne Smith (r 8 ) Maria Smith (r 9 ) Jonn Smith (r 10 ) Duncan Hunter (r 11 )
For simplicity, we only show the name attribute of each record. However, in real data each such record will have
various other attributes including an address, an occupation, and a date of birth, marriage or death, respectively, to
name a few.
over time that can be used for data analysis in demography, sociology, and genealogy [17]. Of
current interest, reconstructing a historical population from 1918 will allow the analysis of how
the Spanish flu has spread [48]. Better understanding such historical pandemics at the scale of
a full population can help public health researchers and governments when dealing with health
crises, such as the current COVID-19 pandemic, and to be better prepared for future outbreaks of
infectious diseases.
Our aim in this work is to provide an unsupervised ER framework that can link records of
complex (as well as basic) entities while addressing the challenges current graph-based ER and
temporal linkage cannot handle adequately. We address five challenges in ER which are funda-
mental in linking records about complex entities, where we elaborate on these challenges with a
motivating example in the following section. We conduct extensive experiments on four datasets
that contain complex entities and three datasets containing basic entities to illustrate how our
proposed framework outperforms state-of-the-art ER approaches.
Contribution: We propose a novel unsupervised graph-based ER framework that is focused
on addressing the challenges associated with resolving complex entities (referred to as RELATER,
which stands for propagation of constRaints and attributE values, reLationships, Ambiguity, and
refinemenT for Entity Resolution, the main contributions of our work). We propose a global
method of propagating attribute values and constraints to capture changing attribute values and
different relationships, a method for leveraging ambiguity in the ER process, an adaptive method of
incorporating relationship structure, and a dynamic refinement step to improve clusters of records
by removing likely wrong links between records. RELATER can be employed to resolve records of
both basic and complex entities, as we will show using extensive experiments in Section 6.
2 MOTIVATING EXAMPLE
As shown in Table 2, let us consider a set of complex entities where we are interested in resolving
eleven person records (r 1 to r 11 ) from one birth certificate and two death certificates. We assume
a birth (B) certificate describes a birth baby (Bb) and its mother (Bm) and father (B f ), while a
death (D) certificate describes a deceased person (Dd), their mother (Dm) and father (D f ), and
possibly their spouse (Ds). Similarly, a marriage (M) certificate would describe a bride (Mb) and a
groom (Mд), the bride’s mother (Mbm) and father (Mb f ), and the groom’s mother (Mдm) and
father (Mд f ).
Given the three example certificates in Table 2, we are interested in finding which person entities
are associated with these eleven records, hence which records need to be linked such that each
resulting cluster of records represents one entity. As an initial step, we need to extract the records
from the certificates where B 1 will contribute three person records, Mary Smith (r 1 ), Margeret
Smith (r 2 ), and John Smith (r 3 ), and likewise for the other certificates. We then need to determine
if Mary Smith in B 1 is the deceased person in D 1 or D 2 , or if she is the mother on either/both of
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:5
these two death certificates. Similarly, the other records in this example have different roles and
relationships.
Assume that Mary Smith (r 1 ) in B 1 is the deceased in D 1 , Mary Taylor (r 4 ), and the deceased in
D 2 , Anne Smith (r 8 ) could be the sibling of Mary Smith (r 1 ). Therefore, the goal of our ER process is
to find the following clusters of records which correspond to six different person entities: (r 1 , r 4 ),
(r 2 , r 5 , r 9 ), (r 3 , r 6 , r 10 ), (r 7 ), (r 8 ), and (r 11 ).
There exist different challenges in this example that are of interest particularly for resolving
complex entities. The primary challenge is the identification problem as defined by Bhattacharya
and Getoor [2], where we need to figure out the set of records that refer to each entity. While this
problem has been explored in the collective ER literature [2, 14, 28, 33], as we discuss next, some
aspects in our example have either not been investigated so far, or improvements are required
because existing methods fail to obtain high linkage quality for complex entities (as we will show
in Section 6).
The second challenge is how to resolve entities with changing attribute values. Mary Smith in
r 1 has a different surname (r 4 ) in her death certificate, D 1 , which is likely due to the change of her
surname when she got married. Assume we have linked r 1 with Mary Smith’s marriage certificate
(not shown), where this link of records states that her surname has changed to Taylor. In such a
scenario, if we can propagate the link decision (of her birth record with her marriage record) to
the link decision of her birth and death records, then we can easily identify that Mary Smith is
the same person as Mary Taylor based on her linked birth and marriage records. While existing
temporal record linkage approaches [27, 35] address the challenge of changing attribute values by
applying techniques such as temporal decays of attribute weights to capture temporal changes,
these solutions do not address the problem of different relationships of the same entity found in
records at various points in time.
The third challenge is how to incorporate the different relationships into the ER process to dis-
cover positive or negative evidence to guide the ER process. Assume r 1 and r 4 in the example in
Table 2 have been linked, and now we are interested in knowing if r 4 and r 9 refer to the same
entity as we still do not know if r 1 and r 8 are siblings. Here, even though both r 1 and r 4 have rela-
tionships with their mothers and fathers, r 9 has different relationships, namely her baby Anne (r 8 )
and her spouse, John (r 10 ). These different relationships occurring in records at different points in
time can provide negative evidence for any subsequent link decisions, for instance in the form of
constraints. For example, in order to decide if r 4 refers to the same entity as r 9 , we can propagate
temporal information from the link decision of r 4 with r 1 (Bb) discussed above. When it comes to
the temporal domain, biological constraints become relevant, in our example for a birth baby to
become a mother there should be a gap of at least around 15 years. Therefore, we can decide that
r 1 and r 9 cannot be linked (refer to the same entity) as they are only 10 years apart.
In a context where relationships are considered, Dong et al. [14] propagated link decisions by
considering attribute value changes and applying constraints. They perform an exhaustive search
to find all record pairs associated with any of the linked records, and then merge attributes and
use the transitive closure property to remove any additional record pairs [14]. However, no exist-
ing graph-based ER work has explored how to efficiently propagate attribute value changes and
apply constraints. As we discuss in Section 4.1, we propose an efficient method that avoids an
exhaustive search to propagate link decisions. Furthermore, no research has so far explored how
this propagation of link decisions is affected when the attribute values of entities are ambiguous.
This disambiguation problem, as we showed in Table 1, is where a given attribute value is shared
by multiple (possible many) entities. Values that are shared by only a smaller number of entities
provide stronger evidence that two records refer to the same entity. For example, if we look at
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:6 N. Kirielle et al.
the attribute values of the records in Table 2, we can see that the surname Smith occurs more
often than the surname Hunter. As a result, if we have two records, such as r 1 and r 8 , which both
have the surname Smith, then this shared value does not provide sufficient evidence to link those
records because Smith is ambiguous. However, if we find a new record with a surname Hunter,
it is more likely that the new record represents the same entity as r 11 because Hunter is unique
in our example. Bhattacharya and Getoor [2] have explored ambiguity of static attribute values
in relational clustering for collective ER. However, they have not investigated how to incorporate
disambiguation while propagating link decisions, or when attribute values can change over time.
In collective ER, we are interested in linking records that are relationally connected with other
records. For example, consider the two connected record groups of B 1 and D 2 in Table 2. If we
assume that Mary Smith and Anne Smith are siblings, then we should not link them. However,
the parent record pairs in that group, (r 2 , r 9 ) and (r 3 , r 10 ), need to be linked as they refer to their
parents. We refer to this challenge as the partial match group problem, where only a subset of
relationally connected records correspond to the same entities while others do not. While recent
ER approaches take relationships into account by either incorporating relationship information
into the similarity calculation [2, 14] or by making a group link decision [19, 40], these approaches
would fail to properly link the parent records in this example because the overall similarity drops
due to the different sibling first names.
The final challenge is the one of incorrect link decisions. Because the process of linking two
records is no longer independent from linking other records in the context of collective ER [2, 14],
a single wrong link might propagate into other link decisions and result in an increase in the
number of false matches as well as missed true matches. For example, assume in Table 2 that
we have incorrectly linked r 1 with r 8 given both their parent’s first names are similar and their
surnames are the same. However, as a deceased person can only be linked to a single birth baby, r 8
will then not be linked to its correct birth record, and similarly r 4 might get linked to a wrong birth
record. To the best of our knowledge, this challenge has not yet been addressed in the literature.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:7
Fig. 1. The challenges of resolving complex entities defined in Section 3. We use v S N to show values from
a surname attribute. The direction of arrows corresponds to the relationships between records as well as
attribute values. Attributes are shown in green, records in light blue, entities in squares, and record clusters
with a shaded box.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:8 N. Kirielle et al.
Fig. 2. Pipeline of RELATER where blue coloured boxes are the key techniques described in Section 4 and
white coloured boxes represent the three main steps described in Section 5.
of distinguishing such entities having the same attribute value (for example, many people
having the common surname Smith) as the disambiguation problem.
(d) Partial match group problem: Let Rα ⊂ R and Rβ ⊂ R be two groups of records, with
Rα ∩ Rβ = ∅, where we assume the records in each group are relationally connected with
each other. When two such groups are compared for linking, in the set of paired records,
{(r i , r k ), (r j , rl )}, where r i , r j ∈ Rα and r k , rl ∈ Rβ , if ∃(r i , r k ), (r j , rl ) : r i .o r j .o ∧ r k .o =
rl .o, we define such a group as a partial match group. We refer to this challenge of having
some record pairs that refer to different entities while other record pairs referring to the
same entity in relationally connected record groups, (such as linking parents across the birth
records of siblings, but not linking the siblings), as the partial match group problem.
(e) Incorrect link problem: Let M be the set of record clusters in the record set R that have
been linked. Assume mk ⊂ R where mk ∈ M and ∃(r i , r j ) ∈ mk : r i .o r j .o. This challenge is
the incorrect link problem where we have records representing different entities in the same
cluster of records, such as an entity cluster that represents a certain individual to contain a
record of a sibling.
Definition 3.1 (ER of Complex Entities). Given a set of records, R, the ER problem of resolving
records of complex entities is to link records r i ∈ R into clusters of records mk such that R =
{r i : ∀r i ∈ mk , ∀mk ∈ M} (all records in R have been inserted into a cluster) with M = ∪{mk }
and ∀mi , mj ∈ M : mi ∩ mj = ∅ (each record has been inserted into only one record cluster);
O = {r i .o : ∀r i ∈ mk , ∀mk ∈ M} and ∀r i ∈ mk : r i .o = o j , ∀mk ∈ M (every entity in O is
represented by one record cluster); and ∀mk ∈ M : |mk | ≥ 1 (each record cluster contains one
or more records, where records that were not linked are clusters of size 1) in a context where
relationally connected record groups can contain partial match groups and the records of an entity
can have changing attribute values, different relationships in different records, and ambiguous
attribute values shared with other entities.
Figure 2 shows the pipeline of our framework where the input is the groups of relationally
connected records extracted from one or more databases, and the output is a set of entities repre-
sented as clusters of records. We now provide an overview of the three main steps of RELATER as
described in detail in Section 5 (the white coloured boxes in Figure 2). In Section 4, we then discuss
how each key technique (the blue coloured boxes in Figure 2) contributes to the pipeline.
(1) Dependency Graph Generation: To resolve records, we need to represent them in a data
structure that can capture the relationships among records. Hence, we generate a depen-
dency graph [14] defined as follows.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:9
Fig. 3. Dependency graph generation from a birth certificate and two death certificates, as discussed in
Section 2. Relationships between records are derived from the structure of certificates (such as a birth cer-
tificate containing a baby Bb, a mother Bm, and a father B f ). Each double ended arrow corresponds to two
single directed edges in the dependency graph while a single directed edge means that the node at the head
(arrow) depends on the similarity of the node at the tail. Atomic nodes are shown in green while active
relational nodes are shown in blue.
Definition 3.2. A dependency graph is a directed graph, GD , that consists of a set of nodes,
ND , where these nodes represent pairs of attribute values or pairs of records; and a set of
edges, ED , that represent relationships between nodes. ND consists of atomic nodes, NA , that
represent pairs of attribute values, and relational nodes, NR , that represent pairs of candidate
records that possibly refer to the same entity, where ND = NA ∪ NR .
To generate the dependency graph, we potentially first have to extract records represent-
ing individual entities (unless the input dataset already contains such individual records).
For example, as shown in Table 2 and Figure 3, to generate the dependency graph for per-
son data, we first extract individual records from birth and death certificates. Then, as we
describe in Section 5.1, for each pair of similar values in an attribute a ∈ A (with similarities
greater than a threshold ta ), vi and v j , we add a node (vi , v j ) ∈ NA to GD . We repeat this
process for a selected set of quasi-identifying attributes that represent an entity. For each
pair of records, (r i , r j ) ∈ R, that possibly refer to the same entity (based on blocking as we
elaborate on in Section 5.1), we add a node (r i , r j ) ∈ NR to GD .
A directed edge in GD represents that the similarity of the destination node depends on the
similarity of the source node. Edges between nodes in NR represent relationships between
records, such as motherOf, or authorOf. For each node in NR , the set of its adjacent nodes
with incoming edges is denoted by C = CA ∪ CR , where CA and CR are the sets of adjacent
atomic and relational nodes of the specified node, respectively. For each r i .vi and r j .v j , if the
node (vi , v j ) ∈ NA , then there is a directed edge (vi , v j ) → (r i , r j ), with (vi , v j ) ∈ CA for the
relational node (r i , r j ). For each pair of nodes ni , n j ∈ NR , if there is a relationship between
these nodes then there exist two directed edges between them: ni → n j (where ni ∈ Cj for
n j ) and n j → ni (where n j ∈ Ci for ni ). For example, there will be two edges between a
mother node and a child node. To keep it simple, we show these as double-arrowed edges in
the example figures.
Figure 3(b) shows an example of a small dependency graph. Since each relational node
in this graph is associated with two records, we refer to linking two records in a node as
merging the node. Each node in GD is also associated with a node state [14], where this state
changes throughout the running of our framework. The possible states are active (considered
for merging), inactive (failed merging due to insufficient evidence such as low similarity),
merged (two records in the node are linked), and non-merge (not considered for merging due
to constraint violations).
(2) Bootstrapping: In this step, described in more detail in Section 5.2, we merge highly
similar groups of nodes in GD that have an average similarity greater than a predefined
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:10 N. Kirielle et al.
4 KEY TECHNIQUES
In this section, we describe the key techniques, including all novel contributions, underlying the
RELATER framework that solve the five challenges described in the previous section. These tech-
niques help our framework to achieve high linkage quality specifically for complex entities when
compared to existing ER approaches.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:11
Fig. 4. Global propagation of attribute values. As r 1 is associated with a record cluster, m1 , we replace the
atomic node of surnames from node (r 1 , r 4 ) with the surnames that have the highest similarity between
m1 and r 4 , (T ayler ,T aylor ). Atomic nodes are shown in green while active and merged relational nodes are
shown in blue and yellow, respectively.
from the (Tayler, Taylor) node to the relational node (r 1 , r 4 ). In this way, even if an individual
changes their name or address over time, our framework can still identify them based on previous
links or highly similar attribute values. With this attribute propagation step, as connected atomic
nodes are changing, the similarity of each relational node can change through the ER process,
which is a significant improvement over previous collective ER approaches that do not consider
such attribute value changes.
In the context of collective ER, the idea of propagating linking decisions has first been proposed
by Dong et al. [14]. However, our propagation method is different from this previous approach
as we make a global propagation of attribute values using a unified view of all record clusters,
M, that represent entities. On the other hand, Dong et al. [14] propagated link decisions with an
exhaustive search that merges relational nodes in the graph.
Definition 4.1. Temporal constraints apply for databases with complex entities where such
constraints restrict (for specific role pairs) if two records should be considered for linking or not.
We model temporal constraints as a set T = ρ 1, ρ 2 ∈P Tρ 1, ρ 2 , of time periods where records can and
cannot be linked.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:12 N. Kirielle et al.
Definition 4.2. Link constraints restrict, for a given role pair, how many links a record can be
involved in. We model link constraints as a set L = ρ 1, ρ 2 ∈P Lρ 1, ρ 2 , of one-to-one or one-to-many
constraints that determine how many records can be involved in a specific relationship for this
role pair.
For example, the temporal constraint between the roles of birth baby and birth mother TBb, Bm
can be represented as (r i .ρ = Bb) ∧ (r j .ρ = Bm) ∧ (15 ≤ YearT imeGap(r i , r j ) ≤ 55) =⇒
V alidMerдe (r i , r j ). Similarly, the one-to-one link constraint between the roles of birth baby and
deceased person LBb, Dd can be represented as (ru .ρ = Bb) ∧ (rv .ρ = Dd ) ∧ (|Links (ru , Dd )| =
0) ∧ (|Links (rv , Bb)| = 0) =⇒ V alidMerдe (ru , rv ), which means both records ru and rv cannot
be involved in any other links to a deceased person and a birth baby, respectively.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:13
Fig. 5. Similarity calculation of node (r 1 , r 4 ). Assuming we set w M , wC , and w E as 0.5, 0.3, and 0.2, respectively
(determined via domain knowledge) and consider first name (Mary, Mary) as a must attribute, surname
(T ayler ,T aylor ) as a core attribute, and city (Klmor , Kilmore) as an extra attribute. Then, sima (r 1 , r 4 ) can
+ 0.3·0.9 + 0.2·0.9 = 0.95 using Equation (1). Similarly, assuming r .f = 45, r .f = 12,
be calculated as 0.5·1.00.5 + 0.3 + 0.2 1 4
log2 (100/(45 + 12))
and |O | = 100, using Equation (2) we can calculate simd (r 1 , r 4 ) as log2 (100) = 0.12.
while w M , wC , and w E represent their corresponding weights, which can be learnt from a train-
ing dataset or determined via domain knowledge [8, 45]. As Extra attributes are subsidiary, the
presence of an Extra attribute provides positive evidence for a match while its absence does not
necessarily provide negative evidence. This is supported because we add atomic nodes only if the
similarity of the two attribute values in a node are above a pre-defined threshold, ta . Therefore,
we set w E = 0.0 if all Extra attributes are absent.
If the pair of records in a relational node has attribute values that occur frequently in the set of
records R, then a high attribute similarity of a record pair is not important compared with a pair
of records that have rare attribute values and the same attribute similarity [16]. In our example
in Section 2, Smith occurs seven times whereas Hunter occurs only once. Two records having the
surname Hunter, therefore, have a higher likelihood of referring to the same real-world entity
compared to two records having the surname Smith. As the link decisions in our framework are
dependent on each other, we need to prioritise record pairs with unique or rare attribute values
such that they are processed before record pairs with ambiguous attribute values. As this is similar
to the concept of inverse document frequency as used in information retrieval, we use a normalised
score of inverse document frequency [46] as the disambiguation similarity score, simd . Assume
a α , a β ∈ A are the attributes that we consider for calculating ambiguity. Then, the frequency r .f
for a record r is calculated as the frequency of attribute value combinations, va α and v a β , in one
of the duplicate free datasets that we aim at linking. For two records r i and r j in a relational node,
let r i .f and r j .f be the frequencies calculated as described. If the number of unique records in the
dataset (i.e., number of entities) is |O|, we define simd as shown in Equation (2), where we can
estimate |O| using the same duplicate free dataset we used to calculate frequencies.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:14 N. Kirielle et al.
Fig. 6. Adaptive leveraging of relationship structure. These group of nodes correspond to the parents and
siblings in the birth (B1) and death (D2) certificates discussed in Section 2 (see Table 2). In the linking process,
we iteratively remove the node with the lowest similarity, here (r 1 , r 8 ), that corresponds to the sibling node
while the parent nodes, (r 2 , r 9 ) and (r 3 , r 10 ), are proceeded for merging.
To overcome this problem, RELATER provides a novel adaptive method to exploit the relational
structure of entities. As GD is a dependency graph, a connected component of a group of relational
nodes represents the structure of relationships between records. In order to decide if a pair of
records in a relational node needs to be linked, we consider the average similarity of the relationally
connected node group. Then, if that average similarity is less than a predefined threshold tm , we
adaptively remove the node with the lowest similarity from the group and recalculate the average
similarity.
As per the previous example of siblings and as illustrated in Figure 6, GD will have a group of
three relational nodes (a triangle) representing the two mothers (r 2 , r 9 ), two fathers (r 3 , r 10 ), and
two siblings (r 1 , r 8 ). To leverage the relational structure, we consider the average similarity (0.63)
of all three nodes in the first iteration. If this average similarity is less than tm , then we remove the
node with the lowest similarity and continue to consider the remaining nodes. Therefore, in the
example in Figure 6, we ignore the lowest similarity sibling node (r 1 , r 8 ) (as the two records refer
to two different individuals), and continue with the remaining pair of parent nodes, (r 2 , r 9 ) and
(r 3 , r 10 ), which now have an average similarity of 0.85 to proceed with the merging and solving of
the partial match group problem we discussed in Section 3.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:15
Fig. 7. (a) A graph with a bridge in red colour (dotted), (b) a graph with high density, and (c) a graph with
low density.
of the undirected graph generated from a record cluster, mk . For such a cluster having at least
three records, |mk | ≥ 3, we calculate the density and if it is less than a predefined threshold, td ,
we remove the node with the lowest degree. Similarly, for a record cluster having more than tn
records, we split the record cluster by any existing bridges. In Section 6, we discuss how to set the
parameters, td and tn in our experimental evaluation.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:16 N. Kirielle et al.
5.2 Bootstrapping
As an initial step of the merging process, we first bootstrap the dependency graph, GD , by merging
highly similar relational nodes. In collective ER, any subsequent links always depend upon the
previous links [2]. Therefore, it is important to bootstrap this graph by only linking record pairs
for which we have a high confidence of them to be a correct match.
After GD is generated, we have groups of relational nodes of different sizes. In the bootstrap-
ping step, we consider only nodes in groups (leaving the singletons), where the average atomic
similarities, following Equation (1), of all nodes in a group must be at least the bootstrap threshold
which we set to tb = 0.95 in our evaluation in Section 6 (based on a set of initial experiments) to
bootstrap the graph by linking highly similar record pairs. We only consider groups of nodes con-
nected with relationships at this stage rather than singleton nodes that are not connected with any
other nodes by relationships because groups provide more relationship evidence than singletons.
While linking such highly similar node groups, we also propagate the attribute values (PROP-
A) and constraints (PROP-C) and adaptively leverage relationship structures (REL), as shown in
Figure 2. After bootstrapping the dependency graph, GD , we dynamically refine record clusters
(REF) to remove any incorrectly linked record pairs, as we discussed in Section 4.5.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:17
violates any constraints is removed from the group in line 11 and its state is updated to non-merge
in line 12.
We then calculate the average similarity, simд , of all nodes in line 14. If simд exceeds the merge
threshold, tm , then in lines 16 to 18, we merge all nodes in g, add the records to the corresponding
record cluster mk , update the entity graph, GO , with the updated record cluster, update the state
of the merged nodes to merged, and continue to the next group in the queue Q. Otherwise, we
remove the node with the lowest similarity from the group g and check the possibility to merge
the group until it is reduced to a pair (in lines 13 to 20) by adaptively leveraging the relationship
structure (REL), as we described in Section 4.4.
After we have processed all the nodes in the priority queue, Q, we finally refine (REF) the
entities in the entity graph, GO , in line 21. In order to identify and remove the likely wrong links
in the entities associated with the record clusters in M, we utilise the measures of graph bridges
and graph density as we described in Section 4.5. Finally, in line 22, we return the generated entity
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:18 N. Kirielle et al.
graph, GO . The reason why we generate an entity graph as the end result of the framework is
that such a graph can capture all direct and indirect relationships between entities. Each entity
represented by a node in GO by now consists of a set of records. This set of records allows us to
infer all possible links to all related entities, which in turn enumerates all the indirect relationships
between entities.
6 EXPERIMENTAL EVALUATION
We conduct an extensive set of experiments to address the following questions: (1) How does
RELATER compare to other state-of-the-art ER baselines? (2) How does RELATER scale to large
datasets? (3) How do parameter values affect linkage quality? (4) How does each key technique in
our framework affect linkage quality?
6.1.1 Datasets. We evaluate RELATER on real datasets from different domains as detailed in
Table 4. To resolve complex entities, we use three demographic datasets, where the goal is to link
person records across birth, marriage, and death certificates; and one census dataset where the
interest is to link person records across several census snapshots. Furthermore, we resolve basic
entities in three publicly available datasets from the bibliographic and music domains to show that
our framework can obtain high linkage quality for both complex and basic entities.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:19
The demographic datasets include two proprietary datasets from Scotland [45], one from the
remote IOS and the other from the town of Kilmarnock (KIL). Both contain civil certificates
(birth, marriage, and death) of their population over the period from 1861 to 1901. A third dataset
is from the publicly available Brabant Historical Information Center (BHIC) [4]. It contains
civil certificates from North Brabant, a province of the Netherlands, in the period from 1759 to
1969. Demographers with expertise in linking such data have curated and linked the IOS and KIL
datasets [45]. Their semi-automatic approach is heavily biased towards certain types of links, such
as Bp-Bp (links between birth parents), as their interests were in identifying siblings of the same
mother to facilitate the analysis of child mortality. Therefore, we show results of Bp-Bp for which
we have directly curated ground truth links, along with the results of Bp-Dp (links between birth
and death parents) for which we have inferred ground truth links (where in Table 4 p represents
both mother, m, and father, f ). We utilise the BHIC dataset for evaluating the scalability of RELATER
since it is a significantly larger dataset compared to IOS and KIL. However, we cannot show the
linkage quality of the BHIC dataset as we do not have ground truth links. As census data, we used
the 1870 and 1880 census snapshots of families from the US census data (IPUMS) publicly available
from IPUMS [47].
To evaluate RELATER for resolving basic entities, we selected datasets with different data charac-
teristics and levels of difficulty to match records. We use a music dataset, Million Songs (MSD) [13],
and two bibliographic datasets, DBLP-ACM [13], and DBLP-Scholar [13]. As DBLP-ACM consists
of two well-structured datasets, it can be considered as a simple dataset to resolve [32]. However,
DBLP-Scholar is more challenging because the publications in Google Scholar have many quality
problems, such as misspellings and different representations of authors and venues [32].
6.1.2 Baselines. To compare RELATER with existing (state-of-the-art) ER approaches, we se-
lected five baselines where each of them represents a different ER approach. The first baseline,
Attr-Sim, provides a basic pairwise similarity approach such as used with traditional pairwise
linkage techniques [7]. Second, Dep-Graph is an implementation of the collective ER approach
proposed by Dong et al. [14] that propagates link decisions in the ER process. To allow a fair com-
parison, we used the same dependency graph and the same set of temporal and link constraints we
used to evaluate RELATER. Third, Rel-Cluster is an implementation of the collective ER approach
proposed by Bhattacharya and Getoor [2] that employs ambiguity of attribute values in the ER pro-
cess. In Rel-Cluster, we apply the same set of temporal and link constraints applied to RELATER for
a fair comparison. Fourth, ZeroER [50] is a recent unsupervised approach that employs generative
modelling for learning match and non-match distributions to resolve entities. Finally, Magellan is
a state-of-the-art supervised ER system available as an open-source library [31]. As training data,
we used the record pairs generated in the blocking step as we will describe in Section 6.1.3. For
our experiments, we selected four classifiers from Magellan (a support vector machine, a random
forest, a logistic regression, and a decision tree) and averaged their linkage quality results.
6.1.3 Settings. We implemented our framework and baselines in Python 2.7 except for Mag-
ellan, which is implemented in Python 3.6 and conducted all experiments on a server running
Ubuntu 18.04 with 64-bit Intel Xeon 2.10 GHz CPUs, and 512 GBytes of memory. The code of our
framework is available in an online repository2 to facilitate repeatability.
For all baselines and RELATER, in the blocking step, we grouped potential matches by employ-
ing a locality sensitive hashing based indexing technique that maps records with similar attribute
pairs to the same block [42]. In the record pair comparison step, we then employed similarity func-
tions such as the Jaro-Winkler similarity for names and the Jaccard similarity for other textual
2 https://github.com/nishadi/RELATER.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:20 N. Kirielle et al.
strings [7] to compare attribute values between records. For numerical comparisons, we used the
maximum absolute difference [7], and for comparing addresses in the IOS dataset, we geocoded
address strings [30] and calculated similarities based on the distance between two locations. How-
ever, due to the absence or low quality of addresses, we did not consider geocoding for the other
datasets.
For RELATER and all unsupervised baselines, we use the same set of attributes for Must, Core,
and Extra attributes (as shown in Table 5) for calculating the attribute similarity for a fair com-
parison. In the presence and absence of the Extra attributes, we set w M , wC , w E to 0.6, 0.2, 0.2 and
0.7, 0.3, 0.0, respectively. As there can be several Extra attributes, w E is always lower than w M for
a single attribute.
We set the default merging threshold as tm = 0.85, the atomic node similarity threshold as
ta = 0.9, the weighting distribution in similarity scores as γ = 0.6, the graph measures tn = 15
(bridges), and td = 30% (density), based on the parameter sensitivity analysis in Section 6.4. We do
not show results varying td as it has a fairly small influence on precision and recall with regard to
different thresholds as the record clusters we are obtaining are not very big.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:21
Table 6. Precision (P), Recall (R), and F∗ Measure Results of RELATER Compared
to the Baselines (Averages ± Standard Deviations)
Dataset Role Pair RELATER Attr-Sim Dep-Graph Rel-Cluster ZeroER Magellan
P 98.73 63.67 90.87 93.59 60.53 77.9 ± 33.4
IOS Bp-Bp R 94.70 88.41 65.26 63.72 70.75 72.9 ± 35.1
F∗ 93.56 58.76 61.25 61.06 48.41 60.4 ± 38.6
P 86.44 43.05 0.00 80.91 58.37 67.8 ± 37.9
IOS Bp-Dp R 92.87 72.32 0.00 49.19 14.02 62.2 ± 41.4
F∗ 81.06 36.96 0.00 44.07 12.74 46.1 ± 40.4
P 97.81 30.26 54.81 71.81 79.45 69.6 ± 40.1
KIL Bp-Bp R 89.52 89.13 74.93 71.92 90.82 62.7 ± 46.7
F∗ 87.76 29.18 46.32 56.09 73.54 51.6 ± 45.9
P 74.36 11.05 28.95 30.35 45.71 63.9 ± 36.1
KIL Bp-Dp R 89.57 90.49 70.69 43.18 15.67 61.8 ± 44.1
F∗ 68.44 10.93 25.85 21.69 13.21 45.6 ± 39.4
P 100.0 99.86 99.98 95.70 99.99 84.0 ± 32.6
IPUMS F-F R 96.33 63.84 76.86 60.58 71.17 84.5 ± 32.7
F∗ 96.33 63.78 76.85 58.98 71.16 81.1 ± 32.0
P 100.0 99.85 99.96 93.68 99.97 80.0 ± 35.1
IPUMS M-M R 95.88 60.17 70.98 57.86 71.19 76.8 ± 38.8
F∗ 95.88 60.11 70.97 55.68 71.17 74.2 ± 37.8
P 89.68 99.60 99.33 77.55 99.96 81.9 ± 32.2
IPUMS C-C R 93.89 58.16 77.16 50.18 90.09 72.2 ± 39.6
F∗ 84.73 58.02 76.76 43.81 90.06 68.1 ± 37.8
P 98.94 71.90 98.89 81.04 99.45 96.8 ± 00.9
DBLP-ACM P-P R 96.49 96.71 96.67 96.44 98.60 97.8 ± 01.6
F∗ 95.50 70.19 95.63 78.68 98.07 94.7 ± 02.2
P 77.89 54.65 69.71 78.54 98.47 88.0 ± 03.3
DBLP-Scholar P-P R 80.10 79.60 80.94 49.41 44.72 87.5 ± 04.0
F∗ 65.26 47.93 59.88 43.53 44.41 78.1 ± 03.9
P 99.99 99.49 99.99 92.97 99.93 99.5 ± 00.3
MSD S-S R 99.26 99.77 95.20 99.79 91.81 98.2 ± 02.2
F∗ 99.24 99.26 95.19 92.79 91.75 97.7 ± 02.4
P 92.4 ± 9.3 67.3 ± 30.9 74.2 ± 33.9 79.6 ± 18.2 84.2 ± 21.5 80.9 ± 11.2
Averages R 92.9 ± 5.1 79.9 ± 14.6 70.9 ± 25.5 64.2 ± 18.7 65.9 ± 31.1 77.7 ± 13.2
F∗ 86.8 ± 11.4 53.5 ± 23.0 60.9 ± 28.5 55.6 ± 18.8 61.5 ± 30.9 69.8 ± 17.8
Best results in each row are shown in bold font.
we have manually curated or inferred ground truth links [45]. For both these datasets, RELATER
obtains both high precision and recall values for the role pair Bp-Bp, whereas for Bp-Dp we can
see a drop in precision and recall. This is to be expected as we have an incomplete (inferred or
biased) set of ground truth links for Bp-Dp [45]. In the IPUMS dataset, the F-F (father to father)
and M-M (mother to mother) role pairs have high precision, recall, and F ∗ results whereas we can
see a small drop for the C-C (children to children) role pair because the set of ground truth links
from IPUMS are more focused on linking couples than children [47]. In the context of resolving
basic entities, RELATER provides high precision and recall results for both DBLP-ACM and MSD.
We can see that the DBLP-Scholar dataset is challenging to resolve because for all the baselines
there is a drop in linkage quality. However, we can see that RELATER outperforms all the other
unsupervised baselines even for the challenging DBLP-Scholar dataset.
The results of the Attr-Sim baseline are not showing acceptable linkage quality in any of the
datasets with complex entities. This indicates that traditional pairwise linkage approaches are
insufficient for linking databases with complex entities because these approaches do not address
the challenges associated with resolving complex entities. With respect to the datasets with basic
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:22 N. Kirielle et al.
entities, good linkage quality can be achieved when resolving easy datasets such as MSD even
for the Attr-Sim baseline. For this dataset, the recall and F∗ measure have a slight improvement
when compared with RELATER. However, for more challenging datasets with basic entities, such
as DBLP-Scholar, the Attr-Sim baseline does not provide good results.
Dep-Graph [14] and Rel-Cluster [2] are two unsupervised collective ER baselines. Although they
exploit relationship information to resolve entities, they have poor performance compared to RE-
LATER when resolving complex entities. The Dep-Graph baseline addresses the problems of chang-
ing attribute values and different relationships by propagating attribute values and constraints.
However, as it does not address the problems of disambiguation, partial match groups, or incor-
rect links that RELATER addresses, for some role pairs (such as IOS Bp-Dp) Dep-Graph cannot
identify any true matches. Similarly, the drop in results in the Rel-Cluster baseline is because it
only addresses the disambiguation problem. However, as we can see both of these baselines per-
form better in resolving basic entities, the type of entities these two techniques were developed
for. Dep-Graph achieves slight improvements in recall and F∗ results for the DBLP-ACM dataset
pair. However, for all other datasets, RELATER performs better than Dep-Graph.
ZeroER [50] is a recently proposed unsupervised ER baseline that exploits the bi-modal nature
of ER problems to resolve entities. Based on the observation that the similarity vectors for matches
are different from those of non-matches, ZeroER employs generative models to learn the match and
non-match distributions. Therefore, when the features are well separated in simple basic entities
such as the DBLP-ACM dataset, ZeroER achieves the best results compared to all other baselines
and RELATER. However, when the datasets become challenging (even for basic entities), such as
DBLP-Scholar and MSD, we can see a drop of linkage quality results due to the absence of well
separated features for the match and non-match classes. Interestingly, none of the datasets with
complex entities achieve acceptable linkage quality with ZeroER compared to RELATER, because
ZeroER is unable to address the challenges associated with complex entities such as changing
attribute values and relationships, ambiguity, and the partial match group problem.
The precision, recall, and F∗ values for Magellan [31] are presented as averages with standard de-
viations because we use four different classifiers and two different settings to generate the training
and testing datasets. Since datasets with complex entities have different role pair types, we trained
Magellan in two different settings, whereas in the first we trained it only on record pairs of the
specific role pair that is being tested, and in the second we trained it on the full dataset. As most of
the datasets with complex entities have incomplete ground truth data, in practical scenarios one
likely will have to train on record pairs of all role pair types, for which Magellan obtains fairly
poor results. However, as expected in the first setting Magellan provides better results compared
to RELATER because Magellan is a supervised learning approach. For simple datasets with basic
entities, such as DBLP-ACM and MSD, we can see that RELATER outperforms Magellan. However,
for challenging datasets such as DBLP-Scholar, Magellan provides better results because it is a
supervised approach.
6.3 Scalability
Table 7 shows the runtimes of RELATER compared with the baselines. Attr-Sim shows the best
runtimes for most of the datasets because it simply links records without considering any rela-
tionships. The next best runtimes alternate between RELATER and Dep-Graph [14]. For datasets
with complex entities RELATER takes more time compared to Dep-Graph because it addresses all
problems specified in Section 3, whereas Dep-Graph addresses only the problems of changing at-
tribute values and different relationships. However, for datasets with basic entities, RELATER runs
faster than Dep-Graph because basic entities do not have most of the challenges complex entities
have. Rel-Cluster has higher runtimes compared to both RELATER and Dep-Graph because of the
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:23
Table 8. Runtimes of RELATER for Different Graph Sizes of the BHIC Dataset Generated
for Different Time Periods
Time Number of Number of Generate Generate Bootstrap Iterative Merging and Linkage Linkage
Period Nodes Edges NA NR time Entity Graph time (ms) time (ms)
time (s) time (s) (s) Generation time (s) per node per edge
1900–1935 22,928,967 41,121,771 20,642 1,438 896 23,155 1.0 0.6
1890–1935 42,398,382 80,524,946 28,881 2,172 1,685 113,143 2.7 1.4
1880–1935 68,739,033 134,057,215 36,033 3,910 3,013 299,123 4.4 2.3
1870–1935 100,907,697 199,588,456 39,113 6,062 5,423 660,896 6.6 3.3
Linkage time is the total of bootstrapping, merging, inferring, and refining using the default settings described in
Section 6.
iterative clustering method employed. ZeroER shows worse runtimes compared to all other unsu-
pervised baselines because it involves a time consuming feature generation process. The worst
performing baseline is Magellan (these runtimes are averages for the four supervised classifiers
and two different settings we described in Section 6.2) as it consumes much time for training the
supervised classification models. Overall, the runtimes of RELATER are comparatively better than
the other baselines.
Next, we evaluate the scalability by comparing the runtimes of RELATER on different sized
datasets. For that, we vary the time periods of records considered for generating the graph with
the BHIC dataset. Table 8 provides an overview of runtimes for the atomic node and relational
node generation, bootstrapping, iterative merging, and entity graph generation steps of RELATER.
These runtimes indicate that the iterative merging and entity graph generation step accounts for
the largest component of the overall runtime because it is the most time consuming step that
involves most of the key techniques. We use total linkage time to measure the scalability of our
framework. Considering the values of linkage times per node and per edge, we can see that our
proposed framework has a near linear scalability with both, which indicates that RELATER can
scale to large graphs.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:24 N. Kirielle et al.
Fig. 8. Precision, Recall and F∗ sensitivity results of RELATER for different ta , tm , γ , and tn values as discussed
in Section 6.4.
We then vary the merging threshold, tm . For lower values of tm we can see that precision drops
for the IOS dataset because many false matches are being linked. However, in the other datasets, we
cannot see a drop in linkage quality results with lower tm values because the blocking technique
we used already provides an initial graph of high precision. We can see the precision, recall, and F∗
results are robust in the [0.8, 0.85] range for all datasets. When we further increase tm , recall drops
because we miss many true matches. Therefore, without a loss of generality, we set tm = 0.85 as
the default value that works well for all datasets.
Next, we discuss the impact on the linkage quality results of the value of γ that defines the
weight distribution of similarity components in Equation (3). When γ is lower a higher weight is
assigned to the disambiguation similarity simd and unambiguous record pairs that refer to different
entities can get linked. Therefore, we can see a drop in results at lower γ values. When we increase
γ recall drops because a higher weight is given to atomic similarity and disambiguation is ignored.
When we do not disambiguate (γ = 1.0), recall drops because ambiguous record pairs with high
similarity get linked and these links along with link constraints restrict true matches with lower
similarity getting linked. However, for the DBLP-ACM dataset, we cannot see a drop in recall when
we do not involve disambiguation similarity because the attribute values of basic entities are not
as ambiguous as complex entities, as we showed in Table 1. We, therefore, set γ = 0.6 as this value
provides a good balance between precision and recall for all datasets.
As we show in Figure 8, we can see that RELATER is fairly robust to tn , the threshold for the
minimum number of nodes in a cluster to split by bridges in the range of [10,20] for the IOS Bp-Bp
dataset. If we further increase tn there is a small drop in precision and F∗ because wrong links
are not removed from small clusters for high tn values. This drop is fairly small due to the small
clusters generated for this dataset. For datasets with larger cluster sizes the effect of tn will be
more significant. In other datasets, we cannot see any effect of tn because the clusters have at
most two records because we link only two datasets. We set tn = 15 as this value works well with
all datasets.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:25
Table 9. Ablation Analysis for RELATER that Shows How Each Key Component
in the Framework Affects Linkage Quality
Dataset Role Pair RELATER without PROP-A and PROP-C without AMB without REL without REF
P 98.73 86.79 99.22 99.88 98.02
IOS Bp-Bp R 94.70 95.20 93.56 61.58 94.87
F∗ 93.56 83.15 92.89 61.53 93.08
P 86.44 72.56 89.72 0.00 85.28
IOS Bp-Dp R 92.87 93.24 88.62 0.00 93.14
F∗ 81.06 68.93 80.45 0.00 80.24
P 100.0 92.75 100.0 0.00 100.0
IPUMS F-F R 96.33 96.33 87.16 0.00 96.33
F∗ 96.33 89.59 87.16 0.00 96.33
P 100.0 92.19 100.0 0.00 100.0
IPUMS M-M R 95.88 95.89 86.77 0.00 95.88
F∗ 95.88 88.69 86.77 0.00 95.88
P 89.68 47.96 90.75 0.00 89.68
IPUMS C-C R 93.89 93.93 86.71 0.00 93.89
F∗ 84.73 46.52 79.67 0.00 84.73
Best results in each row are shown in bold font.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:26 N. Kirielle et al.
An important aspect to notice in the ablation study is that whenever we remove a key technique
from RELATER the overall results always drops. In some scenarios precision increases compromis-
ing recall and F∗ while in other scenarios recall increases compromising precision and F∗ . How-
ever, in none of the scenarios the overall linkage quality results (as indicated by F∗ ) are higher than
RELATER with all its key techniques included.
7 RELATED WORK
Various ER approaches have been developed since the 1950s to link records in databases [7, 21,
26, 41]. Most recent ER methods are based on supervised learning and deep learning approaches,
and the majority of recent works that aim at overcoming the lack of ground truth data are using
active learning or transfer learning. We now describe ER approaches related to ours that exploit
the relationships among entities to achieve high linkage quality.
On et al. [40] used relationship information for group linkage using weighted bipartite match-
ing. However, because groups are linked independently from each other, this approach does not
propagate relationship information. Fu et al. [19] pioneered the use of group linkage for person
records by linking individuals within households in census data (only considering relationships
within households), thereby substantially reducing the number of ambiguous links.
In contrast to pairwise classification based approaches, graph-based collective ER approaches
provide more accurate results by exploiting relational information [21]. Kalashnikov et al. [28]
proposed an approach for reference disambiguation based on random walks, aiming to identify
the entity to which each record refers to Bhattacharya and Getoor [2] also used relational informa-
tion between different types of entities by employing an iterative cluster merging process using a
relationship graph. However, these approaches focused on basic entities that have static attribute
values and static relationships, and they have mostly been evaluated on bibliographic data. In our
work, we address the problems associated with complex entities which have changing attribute
values and diverse relationships at different points in time. Kouki et al. [33] proposed a collective
ER approach for building familial networks based on probabilistic soft logic. Although the predi-
cates in their probabilistic soft logic capture relationships, they do not capture diverse relationships
encountered at different points in time. Similarly, they do not capture attribute values that change
over time.
Dong et al. [14] proposed a dependency graph-based approach to propagate link decisions
among multiple types of entities through the linkage process. We consider their approach as a
baseline (named Dep-Graph) because they also propagate link decisions to capture changing at-
tribute values and apply constraints. However, as we showed in the experimental evaluation, their
approach is not successful in addressing the problems associated with complex entities, such as
the disambiguation problem, the partial match group problem, or the incorrect link problem.
The ambiguity of attribute values in ER has been discussed since the development of probabilis-
tic record linkage by Fellegi and Sunter [16] in 1969. Li et al. [36] discussed the problem of am-
biguity in entities that occur in unstructured textual documents. In their approach, Kalashnikov
et al. [28] employed relationship analysis to enhance feature-based similarities between ambiguous
reference entity choices. This approach is applicable when the set of entities are known prior to
linking, and the task is to match records to entities. In our context, however, the set of entities is ini-
tially unknown. The approach developed by Bhattacharya and Getoor [2] incorporated ambiguity
in neighbours into the calculation of relational similarities. As this is the closest approach to ours,
we consider this as a baseline (named Rel-Cluster). However, as our experiments indicate, we can
see that this approach is not providing good linkage results because besides the disambiguation
problem it does not consider the other challenges our framework addresses.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:27
Recently efforts have been made to identify ER errors using graph theory measures [12, 44].
However, none of them has been proposed in the context of collective ER. We utilise simple graph
theory measures such as bridges and density [44] in our RELATER framework. The clusters of
records we obtain are small and, therefore, more sophisticated “repair” operations such as those
proposed by Croset et al. [12] cannot be applied.
In the context of ER, several approaches have been proposed to incorporate temporal constraints
into the linkage process. Li et al. [35] and Chiang et al. [6] were the first to use temporal information
for improved supervised pair-wise record pair classification in the bibliographic domain. More
recent work on ER for person records provides strong evidence for the improvement of linkage
quality when temporal constraints are applied [11, 38]. Although they address the problem of
changing attribute values over time, none of these temporal record linkage solutions [5, 9, 27, 34]
address the challenges associated with complex entities, such as that different relationships can
occur at different points in time, or the disambiguation or partial match group problems.
A growing body of research has studied supervised techniques in the context of ER. Magellan is
one such framework that supports an end-to-end ER pipeline with supervised techniques [31]. Re-
cently, deep learning techniques have also been proposed that provide good linkage results [37, 39].
However, as we discussed in the experiments in Section 6, databases with complex entities gen-
erally suffer from a lack of ground truth links that makes it challenging if not impossible to
use supervised techniques. Similarly, semi-supervised ER techniques including active learning ap-
proaches [29, 43] that query external sources to resolve challenging training cases, as well as crowd-
based approaches [1, 22, 49] that employ hybrid machine and human-based systems for resolving
entities are challenging with person data due to privacy and confidentiality issues[10]. Further-
more, as transfer learning ER approaches such as [53] use pre-trained models, it is questionable
how to incorporate temporal and relational aspects into the linkage process.
Recent advancements in the ER literature have influenced unsupervised approaches towards
self-supervised learning methods. One recently proposed state-of-the-art unsupervised ER ap-
proach is ZeroER [50], which employs generative modelling to learn match and non-match distri-
butions to resolve entities. However, as we show in the experimental evaluation, ZeroER performs
well only when the features representing similarities are well separated, such as in datasets that
contain basic entities. When the datasets have complex entities, ZeroER does not perform well as it
is unable to distinguish match and non-match distributions due to the challenges associated with
complex entities.
The problem of graph-based ER is related to the graph alignment or graph matching prob-
lem [25, 51, 52], where the aim is to identify nodes that correspond to the same entity in two
graphs. Similarly, ER is also related to link mining [20], which is a research area that focuses on
classification, clustering, prediction, and modelling of links in graphs. However, these techniques
are not suitable for resolving complex entities because they do not address the challenges in resolv-
ing complex entities, including the propagation of link decisions or the disambiguation problem.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:28 N. Kirielle et al.
REFERENCES
[1] Asma Abboura, Soror Sahrl, Mourad Ouziri, and Salima Benbernou. 2015. CrowdMD: Crowdsourcing-based ap-
proach for deduplication. In Proceedings of the International Conference on Big Data. IEEE, 2621–2627.
[2] Indrajit Bhattacharya and Lise Getoor. 2007. Collective entity resolution in relational data. Transactions on Knowledge
Discovery from Data 1, 1 (2007), 5–es.
[3] Gerrit Bloothooft, Peter Christen, Kees Mandemakers, and Marijn Schraagen. 2015. Population Reconstruction.
Springer, Cham.
[4] Brabant Historical Information Center. 2021. Genealogie.Retrieved June 29, 2021 from https://opendata.picturae.com/
organization/bhic.
[5] Yueh-Hsuan Chiang, AnHai Doan, and Jeffrey F Naughton. 2014. Modeling entity evolution for temporal record
matching. In Proceedings of the SIGMOD International Conference on Management of Data. ACM, 1175–1186.
[6] Yueh-Hsuan Chiang, AnHai Doan, and Jeffrey F. Naughton. 2014. Tracking entities in the dynamic world: A fast
algorithm for matching temporal records. VLDB Endowment 7, 6 (2014), 469–480.
[7] Peter Christen. 2012. Data Matching—Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate
Detection. Springer.
[8] Peter Christen. 2016. Application of advanced record linkage techniques for complex population reconstruction.
arXiv:1612.04286. Retrieved from https://arxiv.org/abs/1612.04286.
[9] Peter Christen and Ross W. Gayler. 2013. Adaptive temporal entity resolution on dynamic databases. In Proceedings
of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 558–569.
[10] Peter Christen, Thilina Ranbaduge, and Rainer Schnell. 2020. Linking Sensitive Data. Springer.
[11] Victor Christen, Anika Groß, Jeffrey Fisher, Qing Wang, Peter Christen, and Erhard Rahm. 2017. Temporal group
linkage and evolution analysis for census data. In Proceedings of the International Conference on Extending Database
Technology. 620–631.
[12] Samuel Croset, Joachim Rupp, and Martin Romacker. 2015. Flexible data integration and curation using a graph-based
approach. Bioinformatics 32, 6 (2015), 918–925.
[13] Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, Pradap Konda, Yash Govind, and Derek Paulsen.
2021. The Magellan Data Repository. Retrieved May 05, 2021 from https://sites.google.com/site/anhaidgroup/useful-
stuff/data.
[14] Xin Luna Dong, Alon Halevy, and Jayant Madhavan. 2005. Reference reconciliation in complex information spaces.
In Proceedings of the SIGMOD International Conference on Management of Data. ACM, 85–96.
[15] Xin Luna Dong and Divesh Srivastava. 2015. Big Data Integration. Morgan and Claypool Publishers.
[16] Ivan P. Fellegi and Alan B. Sunter. 1969. A theory for record linkage. Journal of the American Statistical Association
64, 328 (1969), 1183–1210.
[17] Tyler Folkman, Rey Furner, and Drew Pearson. 2018. GenERes: A genealogical entity resolution system. In Proceed-
ings of the International Conference on Data Mining Workshops (ICDMW’18). IEEE, 495–501.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
Unsupervised Graph-Based Entity Resolution for Complex Entities 12:29
[18] Zhichun Fu, Peter Christen, and Jun Zhou. 2014. A graph matching method for historical census household linkage.
In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 485–496.
[19] Zhichun Fu, Jun Zhou, Peter Christen, and Mac Boot. 2012. Multiple instance learning for group record linkage. In
Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 171–182.
[20] Lise Getoor and Christopher P. Diehl. 2005. Link mining: A survey. SIGKDD Explorations 7, 2 (2005), 3–12.
[21] Lise Getoor and Ashwin Machanavajjhala. 2013. Entity resolution for big data. In Proceedings of the SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining. ACM, 1527–1527.
[22] Yash Govind, Erik Paulson, Palaniappan Nagarajan, Paul Suganthan G. C., AnHai Doan, Youngchoon Park, Glenn M.
Fung, Devin Conathan, Marshall Carter, and Mingju Sun. 2018. Cloudmatcher: A hands-off cloud/crowd service for
entity matching. VLDB Endowment 11, 12 (2018), 2042–2045.
[23] David J. Hand and Peter Christen. 2018. A note on using the f-measure for evaluating record linkage algorithms.
Statistics and Computing 28, 3 (2018), 539–547.
[24] David J. Hand, Peter Christen, and Nishadi Kirielle. 2021. F*: An interpretable transformation of the f-measure. Ma-
chine Learning 110, 3 (2021), 451–456.
[25] Mark Heimann, Haoming Shen, Tara Safavi, and Danai Koutra. 2018. Regal: Representation learning-based graph
alignment. In Proceedings of the International Conference on Information and Knowledge Management. ACM, 117–126.
[26] Thomas Herzog, Fritz Scheuren, and William Winkler. 2007. Data Quality and Record Linkage Techniques. Springer.
[27] Yichen Hu, Qing Wang, Dinusha Vatsalan, and Peter Christen. 2017. Improving temporal record linkage using regres-
sion classification. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer,
561–573.
[28] Dmitri V. Kalashnikov and Sharad Mehrotra. 2006. Domain-independent data cleaning via analysis of entity-
relationship graph. Transactions on Database Systems 31, 2 (2006), 716–767.
[29] Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution
with transfer and active learning. In Proceedings of the Association for Computational Linguistics. ACL, 5851–5861.
[30] Nishadi Kirielle, Peter Christen, and Thilina Ranbaduge. 2019. Outlier detection based accurate geocoding of histor-
ical addresses. In Proceedings of the Australasian Conference on Data Mining. Springer, 41–53.
[31] Pradap Konda, Sanjib Das, Paul Suganthan G.C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah
Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, Vijay Raghavendra. 2016. Mag-
ellan: Toward building entity matching management systems. VLDB Endowment 9, 12 (2016), 1197–1208.
[32] Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world
match problems. VLDB Endowment 3, 1–2 (2010), 484–493.
[33] Pigi Kouki, Jay Pujara, Christopher Marcum, Laura Koehly, and Lise Getoor. 2019. Collective entity resolution in
multi-relational familial networks. Knowledge and Information Systems 61, 3 (2019), 1547–1581.
[34] Furong Li, Mong Li Lee, Wynne Hsu, and Wang-Chiew Tan. 2015. Linking temporal records for profiling entities. In
Proceedings of the SIGMOD International Conference on Management of Data. ACM, 593–605.
[35] Pei Li, Xin Luna Dong, Andrea Maurino, and Divesh Srivastava. 2011. Linking temporal records. VLDB Endowment
4, 11 (2011), 956–967.
[36] Xin Li, Paul Morie, and Dan Roth. 2005. Semantic integration in text: From ambiguous names to identifiable entities.
AI Magazine 26, 1 (2005), 45–45.
[37] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep,
Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In
Proceedings of the SIGMOD International Conference on Management of Data. ACM, 19–34.
[38] Charini Nanayakkara, Peter Christen, and Thilina Ranbaduge. 2019. Robust temporal graph clustering for group
record linkage. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer,
526–538.
[39] Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. Deep sequence-
to-sequence entity matching for heterogeneous entity resolution. In Proceedings of the International Conference on
Information and Knowledge Management. ACM, 629–638.
[40] Byung-Won On, N. Koudas, Dongwon Lee, and D. Srivastava. 2007. Group linkage. In Proceedings of the International
Conference on Data Engineering. IEEE, 496–505.
[41] George Papadakis, Ekaterini Ioannou, Emanouil Thanos, and Themis Palpanas. 2021. The Four Generations of Entity
Resolution. Morgan and Claypool Publishers.
[42] George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and filtering tech-
niques for entity resolution: A survey. Computing Surveys 53, 2 (2020), 1–42.
[43] Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active learning for large-scale entity resolution. In Proceedings of
the Conference on Information and Knowledge Management. ACM, 1379–1388.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.
12:30 N. Kirielle et al.
[44] Sean M. Randall, James H. Boyd, Anna M. Ferrante, Jacqueline K. Bauer, and James B. Semmens. 2014. Use of graph
theory measures to identify errors in record linkage. Computer Methods and Programs in Biomedicine 115, 2 (2014),
55–63.
[45] Alice Reid, Ros Davies, and Eilidh Garrett. 2002. Nineteenth-century scottish demography from linked censuses and
civil registers: A ‘sets of related individuals’ approach. History and Computing 14, 1–2 (2002), 61–86.
[46] Stephen Robertson. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of
Documentation 60, 5 (2004), 503–520.
[47] Steven Ruggles, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas, Megan Schouweiler, and Matthew Sobek.
2021. IPUMS USA: Version 11.0 [dataset]. DOI:https://doi.org/10.18128/D010.V11.0
[48] Laura Spinney. 2017. Pale Rider: The Spanish Flu of 1918 and How it Changed the World. PublicAffairs, New York.
[49] Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution.
VLDB Endowment 5, 11 (2012), 1483–1494.
[50] Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. ZeroER: Entity res-
olution using zero labeled examples. In Proceedings of the SIGMOD International Conference on Management of Data.
ACM, 1149–1164.
[51] Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li, and
Kuansan Wang. 2019. Oag: Toward linking large-scale heterogeneous entity graphs. In Proceedings of the SIGKDD
International Conference on Knowledge Discovery and Data Mining. ACM, 2585–2595.
[52] Jing Zhang, Bo Chen, Xianming Wang, Hong Chen, Cuiping Li, Fengmei Jin, Guojie Song, and Yutao Zhang. 2018.
MEgo2Vec: Embedding matched ego networks for user alignment across social networks. In Proceedings of the Inter-
national Conference on Information and Knowledge Management. ACM, 327–336.
[53] Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and
transfer learning. In Proceedings of the World Wide Web Conference. ACM, 2413–2424.
ACM Transactions on Knowledge Discovery from Data, Vol. 17, No. 1, Article 12. Publication date: February 2023.