You are on page 1of 27

Data Profiling in Property Graph Databases

SOFÍA MAIOLO, LORENA ETCHEVERRY, and ADRIANA MAROTTA,


Instituto de Computación, Facultad de Ingeniería, Universidad de la República, Uruguay

Property Graph databases are being increasingly used within the industry as a powerful and flexible way to
model real-world scenarios. With this flexibility, a great challenge appears regarding profiling tasks due to the
need of adapting them to these new models while taking advantage of the Property Graphs’ particularities.
This article proposes a set of data profiling tasks by integrating existing methods and techniques and an
taxonomy to classify them. In addition, an application pipeline is provided while a formal specification of
some tasks is defined.
CCS Concepts: • Information systems → Data cleaning; Graph-based database models;
Additional Key Words and Phrases: Property Graph, data profiling
ACM Reference format:
Sofía Maiolo, Lorena Etcheverry, and Adriana Marotta. 2020. Data Profiling in Property Graph Databases. J.
Data and Information Quality 12, 4, Article 20 (October 2020), 27 pages.
https://doi.org/10.1145/3409473
20
1 INTRODUCTION
The popularity of graph data management systems has risen dramatically over the past few years.
Graphs are a powerful, yet intuitive, abstraction to represent complex real-world processes and
provide a more flexible alternative to the relational model. Nowadays, many application domains
where relationships within concepts are first-class citizens use graph data management systems
to collect, query, and store data [9, 27].
Graph databases use different models for representing graphs. The Property Graph (PG) data
model is a directed graph with labels on nodes and edges, as well as (property,value) pairs associ-
ated with both, which is currently quite popular. Cypher is one of the most popular query language
nowadays as is part of Neo4j graph database and is also used by other commercial database prod-
ucts and researchers [15].
However, there is a tradeoff between the high flexibility and power of graph databases and data
management costs. In particular, the absence of a rigid schema with established constraints and
restrictions make profiling tasks essential to retrieve graph metadata. These metadata can then be
used in different tasks that range from learning graph content to improving queries performance
or assisting in data integration tasks [1, 25].

All authors contributed equally to this research.


Authors’ address: S. Maiolo, L. Etcheverry, and A. Marotta, Instituto de Computación, Facultad de Ingeniería, Universidad
de la República, Montevideo, Uruguay; emails: {sofia.maiolo, lorenae, amarotta}@fing.edu.uy.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2020 Association for Computing Machinery.
1936-1955/2020/10-ART20 $15.00
https://doi.org/10.1145/3409473

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:2 S. Maiolo et al.

Data Profiling (DP) comprises a widely used range of methods and techniques designed to
analyze datasets efficiently [25, 35]. In most scenarios, DP consists of a series of tasks that are
executed over a data source to gather statistics, metadata, and descriptive information about the
underlying data.
Typically, DP tasks are performed as preparation steps for further tasks, such as query opti-
mization, data cleansing, data integration, and data analytics. Furthermore, such activities apply
not only to relational and tabular data. In big data scenarios, data in different data models coexist,
and DP across these models represents a significant challenge. The design of DP tasks should con-
sider the specific features and possibilities of each data model, taking advantage of them to gather
useful metadata.
In the case of graph databases, existing techniques and methods from different research areas can
be applied. Besides adapting classic DP tasks from the relational database world, it is also possible
to integrate analysis tools from graph theory and network science. In this work, we collect, orga-
nize and implement a set of methods and techniques to perform DP in Property Graph databases.
The main contributions of this work are the following:
• a compilation of DP tasks for Property Graph databases, integrating a wide set of existing
techniques and methods
• a taxonomy of DP tasks over graph databases, in particular for Property Graph databases,
that we use to organise our proposal
• a formal specification, based on the Property Graph model, of a subset of the proposed tasks
• a data profiling pipeline to guide the application of the proposed tasks
• an application example of the proposed tasks
To the best of our knowledge, no previous work compilating, categorizing, and formalizing
data profiling tasks for PG has been done. In this work, we make an effort toward achieving an
specification of an organized and wide set of data profiling tasks for this general graph data model.
This document is organized as follows. Section 2 presents the related work in the area while
Section 3 focuses on the proposed set of data profiling tasks and techniques for PG, a taxonomy to
organize them, and we also present our application example. In Section 4, the formal specification
of a subset of the presented tasks is established, while in Section 5 we present an implementa-
tion example in Cypher. In Section 3.3 a pipeline toward the practical application of the tasks is
presented. Finally, Section 6 presents conclusions and future work.

2 RELATED WORK
Data Profiling is a very broad concept, since there exists many different approaches and strategies
for addressing the problem of analyzing the main characteristics of a dataset. Each approach
focuses on different particular aspects of the dataset and they also highly variate according to
the data model. Therefore, DP may consist of very simple tasks, such as values counting or
null detection or very complex ones, such as finding functional dependencies or extracting data
structure summaries.
In the following, we first present the related work organized according to the data model, then
we comment on some works for huge graphs, which apply summary, sampling and extraction
strategies, and, finally, we mention some existing tools for DP.
Relational Model. In Reference [1], the author proposes a classification that includes single-
column tasks, multi-column tasks, and dependency detection. Single column profiling is the study
of values in a single column and may vary from simple analysis such as counts to the discovery
of patterns in the values of data. Multi-column profiling is the set of activities that can be ap-
plied to a single column but are extended to a further analysis through different columns to find
ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:3

Fig. 1. A classification of traditional data profiling tasks. From Reference [1].

association rules and correlations. Dependencies category refers to tasks describing relationships
among columns of a single table, such as keys and functional dependencies, and relationships
across multiple tables, such as foreign keys and inclusion dependencies. Figure 1 shows the tasks
categorization proposed in Reference [1].
RDF. DP has been thoroughly studied for RDF datasets. In Reference [4], the authors provide a
comprehensive survey of the RDF dataset profile features, methods, tools, and vocabularies. They
also propose a feature taxonomy that provides a categorisation system for all those features iden-
tified by the survey. The taxonomy not only organizes the dataset profile features in categories
(general, qualitative, provenance, statistical, and dynamics) but also provides feature extracting
systems for each of them. These extracting systems comprise a broad range of tools and approaches
for assessing and extracting such features from RDF datasets.
The survey presented in Reference [4] was specifically done for RDF datasets with its own
characteristics and challenges. Some of the identified features may apply for PG databases, but
most of them take advantage of the particular aspects of the RDF representation, for example,
the ontology completeness or the stability of URIs. In addition, most of the presented extracting
systems cannot be used for PG.
In Reference [16], the authors study the completeness of RDF databases by proposing a system-
atic analysis of completeness oracles (binary assertions on entities and relationships). They also
show how completeness assertions can be learned through an automatic mining system that can
be trained through crowd-sourcing.
Regarding the study of functional dependencies, in Reference [6] the problem of transferring
functional dependencies from relational data to RDF data is investigated. In addition, in Reference
[21], the authors show that SPARQL, the most popular query language for RDF, can be used as a
constraint language to express functional dependencies.
In Reference [36], the problem of detecting abnormal data and outliers in RDF graphs is studied
by introducing value-clustered graph functional dependencies.

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:4 S. Maiolo et al.

Other kinds of dependencies, as inclusion dependencies, have been researched, too. In Refer-
ence [20], conditional inclusion dependency discovery was applied in RDF datasets. An inclusion
dependency describes the inclusion of a set of values or elements in another set.
Linked Open Data. While in Reference [4] the work focuses on Data Profiling in RDF, in Refer-
ence [5] a tool called ProLOD is presented with the goal of profiling Linked Open Data. ProLOD
is web-based tool containing a suite of methods ranging from clustering and labeling to data type
and pattern detection and value distribution.
Data Quality in Linked Open Data was also exhaustively studied. For example, in Reference
[38], a comprehensive list of the dimensions and metrics related to data quality is presented. In
addition, in Reference [10], a framework called Luzzu, with the aim of being a scalable, extensi-
ble, interoperable, and customisable approach for assessing Data Quality in Linked Open Data, is
described. Luzzu is accompanied by a set of ontologies for capturing quality related information.
Moreover, in Reference [11], the authors propose and ontology, called Dataset Quality Ontology
(daQ), as an extensible vocabulary for attaching the results of quality benchmarking of a linked
open dataset to that dataset.
Graph models. Regarding the study of DP in graphs, there is a wide set of papers analyzing
different methods and techniques that can be applied to create metadata of a dataset. Most of
those works focus on solving the computational and algorithmic challenges of working with big
graph databases.
In Reference [2], a brief analysis of DP techniques for non-relational data as XML, RDF, and
graphs is presented. For graphs, they propose different statistical measures to understand the
dataset structure and mention advanced techniques to gather further information.
In Reference [18], a large-scale graph processing engine is proposed (and implemented in a RDF
graph) to improve the quality of shortest path estimations, while in Reference [19] the authors
consider the problems of pattern matching for RDF, Property Graphs, and the relational model.
Another important problem when working with graph databases, especially in social network
analysis, is the discovery of influential nodes within the graph. In Reference [29], the authors
use graph entropy to determine the most prominent people in an organisation by considering the
email interaction with the members of the company. In that scenario, important nodes are those
who have the most effect of the graph entropy when they are removed from the graph. In the field
of social network analysis, another major challenge is to infer which new interactions among its
members are likely to occur in the near future. In Reference [24], the authors survey and test a
wide set of methods, adapted from graph theory and computer and social sciences.
Discovering functional dependencies has been widely examined for graphs. In Reference [13],
key discovery in graphs was deeply studied. The authors propose a class of keys that are recur-
sively defined in terms of graph patterns and are interpreted with sub-graph isomorphism. They
also determine that entity matching with keys is NP-complete and experimentally verify the ef-
fectiveness of the proposed parallel scalable algorithms. In addition, in Reference [32], the entity
matching problem, which is to find all pairs of entities in a graph that are identified by a given set
of keys, was studied.
In Reference [30], the authors study the mining of conditional keys in large databases. A condi-
tional key is an axiom that under particular conditions, no two distinct entities can have the same
values on a specific set of properties. Moreover, an algorithm that can discover conditional keys
efficiently, by combining key mining techniques with techniques from rule mining, is presented.
Graph Functional Dependencies (GFD) notions are introduced in Reference [14], where the au-
thors formalize the discovery problem for GFDs. They also develop algorithms for discovering
GFDs and test them using real-life data to verify the effectiveness and scalability of the algorithms.

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:5

In Reference [12], the author presents an overview of the problem of graph functional depen-
dencies and how it was addressed so far, he also detects existing open problems and proposes new
research lines and future work.
Big datasets. When working with huge datasets, summarization and sketches are essential to
manage the huge computational cost of performing some algorithms. In Reference [7], a com-
prehensive survey of summarization methods for semantic RDF graphs is provided. The authors
delimit the scope of a RDF summary as a compact information extracted from the original RDF
graph or a graph that some applications can exploit instead of the original RDF graph. In addition, a
classification of RDF summarization methods is proposed according to the main algorithmic idea
behind the summarization method. The focus of the article is set to RDF datasets but they also
provide an overview of generic graph (non-RDF) summarization approaches.
Furthermore, summaries can be a vital tool when trying to perform a structure or schema extrac-
tion. In Reference [17], structure extraction is performed by using a summarization of the original
graph so the computational complexity can be reduced providing a way to dynamically interact
with the algorithm and so train it. The method, which is called SEuS, proposes a three-phase pro-
cess. During the first phase a summarization of the original graph is generated where the given
dataset is preprocessed. In the second phase, the method interacts with a human expert to search
for frequent structures. Using a summary is key, because this lets a real-time interaction within the
user that is not possible when working with the original graph. Finally, an accurate count of the
number of occurrences of each potential structure is produced. Schema extraction by summarizing
labeled directed graphs was also studied in Reference [34], where an approximation approach is
proposed by using an incremental clustering method.
A sampling method is used to efficiently estimate the graph properties by consulting a sample
of the whole population [3]. In Reference [22], the authors provide a systematic evaluation of sam-
pling algorithms while answering what is a good sampling method, what is a good sample size,
and how to measure the goodness of a single sample and the goodness of the method itself. More-
over, in Reference [3], a generic stream sampling framework for big-graph analytic, called Graph
Sample and Hold (gSH), is presented. This framework samples from massive graphs sequentially
in a single pass, one edge at a time, while maintaining a small state in memory.
DP tools. Another important area regarding data profiling is the automation of the profiling pro-
cess. When working with relational databases or tabular data, there are many tools, with different
complexities and features, that provide an automatic profiling. For example, Talend Open Studio
for Data Quality [31] and Data Cleaner [8] are very popular tools that provide a wide spectrum of
profiling tasks.
In the field of network analysis, in Reference [23], the authors propose a general-purpose, high-
performance system, SNAP, that provides high-level operations for analysis and manipulation of
large networks. SNAP provides implementations of common traditional algorithms for graph and
network analysis, such as community detection, important nodes discovery, triangle detection,
and so on.
In the context of automatic profiling of Property Graph databases, in particular when working
with Neo4j, the Neo4j Database Analyzer is an automatic tool designed to provide quick under-
standing of the data structures in a Neo4j Database [33]. In addition, yFiles Neo4j Explorer is a free
online tool to visually explore Neo4j databases [37]. These tools are quite simple, but they provide
a quick first approximation to the graph structure.
Table 1 compares the presented proposals for DP in graph models, according to the tasks they
include. The proposals are ordered according to the considered data model (General Graphs (GG),
Labeled directed graph (LDG), Linked Open Data (LOD), Object Exchange Model (OEM), and RDF).

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:6 S. Maiolo et al.

Table 1. Comparison of DP Proposals According to Data Model and DP Tasks

Other functional dependencies


Patterns, data types & domain

Summarization & Sampling

Clustering and outliers


Schema extraction
Triangle detection
Value distribution

Important nodes

GFD discovery
Key discovery
Completeness
Shortest path
Data Model
Papers
Fan et al. [13] GG  
Fan et al. [14] GG  
Ahmed et al. [3] GG 
Leskovec and Faloutsos [22] GG 
Shetty and Adibi [29] GG 
Tian [32] GG 
Fan [12] GG
Kruse et al. [20] GG 
Robinson et al. [28] GG  
Symeonidou et al. [30] GG 
Ghazizadeh and Chawathe [17] LDG  
Böhm et al. [5] LOD   
Debattista et al. [11] LOD  
Debattista et al. [10] LOD 
Zaveri et al. [38] LOD   
Wang et al. [34] OEM 
Galárraga et al. [16] RDF  
Gubichev and Then [19] RDF 
Yu and Heflin [36] RDF 
Gubichev et al. [18] RDF 

We can observe that, in the studied set of proposals, none of the GG proposals include data
patterns or data types detection, value distribution, triangle detection, shortest path, schema ex-
traction, clustering and outliers detection or completeness evaluation. Besides, only one includes
triangle and shortest path detection and only one includes important nodes detection. Regard-
ing LOD and OEM proposals, none of them include triangle detection, shortest path, important
nodes, summarization and sampling tasks, clustering and outliers, completeness, or any kind of
dependencies treatment. Regarding RDF proposals, it can be noted that none of them include data
patterns or data types detection, value distribution, important nodes detection, summarization and
sampling, schema extraction, or any kind of dependencies treatment.
From this table we can conclude that most of the proposals address very few kinds of tasks, only
focusing in two or three of them. However, we find that many tasks are not addressed at all for
certain data models.
In addition, from this literature review we deduce that, although much work has been done
regarding DP tasks over graph databases, none of them focuses specifically on the PG model and
its challenges.
ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:7

Fig. 2. Application example: Women’s World Cup graph.

Our work addresses the problem of DP in PG, proposing a wide set of DP tasks organized in a
taxonomy. It intends to cover all the most popular DP tasks, considering and using as reference
the existing works for other data models.

3 A TAXONOMY TO ORGANISE DATA PROFILING TASKS


As we have presented in Section 2, there is a wide range of methods and approaches that can be
used when doing a DP over graphs. These techniques vary from simple statistic analysis that apply
to a single property to complex algorithms meant to be executed over a graph. Understanding the
scope of those tasks is vital to design an appropriate DP process.
To facilitate that analysis, in this section we present a taxonomy to organize and classify DP tasks
over PG. The taxonomy is inspired in the classification proposed by Naumann et al. presented
in Section 2, and it is defined by considering the specific characteristics of graph databases. In
addition, we propose several DP tasks for each of the categories within the taxonomy. These tasks
are a compilation of various methods and techniques originally designed for other graph models
but that can be adapted to be applied over PGs. We first introduce our running example used to
illustrate DP tasks whenever possible.

3.1 Running Application Example


Our example is based on the Woman’s World Cup dataset published by Neo4j [26]. The entities in
this dataset are Persons, Matches, Tournaments, Teams, and Squads. As usual in Property Graphs,
nodes represent entities while edges represent relationships. Since in Property Graphs the notion
of schema and strong typing is absent, it is common to use node labels to represent either entity
or relationship types. In this dataset, the labels used for nodes are as follows: Person, Match,
Tournament, Team and Squad, while property types are PLAYED_IN, SCORED_GOAL, COACH_FOR,
IN_SQUAD, NAMED, REPRESENTS, PLAYED_IN, FOR, PARTICIPATED_IN, and IN_TOURNAMENT.
The sample graph consists of eight nodes, identified with numeric ids from 101 to 108. Figure 2
depicts our running example, where triangles represent the nodes labelled Tournament, circles

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:8 S. Maiolo et al.

Fig. 3. Our proposal of a taxonomy to classify DP tasks over PG datasets.

nodes labelled Person, squares nodes labelled Match, pentagon nodes labelled Squad, and hexagon
nodes labelled Team.
Also, there are 13 edges, identified with numeric ids from 201 to 213. Image 2 present each
edge identifier with its relationship type. For example edges 201 and 210 are relationships of type
PLAYED_IN.
In addition, nodes and edges, may have properties containing further information of the entity
or the relationship they represent. The node with id “101” has two properties: name and phone,
while the edge with id “213” has just one property: score.

3.2 Proposed Taxonomy


Figure 3 shows our proposed taxonomy, which organizes DP tasks in the three categories: (i) tasks
that involve a single property, either in nodes or relationships; (ii) tasks that involve graph patterns;
and (iii) tasks that deal with graph dependencies.
Table 2 includes some of the examples presented by Naumann et al. [1], and we include anal-
ogous tasks in the context of PG databases. As can be intuitively visualised in the examples, re-
ferring to a column in a relational database is analogous as working with a certain property in a
graph node or relationship. Moreover, working with multiple columns is analogous as considering
a graph pattern.
Table 3 shows the correspondence between the categories proposed in our taxonomy and
the ones proposed by Naumann et al. For this comparison, we consider our application ex-
ample and a relational databases consisting in the tables Person(id, name, phone) and
Match(id,date,stage) and a table “PLAYED_IN,” representing a relationship between the tuples
of table Person and Match. In the following we describe these categories in detail.
The first category corresponds to the analysis of an individual property with a certain label,
throughout nodes or relationships. This category is similar to the Single column category presented
in Reference [1], as focusing the analysis on a given property in an specific node or relationship
can be considered analogous to analysing a column of a given table in a relational database. Gath-
ered metadata in this category, can vary from the uniqueness of the values of a property to more
complex metadata as the generation of histograms.

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:9

Table 2. Examples of DP Tasks in Relational and PG Databases

Task Relational database example [1] PG example


uniqueness how many “cities” appear in the column how many “cities” appear in the property
“City” in table “Address” “city” of nodes labelled “Address”
data disguised missing values in the column disguised missing values in the property
completeness “State” in table “Address” “state” of nodes labelled “Address”
patterns find a pattern for the column “Phone” in find a pattern for the property “phone” in
table “Customer” nodes labelled “Customer”
semantic label a column as “phone number” in table label a property as “phone number” in nodes
domain “Customer” labelled “Customer”
outlier find single-character in the column “First find single-character in the property “first
detection Name” in table “Customer” Name” in nodes labelled “Customer”
cardinality number of rows in table “Customer” number of nodes labelled “Customer”
associations correlation between columns “Position”, correlation between nodes labelled
and correlations and “Allowance”, in table “Employees” “Employee”, nodes labelled “Position” and
the property “allowance”

Table 3. Correspondence between Categories in Relational and PG Databases

Relational database Relational database example PG database PG example


single column in a attribute “age” in table property “age” in nodes labelled
single property
table Person “Person”
sub-graph containing nodes
tuples of table “Person”
labelled “Person”, where the
where the attribute “name” is a sub-graph
multiple columns property “name” is “Elise
“Elise Thorsnes” joined with represented by a
in a table Thorsnes,” nodes labelled
tuples from tables “Match” graph pattern
“Match” and edges of type
and “PLAYED_IN”
“PLAYED_IN”

The second category corresponds to the analysis of a specific graph pattern that may be eventu-
ally the whole graph. This category matches to the Multiple columns category defined by Naumann
et al., where a set of columns from a single table were considered. To refer to a set of properties
from a node or relationship, the most natural approach is to define a graph pattern. In this way, a
specific sub-graph can be considered to perform the required analysis. By defining a graph pattern
the whole graph can also be considered, and so the tasks can be easily generalized when needed.
In this category, we include all those tasks regarding the analysis of specific metadata that can
be retrieved from a graph by taking advantage of the PG’s topology. In addition, tasks with the
purpose of generating structural information as the amount of nodes within the graph pattern are
considered. Furthermore, analysis related to the generation of summaries, sampling, and sketches
are included. Finally, tasks based on clustering and outliers detection are considered.
In the third category, we include tasks with the aim to identify correlation between the different
nodes and relationships. One of the most important goals is to discover keys expressed as recursive
graph patterns [13]. Besides the key discovery, other kind of graph functional dependencies can be
identified through the graph. This category is equivalent to the Dependencies category established
by Naumann et al. as the goal of defining dependencies in relational databases can be extended to
PG databases, too.

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:10 S. Maiolo et al.

Table 4. Overview of Single-property DP Tasks (Cardinalities Subcategory)

Task Example Result


t 1 - occurrences of values: number occurrences of values in There is one occurrence of value
of occurrences for each possible value property minuteOn in “73  ” and one occurrence of
relationships of type PLAYED_IN value “102  +2  ”
t 2 - uniqueness: percentage of items uniqueness of property name in The uniqueness is 100%.
with distinct values nodes labelled Person
t 3 - nullability: percentage of items nullability of property stage in The nullability is 0%
for which the property is not defined nodes labelled Match

Table 5. Overview of Single-property DP Tasks (Patterns, Datatypes, and Domains Subcategory)

Task Example Result


t 4 - regular expressions: Discover regular expressions in The three phone numbers match the
discovery of regular expressions property phone in nodes regular expression:∧(+)?([0 − 9])
in property values labelled Person {10, 16}$
t 5 - semantic domains: use Which is the semantic domain The values of property stage
values to identify a semantic of property stage in nodes represent the moment of the
domain labelled Match? tournament the match was played.

3.2.1 Data Profiling over Single Properties. This kind of analysis is the most straightforward
one for reusing techniques and methods from DP in relational databases. The goal is to analyze
the values of a certain property in nodes or relationships. These analysis tasks can be further
organised into a set of subcategories that we present next.
Cardinalities. This category corresponds to some simple metadata discovery that can be use-
ful to understand the stored data. For example, calculating the occurrences of values in property
minuteOn for relationships of type PLAYED_IN will provide an overlook of the distribution of values
through the minuteOn property. The analogy in a relational database is to analyze the occurrences
of value of the column minuteOn in the table that stores the players who participated in a match.
Other possible tasks in this category include calculating the uniqueness or nullability of a given
property in a node or relationship with a given label/type. Table 4 shows examples of relevant
tasks in this subcategory.
Patterns, data types, and domains. Identifying frequent patterns and data types is especially use-
ful when preparing the dataset for data cleansing procedures. For example, identifying a pattern
in the phone property of nodes labelled Person can be used to identify data quality problems as
malformed phones. Finding semantic domains is also really important when no meaningful in-
formation of a property is given. For instance, in nodes relationships with type PLAYED_IN the
meaning of the type property is unknown. By analyzing that property, identifying that it stores
values such as Start or Subbed on, we can go on to a further analysis and determine that those
values correspond to the moment of the match when the player joined the game. Table 5 shows
examples of relevant tasks in this subcategory.
Value distribution. Investigate value distribution within a property is essential to understand
how the data may evolve. In our application example, creating a frequency histogram on property
date in nodes labelled Match can be used to identify outlier values within the dataset. Calculating
quartiles and other statistical data is important to describe the values of the property. Table 6 shows
examples of relevant tasks in this subcategory.

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:11

Table 6. Overview of Single-property DP Tasks (Value Distribution Subcategory)

Task Example Result


t 6 - summary statistics: Compute average value in property The two values are 73  and
calculate min, max, average or minuteOn in relationships of type 120  +2  . The average value is
standard deviation PLAYED_IN 97,5 
t 7 - outliers: identify outliers Check outliers in property score in The value “31” is an outlier,
relationships of type PLAYED_IN most values range from 0 to 6.

Table 7. Overview of Graph Pattern DP Tasks (Structure Metadata Subcategory)

Task Example Result


t 8 - number of nodes or Count the number of nodes The graph contains 8 nodes and 13
relationships: count nodes and relationships relationships.
or relationships considering the whole graph
t 9 - list types: list nodes or List nodes and relationship Node types: Person, Match, Tournament,
relationship labels labels considering the whole Team and Squad, while property types are
graph PLAYED_IN, SCORED_GOAL, COACH_FOR,
IN_SQUAD, NAMED, REPRESENTS, PLAYED_IN,
FOR, PARTICIPATED_IN and IN_TOURNAMENT.
t 10 - number of nodes or Count the number of nodes Nodes per label: Person: 3 nodes, Match: 2
relationships per type per label considering the nodes, Team: 1 node, Squad: 1 node, and
whole graph Tournament: 1 node
t 11 - properties of a node List the properties of node phone, name.
or relationship: list all the 101
properties of a certain node
or edge

3.2.2 Data Profiling over Graph Patterns. Profiling tasks in this category include a wide set of
techniques that vary from simple metrics, as the amount of nodes of the graph, to complex graph
theory’s methods or sampling and summarizing strategies. The tasks included in this section take
advantage of the particular topology of the PG representation, and so they are quite specific for a
PG database. Tasks and methods considered can be applied to the whole graph or to a specific graph
pattern according to the scope selected for the required metadata task can be further organised in
a set of subcategories that we present next.
Structure metadata. Because of the lack of a defined schema, the generation of metadata from the
topological structure of the graph is really relevant to perform meaningful queries and a proper
manipulation of the stored data. Collecting basic information, as the amount of nodes or relation-
ships is one of the simplest task that should be performed. In addition, listing the labels/types and
the number of nodes or edges for each type can be really important to understand the domain rep-
resented by the graph. In our example, an interesting task is to calculate the amount of nodes of
type Team, and relationships of type PARTICIPATED_IN. Table 7 shows examples of relevant tasks
in this subcategory.
Paths metadata. In this category, we include all those tasks with the aim of gather relevant meta-
data regarding the relationships and paths they generate within the graph. Computing shortest
paths between two given nodes is a fundamental operation over graphs, especially when working
with social or biological networks. Classic graph theory techniques, such as Dijkstra’s algorithm,
can be applied, but due to its computational complexity their implementation in real-world datasets

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:12 S. Maiolo et al.

Table 8. Overview of Graph Pattern DP Tasks (Paths Metadata Subcategory)

Task Example Result


t 12 - orphan nodes: discover nodes Find nodes of type Person with There are no orphan nodes
with no edges no edges
t 13 - shortest path: find the shortest Find the shortest path between The shortest path between these
path between two nodes [28, 18] Ingrid Moe Wold (107) and nodes has length 2, reaching 106
Martin Sjogren (106) from 107 via edges 203 and 211.
t 14 - triangle detection: detect Detect triangles that involve The triangle formed by nodes
triangles formed by nodes and edges node 101 and edges of type 101, 102, and 104.
[28] [19] SCORED_GOAL and REPRESENTS
t 15 - important nodes: discover Find nodes of type Person with Node 101, Elise Thorsnes has 1
important nodes according to some maximum number of edge of type SCORED_GOAL
definition of importance [29] SCORED_GOAL edges

is a major challenge. In Reference [18], the authors propose techniques to improve the shortest path
estimations. Triangles are another key concept. Triangles are also referred to as Triadic Closures
and are a very powerful concept in graph theory that is frequently used in social graphs to predict
when two people are likely to become connected. If two nodes are connected by a path involving a
third node, then there is an increased likelihood that the two nodes will become directly connected
at some point in the future [28]. Triangles discovery is an interesting analysis because of the pos-
sibility of inferring future links between data. Discovering important or relevant nodes is another
key task when trying to find hidden organizational structure. Finding important nodes within a
graph is a challenging task as the definition of what makes a node important is relative to each
domain. In our example, a relevant person can be defined as a person that has many out-coming
SCORED_GOAL relationships. However, this is not the only possible definition. We can also define a
relevant person as a person who PLAYED_IN a lot of matches. In Reference [29], advanced methods
are applied to systematize this problem. However, finding those nodes that have no relationships
at all is also a key task when analyzing the graph. Identifying orphan nodes is especially useful
when trying to determine the completeness of the graphs as there may be missing edges and also
to find specific facts according to the modelled domain. Table 8 shows examples of relevant tasks
in this subcategory.
Summaries and sketches. Summaries and sketches constitute another way to describe a dataset.
In Reference [7], summarization methods are provided for both RDF and non-RDF graphs. For
example, when needing to generate a report of the players and the scored goals, a summarized
version of the original graph can be used to increase the performance of the process. Of course,
this make sense when the graph increases its size. Summaries should also be considered when per-
forming schema extraction [17, 34]. Besides summarization, sampling is a essential research area
when working with large graphs. Most of the techniques previously mentioned become imprac-
tical to apply for huge real graphs, and so sampling is crucial. The goal of a sampling method is
to efficiently estimate the graph properties by consulting a sample of the whole population [3]. In
Reference [22], the authors provide a systematic evaluation of sampling algorithms. Table 9 shows
examples of relevant tasks in this subcategory.
Clustering and outlier detection. Another useful profiling task is to segment the data into ho-
mogeneous groups using a clustering algorithm. Clustering is especially interesting when dealing
with large and heterogeneous datasets. In Reference [5] a tool for Linked Open Data, called Pro-
LOD, is presented. This tool clusters the data into semantically correlated data(sub)sets, for which

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:13

Table 9. Overview of Graph Pattern DP Tasks (Summaries and Sketches Subcategory)

Task Example Result


t 17 - graph summarization: Extract a sub-graph only with nodes A sub-graph consisting of nodes
extract information from the labelled Squad and Tournament, and with id 103 and 105, and a new
graph representing its content as a new relationship type edge of type TOTAL_SCORED_
faithfully as possible [17] TOTAL_SCORED_GOAL with a new GOAL between them, with the
property “total_score” containing the new property “total_score”: 32 .
consolidated information of the
scored goals for each squad through
the tournament
t 18 - schema extraction: infer a Is the name property a candidate key According to the uniqueness of
schema from the graph [11, 34, 38] for nodes of type Person? this property (100%), it can be a
good candidate.

Table 10. Overview of Graph Pattern DP Tasks (Clustering and Outlier Detection Subcategory)

Task Example Result


t 19 - clustering: segment the Segment the Match according to Only one segment containing matches
nodes into homogeneous groups the month (a part of property 102 and 108, as they both correspond
[5] date) to dates in October.

meaningful and insightful profiling results are possible. Table 10 shows examples of relevant tasks
in this subcategory.
3.2.3 Graph Dependencies in Data Profiling. Finding graph dependencies corresponds to an
advanced stage during a data profiling process. As mentioned in Section 2, graph dependencies
have been widely studied and, in particular, graph dependencies discovery is a current research
field. In this section, we identify three tasks of interest during data profiling. A deeper analysis
regarding these tasks is suggested as future work.
Key discovery. As happens in the relational world, keys are essential to provide a connection
between real-world entities and its representation in the graph. However, when working in graph
databases keys are more challenging, because they are defined in terms of graph patterns, specify-
ing constraints on both topological structure and value bindings. In Reference [32], key discovery
in graphs was studied in detail.
Graph Functional Dependencies discovery. GFDs provide a primitive form of integrity constraints
to specify a fundamental part of the semantics of the data. So they are an essential tool when
specifying the integrity of graph entities, optimizing graph queries, and, in particular, consistency
checking [14].
Other graph dependencies. Other type of dependencies may be considered. As an example, inclu-
sion dependencies can be particular useful. This area was studied in Reference [14] and Reference
[20].

3.3 Data Profiling Pipeline


Defining which tasks should be applied for a given dataset is not a straightforward process. It
depends on the nature of the graph, the queries that will be executed and need to be optimised,
the available resources, and the amount of current metadata. Moreover, determining a proper flow
establishing in which order apply them is a challenge, too.

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:14 S. Maiolo et al.

Table 11. Data Profiling Pipeline

Stage Suggested Tasks Expected Output


amount of nodes, relationships or Overview of the graph topology and estimation of
Structure metadata paths graph size
discovery list nodes, relationships or paths Relevant nodes, relationship and paths and their labels
labels Properties of interest in specific nodes, relationship or
paths
number of nodes, relationships or Need of summaries, sampling or sketches
paths per label
Cardinalities and value occurrences of values Detailed analysis of relevant properties
distribution discovery uniqueness Relevant properties to analyze in future stages
null values
summary statistics
outliers
Patterns, data types & regular expressions discovery Exhaustive metadata for specific properties
domains identify semantic domains
Paths metadata discovery shortest paths between two nodes Summaries, sampling procedures or sketches
orphan nodes Exhaustive characterization of the graph
Relevant metadata from the domain
important nodes
Functional discovery key and GFD discovery Exhaustive functional characterization of the graph

In Table 11, we propose a guide pipeline that can be applied in a general PG. This can be used
as a guideline or framework to start the data profiling process when no metadata about the graph
are known.
The guideline starts from the most general tasks, with the purpose of gather basic information
about the dataset, to most specific techniques that are useful to have a deeper insight over the
graph. In addition, the implementation/computational cost of the tasks is considered. So, in the
first stages, the tasks that are relatively more easy to apply are included, while more complex
methods are included in the last stages. The guideline suggests the following stages:
(1) Structure metadata discovery
(2) Cardinalities and value distribution discovery
(3) Patterns, data types and domains
(4) Paths metadata discovery
(5) Key and GFD discovery
First, we propose the “Structure metadata discovery” step as a stage where tasks should be ap-
plied to have a first approximation to the data source. The main goal of this step is to have an
overview of the graph to plan and design necessary tasks. Understanding the labels, nodes, and
relationships regarding the dataset is essential to plan further tasks. For example, having an estima-
tion of the graph size (by counting the amount of nodes or relationships) is essential to determine
if other tasks can be handled with a reasonable computational cost. By having this output, further
procedures can be considered as the need of performing sketches or summarizations.
As a second step, we suggest including some of the tasks in the “single-property” category. By
executing those tasks simple metadata regarding cardinalities and value distribution is collected.
As many of this tasks can be generalized at a reasonable cost, is a quite straightforward way to
understand the nature of the graph and to have quick and useful information about its data. For
instance, as an standardised stage, the nullability of some properties within a node or relationship
with a certain label (defined from the metadata retrieved from first stage) can be calculated.

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:15

Table 12. Data Profiling Pipeline for the Application Example

Stage Suggested Tasks in Application example


t 8 - amount of nodes or relationships over the full graph
Structure metadata t 9 - list labels/types over the full graph
discovery t 10 - number of nodes or relationships per label/type over the full graph
t 11 - list all properties of a node/relationship over the full graph
t 1 - occurrences of values in property “minuteOn” in relationships of type
“PLAYED_IN”
Cardinalities and t 2 - uniqueness of property “name” in nodes labelled “Person”
value distribution
t 3 - nullability of property “stage” in nodes labelled “Match”
discovery
t 6 - average value in property “score” in relationships of type “PLAYED_IN”
t 7 - identify outlier values in property “score” in relationships of type “PLAYED_IN”
t 4 - discover regular expressions in property “stage” in relationships of type
Patterns, data “IN_TOURNAMENT”
types & domains
t 5 - identify semantic domain in property “stage” in nodes labelled “Match”
t 13 - find shortest paths between two nodes
Paths metadata t 12 - orphan nodes: discover players that haven’t participated on a match
discovery t 14 - triangle detection
t 15 - discover important nodes
Functional
t 20 - key and GFD discovery
discovery

Then, we propose performing more advanced tasks, as the discovery of patterns, data types, and
domains. To define for which properties these tasks should be applied is essential to have in mind
all the output retrieved from stage 1.
Once all the necessary tasks regarding properties were performed, “Paths metadata” tasks and
methods should be applied. Defining which of these tasks should be applied depend on the nature
of the graph, the data that are being modelled, and the requirements. For example, a triangle dis-
covery analysis can be really useful in a social network scenario while it may have no interest in
other kind of domains. As these kind of analysis implies a huge computational cost in terms of
programming, processing, and execution time, they should be properly designed. Furthermore, to
execute some of those methods in real-world datasets, summaries and sketches may be generated
first. So, is a huge decision to define which methods will be applied and over which dataset (the
original one or a reduced version of it).
When performed the mentioned tasks, useful metadata are generated so proper characterization
of the dataset can be done. However, more advanced techniques can be executed to get functional
information. In this direction, methods for key and functional dependencies discovery can be ap-
plied. In Table 12, this pipeline is presented for our application example to illustrate how to guide
the data profiling tasks in real PG dataset.

4 FORMAL SPECIFICATION OF DATA PROFILING TASKS


In this section, we provide a formal specification of a subset of the tasks compiled in this work,
based on the Property Graph model presented in Reference [15]. We focus on the tasks that can
be expressed in terms of the PG model definition, independently of a specific query language, to
abstract the specification from a particular PG engine. This is especially interesting as there is no
standard query language over different PG databases yet. Specifically, we omit the tasks that re-
quire specific methods from the database engine and/or an expressive power beyond typical graph

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:16 S. Maiolo et al.

query languages, like the discovery of shortest paths or summaries generation. The implementa-
tion of those tasks strongly depends on the selected environment (database engine, programming
language, etc.).

4.1 The Property Graph Data Model


To make the article self-contained, we first introduce the main constructs in the model.
Definition 4.1 (Values [15]). Let us consider three disjoint sets K of property keys, N of node
identifiers, and R of relationship identifiers. These sets are all assumed to be infinite. We assume
two base types: the integers Z, and the type of finite strings over a finite alphabet Σ (this does not
really affect the semantics of queries; these two types are chosen purely for illustration purposes).
The set V of values is inductively defined as follows:
• Identifiers (i.e., elements of N and R) are values;
• Base types (elements of Z and Σ∗ ) are values;
• true, false and null are values;
• list () is a value (empty list), and if v 1 , . . . , vm are values, for m > 0, then list (v 1 , . . . , vm ) is
a value
• map() is a value (empty map), and if k 1 , . . . , km are distinct property keys and v 1 , . . . , vm
are values, for m > 0, then map((k 1 , v 1 ), . . . , (km , vm )) is a value
• If n is a node identifier, then path(n) is a value. If n 1 , . . . , nm are node ids and r 1 , . . . , rm−1
are relationship ids, for m > 1, then path(n 1 , r 1 , n 2 , . . . , nm−1 , rm−1 , nm ) is a value
Definition 4.2 (Property Graph [15]). Let L and T be countable sets of node labels and relationship
types, respectively. A Property Graph is a tuple G = (N , R, src, tдt, ι, δ, λ, τ ), where
(1) N is a finite subset of N , whose elements are referred to as nodes of G
(2) R is a finite subset of R , whose elements are referred to as relationships of G
(3) src : R → N is a function that maps each relationship to its source node
(4) tдt : R → N is a function that maps each relationship to its target node
(5) ι : (N ∪ R)xK → V is a finite partial function that maps a (node or relationship) identifier
and a property key to a value.
(6) λ : N → 2L is a function that maps each node id to a finite (possible empty) set of labels.
(7) τ : R → T is a function that maps each relationship identifier to a relationship type.

4.2 Formalization of Single-property Tasks


We first present, in Definitions 4.3, 4.4, 4.5, and 4.6, the formal specification of tasks in the single-
property category of our proposed taxonomy, discussed in Section 3.2.1.
Definition 4.3 (t 1 -occurrences of Values). Given a property k ∈ K, return Ω a set of pairs (v, nv ),
where v ∈ V is a value of the property and nv is the number of instances that have this value. Then
Ω = {(v ∧ v ∈ V , card ({x ∈ N ∪ R / v ∈ ι(x, k )})}.
Definition 4.4 (t 2 -uniqueness). Given a property k ∈ K, return uniqueness t 1 as the ratio between
c 1 (the number of nodes and edges for which property k is defined and not null), and c 2 (the number
of nodes and edges that share the same value of k). Then, c 1 = card ({y ∈ N ∪ R / x ∈ N ∪ R ∧ x 
y ∧ ι(y, k ) = ι(x, k )}), c 2 = card ({y ∈ N ∪ R / ι(y, k )  null }), and t 1 = 100 ∗ c 1 ÷ c 2 .
Definition 4.5 (t 3 -nullability). Given a property k ∈ K, return nullability t 3 as the ratio between
the number of nodes and edges for which property k has null value and the number of nodes
and edges where property k is not null. Then, t 3 = card ({x ∈ N ∪ R / ι(x, k ) = null }) ÷ card ({y ∈
N ∪ R / ι(y, k )  null }).

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:17

Definition 4.6 (t 6 -summary Statistics). Given a property k ∈ K, return different measures like
minimum and maximum values, average, and standard deviation on the set of values. Let Ω = {v ∧
v ∈ V , card ({x ∈ N ∪ R / v ∈ ι(x, k )})}, then max = Max (Ω), min = Min(Ω), avд = Avд(Ω), std =
stDev (Ω).

4.3 Formalization of Graph Patterns Tasks


To formalize the tasks included in the graph patterns category, we assume the following. Given
a graph G and a graph pattern χ , G χ is the sub-graph of G that satisfies this pattern, accord-
ing to the satisfaction relation defined in Reference [15]. Then, the graph G χ is a tuple G χ =
(N χ , R χ , src χ , tдt χ , ι χ , δ χ , λ χ , τ χ ) as defined in Definition 4.2.
Definition 4.7 (t 8 -number of Nodes and Number of Edges). Given the resulting sub-graph G χ , then
t 8−nodes = card (N χ ) and t 8−edдes = card (R χ ).
Definition 4.8 (t 9 -types of Nodes and Types of Edges). Given the resulting sub-graph G χ , then
t 9−nodes = {λ χ (n), ∀n ∈ N χ } and t 9−edдes = {τ χ (n), ∀n ∈ N χ }.
Definition 4.9 (t 10 -number of Nodes and Number of Edges Per Type). Given the resulting sub-graph
G χ , then t 10−nodes = {(l ∈ L χ / ∃ n ∈ N χ ∧ l ∈ λ χ (n), card ({n ∈ N χ / l ∈ λ χ (n)})} and t 10−edдes =
{(t ∈ T χ / ∃ r ∈ R χ ∧ t = ι χ (r ), card ({r ∈ R χ / t = ι χ (r )})}.
Definition 4.10 (t 11 -list of Properties). Given the resulting sub-graph G χ , let n ∈ N χ be a node
of interest. Then, t 11−n = {(k ∈ K χ / ι χ (n, k )  null }. Let now r ∈ R χ be a node of interest, then
t 11−r = {(k ∈ K χ / ι χ (e, k )  null }.

5 EXPERIMENTATION IN CYPHER
In this section, some of the tasks proposed in previous sections are developed in a specific query
language, Cypher, to execute them in an real-world dataset.
We developed every task that was formalized in Section 4 and other tasks from Section 3 by
taking advantage of Cypher functions and methods. We did not develop tasks, as summarizing
and graph dependencies discovery, where special algorithms or techniques need to be developed.

5.1 Experimentation Environment


The selected real-world dataset is an extended version of the running application example pre-
sented in Section 3. The graph was loaded in a Neo4j engine by following the instructions pro-
vided in Reference [26]. The dataset has a total store size of 4.91 MB. The developed queries were
performed over Neo4j Desktop 1.2.2, running on a Windows 10.0.18 machine with an Intel Core
i7-7500, 2.70 GHz, and 8.00 GB of RAM. The average execution time of the queries was 73.14 ms.
The maximum execution time was 378 ms, while the fastest query was executed in 2 ms. The script
to load the graph database and the developed queries are available in a public repository.1
The tasks were executed by following the pipeline specified in Section 3.3 to validate its use-
fulness. To compare the results provided by the execution of our pipeline, two automatic tools
were executed over the application example’s graph: Neo4j Database Analyzer and yFiles Neo4j
Explorer. The Neo4j Database Analyzer is integrated to Neo4j Desktop. It was installed and used
from the Neo4j Desktop development environment. However, yFiles Neo4j Explorer is a standalone
tool that was connected to the Neo4j database, and it displayed the results in its own web applica-
tion. In Table 13, we compare in detail which tasks were generated by the Neo4j Database Analyzer
and yFiles Neo4j Explorer and which not.

1 https://gitlab.fing.edu.uy/sofia.maiolo/data-profiling-in-property-graph-databases

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:18 S. Maiolo et al.

Table 13. Comparison between Our Proposal Tasks and Automatic Tools Task

Our Neo4j Database yFiles Neo4j


Stage Suggested Tasks proposal Analyzer Explorer
t 1 - occurrences of values  × ×
Single property: t 2 - uniqueness  × ×
Cardinalities and value t 3 - nullability  × ×
distribution t 6 - statistic measures  × ×
t 7 - outliers  × ×
t 8 - amount of nodes or
  
relationships
Graph patterns: Structure t - list labels/types   
9
metadata
t 10 - number of nodes or
  ×
relationships per label/type
t 11 - list all properties of an
  ×
specific node/relationship
t 12 - orphan nodes  × ×
Graph patterns: Paths t 13 - shortest paths between
 × 
metadata two nodes
t 14 - triangle detection  × ×
t 15 - important nodes  × 

5.2 Queries
We present below a sub-set of the executed queries with their results and a brief discussion of
them. In Appendix A, we present each query together with its results.
Number of nodes per label
MATCH (n)
RETURN labels(n) AS NodeLabel, count(n) AS NumberOfNodes

In this task, we calculate the amount of nodes for each label. The query is completed after 5 ms. As
visualized in Figure 4, there are 2,022 nodes labelled “Person” while only 8 nodes labelled “Tourna-
ment.” This task was also included in Neo4j Database Analyzer and yFiles Neo4j Explorer default
analysis, with exactly the same results.
Number of relationships per type
MATCH ()-[p]-()
RETURN Type(p) AS RelationshipType, count(p) AS NumberOfRelationships

In this task, we calculate the amount of relationships for each type. The query is completed after
54 ms. As visualized in Figure 4, there are 16,504 relationships of type “PLAYED_IN” while only
272 relationships of type “PARTICIPATED_IN.” This task was also included in Neo4j Database
Analyzer and yFiles Neo4j Explorer default analysis, with exactly the same results.
Uniqueness
MATCH (p:Person)
RETURN count(DISTINCT p.name) AS DifferentNames,
count(p.name) AS TotalPerson,
100*count(DISTINCT p.name)/count(p.name) AS Uniqueness

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:19

Fig. 4. Experimental results.

In this task, we calculate the uniqueness of the property “name” within nodes labelled “Person.”
The query is completed after 15 ms. The returned Uniqueness value is 99%, with 2008 distinct
names in 2,022 nodes labelled “Person.” This task is included in neither Neo4j Database Analyzer
nor yFiles Neo4j Explorer analysis.
Statistic measures
MATCH ()-[p:PLAYED_IN]-()
RETURN min(p.score) AS MinScore,
max(p.score) AS MaxScore,
avg(p.score) AS AvgScore,
stDev(p.score) AS StdScore

In this task, we calculate the minimum, maximum, average, and standard deviation of the “score”
property for relationships of type “PLAYED_IN.” The query is completed after 69 ms. As visualized
in Figure 4, the average value of the “score” property is 1.614, while the maximum value is 13. This
task is included in neither Neo4j Database Analyzer nor yFiles Neo4j Explorer analysis.
Outliers
MATCH ()-[p:PLAYED_IN]-()
WITH avg(p.score) - stDev(p.score)*3 AS min,
avg(p.score) + stDev(p.score)*3 AS max
MATCH ()-[p:PLAYED_IN]-()
RETURN DISTINCT p.score , (p.score < min or p.score> max) as isOutlier
ORDER BY p.score

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:20 S. Maiolo et al.

In this task, we look for outliers over the “score” property for relationships of type “PLAYED_IN.”
The query is completed after 166 ms. Values from 7 to 13 are considered outliers. This task is
included in neither Neo4j Database Analyzer nor yFiles Neo4j Explorer analysis.
Orphan nodes
MATCH (p:Person)
WITH p,size((p)-[]->()) as edges
WHERE edges = 0
RETURN p.name AS OrphanPlayer

In this task, we find nodes labelled “Person,” with no edges. The query is completed after 219 ms,
and it returns no orphan node. This task is included neither in Neo4j Database Analyzer nor yFiles
Neo4j Explorer analysis.
Node rank
MATCH (u:Person)
WITH u,size( (u)-[:SCORED_GOAL]->()) AS AmountGoals
ORDER BY AmountGoals DESC
RETURN u.name, AmountGoals LIMIT 10

In this task, we find the nodes labelled “Persons” with most occurrences of relationships of type
“SCORED_GOAL.” The query is completed after 11 ms. As visualised in Figure 4, it returns the
top 10 players who scored more amount of goals. “Marta” is the player who scored more, with 18
goals. This task is included in neither Neo4j Database Analyzer nor yFiles Neo4j Explorer analysis.

6 CONCLUSIONS AND FUTURE WORK


Data Profiling in PG is essential to properly handle this kind of database and take full advantage
of their features and capabilities, especially when using a dataset that is not known in advance.
Defining which tasks should be applied for a given PG dataset is a major challenge. In this article,
we compiled and organized a wide set of data profiling methods and techniques with the aim of
organizing the existing work in the area. In this compilation, we reused proposals brought from
different data models, adapting them to Property Graphs. Wherever feasible, we provided a formal
specification of tasks based on the Property Graph model that can be used to implement the tasks
independently of the selected engine. Moreover, we presented a practical application with a guide
pipeline to give an example of how to apply the proposed tasks. In addition, we developed some
of the tasks in Cypher.
As a future work, it would be interesting to implement some of the advanced methods (sampling,
key discovery, etc.) for our PG application example.
Moreover, a deeper analysis regarding graph dependencies would be useful to compile existing
methods and algorithms.
Another open field is the schema extraction problem. This was mentioned in this work, but
further research can be done in this area.
Data Profiling is closely related to Data Quality assessment, so another relevant area for future
study is working on the definition of a Data Quality model for PG datasets by defining specific
dimensions, metrics, and granularities. This Data Quality model can be used together with the
proposed tasks and pipeline to create a more ambitious framework for PG datasets management.
Due to the continuous evolution and increasing popularity of graph databases, the compilation
and classification proposed may evolve, but in this work we seek to propose a base framework that
can be reused and extended.

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:21

APPENDIX
A EXPERIMENTATION: QUERIES AND RESULTS
In this section, we present the Cypher queries developed for each DP task and the results from its
execution in the running application example.

A.1 Structure Metadata Discovery


Amount of nodes
MATCH (n)
RETURN count(n) as AmountNodes
The query was executed in 2 ms. It returned 2,486 as the amount of nodes.
Amount of relationships
MATCH ()-[r]->()
RETURN count(*) as AmountRelationships
The query was executed in 14 ms. It returned 14,799 as the amount of relationships.
List node’s labels
CALL db.labels()
The query was executed in 8 ms. It returned the following labels:
• Tournament
• Match
• Person
• Squad
• Team
List relationship’s types
CALL db.relationshipTypes()
The query was executed in 5 ms. It returned the following types:
• PARTICIPATED_IN
• FOR
• IN_SQUAD
• COACH_FOR
• PARTICIPATED_IN
• IN_TOURNAMENT
• PLAYED_IN
• SCORED_GOAL
• REPRESENTS
Number of nodes per label
MATCH (n)
RETURN labels(n) AS NodeLabel, count(n) AS NumberOfNodes
The query was executed in 29 ms. It returned the following values:
• Tournament: 8
• Match: 284

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:22 S. Maiolo et al.

• Person: 2022
• Squad: 136
• Team: 36
Number of relationships per type
MATCH ()-[p]-()
RETURN Type(p) AS RelationshipType, count(p) AS NumberOfRelationship
The query was executed in 113 ms. It returned the following values:
• PARTICIPATED_IN: 272
• FOR: 272
• IN_SQUAD: 5760
• COACH_FOR: 280
• PARTICIPATED_IN: 272
• IN_TOURNAMENT: 568
• PLAYED_IN: 16504
• SCORED_GOAL: 1814
• REPRESENTS: 3856
List all properties of a node
MATCH (m:Match)
RETURN DISTINCT keys(m)
The query was executed in 34 ms. It returned the following properties:
• stage
• date
List all properties of a relationship
MATCH (t:Team)-[p:PLAYED_IN]-()
RETURN DISTINCT keys(p)
The query was executed in 378 ms. It returned the following properties:
• score
• penaltyScore

A.2 Cardinalities and Value Distribution Discovery Subcategory


Occurrences of values
MATCH ()-[p:PLAYED_IN]-()
RETURN p.minuteOn AS Minute, count(p.minuteOn) AS CountOfOccurrences
ORDER BY CountOfOccurrences
The query was executed in 413 ms. There are 106 different values within the “minuteOn” prop-
erty. In Table 14 the first 10 values with more occurrences are presented. The value with most
occurrences is “45.” In Table 15, the first 10 values with less occurrences are presented.
Uniqueness
MATCH (p:Person)
RETURN count(DISTINCT p.name) AS DifferentNames, count(p.name) AS TotalPerson,
100*count(DISTINCT p.name)/count(p.name) AS Uniqueness

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:23

Table 14. Occurrences of Values: Ten


Values with More Occurrences

Minute Count of occurrences


“45” 178
“65” 82
“76” 78
“78” 78
“46” 74
“71” 74
“75” 74
“74” 72
“85” 72
“84” 68

Table 15. Occurrences of Values: Ten


Values with Fewer Occurrences

Minute Count of occurrences


“24” 2
“20” 2
“114” 2
“90+2” 2
“90+3” 2
“11” 2
“112” 2
“94” 2
“91” 2
“9” 2

The query was executed in 40 ms. It returned the following values:


• DifferentNames: 2008
• TotalPerson: 2022
• Uniqueness: 99%
Nullability
MATCH (m:Match)
WHERE m.stage IS null
RETURN count(m) as TotalNulls

The query was executed in 101 ms. It returned TotalNulls = 0. There are no null values in property
“stage.”
Summary statistics
MATCH ()-[p:PLAYED_IN]-()
RETURN min(p.score) AS MinScore, max(p.score) AS MaxScore,
avg(p.score) AS AvgScore, stDev(p.score) AS StdScore

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:24 S. Maiolo et al.

Table 16. Outliers

Score Is Outlier?
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 True
9 True
10 True
11 True
12 True
13 True

The query was executed in 162 ms. It returned the following values:
• MinScore: 0
• MaxScore: 13
• AvgScore: 1,614
• StdScore: 1,754
Outliers
MATCH ()-[p:PLAYED_IN]-()
WITH avg(p.score) - stDev(p.score)*3 AS min,
avg(p.score) + stDev(p.score)*3 AS max
MATCH ()-[p:PLAYED_IN]-()
RETURN DISTINCT p.score , (p.score < min or p.score> max) as isOutlier
ORDER BY p.score
The query was executed in 166 ms. Table 16 present the results.

A.3 Paths Metadata Discovery Subcategory


Orphan nodes
MATCH (p:Person)
WITH p,size( (p)-[:PLAYED_IN]->(:Match)) as played
WHERE played = 0
RETURN p.name AS OrphanPlayer
The query was executed in 46 ms. It returned 457 nodes labelled “Person” that did not play any
match. Table 17 present the first 10 results.
Shortest path nodes
MATCH path = allShortestPaths((u:Person)-[*]-(p2:Person))
WHERE u.name="Ingrid Moe Wold" and p2.name="Martin Sjogren"
RETURN path
The query was executed in 57 ms. Figure 5 presents the returned path.

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:25

Table 17. Orphan Nodes: First 10 Players

Orphan Player
“Michele”
“Thais”
“Barcellos Jorge”
“Ursula Holl”
“Hannah Bromley”
“Stephanie Puckrin”
“Priscilla Duncan”
“Rachel Howard”
“Herdman John”
“Patricia Ofori”

Fig. 5. Shortest path between Ingrid Moe Wold and Martin Sjogren.

Table 18. Node Rank: First 10 Players


with More Goals

Player Goals
Marta 18
Abby Wambach 15
Birgit Prinz 14
Michelle Akers 12
Sun Wen 11
Bettina Wiegmann 11
Cristiane 11
Carli Lloyd 10
Ann Kristin Aarones 10
Christine Sinclair 10

Triangle detection
MATCH (p:Person)-[p1:SCORED_GOAL]-(x1),(p)-[p2:REPRESENTS]-(x2),(x1)-[r1]-(x2)
WHERE x1 <> x2
RETURN count(p1) AS QtyTriangles
The query was executed in 115 ms. There were detected 907 triangles within the full graph.

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
20:26 S. Maiolo et al.

Node rank

MATCH (u:Person)
WITH u,size( (u)-[:SCORED_GOAL]->()) AS AmountGoals
ORDER BY AmountGoals DESC
RETURN u.name as Player, AmountGoals

The query was executed in 46 ms. It returned 2,022 nodes labelled “Person,” with its amount of
scored goals. Table 18 present the first 10 results.

REFERENCES
[1] Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2017. Data profiling: A tutorial. 1747–1751. DOI:https://doi.
org/10.1145/3035918.3054772
[2] Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data profiling. Synth. Lect. Data
Manage. 10, 4 (2018), 1–154.
[3] Nesreen K. Ahmed, Nick Duffield, Jennifer Neville, and Ramana Kompella. 2014. Graph sample and hold: A framework
for big-graph analytics. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (KDD’14). ACM, New York, NY, 1446–1455. DOI:https://doi.org/10.1145/2623330.2623757
[4] Mohamed Ben Ellefi, Zohra Bellahsene, John Breslin, Elena Demidova, Stefan Dietze, Julian Szymanski, and
Konstantin Todorov. 2017. RDF dataset profiling—A survey of features, methods, vocabularies and applications. Se-
mant. Web 9, 08 (2017). DOI:https://doi.org/10.3233/SW-180294
[5] Christoph Böhm, Felix Naumann, Ziawasch Abedjan, Dandy Fenz, Toni Grütze, Daniel Hefenbrock, Matthias Pohl,
and David Sonnabend. 2010. Profiling linked open data with ProLOD. In Proceedings of the 2010 IEEE 26th Interna-
tional Conference on Data Engineering Workshops (ICDEW’10). IEEE, 175–178. DOI:https://doi.org/10.1109/ICDEW.
2010.5452762
[6] Diego Calvanese, Wolfgang Fischl, Reinhard Pichler, Emanuel Sallinger, and Mantaš Simkus. 2014. Capturing rela-
tional schemas and functional dependencies in RDFS. In Proceedings of the National Conference on Artificial Intelligence
2.
[7] Šejla Čebirić, François Goasdoué, Haridimos Kondylakis, Dimitris Kotzinos, Ioana Manolescu, Georgia Troullinou,
and Mussab Zneika. 2019. Summarizing semantic graphs: A survey. VLDB J. 28, 3 (01 June 2019), 295–327. DOI:
https://doi.org/10.1007/s00778-018-0528-3
[8] Data Cleaner. [n.d.]. DataCleaner | Better data for better business decisions. Retrieved from https://datacleaner.org/.
[9] Sarah Cohen, Werner Nutt, and Yehoshua Sagic. 2012. Comparative analysis of relational and graph databases. Int. J.
Soft Comput. Eng. 2, 2 (May 2012), 509–512.
[10] Jeremy Debattista, Sören Auer, and Christoph Lange. 2016. Luzzu—A methodology and framework for linked data
quality assessment. J. Data Inf. Qual. 8, 1, Article 4 (October 2016), 32 pages. DOI:https://doi.org/10.1145/2992786
[11] Jeremy Debattista, Christoph Lange, and Sören Auer. 2014. daQ, an ontology for dataset quality information. In CEUR
Workshop Proceedings 1184.
[12] Wenfei Fan. 2019. Dependencies for graphs: Challenges and opportunities. J. Data Inf. Qual. 11, 2, Article 5 (February
2019), 12 pages. DOI:https://doi.org/10.1145/3310230
[13] Wenfei Fan, Zhe Fan, Chao Tian, and Xin Luna Dong. 2015. Keys for graphs. Proc. VLDB Endow. 8, 12 (August 2015),
1590–1601. DOI:https://doi.org/10.14778/2824032.2824056
[14] Wenfei Fan, Chunming Hu, Xueli Liu, and Ping Lu. 2018. Discovering graph functional dependencies. In Proceedings
of the 2018 International Conference on Management of Data (SIGMOD’18). ACM, New York, NY, 427–439. DOI:https://
doi.org/10.1145/3183713.3196916
[15] Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plan-
tikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. 2018. Cypher: An evolving query language for property
graphs. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). Association for
Computing Machinery, New York, NY, 1433–1445. DOI:https://doi.org/10.1145/3183713.3190657
[16] Luis Galárraga, Simon Razniewski, Antoine Amarilli, and Fabian M. Suchanek. 2017. Predicting completeness in
knowledge bases. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM’17),
Maarten de Rijke, Milad Shokouhi, Andrew Tomkins, and Min Zhang (Eds.). ACM, 375–383. DOI:https://doi.org/10.
1145/3018661.3018739
[17] Shayan Ghazizadeh and Sudarshan Chawathe. 2002. SEuS: Structure extraction using summaries. In Discovery Science,
Steffen Lange, Ken Satoh, and Carl H. Smith (Eds.). Springer, Berlin, 71–85.

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.
Data Profiling in Property Graph Databases 20:27

[18] Andrey Gubichev, Srikanta Bedathur, Stephan Seufert, and Gerhard Weikum. 2010. Fast and accurate estimation of
shortest paths in large graphs. In Proceedings of the 19th ACM International Conference on Information and Knowledge
Management (CIKM’10). ACM, New York, NY, 499–508. DOI:https://doi.org/10.1145/1871437.1871503
[19] Andrey Gubichev and Manuel Then. 2014. Graph pattern matching: Do we have to reinvent the wheel? In Proceed-
ings of the Workshop on GRAph Data Management Experiences and Systems (GRADES’14). Association for Computing
Machinery, New York, NY, 1–7. DOI:https://doi.org/10.1145/2621934.2621944
[20] Sebastian Kruse, Anja Jentzsch, Thorsten Papenbrock, Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann.
2016. RDFind: Scalable conditional inclusion dependency discovery in RDF datasets. In Proceedings of the 2016 In-
ternational Conference on Management of Data (SIGMOD’16). ACM, New York, NY, 953–967. DOI:https://doi.org/10.
1145/2882903.2915206
[21] Georg Lausen, Michael Meier, and Michael Schmidt. 2008. SPARQLing constraints for RDF. In Proceedings of the 11th
International Conference on Extending Database Technology: Advances in Database Technology (EDBT’08). Association
for Computing Machinery, New York, NY, 499–509. DOI:https://doi.org/10.1145/1353343.1353404
[22] Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD’06). ACM, New York, NY, 631–636. DOI:
https://doi.org/10.1145/1150402.1150479
[23] Jure Leskovec and Rok Sosič. 2016. SNAP: A general-purpose network analysis and graph-mining library. ACM Trans.
Intell. Syst. Technol. 8, 1, Article 1 (July 2016), 20 pages. DOI:https://doi.org/10.1145/2898361
[24] David Liben-Nowell and Jon Kleinberg. 2003. The link prediction problem for social networks. In Proceedings of the
12th International Conference on Information and Knowledge Management (CIKM’03). ACM, New York, NY, 556–559.
DOI:https://doi.org/10.1145/956863.956972
[25] Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Rec. 42, 02 (2014), 40–49. DOI:https://doi.org/10.1145/
2590989.2590995
[26] Mark Needham and Karin Wolok. 2019. This Week in Neo4j—Women’s World Cup Graph, Bloom Sandbox, Explor-
ing Shell Companies with Graph Algorithms. Retrieved February 8, 2020 from https://neo4j.com/blog/this-week-in-
neo4j-womens-world-cup-graph-bloom-sandbox-exploring-shell-companies-with-graph-algorithms/.
[27] Pablo Barceló Renzo Angles and Marcelo Arenas. 2018. G-Core. A core for future graph query languages. ACM Digital
Library (June 2018). DOI:https://doi.org/10.1145/3183713.3190654
[28] Ian Robinson, Jim Webber, and Emil Eifrem. 2013. Graph Databases. O’Reilly Media, Inc.
[29] Jitesh Shetty and Jafar Adibi. 2005. Discovering important nodes through graph entropy the case of enron email
database. In Proceedings of the 3rd International Workshop on Link Discovery (LinkKDD’05). ACM, New York, NY,
74–81. DOI:https://doi.org/10.1145/1134271.1134282
[30] Danai Symeonidou, Luis Galárraga, Nathalie Pernelle, Fatiha Saïs, and Fabian M. Suchanek. 2017. VICKEY: Mining
conditional keys on knowledge bases. In Proceedings of the 16th International Semantic Web Conference (ISWC’17),
Claudia d’Amato, Miriam Fernández, Valentina A. M. Tamma, Freddy Lécué, Philippe Cudré-Mauroux, Juan F.
Sequeda, Christoph Lange, and Jeff Heflin (Eds.), Lecture Notes in Computer Science, Vol. 10587. Springer, 661–677.
DOI:https://doi.org/10.1007/978-3-319-68288-4_39
[31] Talend. [n.d.]. Talend Open Studio for Data Quality: Documentation and Installation Guides. Retrieved from
https://www.talend.com/download/data-quality-open-studio/.
[32] Chao Tian. 2017. Towards effective analysis of big graphs: From scalability to quality. The University of Edinburgh.
[33] Kees Vegter. 2019. Introducing the Neo4j Database Analyzer. Retrieved January 29, 2020 from https://medium.com/
neo4j/introducing-the-neo4j-database-analyzer-a989b85e4026.
[34] Qiu Yue Wang, Jeffrey X. Yu, and Kam-Fai Wong. 2000. Approximate graph schema extraction for semi-structured
data. In Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Tech-
nology (EDBT’00). Springer-Verlag, London, UK, 302–316. http://dl.acm.org/citation.cfm?id=645339.650133
[35] Fanghua Yu. 2018. Data Profiling: A Holistic View of Data using Neo4j. Retrieved August 19, 2019 from https://neo4j.
com/blog/data-profiling-holistic-view-neo4j/.
[36] Yang Yu and Jeff Heflin. 2011. Extending functional dependency to detect abnormal data in RDF graphs. In Proceed-
ings of the International Semantic Web Conference (ISWC’11), Lora Aroyo, Chris Welty, Harith Alani, Jamie Taylor,
Abraham Bernstein, Lalana Kagal, Natasha Noy, and Eva Blomqvist (Eds.). Springer, Berlin, 794–809.
[37] ywork. 2017. yFiles Neo4j Explorer: Advanced Node Styling. Retrieved January 29, 2020 from https://www.yworks.
com/blog/neo4j-node-design.
[38] Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2016. Quality
assessment for linked data: A survey. Semantic Web 7, 1 (2016), 63–93.

Received October 2019; revised May 2020; accepted June 2020

ACM Journal of Data and Information Quality, Vol. 12, No. 4, Article 20. Publication date: October 2020.

You might also like