03 Propertygraphdatabases 2

Property Graph Databases
2016-11-08 Dr. Asja Fischer, Prof. Jens Lehmann

Motivation
Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 2

Trends in Data
Trend 1: Big Data
The amount of information that we generate and transfer has increased

dramatically in the past few years.
1Source: IDCs Digital Universe Study, sponsored by EMC, December 2012

http://www.emc.com/collateral/analyst- reports/idc- the- digital- universe- in- 2020.pdf

Trends in Data
Trend 2: Connectedness
Data is evolving to be more interlinked and connected. Hypertext has links, blogs
have pingback, tagging groups all related data.
2 Figure taken from [?]

Why Relational Databases are not Enough
(Provocative) claim: Relational databases are not good at handling relationships!
3 Table taken from [?]

In this social network database, it is easy to find people Bob considers his friends:
SELECT p1 . Person
FROM Person p1 JOIN PersonFriend
ON PersonFriend . FriendID = p1 . ID
JOIN Person p2
ON PersonFriend . PersonID = p2 . ID
WHERE p2 . Person = Bob
Result:
In this social network database, it is easy to find people Bob considers his friends:
SELECT p1 . Person
JOIN Person p2
Result: Alice and Zach

B Sad but true: friendship is not always symmetric!

B What if we want to know who is friends with Bob?
SELECT p1 . Person
JOIN Person p2
B User side: still easy to implement

B Database side: Database has to consider all the rows in the PersonFriend
table more expensive!

Why Relational Databases are not Enough?
What if we query deeper into the social network?
SELECT p1 . Person AS PERSON , p2 . Person AS F RI E N D _ O F _ F R I E N D

FROM PersonFriend pf1 JOIN Person p1
ON pf1 . PersonID = p1 . ID
JOIN PersonFriend pf2
ON pf2 . PersonID = pf1 . FriendID
JOIN Person p2
ON pf2 . FriendID = p2 . ID
WHERE p1 . Person = Alice AND pf2 . FriendID <> p1 . ID
B Uses recursive joins increase in syntactic, computational, and space

complexity of the query

Experimental results: Given any two persons chosen at random, is there a path
that connects them that is at most 5 relationships long?
Depth RDBMS execution time (s) Neo4j execution time (s) Records returned
2 0.016 0.01 2500
3 30.267 0.168 110,000
4 1543.505 1.359 600,000
5 Unfinished 2.132 800,000
B Comparing relational store and a popular graph database Neo4j
B Social network consisting of 1,000,000 people
B Each person with approximately 50 friends
Conclusion: Graph databases outperform relational databases when we are

dealing with connected data!

Relational Databases: What exactly is the Problem?
B Relationships exist only between tables problematic for highly connected

domains
B Relationship traversal can become very expensive
B Flat, disconnected data structures - data processing and relationship
construction happens in the application layer
B Bound by a previously defined schema

Graph Databases using the Labeled Property
Graph Model

Graph Databases: Advantages
B Explicit graph structure: semantic dependencies between entities are made

explicit
B New nodes and new relationships can be easily added into the database
without having to migrate data or restructure the existing network
B Relationships correspond to paths; querying the database = traversing the
graph this makes certain types of queries simpler
B Schema-free
B Index-free adjacency

Index-Free Adjacency
B Each node knows (i.e. directly points to) its adjacent nodes - index-free
adjacency
B No explicit index each node acts as micro-index of nearby nodes much
cheaper than global indices
B Because of this, query times are less/not dependent of the size of the graph
they depend only on the part of the graph that has been searched
B Precomputes and stores bidirectional joins as relationships (i.e. no need to
compute them)
B Enables bidirectional traversal of the database: we can traverse both
incoming and outgoing relationships

Graph Databases: When to Use
When to use:
B Complex, highly-connected data
B Dynamic systems with topology that is difficult to predict
B Dynamic requirements that change over time
B Cases where relationships in data are themselves meaningful
When not to use:

B Large offline batch analysis tasks of static data
B Simple, unconnected data structures
B For certain types of queries:
Graph fishing (no small set of start nodes)
Bulk Scans (generic queries not applying to any indexed subset)

Labeled Property Graph Model
Building blocks of a Labeled Property Graph:

B Nodes
Can be tagged with one or more labels
Can contain properties
B Relationships:
Connect nodes and structure the graph
Directed
Always have a single name, a start node and an end node no
dangling relationships
Can have properties
B Properties (= key-value pairs)
B Labels:
Group nodes together

Properties
Properties are named values:

B Key is the name of the property
Always a string
B Values can be:
Numeric values
String values
Boolean values
Lists of any other type of value

Labels
B Used to assign types to nodes

B Groups nodes into sets: all nodes with the
same label belong to the same set
B Database queries can be constrained to these
sets instead of the whole graph more effi-
cient queries that are easier to write
B Each node has any number of labels (includ-
ing none)

Nodes
B Together with relationships, fundamen-

tal unit of property graph model
the simplest possible graph is a single
node
B Are often used to represent entities
B Can have zero or more properties

Relationships
B Organize the nodes by connect-

ing them
B Always connects a start node to
the end node
B A key part of a graph database:
allow us to find related data
B Always has a direction

Paths
B One or more nodes with connecting

relationships
B Typically is a result of a query or a
traversal
B Length of a path = number of rela-
tionships on that path

Labeled Property Graph Model: Example
3 Illustration taken from [?]





Creating Graph Databases vs. Creating
Relational Databases

Creating a Relational Database
Step 1: Acquire and improve understanding of data: a whiteboard sketch step
3 Diagram taken from [?]

Step 2: Construct an entity-relations (E-R) diagram
Note the complexity of the model before any data has even been added
3 E-R diagram taken from [?]
Step 3: Map the E-R diagram into tables and relations and normalize the data
3 Diagram taken from [?]

Creating a Graph Database
B What you sketch on the whiteboard = what you store in the database
B No normalization, denormalization or conversion to tables-relations structure
necessary
B No conceptual-relational dissonance: the physical layout of the data is same
as it is conceptualized
B Domain modeling = further graph modeling:
Makes sure that every node has the appropriate properties
Ensures that every node is in correct semantic context

Creating a Graph Database
3 Image taken from [?]

Graph DBs: Good Design Principles
B Important to review and evaluate the domain model to make sure it is

suitable for answering most likely queries.
B Ensures that any later changes are driven by changes in application
requirements and not by the need to mitigate bad design decisions
B There are two techniques to do this:
Check that the graph reads well
B Pick a node
B Follow relationships to other nodes, reading each nodes role and
each relationships name as you go
B This should form sensible sentences
Design for queryability
B Understand end-users goals: understand the use cases
B Try to craft sample queries for the use cases

Design Principles: Reading the Graph
App 1 runs on VM 10, which is hosted by Server 1 in Rack 1.

NEO4J & Other Graph Databases

Existing Graph Databases
Source: OReilly Graph Databases [?]

Neo4j
B Probably the most popular graph database

today
B Based on a schema-free labeled property
graph model
https://neo4j.com/ B Scales to billions of nodes, relationships and
properties
Consists of:
B Nodes, Relationships, Properties, Labels, Paths, Traversal, and Schema
(index and constraints)

Neo4j: Learn more!
B Open-source
B Extensive support and learning material
B Check out https://neo4j.com/developer/ for further resources (including a
sandbox for testing!)
B International events and meetup groups
(Neo4j)-[:LOVES]-(Developers)

Cypher

Cypher: Pattern Matching
A pattern matching query language for graphs



Cypher: ASCII Art
Uses ASCII art representation

Cypher: ASCII Art
Directed relationship

Cypher: ASCII Art
Undirected relationship

Cypher: ASCII Art
Specific relationships

Cypher: ASCII Art
Joined paths

Cypher: ASCII Art
Multiple paths

Cypher: Start & Return clauses
B START specifies starting point(s) in the graph

Note : these can be both nodes or relationships
B START is done via index lookup or directly using relationship IDs
B RETURN specifies nodes, relationships, and properties that should be
returned
START - RETURN example

START n = node(0) RETURN n
reads lookup node with id 0; return that node
START n=node:Person(name=Andreas) RETURN n
reads lookup node in index Person with a name property whose value is
Andreas; return that node

Cypher: MATCH Clause
B Specification by example: draw data we are looking for

B Used to define a pattern of nodes and relationships that we want to find
MATCH - example
START a=node:user(name = Michael)
MATCH (a)-[:KNOWS] >(b)-[:KNOWS] >(c), (a)-[:KNOWS] >(c)
RETURN b, c
Reads: Find user with name Michael and starting from that node in the
graph, find nodes b and c that fit the pattern specified by MATCH example.
Return those two nodes.
3 Example from [?]

Cypher: Return Options
B Cypher also provides various options to enrich the returned results

B They include options to aggregate, order, filter, and limit the returned data
B Example: count allows us to return only the number of matched instances

RETURN Options - Example
RETURN Options - example

MATCH (theater:Venue {name:Theatre Royal}),
(bard:Author lastname:Shakespeare),
(theater)< [:VENUE]-()-[p:PERFORMANCE OF] >()
-[:PRODUCTION OF] >(play)< [:WROTE PLAY]-(bard)
RETURN play.title AS play, count(p) AS performance count
ORDER BY performance count DESC
This query counts the number of PERFORMANCE OF relationships using the

identifier p (which is bound to the PERFORMANCE OF relationships in the
MATCH clause). It saves the result as performance count. The results are then
ordered (using the clause ORDER BY) by the count in decreasing order. The
query then returns this result.

RETURN Options - Example
RETURN Options - example

MATCH (theater:Venue {name:Theatre Royal}),
(bard:Author lastname:Shakespeare),
(theater)< [:VENUE]-()-[p:PERFORMANCE OF] >()
-[:PRODUCTION OF] >(play)< [:WROTE PLAY]-(bard)
RETURN play.title AS play, count(p) AS performance count
ORDER BY performance count DESC
LIMIT 1
Analogously, adding LIMIT clause tells this query to limit the results to the first
instance

Cypher: WHERE Clause
B Constrains graph matches by eliminating matched subgraphs by requiring

one or more of the following:
That certain paths must be present (or absent) in the matched
subgraphs
That nodes must have certain labels or relationships with certain names
That specific properties on matched nodes and relationships must be
present (or absent)
That certain properties on matched nodes and relationships must have
specific values
That other predicates must be satisfied (e.g., that performances must
have occurred on or before a certain date)

WHERE - Example
For example, we can query specifically for Shakespeare plays that were written
after 1608
WHERE - example
MATCH (bard:Author {lastname:Shakespeare}),
(play)< [w:WROTE PLAY]-(bard)
WHERE w.year > 1608
RETURN DISTINCT play.title AS play

Other Clauses
Cypher supports a variety of clauses:

B CREATE and CREATE UNIQUE
B DELETE
B SET
B FOREACH
B UNION
B WITH

Cypher - Design for Queryability Example
Remember the design for queryability design goal?

Goal: Design a query to find the cause behind an unresponsive application or
service in our example graph.
Example Query
MATCH (user:User)-[*1..5]-(asset:Asset)
WHERE user.name = User3 AND asset.status = down
RETURN DISTINCT asset

The sample query we would need to define is:
Example Query

Example Query
B Describes a variable length path between one and five relationships long
B There is no colon or relationship name between the square brackets the
relationships are unnamed
B There are no arrow-tips relationships are undirected

Example Query

Example Query
B We start with the user who reported a problem

B We add asset nodes that have a status property with a value of down
B Nodes which do not have a status property will not be added to the results

Example Query

Example Query
B Ensures that unique assets are returned in the results, no matter how many
times they are matched

Sample Queries
Example 1
MATCH (p:Product)
RETURN p.productName, p.unitPrice
ORDER BY p.unitPrice DESC
LIMIT 10;

Sample Queries
Example 1
MATCH (p:Product)
RETURN p.productName, p.unitPrice
ORDER BY p.unitPrice DESC
LIMIT 10;
B Returns only a subset of attributes, in this case: ProductName and UnitPrice

B Orders by price
B Returns 10 most expensive items

Sample Queries
Example 2
MATCH (p:Product {productName:Chocolade})
RETURN p.productName, p.unitPrice;

Sample Queries
Example 2
MATCH (p:Product {productName:Chocolade})
RETURN p.productName, p.unitPrice;
B A shortcut for MATCH (p:Product) WHERE p.productName =

Chocolade

Sample Queries
Example 3
MATCH (p:Product {productName:Chocolade})< [:PRODUCT]-
(:Order)< [:PURCHASED]-(c:Customer)
RETURN distinct c.companyName;

Sample Queries
Example 3
MATCH (p:Product {productName:Chocolade})< [:PRODUCT]-
(:Order)< [:PURCHASED]-(c:Customer)
RETURN distinct c.companyName;
B Names of everyone who bought Chocolade

Sample Queries
Example 4
MATCH (bob:User {username:Bob})-[:SENT] >(email)-[:CC] >(alias),
(alias)-[:ALIAS OF] >(bob)
RETURN email.id
3 Example and figure taken from [?]

Sample Queries
Example 4
MATCH (bob:User {username:Bob})-[:SENT] >(email)-[:CC] >(alias),
(alias)-[:ALIAS OF] >(bob)
RETURN email.id
B Returns all emails that Bob has sent where hes CCd one of his own aliases
B Returns 1 result: id:1, content:email content

Sample Queries
Example 5
MATCH (email:Email {id:11})< [f:FORWARD OF*]-(:Forward)
RETURN count(f)

Sample Queries
Example 5
MATCH (email:Email {id:11})< [f:FORWARD OF*]-(:Forward)
RETURN count(f)
B This returns the number of times a particular email is forwarded

B The answer is 2 (count the number of FORWARD OF relationships bound
to f)

Serialisation

Cypher as Serialisation Format
Cypher - Example
CREATE (:Person {name:Ian})-[:EMPLOYMENT] >
(employment:Job {start date:2011-01-05})
-[:EMPLOYER] >(:Company {name:Neo}),
(employment)-[:ROLE] >(:Role {name:engineer})
3 Image taken from [?]
Geoff
B Based on Cypher
B A Geoff document consists of a one or more subgraphs, each of which
contains one or more paths
B Properties in JSON syntax
( alice {" name ":" Alice "})

( bob {" name ":" Bob "})
( carol {" name ":" Carol "})
( alice ) < -[: KNOWS ] - >( bob ) < -[: KNOWS ] - >( carol ) < -[: KNOWS ] - >( alice )

GraphML
B XML-based file format for graphs

B Common format for exchanging graph structure data
B Generic (not limited to graph databases)
<? xml version = " 1.0 " encoding = " UTF -8 " ? >
< graphml xmlns = " http: // graphml . graphdrawing . org / xmlns "
[...]
< graph id = " G " edgedefault = " undirected " >
< node id = " n0 " >
< data key = " d0 " > green </ data >
</ node >
< node id = " n1 " / >
< edge id = " e0 " source = " n0 " target = " n1 " >
< data key = " d1 " > 1.0 </ data >
</ edge >
</ graph >
</ graphml >

Serialisation
B Essentially no standardisation in property graph community

B Serialisation less important than in semantic technologies:
Not much emphasis on data exchange
Data publishing not commonly considered (in contrast to Linked Data)
Embedding in other formats usually not considered (in contrast to
RDFa)

Hypergraphs

Graph Shortcomings
Graphs pairwise relationships.

Pairwise relationships are often insufficient - the world is more complex than that.
Example Problem
Given: A collection of articles together with their authors.
Goal: Group articles into different topics.

Graph Shortcomings
Example Problem
Given: a collection of articles and their authors.
Want: Group articles into different topics.
Possible solution: undirected graph
B Nodes = articles
B If two articles have at least one author in common
add an edge between them
B Add weights to edges equal to the number of
authors the two corresponding articles have in
common
Group articles by topic using clustering approaches

Example Problem
What is wrong with this solution?

Example Problem
What is wrong with this solution?
B Loss of information: No way to tell whether the same person joined in

writing three or more articles.
B This is a problem because the articles from the same person are more likely
to belong to the same topic.

Hypergraphs
Natural solution: a hypergraph!
Definition: Hypergraph
Let V = (v1 , v2 , . . . , vn ) be a set of vertices. A hypergraph on V is a family of
H = (E1 , E2 , . . . , Em ) of subsets of V s.t.:
B For i = 1, 2, . . . , m, Ei 6=
Sm
B Ei = V
i=1
Intuitively: A hypergraph is a graph in which an edge (called hyperedge) can

connect more than two vertices.

Hypergraphs
Models the example problem:

B Each article matches to a vertex
B Each hyperedge contains set of articles written
by the its corresponding author
Note that hyperedges are represented by
dotted lines around the set of vertices they
include

Hypergraphs: Summary
B Each hyperedge can connect to an arbitrary number of nodes

B Hypergraphs can be useful for data with a large number of many-to-many
relationships
B Hypergraphs are particularly good at capturing meta-intent
Disadvantage:
B Can be more confusing than the more explicit data modeling techniques

Converting and Comparing the Graph Models

RDF to PGM Transformation
Distinguish between two types of RDF triples:

B attribute triples
= triples whose object is a literal
B relationship triples
= triples whose object is an IRI or a blank node
The transformation:
B Every relationship triple an edge
B Every attribute triple a property of the vertex for the subject of that triple

RDF to PGM Transformation
Simple transformation example:
ex : alice foaf : name " Alice " .

ex : bob foaf : name " Bob " .
<< ex : alice foaf : knows ex : bob > > ex : certainty 0.5 .
<< ex : bob foaf : age 23 > > ex : certainty 0.9 .
Note: << and >> enclose metadata tripes in Turtle syntax. The last line cannot
be converted using this transformation approach. See
https://arxiv.org/pdf/1409.3288v2.pdf for details.

PGM to RDF Transformation
B Edges (incl. their labels) an ordinary RDF triple

B Vertices (incl. their labels) an ordinary RDF triple
B Edge properties metadata triple whose subject is the triple for the
corresponding edge
B Patterns for generating IRIs that denote edge labels and properties can be
chosen freely

Hypergraphs to Property Graphs
Labeled hypergraphs can be powerful when modelling many-to-many

relationships:

Hypergraphs to Property Graphs
All relations need to be made explicit when converting to property graphs:
Why can it still make sense to use property graphs?

B More familiar and explicit modelling technique
B Model can be fine-tuned with properties such as primary driver (for
insurance purposes)
Knowledge Graph Formats
Characteristics Relational RDF PGM

Standardised yes yes no
Traversal performance - o +
Large analytical queries + o -
Query language SQL SPARQL Cypher & more
Data Publication & Differentiability - + -
Global Identifiers & Cross dataset fusion - + -

Summary
Covered in the lecture:
B Motivation for Model Graph Databases (Big Data, Connectivity)

B Comparing Relational and Property Graph Databases
B Labeled Property Graph Model
B Cypher Query Language
B Unifying RDF, Property Graphs and Hypergraphs

References

03 Propertygraphdatabases 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 Propertygraphdatabases 2

Uploaded by

Copyright:

Available Formats

Property Graph Databases

2016-11-08 Dr. Asja Fischer, Prof. Jens Lehmann

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 2

Trend 1: Big Data

The amount of information that we generate and transfer has increased

1Source: IDCs Digital Universe Study, sponsored by EMC, December 2012

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 3

2 Figure taken from [?]

(Provocative) claim: Relational databases are not good at handling relationships!

3 Table taken from [?]

(Provocative) claim: Relational databases are not good at handling relationships!

(Provocative) claim: Relational databases are not good at handling relationships!

Result: Alice and Zach

B Sad but true: friendship is not always symmetric!

B User side: still easy to implement

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 6

What if we query deeper into the social network?

SELECT p1 . Person AS PERSON , p2 . Person AS F RI E N D _ O F _ F R I E N D

B Uses recursive joins increase in syntactic, computational, and space

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 7

Conclusion: Graph databases outperform relational databases when we are

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 8

B Relationships exist only between tables problematic for highly connected

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 9

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 10

B Explicit graph structure: semantic dependencies between entities are made

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 11

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 12

When not to use:

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 13

Building blocks of a Labeled Property Graph:

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 14

Properties are named values:

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 15

B Used to assign types to nodes

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 16

B Together with relationships, fundamen-

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 17

B Organize the nodes by connect-

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 18

B One or more nodes with connecting

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 19

3 Illustration taken from [?]

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 21

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 22

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 23

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 24

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 25

Step 1: Acquire and improve understanding of data: a whiteboard sketch step

3 Diagram taken from [?]

Step 2: Construct an entity-relations (E-R) diagram

3 Diagram taken from [?]

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 29

3 Image taken from [?]

B Important to review and evaluate the domain model to make sure it is

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 31

App 1 runs on VM 10, which is hosted by Server 1 in Rack 1.

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 32

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 33

Source: OReilly Graph Databases [?]

B Probably the most popular graph database

Dr. Asja Fischer, Prof. Jens Lehmann Property Graph Databases 35