Professional Documents
Culture Documents
Neo4j Juillet2019
Neo4j Juillet2019
Data Science
Graph analy$cs
Daniela Grigori
(Pr, Université Paris-Dauphine)
Aknowledgement
Source des transparents : Max de
Marzi, Gagan Agrawal, Neo4J site,
basés sur le livre : Graph Data
Management
hDps://www.lamsade.dauphine.fr/~grigori/
Cer$ficat/
High Level view of Graph Space
•
Ø Graph Databases - Technologies used primarily
for transac$onal online graph persistence –
OLTP.
3
Use cases
• Graph database • Graph processing
– Low latency (1 ms), framework
– high volumes are – Scaling by distribu$ng
typically handled with the processing across
ver$cal scaling on a several commodity
single server. servers,
– well fiDed for naviga$on – lower latency (secs)
inside a graph, or solving – will address clustering
a shortest path problem problems, betweenness
computa$ons or a global
search over all paths.
4
Graph databases
Graph Databases
•
Ø Online database management system with -
– Create, Read, Update, Delete
• methods that expose a graph data model.
Ø Built for use with transac$onal (OLTP) systems.
Ø Used for richly connected data.
Ø Querying is performed through traversals.
Ø Can perform millions of traversal steps per
second.
Ø Traversal step resembles a join in a RDBMS
6
Graph Database Proper$es
7
Graph DB – The Underlying Storage
Ø Na$ve Graph Storage – Op$mized and designed
for storing and managing graphs.
8
Graph DB – The processing Engine
Ø Na$ve graph DB :
Ø Index free adjacency – Connected Nodes physically
point to each other in the database
9
Power of Graph Databases
Ø Performance
Ø Flexibility
Ø Agility
10
Comparison
Ø Rela$onal Databases
Ø NoSQL Databases
Ø Graph Databases
11
Agenda
• Trends in Data
• NOSQL
• What is a Graph?
• What is a Graph Database?
• What is Neo4j?
• Query language for graphs?
Trends in Data
Data is ge8ng bigger:
“Every 2 days we
create as much
informa$on as we did
up to 2003”
– Eric Schmidt, Google
Trend 2: Connectedness GGG
Social
networks
RDFa
Informa$on connec$vity
Tagging
Wikis
UGC
Blogs
Feeds
Hypertext
Text
Documents web 1.0 web 2.0 “web 3.0”
1990 2000 2010 2020
Data is more connected:
• Text (content)
• HyperText (added pointers)
• RSS (joined those pointers)
• Blogs (added pingbacks)
• Tagging (grouped related data)
• RDF (described connected data)
• GGG (content + pointers + rela$onships +
descrip$ons)
Data is more Semi-Structured:
• If you tried to collect all the data of every
movie ever made, how would you model it?
• Actors, Characters, Loca$ons, Dates, Costs,
Ra$ngs, Showings, Ticket Sales, etc.
NOSQL
Not Only SQL
Less than 10% of the NOSQL Vendors
Four NOSQL Categories
Key Value Stores
• Most Based on Dynamo: Amazon Highly
Available Key-Value Store
• Data Model:
– Global key-value mapping
– Big scalable HashMap
– Highly fault tolerant (typically)
• Examples:
– Redis, Riak, Voldemort
Key Value Stores: Pros and Cons
• Pros:
– Simple data model
– Scalable
• Cons
– Create your own “foreign keys”
– Poor for complex data
Column Family
• Most Based on BigTable: Google’s Distributed
Storage System for Structured Data
• Data Model:
– A big table, with column families
– Map Reduce for querying/processing
• Implementa$on :A Bigtable is a sparse,
distributed, persistent mul$dimensional sorted
map.
• Examples:
– HBase, HyperTable, Cassandra
Big Tables vs key value and RDBMS
Column Family: Pros and Cons
• Pros:
– Supports Semi-Structured Data
– Naturally Indexed (columns)
– Scalable
• Cons
– Poor for interconnected data
Document Databases
• Data Model: • Example :
{ "_id”:ObjectId("4efa8d2b7d284dad101e4bc7"),
– A collec$on of "Nom": "DUMOND",
documents "Prénom": "Jean",
"Âge": 43 },
– A document is a key { "_id”:ObjectId("4efa8d2b7d284dad101e4bc8"),
"Nom": "PELLERIN",
value collec$on "Prénom": "Franck",
– Index-centric, lots of "Adresse": "1 chemin des Loges",
"Ville": "VERSAILLES" },
map-reduce – Query example:
db.clients.find( { Ville: "Versailles" } )
• Examples:
– CouchDB, MongoDB
Document Databases: Pros and Cons
• Pros:
– Simple, powerful data model
– Scalable
• Cons
– Poor for interconnected data
– Query model limited to keys and indexes
– Map reduce for larger queries
Graph Databases
• Data Model:
– Nodes and Rela$onships
• Examples:
– Neo4j, OrientDB, InfiniteGraph, AllegroGraph
Graph Databases: Pros and Cons
• Pros:
– Powerful data model, as general as RDBMS
– Connected data locally indexed
– Easy to query
• Cons
– Sharding ( lots of people working on this)
• Scales UP reasonably well
– Requires rewiring your brain
Compared to Rela$onal Databases
Op$mized for aggrega$on Op$mized for connec$ons
Join problem
• all JOINs are executed every $me you query
(traverse) the rela$onship
• execu$ng a JOIN means to search for a key in
another table
• with Indices execu$ng a JOIN means to lookup
a key
– B-Tree Index: O(log(n))
• more entries => more lookups => slower JOINs
People
Rela$onal model
User_Id name Gender Age
00 Neo M 29
01 Trinity F 28
02 Reagan M 32
03 Agent Smith M 30
04 The Architect M 35
05 Morpheus M 69
Friendship
ID Friend since relation User1_Id User2_Id
1 1990 knows 00 05
2 1991 Loves 01 00
3 1992 knows 02 01
4 1993 Coded by 03 02
5 1995 knows 05 06
6 1996 knows 05 04
Graph model
ow
725 QLN
N oS
143 5
72
143 IO
981 Dat
a
981 r iot
Cha
IO
143 981 Dat
a
981 r iot
Cha
Rela$onal Databases Lack Rela$onships
Ø Ini$ally designed to codify paper forms and
tabular structures.
Ø Deal poorly with rela$onships.
Ø The rise in connectedness translates into
increased joins.
Ø Lower performance.
Ø Difficult to cater for changing business needs.
38
NoSQL Databases also lack
Rela$onships
Ø NOSQL Databases e.g key-value, document or
column oriented store sets of disconnected values/
documents/columns.
Ø Makes it difficult to use them for connected data
and graphs.
Ø One of the solu$on is to embed an aggregate's
iden$fier inside the field belonging to another
aggregate.
Ø Effec$vely introducing foreign keys
Ø Requires joining aggregates at the applica$on level.
39
Compared to Key Value Stores
Op$mized for simple look-ups Op$mized for traversing connected data
Compared to document Stores
Op$mized for “trees” of data Op$mized for seeing the forest and the
trees, and the branches, and the trunks
Reifying rela$onships in an aggregate store
A small social network encoded in an
aggregate store
Living in a NOSQL World
RDBMS
Graph
Complexity
Databases
Document
Databases
BigTable
Clones
Key-Value
RelaJonal Store
Databases
90% of
Use Cases
Size
What is a Graph?
What is a Graph?
• An abstract representa$on of a set of objects
where some pairs are connected by links.
• Pseudo Graph
• Mul$ Graph
• Hyper Graph
More Kinds of Graphs
• Weighted Graph
• Labeled Graph
• Property Graph
What is a Graph Database?
• A database with an explicit graph structure
• Each node knows its adjacent nodes
• As the number of nodes increases, the cost of
a local step (or hop) remains the same
• Plus an Index for lookups
Example: Internet
Example: Social Network
Protein Interaction
What is Neo4j?
The Neo4j Graph Database
• Neo4j is the leading graph database in
the world today!
• Most widely deployed: 500,000+
downloads!
• Largest ecosystem: active forums,
code contributions, etc!
• Most mature product: in development
since 2000, in 24/7 production since
2003
ACID
Java http://www.neotechnology.com/customers/
What is Neo4j?
• A Graph Database + Lucene Index
• Property Graph
• Full ACID (atomicity, consistency, isola$on,
durability)
• High Availability (with Enterprise Edi$on)
• 32 Billion Nodes, 32 Billion Rela$onships,
64 Billion Proper$es
• Embedded Server
• REST API
Good For
• Highly connected data (social networks)
• Recommenda$ons (e-commerce)
• Path Finding (how do I know you?)
S _ W ITH
L
TRAVE
LOVES
L S _ WIT H
TRAVE
Property Graph Model
name: the Doctor
age: 907
S _ W ITH
L species: Time Lord
TRAVE
LOVES
L S _ WIT H
TRAVE
first name: Rose
late name: Tyler
vehicle: tardis
model: Type 40
Property Graph
// then traverse to find results
start n=(people-index, name, “Andreas”)
match (n)--()--(foaf) return foaf
n
Cypher
PaDern Matching Query Language (like SQL for graphs)
// get node 0
g.v(0)
g.v(0).in
g.v(0).out(“KNOWS”)
If you’ve ever
• Joined more than 7 tables together
• Modeled a graph in a table
• WriDen a recursive query
• Tried to write some crazy stored procedure
with mul$ple recursive self and inner joins
appeared
in enemy
enemy
appeared appeared
A Good Man
in in Goes to War
Victory of
the Daleks
appeared
in
Graphs are very whiteboard-friendly
Schema-less Databases
• Graph databases don’t excuse you from
design
• Good design s$ll requires effort
• But less difficult than RDBMS because you
don’t need to normalise
– And then de-normalise!
Design errors
What you What you
end up with know
hDp://talent-dynamics.com/tag/sqaure-peg-round-hole/
More on Neo4j
• Neo4j is stable
– In 24/7 opera$on since 2003
• Neo4j is under ac$ve development
• High performance graph opera$ons
– Traverses 1,000,000+ rela$onships / second on
commodity hardware
Neo4j Logical Architecture
susan.createRelationshipTo(theDoctor,
DynamicRelationshipType.withName("COMPANION_OF"));
tx.success();
} finally {
tx.finish();
}
Indexing a Graph?
• Graphs are their own indexes!
• But some$mes we want short-cuts to well-
known nodes
• Can do this in our own code
– Just keep a reference to any interes$ng nodes
• Indexes offer more flexibility in what
cons$tutes an “interes$ng node”
No free lunch
Indexes trade read performance for write cost
Just like any database, even RDBMS
GraphDatabaseService db = …
Index<Node> planets = db.index().forNodes("planets");
Create or retrieve
Index name
Type
Type
Crea$ng a Rela$onship Index
GraphDatabaseService db = …
Index<Relationship> enemies = db.index().forRelationships("enemies");
Create or retrieve
Index name
Type
Type
Exact Matches
GraphDatabaseService db = …
Value to match
First match only
Query Matches
GraphDatabaseService db = …
Key
Query
Must use transac$ons to mutate
indexes
• Muta$ng access is s$ll protected by transac$ons
– Which cover both index and graph
GraphDatabaseService db = …
Transaction tx = db.beginTx();
try {
Node nixon= db.createNode();
nixon("character", "Richard Nixon");
db.index().forNodes("characters").add(nixon,
"character",
nixon.getProperty("character"));
tx.success();
} finally {
tx.finish();
}
Explicit indexes - Deprecated
• Create an index
IndexManager index = graphDb.index();
Index<Node> actors = index.forNodes( "actors" );
Index<Node> movies = index.forNodes( "movies" );
Rela$onshipIndex roles = index.forRela$onships( "roles" );
• Add
• Each index supports associa$ng any number of key-value pairs with any
number of en$$es (nodes or rela$onships), where each associa$on
between en$ty and key-value pair is performed individually
Node bellucci = graphDb.createNode();
bellucci.setProperty( "name", "Monica Bellucci" );
actors.add( bellucci, "name", bellucci.getProperty( "name" ) );
actors.add( bellucci, "name", "La Bellucci" );
• Get
IndexHits<Node> hits = actors.get( "name", " Monica Bellucci" );
Mixing Core API and Indexes
• Indexes are typically used only to provide
star$ng points
• Then the heavy work is done by traversing the
graph
• Can happily mix index opera$ons with graph
opera$ons to great effect
Comparing APIs
Core Traversers
• Basic (nodes, rela$onships) • Expressive
• Fast • Fast
• Impera$ve • Declara$ve (mostly)
• Flexible • Opinionated
– Can easily intermix muta$ng
opera$ons
(At least) Two Traverser APIs
• Neo4j has declara$ve traversal frameworks
– What, not how
• There are two of these
– And development is ac$ve
– No “one framework to rule them all” yet
Traverser API
• In the org.neo4j.graphdb.traversal package
• hDp://wiki.neo4j.org/content/Traversal_Framework
() --> ()
A -[:LOVES]-> B
A --> B --> C
B C
B C
B C
7 rows
Examples
• Get all nodes with a label
MATCH (movie:Movie)
RETURN movie
Result
Movie
President »}
Node[3{name:"TheAmericanPresident",Jtle:"The American
Node[4]{name:"WallStreet",$tle:"Wall Street"}
2 rows
Example - related nodes
MATCH (director { name:'Oliver Stone' })--(movie)
RETURN movie.$tle
1 row
Example – match with labels
MATCH (charlie:Person { name:'Charlie Sheen' })--(movie:Movie)
RETURN movie
5 rows
Where clause
MATCH (n)
WHERE n:Swedish AND n.age < 30 AND n.name =~ 'Tob.*’ AND
(n)-[:KNOWS]-({ name:’Peter' })
RETURN n
Result
1 raw
Count
- count(*) counts the number of matching rows
MATCH (n { name: 'A' })-->(x)
RETURN n, count(*)
This returns the start node and the count of related nodes.
Result
n Count(*)
Node[1]{name:"A",property:13} 3
1 raw
Count
- to count the groups of rela$onship types,
return the types and count them with
count(*).
MATCH (n { name: 'A' })-[r]->()
RETURN type(r), count(*)
Result
type® Count(*)
"KNOWS" 3
1 raw
Count
- COUNT(<iden$fier>), which counts the
number of non-NULL values in <iden$fier>.
MATCH (n { name: 'A' })-->(x)
RETURN count(x)
returns the number of connected nodes from the start node.
Result
Count(x)
3
1 raw
StaJsJcs
MATCH (n:Person)
RETURN sum(n.property)
This returns the sum of all the values in the property property.
Result
sum(n.property)
90
1 raw
WriJng clauses
- Create a node with a label
CREATE (n:Person { name : 'Andres', $tle : 'Developer' })
- To return the node, add Return n
WriJng clauses
• Create a relaJonship and set properJes
MATCH (a:Person),(b:Person)
WHERE a.name = 'Node A' AND b.name = 'Node B'
CREATE (a)-[r:RELTYPE { name : a.name + '<->' + b.name }]->(b)
RETURN r
Result
r
:RELTYPE[0]{name:"Node A<->Node B"}
1 raw
Rela$onships created: 1
Proper$es set: 1
WriJng clauses
• DELETE –dele$ng graph elements — nodes
and rela$onships
• REMOVE - Removing proper$es and labels
from graph elements.
• MERGE -It’s like a combina$on of MATCH and
CREATE that addi$onally allows you to specify
what happens if the data was matched or
created (by using Use ON CREATE and ON
MATCH).
Merge single node with a label
MERGE (robert:Cri$c)
RETURN robert, labels(robert)
A new node is created because there are no
nodes labeled Cri$c in the database.
142
Other Cypher Clauses
Ø FOREACH
Ø Performs an upda$ng ac$on for graph element
in a list (a path, ..).
Ø UNION
Ø Merge results from two or more queries.
Ø WITH
Ø Chains subsequent query parts and forward
results from one to the next. Similar to piping
commands in UNIX.
143
WITH example
Filter on aggregate
funcJon results
MATCH (david { name: 'David' })-->(otherPerson)-->()
WITH otherPerson, count(*) AS foaf
WHERE foaf > 1
RETURN otherPerson.name
Result
otherPerson.name
"Anders »
1 row
144
WITH example
limit the number of
entries for 2nd match
MATCH (n { name: 'Anders' })--(m)
WITH m
ORDER BY m.name DESC LIMIT 1
MATCH (m)--(o)
RETURN o.name
Result
o.name
"Bossman"
"Anders"
2 rows
145
Example Query
Start node from
• The top 5 most frequently appearing index
companions:
Subgraph
match (doctor:characters{ name = 'Doctor}), paDern
(doctor)<-[:COMPANION_OF]-(companion)
-[:APPEARED_IN]->(episode) Accumulates
return companion.name, count(episode) rows by episode
order by count(episode) desc
limit 5
Limit returned
rows
Results
+-----------------------------------+
| companion.name | count(episode) |
+-----------------------------------+
| Rose Tyler | 30 |
| Sarah Jane Smith | 22 |
| Jamie McCrimmon | 21 |
| Amy Pond | 21 |
| Tegan Jovanka | 20 |
+-----------------------------------+
| 5 rows, 49 ms |
+-----------------------------------+
Execute Cypher queries from Java
GraphDatabaseService db = new
GraphDatabaseFactory().newEmbeddedDatabase( databaseDirectory );
158
Scalability
Ø Capacity (Graph Size)
159
Capacity
Ø 1.9 Release of Neo4j can support single graphs
having 10s of billions of nodes, relaJonships
and properJes.
Ø Star$ng with 3.0 :
Ø Dynamic pointer compression expands Neo4j’s
available address space as needed, making it
possible to store graphs of any size. (<1024 ,
Facebook’s largest graph is in the single-digit
trillions.)
160
Latency
Ø RDBMS – more data in tables/indexes result in
longer join opera$ons.
Ø Graph DB doesn't suffer the same latency problem.
Ø Index is used to find star$ng node.
Ø Traversal uses a combina$on of pointer chasing
and paDern matching to search the data.
Ø Performance does not depend on total size of the
dataset.
Ø Depends only on the data being queried.
161
Throughput
Ø Constant performance irrespec$ve of graph
size.
162
Opera$ons and Big Data
174
A comparison of relational and graph modelling
175
Simplified snapshot of applica$on deployment within a data center
An en$ty-rela$onship diagram for the data center domain
Graph for the data center applica$on
Other example
Language LanguageCountry Country
Language Country
name name
IS_SPOKEN_IN
code code
word_count as_primary flag_uri
name: “Canada”
languages_spoken: “[ ‘English’, ‘French’ ]”
language:“English” spoken_in
name: “USA”
name: “Canada”
language:“French” spoken_in
name: “France”
Country
name
flag_uri
language_name
number_of_words
yes_in_langauge
no_in_language
currency_code
currency_name
Country
Language
name name
flag_uri SPEAKS number_of_words
yes
no
Currency
USE code
S_C
URR
ENC name
Y
Common modeling piŒalls
Ø Example: email provenance problem domain
Ø Analysis of email communica,ons to detect suspicious pa7erns of email
communica,on
Ø Users and aliases :
183
Common modeling piŒalls
Ø Exchanged emails
184
Avoiding loss of informa$on
Ø Introducing an email node
185
A graph of email interac$ons
186
Represen$ng facts as nodes
187
Timeline trees
Represen$ng $me
Graph Databases popularity
ranking
27 systems in ranking, November 2017
Source :hDp://db-engines.com/en/ranking/graph+dbms
193
Common Use Cases
Ø Social
Ø Recommenda$ons
Ø Geo
Ø Logis$cs Networks : for package rou$ng, finding shortest Path
Ø Financial Transac$on Graphs : for fraud detec$on
Ø Master Data Management
Ø Bioinforma$cs : Era7 to relate complex web of informa$on that
includes genes, proteins and enzymes
194
Emergent Graph in Other Industries
(Actual Neo4j Graphs)
Content Management Insurance Risk Analysis
& Access Control
Geo Routing!
(Public Transport)
Network Asset
Network Cell Analysis
Management
BioInformatics
E-commerce fraud detec$on
Credit Card Fraud DetecJon
• the Post Office clerks copied the credit card
informa$on of some of their customers while
processing transac$ons;
• they then located these customers' home
addresses;
• using the credit card numbers, they would place
orders online for goods or gi• cards, to be
delivered at their vic$ms' home address;
• with goods ordered online, an accomplice would
wait at the address to intercept the deliveries;
Credit Card Fraud DetecJon
eBay case study
• “Our Neo4j solu$on is literally thousands of $mes faster than
the prior MySQL solu$on, with queries that require 10-100
$mes less code. At the same $me, Neo4j allowed us to add
func$onality that was previously not possible.”
Source :hDps://neo4j.com/case-studies/ebay/
Use Cases
• NASA (hDps://neo4j.com/blog/david-meza-chief-knowledge-
architect-nasa/)
– « recording comments from astronauts as they come off the space sta$on.
This allows us to connect their experiences with those of astronauts over
the last 10 or 15 years, which we can then $e to subsequent ac$ons and
projects. «
AirBnb (hDps://neo4j.com/users/airbnb/)
• As Airbnb grows (3500 employees, 200K tables, ) , so do the
challenges around the volume, complexity and obscurity of data,
which is o•en siloed and lacks context.
– datausers struggle to find the right data
– an internal data tool, which empowers Airbnb employees to be
data-informed by aiding with data explora$on, discovery and
trust