You are on page 1of 6

2017 International Conference on Advanced Computing and Communication Systems (ICACCS -2017), Jan.

06 – 07, 2017, Coimbatore, INDIA

A Cypher Query based NoSQL Data Mining on


Protein Datasets using Neo4j Graph Database

Johnpaul C I∗ , Tojo Mathew†


∗ Departmentof Information Science & Engineering
† Departmentof Computer Science & Engineering
The National Institute of Engineering, Mysore, India
Affiliated to VTU, Belagavi
Email: johnpaul.ci@gmail.com, tojomathew@gmail.com

Abstract—Graph data analysis is one of the upcoming data (numbers) in a given dataset, with a view to provide a
methodologies in various niches of computer science. global meaning to the given dataset.
Traditionally for storing, retrieving and experimenting test
data, researchers start with mysql database which is more When a researcher thinks about the relation between the
approachable and easier to build their test experimentation attributes of a dataset, the most common method he will adopt
platform. These test bed mysql databases will store data in the is the relational database model, which has significant methods
form of rows and columns, over which various SQL queries of establishing relationship between the attributes. In RDBMS
are performed. At times when the structure and size of dataset the concept of foreign key will help to link tables of different
changed, these traditional mysql databases become inefficient in attributes [2]. Even the systematic modeling of RDBMS tables
storing and retrieving of data. When the structure of dataset corresponding to a dataset itself requires a disciplined thought
changes from row-column to graph representation, mysql process by adopting different rules of RDBMS including nor-
database based querying and analysis become inefficient. The
internal representation of data is changed to key-value pairs,
malization: which will make the table free from the outcomes
more often the data in an unstructured format, which prompted of repeated values, partial dependencies: which deals with the
the researchers to think about other databases which can relationship between composite key and other attributes, key
achieve faster retrieval and mining over the dataset. This paper management: which include the possibility of defining super
explores the approach of NoSql query design and analysis of keys, database schema: which describes the global view of the
different datasets, particularly a proteome-protein dataset over table and in turn the database etc. This is the context where
a renowned graph database, Neo4j. The mode of experiments the importance of new family of databases, particularly the
involve the evaluation of NoSql query execution on datasets vary databases which have their base design entirely different from
in the number of nodes and relationships between them. It also RDBMS, comes in to picture.
emphasises the process of mining large graphs with meaningful
queries based on a NoSql Query language called Cypher. In this new era of data mining, experimentation on
databases other than mysql has a profound significance. The
Keywords : NoSql, Neo4j, unstructured data, key-value, subject domain of data mining includes key methods of ma-
graph dataset, datamining. chine learning, where the training and test datasets have to
be prepared from the original dataset of experimentation. The
I. I NTRODUCTION proportion of the content of data in test and training dataset
accounts for the accuracy of some algorithms. The relationship
Data mining is regarded as one of the traditional collection
of the values in the dataset has to be established with less
of methods to formulate meanings of different dimensions
probability of erroneous conclusion. If the database structure
from a given dataset. The success of methods not only de-
itself helps to establish, relationship between the attributes of
pends up on the efficiency of the algorithm, but also on the
dataset, it will be a major milestone in the efficiency of the
structure of the dataset. The dataset can be a physical group of
algorithm running over it. The different properties of graph
numbers which are the values of some significant attributes.
databases which are of prime focus are listed below. Even
Then the aim of any data mining method is to find out the
though, one can define a large set of key points, the properties
relation between these bare numbers. Different algorithms,
mentioned below form the major attraction of graph databases.
will consume these numbers, find out various parameters like
mean, median, mode, correlation, variance and coefficients to • Data representation in the form of graph.
establish a relation between these numbers [1]. With the help of
further equational proof and property constants of the dataset, • The relationship establishment.
the algorithms will bring out ample dimensions of conclusion • Simplicity in query formulation.
which will proclaims the relationship of data in the dataset.
These conclusions will finally lead to the identification of new • Data visualization modules.
data, its nature and other prediction properties. This is a general • Interoperability.
routine work-flow of any data scientist who works on data
• Acceptance of different standred forms of data.
mining. In short, data mining comprises a sequence of steps or
combination of methods or mathematical solutions or machine • Real programming experience on a conceptual
learning procedures to establish a relationship between the raw database design.

978-1-5090-4559-4/17/$31.00©2017IEEE
2017 International Conference on Advanced Computing and Communication Systems (ICACCS -2017), Jan. 06 – 07, 2017, Coimbatore, INDIA

The first three key points form the major break through A. Graph Theory Fundamentals and Related Concepts
of graph databases compared to traditional mysql databases.
Graph representation of attributes have three major advantages. Graph theory and its associated concepts are assimilated
Firstly, compared to traditional mysql database design, the in designing a graph database for handling data which emphe-
tables should not contain multivalued attributes. But in graph sises on the importance of relationship establishment between
databases we can have a disciplined number of multivalued attributes. There are different interesting results from graph
attributes. It can be node name or node number or even an theory which can be extracted to evaluate the graph represen-
array of different attributes for a single node. The second tations. Some of the various concepts of interest include the
advantage is that each node can have a fixed number of in degree and out degree count of nodes, the count of cut-
incoming and outgoing edges, which are otherwise called as edges, existence of cut-vertices, shortest path algorithms in
indegree and outdegree. These are analogous to the connec- database view, network flows and analysis, prominent graph
tion between different tables in mysql. Third advantage is a traversal methods etc. [3][4]. Euler formula, which establishes
futuristic thought which predicts the possibility of applying a relationship between the number of vertices and edges,
different graph algorithms over the representation, which can Havel-hakimi theorem of establishing valid graphic sequences,
bring unexpected quick convergence to the result for various four coloring theorem, matching property in graphs, sub graph
experiments in data mining. properties, bipartite and complete bipartite graphs which are
formed by vertex partitioning process, connected components
The second key point suggests the importance of foreign of a graph, directed acyclic graphs (DAG), finding out centers
key in a native mysql table. The relationships, which can be and bi-centers of a tree etc are some of the concepts which
established physically between the nodes using typical query are widely used in data representation. Though the results and
syntax is analogous to the foreign key in mysql. Simplicity in concepts seem to be too basic, they have significant impact on
query formulation helps the database developers, programmers data representation on various graph frameworks, where the
and academicians to experience the liveliness of their query distribution of nodes and processing steps take place across a
thoughts. Graph databases have different plugins for bridging network spreading out millions of nodes and edges [5].
the gap between a Query based language and traditional
high level programming language. There are different data
visualization modules or packages are available which can be
easily integrated with the work-flow of graph databases. Graph B. Spark: Graph theory based In-Memory Cluster Computing
data visualization is one of the essential properties of a graph Framework
database which will portray the real structure of establishment
of nodes according to the programmers view. It is analogous After the reign of Hadoop [6], which served as one of
to the select * query in mysql database. the base framework for large number of distributed system
The remaining work described in the paper is organized applications, graph theory based frameworks started heading
as follows: Section II contains literature survey remarking the up for distributed data processing applications. Since the estab-
different graph based frameworks which are used in computing lishment and other properties are user-friendly, Spark is a well
world. It describes the ideas of graph representation which accepted package for designing clusters with adequate number
are used not only in database domain, but also in distributed of systems. Considering the physical properties of a large
computing, machine learning etc. A significant background cluster, it is always advisable to consider about disk access. It
study conducted on programming languages, which are more is obvious that when the number of worker nodes included in
compatible for graph like structure programming, is also the cluster grows, the disk access also grows linearly. Rippling
included. Section III, comprises of the proposed experiments to this, the communication over-head between the systems also
carried out over different datasets, proteome-protein datasets, grows drastically. Hence all these post-establishment features
the query design formulation methods and relationship based will have an ample impact on the performance and scaling
significant queries. Section IV discusses the results of various of the framework. Data scientists thought of different methods
experiments done over the above mentioned datasets. Section which can also utilize memory space (total RAM) of the cluster
V concludes the paper with a vision of future scope on graph to process, store, retrieve and analyze the data for applications
databases. which works in a strict time bound manner . These aggressive
research gave rise to in-memory cluster computing framework
called Spark [7].
II. R ELATED W ORK
Spark in-memory cluster computing framework is an open
The ground work conducted to explore graph database source package that helps in quick data analytics. It evolved
includes the following action plan. Graph theory concepts from a group of research scholars from, Berkeley, University
helped in mastering and designing queries. With a view to of California under the leadership of Matei Zaharia. When
understand different graph representations, a background study analytics comes into the scene, it contains both read and
is performed on various graph theory based computing plat- write operations into the data source. To provide a faster
forms, a graph theory based in-memory distributed computing runtime execution environment, Spark makes room for all
framework called, Spark, Google graph processing framework the prerequisites to establish an in-memory cluster: The most
Pregel and Neo4j graph database. The end of this base ex- striking feature of Spark framework is that the user can load
ploration, includes a description of two co-orporate streams, data directly into memory of the cluster (total RAM space)
which uses Neo4j graph data base in full swing for their mode and perform queries repeatedly much faster than traditional
of operations related to big data analysis and visualization. disk based systems like hadoop map-reduce [6].
2017 International Conference on Advanced Computing and Communication Systems (ICACCS -2017), Jan. 06 – 07, 2017, Coimbatore, INDIA

C. Graph Processing Frameworks


Graphs gather attention from research community when
they project real-world data. They are intuitive and flexible in
terms of elementary data operations. Supporting graph theory
forms a strong basement in nurturing of graph processing
Fig. 1: Visualization of a Neo4j nodes and Relationship
frameworks. There are various standard algorithms in graph
between them
theory: shortest path, finding the route, detection of loop in a
network, matching and subgraph generation etc are some of
them [8].
Problems arise when very large graph representation are form of queries, written in Cypher programming language. It
required. The communication overhead is large when visiting is famous for its simplicity in writing the queries for retrieving
highly connected vertices. Considering this scenario graph and declaring relationships between the nodes.
cannot accommodate on a single physical machine, and the
Another striking feature of Neo4j is the availability of
implementation spreads over a cluster of machines. Algorithms
a graphical user interface to track down all the nodes, re-
typically trace the vertices through edges of graph. This
lationships, queries and performance. It provides the user
will increase the communication overhead due to machine-
a well organized GUI for declaring nodes and relationship
to-machine communication. For dividing the graph optimally
between the nodes graphically. It is also equipped with a
on the cluster, Bluk-Synchronous Parallel (BSP) algorithm is
performance analysis graph which will update in every three
commonly used [8][9].
seconds with the recent details of data. More than that there
1) Pregel: Google Graph Processing System: Processing is no much hurdle to establish a Neo4j database in a server,
a massive graph, with the help of an algorithm that explores there by increased users trust and acceptance. Neo4j can also
the graph along its edges. Originating from an set of vertices, be integrated with Python using Py2neo plugin, where all the
one can travel from one vertex to another and transmit the queries can be written as programming statements. Thus the
execution of algorithm, across a group of active vertices. emerging graph database had a commendable acceptance in
This is the BSP model of running an algorithm iteratively by developer community [13][14].
transferring messages [10]. The pregel system proceeds further
in a master-slave mode of graph processing.
III. P ROPOSED W ORK
2) Apache Giraph: Giraph is yet another graph processing
package from Apache foundation following the development This quarter of the paper contains the in-depth description
Pregel. It is still undergoing research to modify its capabilities. of the experiments conducted over datasets on Neo4j. It
Fault tolerance is one of the major issues taken care by Giraph starts with a basic modeling of the data set parsing methods,
[11]. proceeds further to Neo4j and reaches the query formulation.
The whole process of experiments can be inferred from Figure
3) Apache HAMA: HAMA was originated by inspiring 2
ideas from Google Pregel. The source of data to HAMA is
through HDFS APIs, in the form of adjacency lists stored in
text files with every line starts with a vertex ID, followed by the
list of IDs of vertices connected to the former by its outgoing
edges. Synchronization is performed by Zookeeper, which is
the most widely used synchronization component in distributed
system frameworks [12].

D. Neo4j: The emerging graph data base


Neo4j is one of the emerging and most used graph
databases in this computing era. There is a wide acceptance
to Neo4j apart from mysql due to various reasons. For a
quite long time, the database world enjoyed all the privileges
of mysql, row based data storage. For software applications,
academics and research, mysql paved an immense role of Fig. 2: Work-flow Diagram of Query Formulation of Graph
establishing trust on data storage. In course of time, the data-sets
structure and scale of data also changed. It turned out to be
inefficient to store data in the form of rows and columns. As a
result, people started thinking of migrating to other databases
[13]. The parser program is written in Perl which contains specific
functions to identify the source and destination node and
Neo4j provides a convenient way to visualize data with its formulate the necessary queries. To generate a basic graph
inherent graph structure. The users can declare data attributes model it is essential to define two type of queries, viz. create
in the form of nodes and relationships between these nodes and relationship. A general description of the algorithm for
as edges during the declaration itself. Querying is another generating these queries is shown in the below mentioned
property of all databases. In Neo4j the data is retrieved in the algorithm 1.
2017 International Conference on Advanced Computing and Communication Systems (ICACCS -2017), Jan. 06 – 07, 2017, Coimbatore, INDIA

Algorithm 1 Query Generation Algorithm


Data: Graph Dataset
Result: Create queries, Relationship queries
initialize file descriptor FILEDES as null;
while FILEDES != null do
line = String FILEDES
Array(output) = split(line)
source = output[0]
createquery(source,nodename,properties) Fig. 4: Generation of Relationship Queries
destination = output[1]
destinationcount = length(output[1])
iter = null
while iter != destinationcount do
n3255 with the relationship construct -[:connectedto]− >, the
relationquery(source,output[1].iter)
term connectedto is the type of relationship. The meaning of
iter = iter + 1
end this query formulation is left to the designer.
FILEDES = FILEDES + 1
end Various experiments conducted over yeast dataset and
other datasets are basically different in query formulation. An
instance of match query is as follows
The algorithm depends directly on the count of lines or
source nodes. We can trace these count with the help of file match (a)-[:connectedto]− >(b) return a,b limit 200
descriptor, FILEDES. Consider there are m nodes in a graph
dataset. Hence the maximum value of FILEDES is m. Looking This query will return the nodes named as a and b
forward to the algorithm mentioned above it is clear that for with the relationship connectedto. Since the result set contains
each source node say, i there can be an arbitrary number of as many 20000 nodes, we can limit the result for 200.
destination nodes, say n. In general, if we consider the running
time of this algorithm it can found that it will attain close to match(n0:yeastnode)-[link:linkedproteinto]− >(n1:yeastnode)
O(mn), keeping the assumption that the average out-degree of where link.edgeweight = 0.003721 return link limit 40
any node will not scale up to m.
This query will result in the retrieval of yeastnode having an
edgeweight greater than 0.003721 in blocks of 40. The query
formulation gives rise to different meanings which can be
utilized for complex graph data analysis.

match testnode =(begin)−[link : linkedproteinto]− >(end)


where begin.yeastlabel=’ylow34’ and end.name=’ylowe43’
foreach (n in yeastnodes(link)|set n.edgeweight = n.edgeweight
+ 1)
Fig. 3: Generation of Node Creation Queries
The above complex query help the programmer to find
path from the starting yeast node labeled ylow34 and ending
As an initial step towards the formulation of create and at node labeled ylowe43. After finding the path using the
relationship queries, the graph dataset selected contains 20000 existing linkedproteinto relationship, it will increment the
nodes and 100000 edges, with an average degree 5, for each current edgeweight by one.
node. The whole dataset is of the form, sourcenode dest1,
dest2, dest3 ... and so on. On running the query formulation match testnode = (a)−[∗1..3]− > (b) where a.yeastlabel =
algorithm for neo4j the query for create node as shown in fig- ’y0w5rt’ and b.yeastlabel =’ymly98’ and all (x in yeastnode
ure 3 and the queries for establishing the relationship between (testnode) where x.edgeweight > 0.001675) return p
the nodes can be viewed from figure 4. Taking the first create
query from the figure 3, it tells that a node of name node0 This query will return all the nodes starting from the
and type largenode has to be created as n0. Here nodename is yeastnode labeled y0w5rt through the yeastnode labeled
property of the node n0. Considering the first relation query ymly98 and all the other nodes having a relationship
from figure 4, it advances as a combination of match and create edgeweight greater than 001675.
queries [15][16]. With the help of node creation queries, all
the respective nodes would have been created and with that match p=(a)−− >(b)−− >(c) where a.yeastlabel=’ylwdd43’
existing nodes, relationship queries are formulated. The first and b.yeastlabel=’ywerll3’ and c.yeastlabel = ’yylw3c’ return
step in creating the relationship is matching. The match sub- extract(n in yeastnode(p)|n.yeastlabel) as extracted
query will identify the nodes for establishing the relationship
between them. This will be in accordance with specification The graph framework, neo4j also contains inbuilt functions
of the datasets. The match query will find out two nodes viz. for processing the output of match statements [16]. The above
n1 and n3255 and it will create relationships between n1 and Cypher query is one of the instances where the functions
2017 International Conference on Advanced Computing and Communication Systems (ICACCS -2017), Jan. 06 – 07, 2017, Coimbatore, INDIA

are used. Here the inbuilt function extract is used to filter


out only the labels from the path specified by the match
query. The function extract works with the help of an iterator
variable over the result of match function and filters off the
required result by the user.

IV. R ESULTS AND A NALYSIS


The experiments performed over neo4j can be broadly
classified into to two sections. Firstly, the aim of the
experiments involves the design of various queries using
Cypher language. In the previous section different use case
queries are discussed. The results of such queries have to
go in pace with programmers view. For each query the
output is recorded in terms of the number of nodes and
relationships. This data is taken by executing the query under Fig. 7: The Webadmin Window which gives the Accurate
a threshold limit to restrict the number of nodes appearing in Statistics of neo4j Contents
the webadmin of neo4j. Some of the prominent query result
visuals obtained can be viewed from the figures 5, 6, 7.
The figure 5 was produced as a result of match query on
properties of the infrastructure. The activity graph shows the
progress of node, relation and property creation at specific
times.

The following table 1 gives the various values of the


nodes and relationship retrieved when the match query was
executed. The match query was performed over the 20000
node, 100000 edge dataset with a check of discrete limit
intervals.
Limit Number of Nodes Number of Relationships
50 39 86
100 73 166
Fig. 5: Match Query Visualization on connectedto Relationship 150 105 253
200 126 299
250 184 355
300 241 417
20000 node and 100000 edges dataset on the relationship 350 300 474
connectedto. The query returns large number of nodes. To 400 359 529
450 417 588
have a better visualization the view is limited to 200 nodes. 500 476 654
Figure 6, is formed as a result of the match query on yeast
TABLE I: Node and Relationship Count based on Limit based
Match Query Execution

The statistical details of match query experiments over the


yeast dataset under descrete limit can be viewed from table 2.

Limit Number of Nodes Number of Relationships


50 59 66
100 108 133
150 155 233
200 196 302
250 250 388
300 280 492
Fig. 6: Match Query Visualization on Yeast Dataset based on 350 330 583
linkedproteinto relationship 400 364 700
450 398 792
500 433 894
protein dataset. The protein nodes of this dataset are linked TABLE II: Node and Relationship Count of Yeast Dataset
with the linkedproteinto relationship. Each link is associated based on Limit based Match Query Execution
with an edgeweight. The query returns the nodes having
relationship equal to the given edge weight. Figure 7 reveals
the capacity of the neo4j database during various experiments.
It gives an accurate statistics of nodes, relationships and The two graphs represented by figures 8, 9 throw light to
2017 International Conference on Advanced Computing and Communication Systems (ICACCS -2017), Jan. 06 – 07, 2017, Coimbatore, INDIA

queries using Cypher language. Programmers, developers have


to invest their effort to design meaningful queries on the dataset
based on the existing properties of the dataset. Successful de-
sign of queries will provide new dimension of research results
which can be scaled over to a vast number of datasets, there
by drilling out various meaningful conclusions and predictions
from that dataset. The neo4j graph database is widely accepted
by research community and corporate world. In this era of
unformatted data and methods like big data analytics, neo4j
will be one of the base platform to provide solutions to various
data mining requirements.

R EFERENCES
[1] Elena Nabieva, Kam Jim, Amit Agarwal, Bernard Chazelle, and Mona
Singh, “Whole-proteome prediction of protein function via graph-
theoretic analysis of interaction maps,” in ISMB 2005 Proceedings.
Thirteenth International Conference on Intelligent Systems for Molec-
Fig. 8: Node and Relationship Graph of Limit based Match ular Biology, 2005, pp. 1302–1310.
Query Execution
[2] M. Di Giacomo, “Mysql: lessons learned on a digital library,” IEEE
Computer Society, IEEE Software, vol. 22, no. 3, pp. 10–13, July 2005.
[3] J.A. Bondy and U.S.R. Murty, Graph Theory with Applications,
O’Reilly Media, 2nd edition, January 2013.
[4] Maarten Van Steen, Graph Theory and Complex Networks: An Intro-
duction, Altera Corporation, 1st edition, January 2010.
[5] Harith A. Dawood, “Graph theory and cyber security,” in IEEE Third
International Conference on Advanced Computer Science Applications
and Technologies (ACSAT), 2014, pp. 90–96.
[6] Jeffrey Dean and Sanjay Ghemawat, “Mapreduce simplified data
processing on large clusters,” in in OSDL, 2004, pp. 137–150.
[7] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, and
Ion Stoica, “Resilient distributed datasets: A fault-tolerant abstraction
for in-memory cluster computing,” in Proceedings of the 9th USENIX
conference on Networked Systems Design and Implementation, 2012,
pp. 2–20.
[8] Elzbieta Krepska, Thilo Kielmann, Wan Fokkink, and Henri Bal, “A
high-level framework for distributed processing of large-scale graphs,”
in 12th International Conference on Distributed Computing and Net-
Fig. 9: Node and Relationship Graph of Limit based Match working, 2011, pp. 155–166.
Query Execution on Yeast Dataset [9] Ian Robinson, Jim Webber, and Emil Eifrem, Graph Databases, Orielly
publications, first edition, June, 2013.
[10] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C.
Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski, “Pregel: A
following conclusions. From the figure 8, it is obvious that the system for large-scale graph processing,” in Proceedings of the 2010
ACM SIGMOD International Conference on Management of data, 2010,
nodes and relationships keep a safe distance between them pp. 135–146.
in their growth. It is because of the nature of 20000 node [11] Arne Koschel, Felix Heine, Irina Astrova, Fred Korte, Thomas Rossow,
100000 edge dataset. The dataset contains nodes on an average and Sebastian Stipkovic, “Efficiency experiments on hadoop and giraph
outdegree of five, where as in the graph represented by the with pagerank,” in IEEE 24th Euromicro International Conference on
figure 9, the internal structure of the dataset in not known until Parallel, Distributed, and Network-Based Processing (PDP), Heraklion,
the graphs are created using create and relationship queries. 2016, pp. 328–331.
It can be noticed that when the nodes increase, the growth of [12] Sangwon Seo, Edward J. Yoon, Jaehong Kim, Seongwook Jin, Jin-Soo
Kim, and Seungryoul Maeng, “Hama: An efficient matrix computation
relationship tends to increase tremendously and it is attaining with the mapreduce framework,” in IEEE Second International Confer-
an exponential curve. ence on Cloud Computing Technology and Science (CloudCom), 2010,
pp. 721–726.
V. C ONCLUSION AND F UTURE S COPE [13] David Hoksza and Jan Jelnek, “Using neo4j for mining protein graphs:
A case study,” in 26th International Workshop on Database and Expert
Experiments of graph-database neo4j reveals a new dimen- Systems Applications (DEXA), Valencia, 2015, pp. 230–234.
sion of data mining research. In this modern times, the dataset [14] Hongcheng Huang and Ziyu Dong, “Research on architecture and
format changes and mysql database experiments seems to be query performance based on distributed graph database neo4j,” in 3rd
inadequate, when the data cannot be represented in the form International Conference on Consumer Electronics, Communications
of rows and columns. The NoSql queries are more powerful in and Networks (CECNet), Xianning, 2013, pp. 533–536.
performing retrieval from a graph database without the hurdles [15] “Comparison of graph processing frameworks,”
of constraints. Moreover graph database also supports the http://blog.octo.com/en/introduction-to-large-scale-graph-processing,
storage of unstructured data in the form of defined properties Accessed on January 28 2015.
and relationships. It can be also integrated with modern forms [16] “Neo4j user mannual,” http://neo4j.com/, Accessed on January 11,
of data interchange using JSON an REST interfaces. The only 2016.
challenge on using the graph database is the design of efficient

You might also like