Professional Documents
Culture Documents
Abstract—Graph data analysis is one of the upcoming data (numbers) in a given dataset, with a view to provide a
methodologies in various niches of computer science. global meaning to the given dataset.
Traditionally for storing, retrieving and experimenting test
data, researchers start with mysql database which is more When a researcher thinks about the relation between the
approachable and easier to build their test experimentation attributes of a dataset, the most common method he will adopt
platform. These test bed mysql databases will store data in the is the relational database model, which has significant methods
form of rows and columns, over which various SQL queries of establishing relationship between the attributes. In RDBMS
are performed. At times when the structure and size of dataset the concept of foreign key will help to link tables of different
changed, these traditional mysql databases become inefficient in attributes [2]. Even the systematic modeling of RDBMS tables
storing and retrieving of data. When the structure of dataset corresponding to a dataset itself requires a disciplined thought
changes from row-column to graph representation, mysql process by adopting different rules of RDBMS including nor-
database based querying and analysis become inefficient. The
internal representation of data is changed to key-value pairs,
malization: which will make the table free from the outcomes
more often the data in an unstructured format, which prompted of repeated values, partial dependencies: which deals with the
the researchers to think about other databases which can relationship between composite key and other attributes, key
achieve faster retrieval and mining over the dataset. This paper management: which include the possibility of defining super
explores the approach of NoSql query design and analysis of keys, database schema: which describes the global view of the
different datasets, particularly a proteome-protein dataset over table and in turn the database etc. This is the context where
a renowned graph database, Neo4j. The mode of experiments the importance of new family of databases, particularly the
involve the evaluation of NoSql query execution on datasets vary databases which have their base design entirely different from
in the number of nodes and relationships between them. It also RDBMS, comes in to picture.
emphasises the process of mining large graphs with meaningful
queries based on a NoSql Query language called Cypher. In this new era of data mining, experimentation on
databases other than mysql has a profound significance. The
Keywords : NoSql, Neo4j, unstructured data, key-value, subject domain of data mining includes key methods of ma-
graph dataset, datamining. chine learning, where the training and test datasets have to
be prepared from the original dataset of experimentation. The
I. I NTRODUCTION proportion of the content of data in test and training dataset
accounts for the accuracy of some algorithms. The relationship
Data mining is regarded as one of the traditional collection
of the values in the dataset has to be established with less
of methods to formulate meanings of different dimensions
probability of erroneous conclusion. If the database structure
from a given dataset. The success of methods not only de-
itself helps to establish, relationship between the attributes of
pends up on the efficiency of the algorithm, but also on the
dataset, it will be a major milestone in the efficiency of the
structure of the dataset. The dataset can be a physical group of
algorithm running over it. The different properties of graph
numbers which are the values of some significant attributes.
databases which are of prime focus are listed below. Even
Then the aim of any data mining method is to find out the
though, one can define a large set of key points, the properties
relation between these bare numbers. Different algorithms,
mentioned below form the major attraction of graph databases.
will consume these numbers, find out various parameters like
mean, median, mode, correlation, variance and coefficients to • Data representation in the form of graph.
establish a relation between these numbers [1]. With the help of
further equational proof and property constants of the dataset, • The relationship establishment.
the algorithms will bring out ample dimensions of conclusion • Simplicity in query formulation.
which will proclaims the relationship of data in the dataset.
These conclusions will finally lead to the identification of new • Data visualization modules.
data, its nature and other prediction properties. This is a general • Interoperability.
routine work-flow of any data scientist who works on data
• Acceptance of different standred forms of data.
mining. In short, data mining comprises a sequence of steps or
combination of methods or mathematical solutions or machine • Real programming experience on a conceptual
learning procedures to establish a relationship between the raw database design.
978-1-5090-4559-4/17/$31.00©2017IEEE
2017 International Conference on Advanced Computing and Communication Systems (ICACCS -2017), Jan. 06 – 07, 2017, Coimbatore, INDIA
The first three key points form the major break through A. Graph Theory Fundamentals and Related Concepts
of graph databases compared to traditional mysql databases.
Graph representation of attributes have three major advantages. Graph theory and its associated concepts are assimilated
Firstly, compared to traditional mysql database design, the in designing a graph database for handling data which emphe-
tables should not contain multivalued attributes. But in graph sises on the importance of relationship establishment between
databases we can have a disciplined number of multivalued attributes. There are different interesting results from graph
attributes. It can be node name or node number or even an theory which can be extracted to evaluate the graph represen-
array of different attributes for a single node. The second tations. Some of the various concepts of interest include the
advantage is that each node can have a fixed number of in degree and out degree count of nodes, the count of cut-
incoming and outgoing edges, which are otherwise called as edges, existence of cut-vertices, shortest path algorithms in
indegree and outdegree. These are analogous to the connec- database view, network flows and analysis, prominent graph
tion between different tables in mysql. Third advantage is a traversal methods etc. [3][4]. Euler formula, which establishes
futuristic thought which predicts the possibility of applying a relationship between the number of vertices and edges,
different graph algorithms over the representation, which can Havel-hakimi theorem of establishing valid graphic sequences,
bring unexpected quick convergence to the result for various four coloring theorem, matching property in graphs, sub graph
experiments in data mining. properties, bipartite and complete bipartite graphs which are
formed by vertex partitioning process, connected components
The second key point suggests the importance of foreign of a graph, directed acyclic graphs (DAG), finding out centers
key in a native mysql table. The relationships, which can be and bi-centers of a tree etc are some of the concepts which
established physically between the nodes using typical query are widely used in data representation. Though the results and
syntax is analogous to the foreign key in mysql. Simplicity in concepts seem to be too basic, they have significant impact on
query formulation helps the database developers, programmers data representation on various graph frameworks, where the
and academicians to experience the liveliness of their query distribution of nodes and processing steps take place across a
thoughts. Graph databases have different plugins for bridging network spreading out millions of nodes and edges [5].
the gap between a Query based language and traditional
high level programming language. There are different data
visualization modules or packages are available which can be
easily integrated with the work-flow of graph databases. Graph B. Spark: Graph theory based In-Memory Cluster Computing
data visualization is one of the essential properties of a graph Framework
database which will portray the real structure of establishment
of nodes according to the programmers view. It is analogous After the reign of Hadoop [6], which served as one of
to the select * query in mysql database. the base framework for large number of distributed system
The remaining work described in the paper is organized applications, graph theory based frameworks started heading
as follows: Section II contains literature survey remarking the up for distributed data processing applications. Since the estab-
different graph based frameworks which are used in computing lishment and other properties are user-friendly, Spark is a well
world. It describes the ideas of graph representation which accepted package for designing clusters with adequate number
are used not only in database domain, but also in distributed of systems. Considering the physical properties of a large
computing, machine learning etc. A significant background cluster, it is always advisable to consider about disk access. It
study conducted on programming languages, which are more is obvious that when the number of worker nodes included in
compatible for graph like structure programming, is also the cluster grows, the disk access also grows linearly. Rippling
included. Section III, comprises of the proposed experiments to this, the communication over-head between the systems also
carried out over different datasets, proteome-protein datasets, grows drastically. Hence all these post-establishment features
the query design formulation methods and relationship based will have an ample impact on the performance and scaling
significant queries. Section IV discusses the results of various of the framework. Data scientists thought of different methods
experiments done over the above mentioned datasets. Section which can also utilize memory space (total RAM) of the cluster
V concludes the paper with a vision of future scope on graph to process, store, retrieve and analyze the data for applications
databases. which works in a strict time bound manner . These aggressive
research gave rise to in-memory cluster computing framework
called Spark [7].
II. R ELATED W ORK
Spark in-memory cluster computing framework is an open
The ground work conducted to explore graph database source package that helps in quick data analytics. It evolved
includes the following action plan. Graph theory concepts from a group of research scholars from, Berkeley, University
helped in mastering and designing queries. With a view to of California under the leadership of Matei Zaharia. When
understand different graph representations, a background study analytics comes into the scene, it contains both read and
is performed on various graph theory based computing plat- write operations into the data source. To provide a faster
forms, a graph theory based in-memory distributed computing runtime execution environment, Spark makes room for all
framework called, Spark, Google graph processing framework the prerequisites to establish an in-memory cluster: The most
Pregel and Neo4j graph database. The end of this base ex- striking feature of Spark framework is that the user can load
ploration, includes a description of two co-orporate streams, data directly into memory of the cluster (total RAM space)
which uses Neo4j graph data base in full swing for their mode and perform queries repeatedly much faster than traditional
of operations related to big data analysis and visualization. disk based systems like hadoop map-reduce [6].
2017 International Conference on Advanced Computing and Communication Systems (ICACCS -2017), Jan. 06 – 07, 2017, Coimbatore, INDIA
R EFERENCES
[1] Elena Nabieva, Kam Jim, Amit Agarwal, Bernard Chazelle, and Mona
Singh, “Whole-proteome prediction of protein function via graph-
theoretic analysis of interaction maps,” in ISMB 2005 Proceedings.
Thirteenth International Conference on Intelligent Systems for Molec-
Fig. 8: Node and Relationship Graph of Limit based Match ular Biology, 2005, pp. 1302–1310.
Query Execution
[2] M. Di Giacomo, “Mysql: lessons learned on a digital library,” IEEE
Computer Society, IEEE Software, vol. 22, no. 3, pp. 10–13, July 2005.
[3] J.A. Bondy and U.S.R. Murty, Graph Theory with Applications,
O’Reilly Media, 2nd edition, January 2013.
[4] Maarten Van Steen, Graph Theory and Complex Networks: An Intro-
duction, Altera Corporation, 1st edition, January 2010.
[5] Harith A. Dawood, “Graph theory and cyber security,” in IEEE Third
International Conference on Advanced Computer Science Applications
and Technologies (ACSAT), 2014, pp. 90–96.
[6] Jeffrey Dean and Sanjay Ghemawat, “Mapreduce simplified data
processing on large clusters,” in in OSDL, 2004, pp. 137–150.
[7] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, and
Ion Stoica, “Resilient distributed datasets: A fault-tolerant abstraction
for in-memory cluster computing,” in Proceedings of the 9th USENIX
conference on Networked Systems Design and Implementation, 2012,
pp. 2–20.
[8] Elzbieta Krepska, Thilo Kielmann, Wan Fokkink, and Henri Bal, “A
high-level framework for distributed processing of large-scale graphs,”
in 12th International Conference on Distributed Computing and Net-
Fig. 9: Node and Relationship Graph of Limit based Match working, 2011, pp. 155–166.
Query Execution on Yeast Dataset [9] Ian Robinson, Jim Webber, and Emil Eifrem, Graph Databases, Orielly
publications, first edition, June, 2013.
[10] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C.
Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski, “Pregel: A
following conclusions. From the figure 8, it is obvious that the system for large-scale graph processing,” in Proceedings of the 2010
ACM SIGMOD International Conference on Management of data, 2010,
nodes and relationships keep a safe distance between them pp. 135–146.
in their growth. It is because of the nature of 20000 node [11] Arne Koschel, Felix Heine, Irina Astrova, Fred Korte, Thomas Rossow,
100000 edge dataset. The dataset contains nodes on an average and Sebastian Stipkovic, “Efficiency experiments on hadoop and giraph
outdegree of five, where as in the graph represented by the with pagerank,” in IEEE 24th Euromicro International Conference on
figure 9, the internal structure of the dataset in not known until Parallel, Distributed, and Network-Based Processing (PDP), Heraklion,
the graphs are created using create and relationship queries. 2016, pp. 328–331.
It can be noticed that when the nodes increase, the growth of [12] Sangwon Seo, Edward J. Yoon, Jaehong Kim, Seongwook Jin, Jin-Soo
Kim, and Seungryoul Maeng, “Hama: An efficient matrix computation
relationship tends to increase tremendously and it is attaining with the mapreduce framework,” in IEEE Second International Confer-
an exponential curve. ence on Cloud Computing Technology and Science (CloudCom), 2010,
pp. 721–726.
V. C ONCLUSION AND F UTURE S COPE [13] David Hoksza and Jan Jelnek, “Using neo4j for mining protein graphs:
A case study,” in 26th International Workshop on Database and Expert
Experiments of graph-database neo4j reveals a new dimen- Systems Applications (DEXA), Valencia, 2015, pp. 230–234.
sion of data mining research. In this modern times, the dataset [14] Hongcheng Huang and Ziyu Dong, “Research on architecture and
format changes and mysql database experiments seems to be query performance based on distributed graph database neo4j,” in 3rd
inadequate, when the data cannot be represented in the form International Conference on Consumer Electronics, Communications
of rows and columns. The NoSql queries are more powerful in and Networks (CECNet), Xianning, 2013, pp. 533–536.
performing retrieval from a graph database without the hurdles [15] “Comparison of graph processing frameworks,”
of constraints. Moreover graph database also supports the http://blog.octo.com/en/introduction-to-large-scale-graph-processing,
storage of unstructured data in the form of defined properties Accessed on January 28 2015.
and relationships. It can be also integrated with modern forms [16] “Neo4j user mannual,” http://neo4j.com/, Accessed on January 11,
of data interchange using JSON an REST interfaces. The only 2016.
challenge on using the graph database is the design of efficient