You are on page 1of 16

ScienceDirect

Available online at www.sciencedirect.com

ScienceDirect
Procedia Computer Science 00 (2022) 000–000
Available online at www.sciencedirect.com
www.elsevier.com/locate/procedia
Procedia Computer Science 00 (2022) 000–000
ScienceDirect www.elsevier.com/locate/procedia

Procedia Computer Science 215 (2022) 8–23


4th International Conference on Innovative Data Communication Technology and
Application
4th International Conference on Innovative Data Communication Technology and
Insights into NoSQL databases using financial data: A comparative
Application

Insights into NoSQL databases analysis


using financial data: A comparative
analysis
Ashish Rao1, Dhruvi Khankhoje 1
, Uditi Namdev1, Chetashri Bhadane1,
Deepika Dongre1
Ashish Rao , Dhruvi Khankhoje , Uditi Namdev1, Chetashri Bhadane1,
1 1

Deepika
Department of Computer Engineering, Dwarkadas
1
Dongre
J. Sanghvi College of1 Engineering, Mumbai, Maharashtra, India.

1
Department of Computer Engineering, Dwarkadas J. Sanghvi College of Engineering, Mumbai, Maharashtra, India.
Abstract

The massive increase of data and the existence of several distinct methods in which data is generated, accumulated, stored, and
Abstract
utilized has significantly changed with time. Additionally, the nature of data has changed throughout the years transforming from
structured
The massiveto more unstructured
increase of data anddata.
theThis bringsofabout
existence a need
several for efficient
distinct methodsstorage anddata
in which management of such
is generated, data whichstored,
accumulated, cannotand
be
handled by traditional RDBMS methods. Hence, NoSQL databases have gained popularity and have become pivotal
utilized has significantly changed with time. Additionally, the nature of data has changed throughout the years transforming from in database
management.
structured Finance
to more is one domain
unstructured where
data. This dynamic
brings about and large
a need for amounts
efficient of data and
storage is produced on a of
management daily
suchbasis, therebycannot
data which makingbe
NoSQL databases
handled an ideal
by traditional RDBMS choice for dataHence,
methods. management.
NoSQLThis paper have
databases compares
gainedthese types ofand
popularity NoSQL databases
have become basedinondatabase
pivotal certain
metrics like data
management. model,
Finance indexing
is one domainmethods, atomicity,
where dynamic andintegrity and several
large amounts more
of data and demonstrate
is produced on a daily implementation
basis, thereby of three
making
NoSQL databases namely, MongoDB, Cassandra and Redis, using financial data. Experiments were performed
NoSQL databases an ideal choice for data management. This paper compares these types of NoSQL databases based on certain to compare the
performance of the aforementioned databases when using fundamental READ queries to retrieve the
metrics like data model, indexing methods, atomicity, integrity and several more and demonstrate implementation of three complete dataset and
complexdatabases
NoSQL READ queries
namely,to retrieve
MongoDB, a specific section.
Cassandra and Aggregation
Redis, using operations wereExperiments
financial data. also implemented on the data.
were performed to Fundamental
compare the
WRITE queries to load the entire dataset and complex WRITE queries to update particular parts of it were
performance of the aforementioned databases when using fundamental READ queries to retrieve the complete dataset and also performed.
complex READ queries to retrieve a specific section. Aggregation operations were also implemented on the data. Fundamental
© 2023queries
WRITE The Authors. Published
to load the by Elsevier
entire dataset B.V. WRITE queries to update particular parts of it were also performed.
and complex
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
© 2023 The Authors.
Peer-review Published by Elsevier
under responsibility B.V.
of the scientific
© 2023
This is anThe
openAuthors. Published
access article under by
theElsevier B.V. committee
CC BY-NC-ND
of the 4th International Conference on Innovative Data
license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Communication
This is an open Technologies and Application
Peer-review underaccess article of
responsibility under the CC BY-NC-ND
the scientific committee oflicense
the 4th(https://creativecommons.org/licenses/by-nc-nd/4.0)
International Conference on Innovative Data Communication
Peer-review and
Technologies under responsibility of the scientific committee of the 4th International Conference on Innovative Data
Application
Communication Technologies and Application
Keywords: NoSQL (not only SQL), databases, Finance, MongoDB, Cassandra, Redis

Keywords: NoSQL (not only SQL), databases, Finance, MongoDB, Cassandra, Redis

1. Introduction
A database is a collection of data that is kept electronically and can be accessed whenever needed through the use of
a database management system, or DBMS. Data can be stored, retrieved, manipulated, and managed using this
1. Introduction
system. SQL,
A database is aor a relational
collection database,
of data and electronically
that is kept NoSQL, or aand non-relational database,
can be accessed are the
whenever two through
needed primarythe
kinds of
use of
a database management system, or DBMS. Data can be stored, retrieved, manipulated, and managed using this
system. SQL, or a relational database, and NoSQL, or a non-relational database, are the two primary kinds of
1877-0509 © 2023 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Innovative Data Communication
1877-0509 © 2023 The Authors. Published by Elsevier B.V.
Technologies and Application
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
1877-0509
Peer-review © under
2023 The Authors. Published
responsibility by Elsevier
of the scientific B.V.
committee of the 4th International Conference on Innovative Data Communication
This is an open
Technologies access
and article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Application
Peer-review under responsibility of the scientific committee of the 4th International Conference on Innovative Data Communication
Technologies and Application
10.1016/j.procs.2022.12.002
2 Ashish
Author name Rao et Computer
/ Procedia al. / Procedia Computer
Science Science
00 (2022) 215 (2022) 8–23
000–000 9

databases based on the connection of the data stored. SQL databases store data in tables, use queries to analyze the
data and are useful for the data which can undergo normalization and requires transactional integrity [1]. They were
the most used databases and were incorporated in major companies in the late 2000s.

The way many different web applications deal with data changed considerably over the past years. Relational
databases that are schema-based suffer in terms of performance and scalability as a result of the massive expansion
of data in online and mobile apps. As more data is being accumulated, simultaneously, more users are accessing this
data which has resulted in the development of newer database technologies that can handle complex data. The
NoSQL database, which is also known as ‘Not only SQL’ overcomes most of the serious limitations that a relational
database faces while handling large volumes of data [2]. These limitations include first high complexity when the
data cannot be easily encapsulated in a table and when SQL is used for working with unstructured data. Second, as
the level of data becomes high, the partitioning of relational databases takes place across various multiple servers,
and this causes several problems as the joining of tables in distributed servers is difficult [2]. NoSQL databases
follow the CAP theorem which was proposed by Eric Brewer in the year 2000. The term CAP stands for
Consistency, Availability and Partition Tolerance. When data is written to a distributed system, consistency means
that the same data may be read at any moment from any node in the system. Availability means that the data should
be available and accessible even if there is some kind of failure in the system. Partition Tolerance means that the
system should keep functioning even if there is a partition between two nodes in a distributed system. All distributed
systems are designed to be partition-tolerant, which means that to keep them running, one can only supply either
Availability or Consistency, not both [3].

Businesses use these databases to increase their competitive advantage and also their overall ROI. Since the top
NoSQL databases such as MongoDB, Couchbase, CouchDB and DynamoDB are highly affordable as they offer
cloud-based options to process big data, they are great for smaller organizations who have a limited budget for the
same. These databases are developed to be flexible and capable of responding to the data management related
demands of various institutions. Hence, NoSQL database handles semi-structured and unstructured data quite
efficiently unlike relational databases. The non-tabular structure of these databases corresponds to the lack of tabular
structure in semi-structured and unstructured data. This feature allows storage and retrieval of information and
provides flexibility in terms of types of data sources that can be used. Figure 1 shows the architecture diagram for
MongoDB and Cassandra.

Fig. 1 Architecture Diagram of the NoSQL databases


Author name / Procedia Computer Science 00 (2022) 000–000 3
10 Ashish Rao et al. / Procedia Computer Science 215 (2022) 8–23

Additionally, these databases are divided into diverse types based on the type of data they can handle, which are-
Document databases, Key-value stores, Column-oriented databases, and Graph databases. The comparison and
importance of each of the aforementioned subtypes will be discussed later in our paper, based on certain metrics
which will be discussed in the paper. We highlight the usage of NoSQL databases in the financial sector.

The NoSQL databases can be used and are used in major financial companies, where huge volumes of data are
collected and analyzed. These institutions should be able to handle storing such growing amounts of data efficiently.
Financial data refers to any kind of data which is associated with financial institutions, markets, companies and
performance. This data can include the expenses, profits, income and revenues related to the performance of any big
entity. It also comprises the foreign exchange data, equities market data, options, futures, and commodities like gold
[4]. Real-time stock data is generated continuously and an enormous amount of financial data is produced in a short
period. To handle such data NoSQL databases are preferred over traditional relational databases as API systems can
be linked with these NoSQL databases through third-party drivers which bridge the gap between the company
providing this data and the storage of this data in the NoSQL database. NoSQL follows “BASE” (basically
available, soft-state, eventual consistency) which is utilized for its speed, flexibility, and scalability [5]. A NoSQL
database should support multiple data models, should be easily scalable, and flexible. This study analyses the
various NoSQL databases in terms of data formats and functionality, as well as the utilization of each type of
NoSQL database. In addition, this paper emphasizes the need of using the right type of NoSQL database for
financial data to address the numerous issues that financial organizations confront.

The paper discusses the related work in Section 2, the various types of databases and an overview of a few databases
in Section 3. Section 4 compares 5 different types of databases on various attributes like security, integrity and basic
characteristics. Section 5 describes the dataset utilized in the experiment as well as the experimental setup required
to carry out the implementation. Section 6 displays the outcomes of implementing financial data on MongoDB,
Cassandra, and Redis. Finally, Section 7 concludes the paper and discusses potential future work.

2. Comparison Between All NOSQL Databases


This section compares NoSQL databases on the basis of their primary attributes, integrity and distribution attributes
and security measures. The paper includes one popular database from each of the four types mentioned in section 3,
namely Key-value stores, Column Family stores, Document Stores and Graph databases (Redis, Cassandra,
MongoDB and Neo4j respectively). In addition, we have also added three more commonly used NoSQL databases
(CouchDB, HBase, DynamoDB) for comparison with the aforementioned databases.

2.1 Comparison of the Databases on Primary Attributes


The primary attributes for comparison of databases are the year of the initial release, the developers of the database,
the type of database as explained in Section 3, the data model or the construct used for creation of the database, the
data storage system, the data structure used for indexing of the data, available APIs for the accessing of the database
and stored procedures or server-side scripts in the database. Lastly, the characteristics of the CAP theorem which are
delivered by the database are also mentioned. Comparisons of databases on the basis of their primary attributes are
shown in table 1.

CouchDB Neo4j Redis MongoDB Cassandra HBase Dynamo


DB

Initial 2005 2007 2009 2009 2008 2008 2012


Release

Developed Apache Neo4j, Inc Salvatore MongoDB Apache Apache Amazon


Ashish Rao et al. / Procedia Computer Science 215 (2022) 8–23 11
4 Author name / Procedia Computer Science 00 (2022) 000–000

By Sanfilippo Inc

Language Erlang Java ANSI C C++ Java Java Various


developed
in

Database Document Graph Key-Value Document Column Wide Key–


Type Oriented based Store Oriented Database Column value,
Store Documen
t Store

Data BSON based Graph Key-Value JSON Keyspaces, Compone Limited


Model document based tables, and nts such key value
store documents Data as Tables, store
columns. Rows, with
Column JSON
Families, support
Columns,
Cells and
Versions.

Data Disk File File System Disk Disk HDFS SSD


Storage System Volatile
Volatile Memory
Memory

Indexing B-Tree Graph Hashes and B-Tree Hash LSM trees B-Tree
Data Sorted Sets indexes
Structure

API’s and RESTful Bolt Proprietary Proprietary Proprietary JAVA RESTful


other HTTP protocol protocol protocol API RESTful HTTP
access API/JSON Cypher using Thrift Thrift API
methods API query JSON CQL Avro
language
Java API
Neo4j-
OGM info
RESTful
HTTP API
Spring
Data Neo4j
TinkerPop
3

Server-side View Yes Lua JavaScript No No Yes


Scripts functions in
(Stored JavaScript
Procedures)
CAP AP CA CP AP/CP AP AP AP
12 Ashish Rao et al. / Procedia Computer Science 215 (2022) 8–23
Author name / Procedia Computer Science 00 (2022) 000–000 5

Theorem
Citations [26] [27, 28] [27] [26] [8, 26] [8, 26, 29] [26, 29]

Table 1 Comparison of databases on the basis of their primary attributes

2.2 Comparison between databases based on Integrity and Distribution Attributes

The integrity and distribution attributes are being used for comparison of databases in this section. Integrity model
refers to a set of principles or a philosophy that are used to design a database. Consistency is the requirement for a
database that states each transaction should only affect the parts of the data which are intended to be changed, while
the rest of the data remains consistent and unchanged by the transaction. Atomicity means that if an operation is
performed on the data, it should undergo complete execution or no execution (the operations cannot be partially
executed). Durability ensures that the data is not lost and stays in the database on a permanent basis. If there is a
case of database or system failure, the data still remains intact. Transactions are a logical unit of work which allow
operations to be performed on the data stored in the database. Each transaction is atomic in nature. The table
mentions whether each database allows transactions to be performed. Replication model is mainly of two types:
Master-Slave replication and Peer to peer replication. In Master-Slave replication, one node has authoritative control
to replicate data across multiple nodes while slaves synchronize with the master and usually handle reads. In peer-
to-peer replication, all replicas have equal authority and can accept writes as well as reads and loss of any doesn’t
affect the access of the data store. Concurrency refers to the support for concurrent data manipulation. Concurrent
data manipulation refers to simultaneous WRITE operations that are carried out on a database. Since the databases
need to scale to accommodate the large number of requests, features need to be provided to queue and process made
on a global scale. This is carried out through Concurrent data manipulation control or Concurrency. Partitioning
methods are the methodologies to divide a big database containing data metrics and indexes into partitions.
Comparisons of databases on the basis of their integrity and distribution attributes are shown in table 2.

CouchD Neo4j Redis MongoDB Cassandra Hbase Dynamo


B DB

Integrity MVCC ACID - BASE BASE Log ACID


Model Replication

Consistency Eventual Default Default Strict, or Eventually Strongly Eventuall


configura configuratio Eventual consistent consistent y
tion is n is reads and consistent
Strong, Eventual, writes and
among strongly
replica is consistent
Causal reads

Atomicity Yes Yes Yes Conditional Yes Yes Yes

Durability Yes Yes Yes Yes Yes Yes Yes

Transactions No Yes Yes No No Yes No

Replication Master- Master- Master-Slave Master-Slave Multi- Master- Peer to


Model Slave Slave Master Slave Peer
Ashish Rao et al. / Procedia Computer Science 215 (2022) 8–23 13
6 Author name / Procedia Computer Science 00 (2022) 000–000

Concurrency Yes Yes Yes Yes Yes Yes Yes


(Support for
concurrent
manipulation
of data)

Partitioning Sharding Using Sharding Sharding Sharding Sharding Sharding


Methods Neo4J
Fabric

Citations [8, 26] [26, 28, [26, 27, 28, [8, 15, 26] [8, 15, 26] [26, 31] [26, 31,
30] 30] 32]

Table 2 Comparison of databases on the basis of their integrity and distribution attributes

2.3 Comparison of the Databases on Security


The Security attributes are being used for comparison of databases in this section. The first step in protecting the
data is to authenticate users. The process of validating the identification of a specific individual or a service request
is known as Authentication. Access control is the process that allows us to ensure that only authorized users have
access to system resources. Depending on the database administrator's configurations, access control can be applied
at the system, database, object, and content levels. Secure configurations are universally applied configurations that
system administrators can use to secure databases at both the physical and logical levels. Patch and update security,
services, protocols, roles and accounts, files and directories, backups, ports, and registries, and so on are all key
categories of secure configurations. In a database system, data and applications are encrypted to ensure their
confidentiality. It encompasses both data encryption at rest and data encryption in transit over the network. Database
auditing is the process of tracking and recording individual and group actions taken by database users. Before an
attack occurs, auditing aids in the detection of footprints or suspected password cracking efforts. Comparison of
databases on the basis of their security features are shown in table 3.

CouchDB Neo4j Redis MongoDB Cassandra HBase Dynamo


DB
Authentication Medium Medium Low Medium Low Medium High
Access Control Low Medium High High Low Medium Medium

Secure Low Medium Low Medium Low Low Medium


Configuration
Data Medium Medium Low Medium Medium Low High
Encryption
Auditing Medium Medium Low Low Low Medium Medium
Citations [31] [33] [31] [31] [31] [31] [34]

Table 3 Comparison of databases on the basis of their security features

Criteria:
High: Provides complete support for all necessary features for data security
Medium: Only provides a limited range of security protections; it is recommended that missing aspects be
implemented
Low: Offers only the most basic security features or none at all
Author name / Procedia Computer Science 00 (2022) 000–000 7

14 Ashish Rao et al. / Procedia Computer Science 215 (2022) 8–23

3. Methodology
This section describes the dataset used for the experiment as well as the experimental setup used to perform the
tests.

3.1 Data Description


We have used two datasets to assess and compare the performance of different NoSQL databases on financial data
for this paper, the first one being the historical stock data [35, 36] and the second one provided by a financial
institute: Federal Deposit Insurance Corporation (FDIC) [37]. The first dataset containing the historical stock data is
obtained by API’s of [35] and [36]. These sources have stock data for various companies. The sizes of the json files
used from [35] are 1.35 MB, 3.97 MB and 5.07 MB respectively. Whereas sizes of the json files used from [36] are
3.21MB, 9.29MB and 11.12MB. The second dataset consists of data of the three quarters of the year 2021 having 64
csv files for each quarter. There is data on the Total Deposit Present, Bank Assets Sold and Securitized, Changes in
Bank Equity Capital, Maximum Amount of Credit Exposure Retained and many more. The total size of the dataset
is 300.2 mb. The three NoSQL databases used for comparison are, namely, MongoDB, Cassandra and Redis. This
selection of databases was so that each belonged to a different kind of NoSQL database. The types of databases
include Document-based database (MongoDB), Column family store database (Cassandra) and Key-Value database
(Redis). The latency, which can be defined as the read or write response time of the databases, for READ and
WRITE operations was captured in milliseconds and compared for both the data sources. Two experiments are
designed in order to calculate the latency. The first experiment uses data from [35, 36] to perform READ and
WRITE operations. The second experiment similarly uses data from [37][38] in order to perform READ and
WRITE operations.

3.2 Experimental Setup


The setup used for the experimentation had Ubuntu 20.04 as the Operating System with 64-bit operating system, 4
GB RAM, Intel(R) Core (TM) i5-9300H CPU @ 2.40GHz. The versions of the MongoDB, Redis and Cassandra
database systems used for performing the experiments were 4.4, 6.2.6, 4.0 respectively.

4. Purpose
This paper aims to compare different NoSQL databases and their characteristics relevant to the financial domain and
show the applicability of various NoSQL databases as per different use cases. It highlights the importance of
NoSQL database in a world with the volume of data increasing at a rapid rate. Given the volume of data generated
by the financial markets and banking industry, it was important to emphasize data management in this context.

5. Major Research Findings


This section displays the results of the experiment, and the results are analyzed. The analysis is discussed in the
subsections and the ideal choice in each test case has been mentioned. The python drivers for each of the databases
are used for loading of the data due to the large size of the dataset. Furthermore, in-built analysis modules provided
by each database are used to gather information regarding the latency, which is the read or write operation execution
time of the databases was captured in milliseconds in each experiment. These in-built modules paired with the
python drivers for each of the databases are used to efficiently collect huge volume of data and analyze it in the
experiments.

5.1 First Experiment


This experiment uses the data from [35, 36] to compare the NoSQL databases on the basis of READ and WRITE
operations.

5.1.1 Comparison on the basis of WRITE operation


Ashish Rao et al. / Procedia Computer Science 215 (2022) 8–23 15
8 Author name / Procedia Computer Science 00 (2022) 000–000

The data fetched from the API’s of [35] and [36] are added to each of the 3 databases and the time taken to write
them into the databases are recorded.

Fig. 2 Write Latency for first experiment using data from [35]

Figure 2 shows the results for the WRITE operation or loading of the data using the historical stock dataset from the
Financial Modeling Prep API [35]. The WRITE operation was performed on the data and the response time in
milliseconds was recorded as outputted by the databases.

Fig. 3 Write Latency for first experiment using data from [36]
16 Ashish Rao et al. / Procedia Computer Science 215 (2022) 8–23
Author name / Procedia Computer Science 00 (2022) 000–000 9

Figure 3 shows the results for the loading of data using the historical stock data from the Alpha Vantage API [36].
The WRITE operation was performed on the data and the response time in milliseconds was recorded as outputted
by the databases.

From Figure 2 and Figure 3, it is observed that MongoDB is comparatively much faster than Cassandra and Redis to
write the given data into the database. It can thus be inferred that for smaller sizes as used in the experiment,
MongoDB is the preferred database. Redis can also be used due to its efficient key-value data structure which allows
for key-value data such as authentication details or hashmaps to be stored. Cassandra, though a valuable tool to
translate tabular data, may have a higher WRITE latency than MongoDB and Redis for smaller sizes of data as used
in the experiment.

5.1.2 Comparison on the basis of READ operation

On storing the fetched data into these databases, the entire dataset was extracted using the respective query
languages of each of the three databases.

Fig. 4 Read Latency for first experiment using data from [35]

Figure 4 shows the results for the READ operation using the historical stock dataset from the Financial Modeling
Prep API [35]. The READ operation was performed on the data and the response time to fetch the data was recorded
in milliseconds as outputted by the databases.
10 Author name / Procedia
Ashish Rao et Computer Science
al. / Procedia 00 (2022)
Computer 000–000
Science 215 (2022) 8–23 17

Fig. 5 Read Latency for first experiment using data from [36]

Figure 5 shows the results for the READ operation using the historical stock dataset from the Alpha Vantage API
[36]. The READ operation was performed on the data and the response time to fetch the data was recorded in
milliseconds as outputted by the databases.

From Figure 4 and 5, it is observed that MongoDB is yet again faster than both Redis and Cassandra in fetching
data. Although, with the larger database, the read latency between the three did not have a humongous difference in
values. Additionally, it is also observed that Cassandra is faster than Redis when performing read functions and the
difference between their read latencies increases as the size of the databases increases.

The faster performance of Cassandra can be attributed to the elaborate hash indexing which performs well for
smaller data sizes and provides lower latency than Redis. Redis’ main setback is its limited key-value pair data
structure which may not accommodate various types of data easily and can be cumbersome for storing data with
higher number of features. However, MongoDB provides better results than both Cassandra and Redis and can be
ideal for smaller amounts of data and simple queries.

5.2 Second Experiment


This experiment uses the data from [37] to compare the NoSQL databases on the basis of READ and WRITE
operations. Comparison according to aggregation functions is also carried out.

5.2.1 Comparison on the basis of WRITE operation


The data fetched from [37] is added to each of the three databases and the time taken to write them into the
databases is recorded. The WRITE operation used in this experiment is an update operation where the second
occurrence of the string ‘New York’ in the city column of the Total Fiduciary and Related Assets table is set as
‘NYC’.
18 Author nameRao
Ashish / Procedia ComputerComputer
et al. / Procedia Science 00 (2022)215
Science 000–000
(2022) 8–23 11

Fig. 6 Write Latency for second experiment using data from [37]

It can be observed from the figure 6 that the Write Latency for MongoDB is the least and it outperforms Redis as
well as Cassandra during this experiment. This can be attributed to the data storage structure (JSON-based
documents) and indexing using B-Trees which help to locate the second occurrence of the string ‘New York’
relatively faster. Redis comes to a close second with its key-value structure of data storage where searching through
a list data structure is performed through a quick ‘LINDEX’ query. Cassandra follows the Column-family keyspaces
as its data model which makes the retrieval of data more time-consuming. Though it uses hash indexes for indexing
of data, the searching and updating of data may not be time efficient.

It is inferred that MongoDB due to its simple data structure helps to perform faster WRITE operations. Redis can
also be a preferred option, but it can easily become too simple to use for complex data. Cassandra can be preferred
in the case of data which needs to be stored as tables.

5.2.2 Comparison on the basis of READ operation


The fetched data is stored into the databases used for the experiment. Here, two test cases are considered. The first
test case involves retrieval of data from all csv files belonging to the first quarter of the year, for which the results
are shown in figure 7. The second test case involves retrieval of the whole dataset by each of the databases, for
which the results are shown in figure 8.
12 Ashish
Author name Rao et Computer
/ Procedia al. / Procedia Computer
Science Science
00 (2022) 215 (2022) 8–23
000–000 19

Fig. 7 Read Latency for first quarter of the [37] dataset

After analyzing the results of the first test case, it can be observed that MongoDB outperforms the other two
databases. MongoDB’s B-Tree indexing data structure helps to carry out complex queries faster. It can also be noted
that the JSON-document based data model helps to perform quicker searching of values for keys in the documents.
Redis’ key-value data model helps to retrieve keys according to a certain pattern. There is however a disadvantage
to the Redis indexing methodology which needs to be noted, Redis cannot perform complex queries if the metadata
is not present in the key (e.g.: metadata involving the quarter and the table name). Redis’ key-value data structure
might provide a lower latency but may not be well-suited for more complex queries and may require multiple
operations to be performed. Cassandra has a higher latency as compared to Redis, but its indexing method helps to
execute complex queries. Furthermore, with the introduction of secondary queries through third-party additions,
complex queries can be executed with a single command.

It can be, thus, implied that MongoDB is well-suited for complex READ queries with a high number of conditions
for data retrieval due to its low latency and convenient querying syntax. Redis provides a lower latency according to
the experiments but could be syntactically more complex as compared to Cassandra. Cassandra offsets the higher
latency by providing easier to use indexing queries.

Fig. 8 Read Latency for the whole [37] dataset


20 Ashish
Author nameRao et al. / Procedia
/ Procedia ComputerComputer Science
Science 00 (2022)215 (2022) 8–23
000–000 13

The second test case provides interesting results where it can be observed that Redis outperforms MongoDB even
though in the first test case, MongoDB showed lower latency. On analysis, it can be inferred that as the size of data
increases, Redis’ key-value data model shows better results than MongoDB. However, it should be noted that Redis’
list data structure was used for the storage of data in this experiment. This might be cumbersome as the number of
features increases for the data. This can be offset by using a JSON-based data structure for storage of data using
Redis. However, since the indexing structure for Redis would remain the same, MongoDB would eventually provide
a convenient yet fast solution for retrieval of large amounts of data.

To conclude the analysis of the second experiment, MongoDB provides a convenient and fast solution. MongoDB
ensures simplicity in the database design as the amount of data grows. Redis can be used if the number of features is
less as it proves to be faster. However, if the number of features or columns grow, MongoDB and Cassandra are the
preferred options.

5.2.3 Comparison on the basis of Aggregation functions


Aggregation queries lie at the crux of financial analysis. Hence, a comparison has been conducted using an
aggregation query which might be commonplace at a financial institution. In this query, the average of the dividends
paid on common and preferred stock for the second quarter in the dataset is calculated and latency for processing of
the query is recorded.

The databases, Cassandra and Mongo, are well-suited for aggregation functions as compared to Redis. One of the
main reasons for this is that Redis has a key-value pair data model. This data model is more appropriate for data
retrieval and manipulation. Another reason is that the standard convention to perform aggregation on a list data
structure in Redis is using a driver for a preferred programming language (Example: Lua). On the other hand,
Cassandra’s column family structure allows the usage of aggregation functions on a column-level. Similarly, usage
of aggregation functions on numerical data can be performed in documents for MongoDB using its query language.
Hence, a comparison between Cassandra and MongoDB has been focused upon.

Fig. 9 Aggregation function Latency using [37] dataset

Figure 9 shows the results for the query execution latency. On analyzing the results in figure 9, it can be noted that
MongoDB has a lower latency as compared to Cassandra for executing the query. Therefore, in terms of latency,
MongoDB will be the preferred choice. However, for a financial institution making a shift from a RDBMS to a
NoSQL database, queries related to aggregation for MongoDB might be cumbersome to use. Using complex queries
in MongoDB encloses a learning curve whereas Cassandra’s column-family structure is similar to a RDBMS table
structure. This makes the execution of aggregation functions straightforward and similar to queries in an RDBMS
Ashish Rao et al. / Procedia Computer Science 215 (2022) 8–23 21
14 Author name / Procedia Computer Science 00 (2022) 000–000

querying language (like SQL). Hence, if convenience is taken into account as a parameter, Cassandra can prove to
be the preferred choice over MongoDB.

6. Practical Implications
With the financial markets and banking business producing so much data, there was a need to highlight data
management for this area. It was discovered that NoSQL databases were best suited for the data after examining
many publications and research papers on the type of data used in this field. One such publication that explores the
use of NoSQL databases in the banking sector is [38], where data related to customer segmentation and risk
management is used.

Any domain that produces data that is dynamic, has multiple formats, and may undergo structural changes on a daily
or periodic basis can have this as a reference point. NoSQL databases enable this flexibility, variety in the types of
documents to be stored, and the feature of horizontal scalability via sharding or server addition.

Hence, the work in this paper can be used by professionals working not only in the financial domain but any domain
with dynamic data as a point of reference.

7. Conclusion & Future Research Work


With the growing complexity and size of data needed to be stored, NoSQL databases have gained popularity in the
industry. A wide range of applications from data management to cache storage, NoSQL databases have surpassed
traditional database management systems in providing better performance and storage efficiency. One such
application can be seen in the storage and retrieval of historical financial data where data is added on a regular basis
making it necessary to analyze and decide the database best suited for the task. In the experiments performed, using
the historical stock data and data provided by a financial institute: Federal Deposit Insurance Corporation (FDIC),
the execution time for READ and WRITE queries were recorded. It can be concluded that MongoDB performed
better than Cassandra as well as Redis with respect to the latency of both, READ and WRITE queries and Cassandra
despite being slow in the Write query, performs really well for Read queries for not only small but also large
databases. Another conclusion which can be drawn from the experiments is that each NoSQL database has certain
characteristics and differences in architecture which make it suitable for a specific use case. Neo4J, which is a Graph
NoSQL database, is more suited for connected data. On the other hand, key-value databases such as Redis are more
suited for key-value mapped data and find better applications in caching systems.

References

[1] Győrödi, Cornelia, et al. "A Comparative Study of Databases with Different Methods of Internal Data
Management." Database 7.4 (2016)
[2] Jatana, Nishtha, et al. "A survey and comparison of relational and non-relational database." International Journal
of Engineering Research & Technology 1.6 (2012): 1-5.
[3] M. Borad, T. Chauhan, and H. Mehta, “A Review on NoSQL Databases,” 2017. [Online]. Available:
https://www.researchgate.net/publication/327883334.
[4] Hu, Pan. "Discussion and Sample Test of Big Financial Data Applied to NoSQL Database." (2013).
[5] Chen, Jeang-Kuo, and Wei-Zhe Lee. "An Introduction of NoSQL Databases based on their categories and
application industries." algorithms 12.5 (2019): 106.
[6] Lu, Hongjun, Hock Chuan Chan, and Kwok Kee Wei. "A Survey on Usage of SQL." ACM SIGMOD Record
22.4 (1993): 60-65.
[7] Batra, Rahul. "A history of SQL and relational databases." SQL Primer. Apress, Berkeley, CA, 2018. 183-187.
[8] Khasawneh, Tariq N., Mahmoud H. AL-Sahlee, and Ali A. Safia. "SQL, NewSQL, and NOSQL Databases: A
Comparative Survey." 2020 11th International Conference on Information and Communication Systems (ICICS).
IEEE, 2020.
[9] Malik, Ahsan, Aqil Burney, and Fawad Ahmed. "A Comparative Study of Unstructured Data with SQL and NO-
SQL Database Management Systems." Journal of Computer and Communications 8.4 (2020): 59-71.
22 Ashish Rao et al. / Procedia Computer Science 215 (2022) 8–23
Author name / Procedia Computer Science 00 (2022) 000–000 15

[10] Sahib, Sri Guru Granth. "A review of non relational databases their types advantages and disadvantages."
International Journal of Engineering &Technology 2.2 (2013).
[11] Padhy, Rabi Prasad, Manas Ranjan Patra, and Suresh Chandra Satapathy. "RDBMS to NoSQL: reviewing some
next-generation non-relational database’s." International Journal of Advanced Engineering Science and
Technologies 11.1 (2011): 15-30.
[12] Angles, Renzo. "A comparison of current graph database models." 2012 IEEE 28th International Conference
on Data Engineering Workshops. IEEE, 2012.
[13] Aqel, Musbah J., Aya Al-Sakran, and Mohammad Hunaity. "A Comparative Study of NoSQL Databases."
[14] Sumalatha, V., and Suresh Pabboju. "Overview of NoSQL Databases and A Concise Description of
⁠MongoDB."
[15] Abramova, Veronika, and Jorge Bernardino. "NoSQL databases: MongoDB vs cassandra." Proceedings of the
international C* conference on computer science and software engineering. 2013.
[16] Saba, Amir Mohammad, et al. "A Comparative Analysis of XML Documents, XML Enabled Databases and
Native XML Databases." arXiv preprint arXiv:1707.08259 (2017).
[17] Rafique, Ansar. "Evaluating NOSQL Technologies for Historical Financial Data." (2013).
[18] Rodrigues, Romulo Alceu, et al. "Integrating NoSQL, relational database, and the hadoop ecosystem in an
interdisciplinary project involving big data and credit card transactions." Information technology-new generations.
Springer, Cham, 2018. 443-451.
[19] Holanda, Maristela. "Performance Analysis of Financial Institution Operations in a NoSQL Columnar
Database." 2020 15th Iberian Conference on Information Systems and Technologies (CISTI). IEEE, 2020.
[20] “Introduction - Comparing the Use of Amazon DynamoDB and Apache HBase for NoSQL,”
docs.aws.amazon.com. [Online]. Available: https://docs.aws.amazon.com/whitepapers/latest/comparing-dynamodb-
and-hbase-for-nosql/introduction.html. [Accessed: 20-Jul-2021]
[21] A. Nayak, A. Poriya, and D. Poojary, “ISSN : 2249-0868 Foundation of Computer Science FCS,” Int. J. Appl.
Inf. Syst., vol. 5, no. 4, 2013, Accessed: Jul. 30, 2021. [Online]. Available: www.ijais.org.⁠
[22] Y. H. Rasheed, M. H. Qutqut, F. Almasalha, and Y. Rasheed, “Overview of the Current Status of NoSQL
Database Analyzing the Performance Gains of Mobile Small Cells View project Hybrid User Action Prediction
System for Automated Home using Association Rules and Ontology View project Overview of the Current Status of
NoSQL Database,” IJCSNS Int. J. Comput. Sci. Netw. Secur., vol. 19, no. 4, p. 47, 2019, Accessed: Jul. 30, 2021.
[Online]. Available: https://www.researchgate.net/publication/336746925.⁠
[23] “Apache Cassandra,” Apache.org, 2016. [Online]. Available: http://cassandra.apache.org/
[24] M. U. Hassan, I. Yaqoob, S. Zulfiqar, and I. A. Hameed, “A Comprehensive Study of HBase Storage
Architecture-A Systematic Literature Review,” 2021, doi: 10.3390/sym13010109.
[25] H. Krishnan, M. Sudheep. Elayidom, and T. Santhanakrishnan, “MongoDB – a comparison with NoSQL
databases,” International Journal of Scientific and Engineering Research, vol. 7, no. 5, pp. 1035–1037, May 2016.
[26]“Elasticsearch vs. Neo4j vs. Redis Comparison,” db-engines.com. [Online]. Available: https://db-
engines.com/en/system/Elasticsearch%3BNeo4j%3BRedis. [Accessed: 19-Jul-2021]
[27] M. Shertil, “TRADITIONAL RDBMS TO NOSQL DATABASE: NEW ERA OF DATABASES FOR BIG
DATA,” in Journal of Humanities and Applied Science (JHAS), 2016, no. 29.
[28] A. B. M. Moniruzzaman and S. A. Hossain, ‘‘NoSQL database: New era of databases for big data analytics—
Classification, characteristics and comparison,’’ 2013, arXiv:1307.0191. [Online]. Available:
https://arxiv.org/abs/1307.0191
[29] “Introduction - Comparing the Use of Amazon DynamoDB and Apache HBase for NoSQL,”
docs.aws.amazon.com. [Online]. Available: https://docs.aws.amazon.com/whitepapers/latest/comparing-dynamodb-
and-hbase-for-nosql/introduction.html. [Accessed: 20-Jul-2021]
[30] M. Diogo, B. Cabral, and J. Bernardino, “Consistency Models of NoSQL Databases,” Future Internet, vol. 11,
no. 2, p. 43, Feb. 2019, doi: 10.3390/fi11020043.
[31] A. Zahid, R. Masood and M. A. Shibli, "Security of sharded NoSQL databases: A comparative analysis," 2014
Conference on Information Assurance and Cyber Security (CIACS), 2014, pp. 1-8, doi:
10.1109/CIACS.2014.6861323.
[32] G. Decandia et al., Dynamo: Amazon’s Highly Available Key-value Store. 2007.⁠
Ashish Rao et al. / Procedia Computer Science 215 (2022) 8–23 23
16 Author name / Procedia Computer Science 00 (2022) 000–000

[33] Inc, Neo4j. “Security - Operations Manual.” Neo4j Graph Database Platform, neo4j.com/docs/operations-
manual/current/security/. [Accessed: 20-Jan-2022]
[34] Inc, Amazon Web Services. “What Is Amazon DynamoDB? - Amazon DynamoDB.”
Docs.aws.amazon.com,docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html.
[Accessed: 20-Jan-2022]
[35] “Financial Modeling Prep - FinancialModelingPrep,” Financial Modeling Prep. [Online]. Available:
https://financialmodelingprep.com/. [Accessed: 23-Jul-2021]
[36] “Free Stock APIs in JSON & Excel | Alpha Vantage,” www.alphavantage.co. [Online]. Available:
https://www.alphavantage.co/. [Accessed: 23-Jul-2021]
[37]“FDIC: Prezipped Large Download files List,” www7.fdic.gov.
https://www7.fdic.gov/sdi/download_large_list_outside.asp. [Accessed: 25-Jan-2022]
[38] Shakya, Subarna, and S. Smys. "Big Data Analytics for Improved Risk Management and
Customer Segregation in Banking Applications." Journal of ISMAC 3, no. 03 (2021): 235-
249.

You might also like