Dita - Kebede - BIG Data Assignment Report

Adama Science & Technology University
School of Electrical Engineering and Computing

Department of Computer Science and Engineering
MSc in Computer Science and Engineering
Course Title: Big Data Analytics & Visualization (CSEg6419)

Program: Weekend Years: 2nd & 2nd Semester
Assignment: Report on NoSQL Databases: Hbase and Mongodb
Selected Title: Comparative Performance Evaluation of NoSQL
Databases: HBase and MongoDB
Name ID Number
Nuredin Abdellah…………………………… PGE/28278/15
Dita Kebede………………………………… PGE/28270/15
Semira Ahmed……………………………… PGE/28280/15
Submitted to: Getinet Yilma (PhD)
Date: 19/04/2024
i
1. Introduction
Rapid advancement in Technologies and the exponential increase in internet users worldwide
have resulted in a rapid increase in the number of data sources which indeed produce a huge
amount of data that needs to be managed. The database is pivotal for an organization as it
maintains the legacy data which can help them make better decisions as well as handle the
operational day-to-day data. Relational Database uses a relational data model organized in a
table with rows and columns. It has a predefined relationship that can be redefined in many
ways. To communicate with the database, it uses SQL(Structured Query Language) which is a
standard language for RDBMS like MySQL and Oracle. It has been used in industry for a long
because of the advantages it gives like data security, easy extendibility, and power to work with
multiple data requirements. As the technologies advanced and IoT emerged the huge number
of people being able to access the internet led to an increase in data especially unstructured and
semi-structured data which has multiple data types that can not be handled by relational
databases having tabular structure and scalability is another factor. Companies wanted to invest
in a system that requires less maintenance yet provides productivity and performance and
identify alternatives to expensive proprietary software and hardware. Another inclination factor
was agility, as enterprises look forward to faster time to market and embrace agile development
methodologies [1].
The relational database was not able to meet the storage requirement due to the new demand
and increasing database with high operation rates. NoSQL emerged not to replace relational
databases but as the backend to support the new demand and increasing data or Big Data
applications. Relational databases follow the ACID(Atomicity, Consistency, Isolation, and
Durability) properties while NoSQL databases follow BASE(Basically Available, Soft State,
Eventual consistency)properties [2]. No SQL stands for “Not only SQL” useful for working
with large distributed sets. Due to smooth, simple, uncomplicated cloud deployment, schema-
free data models, managing large data, and providing scalability, NoSQL is widely used in
cloud platforms. A wide variety of NoSQL databases have been developed to cater to BigData
applications. Major contributors of NoSQL - Google’s BigTable, Amazon’s Dynamo,
Facebook’s Cassandra, Oracle’s NoSQL DB, MongoDB, and Apache’s HBase. The retrieval
of information from big data should be less and scale at the same rate as the data increases [3],
which can be achieved by NoSQL.
Motivations of NoSQL design:-
1
➢ Easier cloud deployment
➢ Large scale data
➢ Meeting the scalability requirement i.e. Scale-Up and Scale-Out. Scale Up(vertical)
means the number of servers are the same but we increase the performance of hardware
by increasing RAM, and clock speed. Scale-out (horizontal) means adding more servers
in the existing cluster.
➢ For storing the caching data, it can be used as a caching layer [2].
➢ There is zero downtime and also flexible.
At the early stage of NoSQL, the biggest hurdle was the weak consistency. But as technology
advanced and more development was done new features to NoSQL led to many NoSQL
databases having strong consistency such as MongoDB. Strong consistency in the sense that
once data is written, it is available for reading and we can get a consistent view of the data
when it is queried. There are 4 main categories of NoSQL based on the specific type of dataset
and how it is represented [4]:-
a. Key-value store - Each value that is stored is combined with a unique key. Examples:
Amazon Dynamo DB, Hazlecast
b. Wide Column store – It is column-oriented and effective to store large volumes of data
having the same type. Example: Cassandra, HBase
c. Document store – The keys are associated with a document that is schema-less(no format
imposed). Example: MongoDB, Microsoft Azure DocumentDB.
d. Graph store – Graph-oriented and used when some relationships may exist between data or
elements. Example: Neo4j, Titan.
Although NoSQL innovation has been quickly built up, the comparison between them is still
not very clear based on performance. Different databases have their implementation structure,
storage qualities, designs, and streamlining techniques or optimization, which increases the
challenge of deciding which NoSQL database to use. In this paper, we endeavor to test NoSQL
performance to help in deciding which NoSQL is better in which situation.
We selected two databases, MongoDB and HBase, the popular NoSQL databases. Two
workloads are run – Workload A (50% read and 50% update)and Workload D(95% read and
5% insert) against both the databases to evaluate and compare the performance based on
latency and overall throughput. The benchmarking is done using the YCSB (Yahoo Cloud
2
Serving Benchmark)tool, which is an open system to evaluate NoSQL databases. Average
Latency and Overall Throughput of various operations in the workload specified.
2. Related Work
The data is increasing at a massive rate and it’ll grow continuously shortly as well. It’s not only
the “volume” and velocity” but the “variety” of data is also increasing due to the vast usage of
the internet. For working with BigData having a major part as unstructured data, NoSQL
databases are used like HBase and MongoDB. They are schema-independent and hence can be
used for a variety of applications. Many NoSQL databases can be used, but it depends on the
organization's requirements and needs as to what kind of operations will be performed. Testing
the performance of NoSQL databases is a key factor in deciding which NoSQL database
enterprise should go for. Many researches have been carried out in the past for understanding
quantitative as well as qualitative analysis of NoSQL Databases.
Many studies have been carried out on numerous NoSQL database solutions in an attempt to
understand their characteristics across different functional and operational requirements thus
advocating the decision-making process of selecting a NoSQL database solution that meets the
requirement of the project.
We find that the author in [5] compares several NoSQL databases. In the paper presented author
performs the comparison of MongoDB, HBase, Cassandra, and Raik from a different
viewpoint. The test was performed on the stated NoSQL database solution with the help of the
Yahoo cloud serving benchmark (YCSB) by applying different workloads on the solutions.
The author finds that every solution performs differently against different workloads because
of the differences in the design models.
[6] experimented to guide a suitable NoSQL database required by an application. The NoSQL
databases taken into consideration were Cassandra, HBase, and MongoDB and the
benchmarking tool used is YCSB. They tested for a defined workload condition(A, B, C, D, E)
and dataset size. For Workload A having 50% Read - 50% Write, it was observed that
Cassandra had better throughput overall. But for medium or small data(1GB to 4GB), HBase
performed better with 20% more throughput than Cassandra. For Workload B having 100%
Read, MongoDB was the clear winner in this case and the reason could be the format in which
the data is stored like BSON documents which is easier for read-only operations. For Workload
C having 100% Blind Write, HBASE was the obvious choice irrespective of the size of the
database and cluster with much higher throughput than others. This indicates that HBase is
suitable to handle huge amounts of write operations effectively. For Workload D having 100%
3
Read-Modify-Write, the results were similar to A, Cassandra was the best choice overall in this
case while HBase performed better for small data sets. For Workload E having 100% Scan,
Cassandra had the beat throughput for large data sets while HBase performed better for small
data sets. MongoDB is not suitable for working with range queries.
[7] performed comparative analysis of 3 databases- MongoDB, Cassandra, and HBase using
YCSB. The databases were tested on 3 different workloads, having 50% read and 50% update,
100% read and 0% update, 0% read and 100% update and the overall result was that Cassandra
outperformed all other databases. The replication factor was also studied in this experiment.
As the replication factor increased, the performance of Cassandra and MongoDB decreased
while HBase gave better throughput.
[8] did performance testing for NoSQL databases using the YCSB benchmark tool. The
databases taken into consideration are:- HamsterDB, BrightstarDB, LevelDB, RavenDB, and
STSdb and the parameters taken into account were speed of writing and speed of reading. The
overall conclusion of the tests was, that for speed of insertion, the HamsterDB was much better
as compared to others while BrightstarDB was worse in this scenario. For speed of reading,
STSdb and LevelDB gave the highest performance and BrightstarDB throughput was much
lesser than others.
[9] researched on comparison of 5 NoSQL databases- HBase, Redis, Cassandra, MongoDB,
and Couchbase using the YCSB tool. The specifications taken into account were data loading
and workload execution. Redis achieved the best throughput in the area of data loading because
of its semi-persistent approach. Cassandra and HBase were almost 1.7 and 1.8 times slower
respectively. Couchbase performed the worst. Workload execution was performed using
Workload A, C, and H. Overall Redis overshadowed all other databases maintaining the highest
throughput. HBase and Cassandra were the slowest due to column family structure and
MongoDB and Couchbase gave better performance due to the storage technique they followed
i.e. document type which provides asynchronous read and write. When the record operation
increased by a certain limit, MongoDB's overall throughput decreased significantly.
[3] also did performance evaluation of NoSQL databases- Cassandra, MongoDB, and HBase
using YCSB. Factors like the number of cores, number of nodes, and replication were
considered. The number of cores on a single node was increased keeping the number of threads
constant. HBase uses the number of cores wisely but it shows low update latency. When the
database was run on a Virtual Machine, Cassandra showed the highest latency and MongoDB
had the lowest. With the increase in the number of nodes the throughput of Cassandra increased
and the same was the case with Cassandra. In terms of absolute performance, HBase gave
4
higher throughput. Replication is an impacting factor and an increase in this factor degrades
the throughput as in the case of MongoDB. For Cassandra, if the replication factor was more
than 1 did not hamper the performance. There was no significant change in performance either
in terms of throughput or response time of HBase.
[4]also did a comparative study of HBase vs MongoDB using YCSB. The results were almost
similar to the above-presented findings. HBase gave better insertion time as compared to
MongoDB. In heavy update operations and read-heavy operations, MongoDB outperformed
HBase. In the update, mostly or 100% update operation HBase was much better than
MongoDB. Overall, HBase performed better in write operations either in inserting records or
updating them, and MongoDB was better in read operations.
Scalability, availability, consistency, request, and response time are some of the important
metrics and benchmarks to evaluate the performance of NoSQL database systems [10]. An
experimental evaluation of NoSQL databases was given in [11]. This work compares the
performance of the various NoSQL databases by the storage and retrieval time metrics. It
further evaluates the performance measures of the NoSQL databases concerning the cloud
computing systems. Some of the benchmarks to assess the performance measures of the
NoSQL databases to the relational database are given in [12]. This work examines the
performance of NoSQL databases such as MongoDB, PostgreSQL, and SQLLite3 by the
workload benchmarks such as several messages inserted, the size of the messages, and the
number of topics inserted. It categorizes the performance of the three databases concerning
robotics logging applications. However, the metrics and benchmarks defined by this work form
the basis for the other big data applications. In [13] an analysis of various NoSQL databases
and their performance measure are described in detail. The performance measures of widely
adopted databases such as HBase, Cassandra, MongoDB, and Redis were comparatively
evaluated. It applies the YCSB benchmark tool for the process of performance evaluation. This
work evaluates the performance of the database using transaction processing time metrics. Thus
through the adoption of standard benchmarks and metrics performance measures of the NoSQL
databases are quickly evaluated.
So, the overall findings from the above research is that the different NoSQL databases perform
differently under varying conditions. It depends on the requirement of the application to choose
the appropriate database required to fulfill the needs.
3. Proposed Methods
5
The challenges behind the traditional RDBS systems gave rise to the evolution
of NoSQL databases. The NoSQL databases are schema-less data model that supports scalable
replication and distribution. The shared-nothing architecture of the NoSQL databases enables
it to run across a large number of nodes and provides higher performance per node in
comparison to the traditional database systems. Further, the NoSQL databases possess
nonlocking concurrency control mechanisms such that there are no conflicts between the real-
time read and write operations. Thus, NoSQL database systems preserve consistency
properties. Whereas, in the traditional systems consistency becomes the bottleneck when it
deals with the property of scalability. In general, the NoSQL databases fall into four major
categories such as document store, key-value, graph store, and column family stores.
Key-Value Data Stores
It is a simple form of NoSQL database system that makes use of dictionary-like data structures.
It provides data access through keys, which act as a unique identifier for the data contents. It
maps every attribute with a separate key, and each key represents a value. The Value, in turn,
represents the set of data. It does not adopt any structured query languages and offers greater
flexibility. In a key value, the data store client can get, add, or delete a value to the key. In
general, it stores and retrieves data using key-value pairs. For example, to represent the
treatments undergone by a particular patient a key k1 can be mapped to the value of the
treatment T1. Similarly, the values k2 and k3 are mapped to the treatment T2 and T3 of the
patient P1. Some of the well-known examples of Key-value data stores include Redis, Riak,
Couchbase, Memcached, BerkeleyDB, and upscaled.
Document Databases
Document databases store multiple attributes in a single document rather than storing every
attribute with a key. Upon the addition of the documents, it builds the required data structures
to support the document. These database systems are highly flexible and vary from the
traditional RDBMS systems. In general, the document databases make use of JSON (JavaScript
Object Notation) and XML (Extensible Markup Language) for querying purposes. In document
database systems, a document may contain multiple documents embedded and a list of multiple
values within it. The major advantage of these databases are support querying with various
attributes. For example, to cluster the list of patients with low blood pressure. First, the
identifier to list the patients with low blood pressure is stored in a document, and within the
6
document, all the patient documents with low blood pressure are embedded. The identifiers
assist in easy reference to the patient's attributes.
Column Family Stores
In a column family data stores, data is stored in the form of columns, and a set of columns
forms the row. A row can contain n number of columns corresponding to it. Column family
represents the group of related data contents, and it can be accessed together. In column family
stores a key identifies a row and a row can have multiple columns. In a column family stores,
all the rows don't need to have the same columns and a column can be added to a row at any
time without affecting other values. It is designed for rows with many columns and can even
handle millions of columns. For example, Patient name and ID can frequently be used together
thus these two columns form a row. Similarly, patient disease and medication details are often
used together hence these two columns form a row, and all these are grouped into a collection
called Patient Diagnosis (column family store).
Graph Stores
The graph data stores work by nodes and relationships. Nodes represent an object with a set of
identifiers and relationship defines the link between two nodes. Its applications are prevalent
across social media systems such as Facebook and Twitter. Nodes and relationships may
sometimes form a complex structure. The best examples of graph data stores are Neo4j, Infinite
Graph, and Orient DB.
3.1. HBase
HBase is based on HDFS( Hadoop Distributed File System) and Map-Reduce framework. It
belongs to the Wide-column store of the NoSQL family and it is schema-less. HBase is an
extensible record data store type of database. HBase has a similar architecture to that of Google
Bigtable it is based on the similar concept of Google's Bigtable but the difference is that it is
an open-source column-oriented database solution that is automatically distributed and scaled
over Hadoop. To recover from automatic failovers it uses replication, write-ahead logging
(WAL) approach, and distributed configuration. A query from the client in HBase is directly
redirected to a specific Regionserver after the lookup is performed in the root catalog tables
and .meta. tables [14]. The HBase cluster uses a single master and multiple slave design. Where
each Regionserver has multiple regions to store data tables. In the scenario of a single table
becoming too big, it is distributed among different Regions [15].
7
3.1.1. Key features of Hbase
HBase is also called as Hadoop Database as it is mainly installed on top of Hadoop. Thus,
HBase has all the advantages related to a distributed file system and a Map-reduced model.
HBase can access sparse data i.e. a small size of valuable data in the heap of the massive volume
of unstructured data used in the analysis of Big data. HBase has abilities like coherent read and
write, data replication throughout clusters, and the ability to report failures automatically.
Key features are [16]:-
➢ It provides horizontal scalability by adding more servers or machines to the pool of

resources. Horizontal scalability is provided by “Region”.
➢ It provides fault-tolerant storage capacity for data that is sparse. Sparse data is very
small but immensely valuable with the huge volume of unstructured data.
➢ It also supports the parallel processing of large-volume data (HDFS and MapReduce)
and has a highly adaptive data model.
➢ It can host large tables on top of the cluster of inexpensive, widely available hardware
with unstructured data and performs real-time lookup as it uses Hash tables.
➢ It facilitates automatic load balancing of tables.
➢ It provides easy JAVA API for clients which provides programmatic access.
➢ Data replication is possible across various clusters.
➢ Consistent read and write is possible which is required for high-speed performance and
high-speed counter aggregation
➢ Automatic sharding is provided in HBase. When the size reaches a threshold value and
to reduce IO time and overhead, HBase splits regions into smaller sub-regions as the
data grows.
➢ For optimization and real-time query processing, HBase provides block cache and
bloom filters.
➢ It facilitates high write throughput because of the high security and easy management.
3.1.2. HBase structure and architecture
As stated earlier, HBase belongs to the Wide-column family of NoSQL. It is based on the
Hadoop Map-reduce framework. It runs on top of HDFS(Hadoop Distributed File System).
Column families with key-value pairs are defined by the HBase table schema.
8
HBase running on top of HDFS enables low-expectancy read and write operations. HBase uses
a multidimensional sparse map with rows and columns to store tables thus enabling random
real-time read and write accessibility. Each cell in HBase can be identified uniquely by the
table, row, column family, and timestamp. Tables in HBase can be directly used as targets for
MapReduce jobs with the help of Java client APIs. The management of partial failures in
database are handled by Zookeeper which is also an open source project of Apache. Zookeeper
also provides abilities like maintenance of configuration information and distributed
synchronization. Zookeeper's ability to track failures and network partitions is due to the
presence of fleeting nodes that represent the RegionServer.
Key components of Apache HBase [15]:
Hadoop:- It provides a map-reduce framework that gives high throughput and easy access to
application data. It also governs replication.
Hadoop YARN:- It provides a framework for job scheduling and resource management of
clusters.
Hadoop MapReduce:- It is a YARN-based system that provides parallel processing of large

distributed sets of data.
The 3 main nodes are The Master(HMaster). It reports for the nodes that are alive and also
provides communication services, the zookeeper cluster and clients, and the region servers that
distribute the data. Initially, Regions are allocated to a node and once they become huge, they
are split [3]. Master maintains load balancing across the cluster to maintain the state of the
cluster by distributing the load, unloading busy servers, and moving to less occupied servers.
It is also accountable for schema changes.
Region Server: The tables are spread across the Region Server which is known as Region and
it runs on HDFS data nodes. Clients interact with the Region server and also perform
CRUD(Create, Read, Update, Delete) operations from clients hence they are called slave
nodes. Four main elements are-
a. Block Cache: It is read-cache. Recurring data is stored and if it’s full, the most recent data
is deleted,
b. MemStore: It is a write cache that stores new data that has not been saved,
9
c. WAL: (Write Ahead Logs)It also stores new data that is not yet in permanent storage, d.
HFile: It is an actual storage file stored in a key-value pair [16].
Zookeeper:- For large distributed systems, it facilitates a distributed configuration and

synchronization service and stands between the client and HMaster. It also manages and keeps
track of failures in database and server failures. It is the first point of contact when clients want
to interact with the Region server and acts as coordinator [16].
Lookup Master -ROOT- Register HMaster & ROOT location

ZOOKEEPER
YCSB CLIENT
Client rarely needs HMaster HMASTER
Read\Write data Assign Regions to Region
Servers, check health RegionServer

TableF00 TableBar
TableF00 TableBar
MemStore(CF1) MemStore(CF1
MemStore(CF1) MemStore(CF1)
Region )
Region Region
Region
MemStore(CF2) MemStore(CF2)
-ROOT-
-META-
MemStore(CF1
) RegionServer2 MemStore(CF1
RegionServer1 Region )
Region
Hlog
Hlog
Figure 1: HBase Architecture [14]
3.2. MongoDB
MongoDB belongs to the Document-store NoSQL family that provides automatic scaling, high
availability, and good performance [15]. MongoDB is a schema-less document-oriented
database developed by 10gen and an open-source community. MongoDB is a document-
oriented NoSQL database with high availability, scalability, and fault tolerance. It is capable
of shrading by implementing a shraded cluster. Each cluster in MongoDB comprises three
components which are shards, configuration servers, and query routers. Client query is
redirected by Query routers to the appropriate shards after performing a look-up operation on
10
the shard addresses which are maintained in the configuration servers. Cluster balancing is also
performed by the Query router by using two primary operations chunk splitting and balancing.
3.2.1. Key features of MongoDB
Key features of MongoDB are
➢ It is schema-less and has a dynamic schema with no DDL( Data Definition Language).
In a collection, every document could have different data. The collection is not rigid or
strict about what it stores. The database can have 0 or more collections and a collection
can have 0 or more documents.
➢ It provides a secondary and flexible Index feature which needs to be kept in memory.
An index is automatically created on the primary key field. To improve the performance
of a query or to give unique values to a particular field, the user can create other indexes
as well and hence it supports Single field indexes as well as Compound indexes.
➢ The query language can be performed via API and supports API in many languages
like- Python, Java, C++, and Erlang.
➢ If it is configured in a particular way, it can provide automatic writes and fully
consistent reads.
➢ It also provides Master-slave replication with automated failover.
➢ It facilitates automated scaling via automated range-based partitioning of data(Hash-
based) also known as automatic sharding. Sharding means it breaks the data into smaller
chunks if it goes beyond a threshold and stores these chunks on different
machines/shards. To handle the increasing data, more servers can be added horizontally
without taking a downtime and it doesn’t affect the application.
➢ The data can be easily and rapidly integrated.
➢ It also gives Map Reduce functionality. Map Reduce is the framework for processing
big data in two phases Map and Reduce that has key-value pairs as input and output.
➢ It provides replication of data that ensures redundancy and enables automatic failover.
Replication happens via replica sets. There are two sets:- Primary sets and Secondary
sets. At a given time only one member can act as a Primary set and others act as
Secondary sets. In case the Primary set fails the secondary sets elect the new Primary
sets(Master-Slave architecture).
11
➢ MongoDB has a unique feature for searching the document in the database using
find/findOne by its inner attribute and the attribute can be indexed so that the search
operation is faster.
3.2.2. MongoDB structure and architecture
As stated earlier, MongoDB is a Document-store NoSQL type. The data is stored as documents
in the binary-encoded form called BSON( Binary JavaScript Object Notation). Zero or more
key-value pairs are stored as a single entity. A collection is a group of related documents having
a shared common index. The main elements of the system are [3]:-
Mongodb (Shard nodes):- It is the Data node that is used for storing and retrieving the data.
Mongos:- It is the only instance that can communicate outside of the cluster. It interfaces with
clients and router's operations to appropriate shards.
Config Server:- It acts as the container of the metadata(used in case of node failure) about the
object stored in the Mongod. Configuration server serves the purpose of a repository for the
metadata for sharded clusters. The metadata gives information on the state and organization of
all the data present within the shared clusters. In a cluster, there can be 1 or 3 config-server
instances. Each running component constitutes one node in the MongoDB cluster.
Shard 1 Shard 2 Shard 3
Mongodb Mongodb Mongodb
Mongodb Mongodb Mongodb
Config Server
Mongos
YCSB
Client
Figure 2: MongoDB Architecture [3]
12
A shard is a group of one or more Mongod nodes and is known as a replica set which contains
a copy of the data. The replication factor of a system is determined by the number of data nodes
in a shard. It has Master-Slave architecture and within a shrad there is only 1 Master who can
read and write and others will be slave which can only perform read operations.
Read Preference:- It helps to determine where to route the Read operation, the default route
is through the primary node but possible options are secondary node, primary-preferred, etc. It
helps to improve latency which eventually increases throughput.
Write Concern:- It determines the guarantee that MongoDB provides on the success of a write
operation. The default value for this is “Acknowledged”. Weaker writing concern implies faster
writing time [7].
Write operation Performance:- Every index associated with the collection needs to be
updated with every write. When a document grows beyond the current allocation, it is relocated
to disk.
3.3. Security Features in MongoDB and HBase

Security is one of the most critical factors of a database. A database must have good inbuilt
security feature so that it’s not affected by some ransomware attacks and it is critical for project-
sensitive data.
3.3.1. MongoDB
MongoDB provides facilities that guard, detect, and control unidentified access to data. The
security features are [1]:-
Authentication:- MongoDB provides integration with an external access control mechanism

that provides authentication security using LDAP, Kerberos, and Windows Active Directory.
IP whitelisting feature gives an additional advantage by configuring MongoDB to accept
external connections having approved IP addresses as in the list. To enable the Authentication
we need to give the authentication parameter as ON in mogod.conf file.
Authorization:- The DevOps team can isolate the access by specifying a role to a user
depending on the type of job that needs to be performed and is called Role Based Access
Control. The access control can be defined in MongoDB centrally or in the LDAP server. It
can also restrict the view of certain sensitive data from a group of users.
13
Auditing:- MongoDB’s native audit log can be used by security administrators for tracking
any changes in the database to abide by governing compliance.
Encryption:- MongoDB provides an encrypted storage engine, it encrypts dt at network, disk,

and backup levels. Hence it avoids the additional headache of organizations deploying external
encrypted mechanisms. Only a set of people can have access to encrypted data based on the
permission provided to a user which provides additional security.
Replication Keyfile:- Enabling the replication key file will automatically enable
authentication. Only the hosts having the key file installed can join the replica sets. The
replication key is also encrypted which provides an additional layer of security.
Non-standard ports:- MongoDB runs on a standard port 27019, 27018, 27017, and 2700X
and attackers mostly scan standard MongoDB ports, so it is advised to run MongoDB on a non-
standard port.
Firewall:- Firewall groups on hosts or security groups with cloud hosting can be enabled on
the servers.
Encryption:- Mongo DB provides encryption at Rest. It can encrypt pages at the application
level using OpenSSL which boosts the performance.
3.3.2. HBase
In the initial release of Hadoop 0.20.2, there was minimal in-built security in Hadoop but as a
newer version came Security was added. Now it gives strong security features based on the
Kerberos authentication model and ACL( Access Control List). More features are explained
below [5]:-
ACL:- It is a list of permissions given to an entity/user. It decides which users shall be granted
access as well as what kind of operation the user is eligible to perform. It severs 2 different
roles in HBase, it provides interaction for granting, revoking, and listing assigned permissions
and it also performs all the authorization checks for requests that come against table data.
RPC(Remote Procedure Call):-The implementation of security is built on top of RPC. It

provides separate loadable SecureRPCEngine and it is optional to avoid the overhead for the
organization where security is not a concern. We need to set hbase.security.authentication and
hadoop.security.authentication to Kerberos (a network authentication protocol)in the xml
config files.
14
SASL(Simple Authentication and Security Layer):- Provides SASL authentication of
clients using Kerberos/GSSAPI authentication or Digest-MD5 implementation using signed
tokens for MapReduce.
Authorization:- Permissions and roles are granted to only authorized users. It supports user
and group-based management. The different types of operations that can be performed
depending on the scope are:- READ, WRITE, EXECUTE, CREATE, ADMIN, and the
permission can be granted at global, namespace, table, column family, and cell level.
Secure Zookeeper:- HBase heavily relies on a Zookeeper for its operation and hence the
security of the Zookeeper is important. It is achieved by SASL-based authentication using
Kerberos or it can use existing Zookeepers ACLs.
To enable HBase authorization we need to grant permission by changing hbase-site.xml

properties.
On behalf of the client, we can also configure the Thrift gateway for Authentication. Using a
fixed user we can authenticate the Thrift client to HBase and it was implemented in HBase
versions 11349 and 11474 [17].
NoSQL DB Version Authentication Authorization

MongoDB 1.9.1 Y Y
Hbase 0.20.203.0 Y(Kerberos) Y
HBase and MongoDB- Support for Authentication and Authorization
3.4. Comparative study of HBase and MongoDB
a.Scalability
Horizontal scalability in HBase: HBase is a Hadoop database that provides low latency while
performing random reads and writes while running on top of HDFS, it is designed to handle
petabytes of data.
A Region server could serve one or more regions. Regions are assigned to the Region servers
at the startup by master, to achieve load balancing master can decide to move regions from one
region server to another.
Horizontal scalability in MongoDB: MongoDB provides the capability of horizontal scale-

out for databases with low cost, commodity hardware, or with cloud infrastructure with the
15
technique of sharding. Sharding is nothing but automatically partitioning the data and
distributing the data over different physical instances known as shards.
b.Availability
The availability of a system states that every request gets a response whether it was successful
or failed.
In MongoDB, when such failures happen if there’s no primary node then other secondaries
would perform the process of electing a new Primary node while the old primary node steps
down, during the voting phase which ideally only takes a few seconds, but during this point we
cannot access the data on that node.
In HBase when the master failure occurs all the read and write requests are stopped until the
secondary master becomes active.
In the event of failure, both HBase and MongoDB would choose Consistency over Availability,
which results in stopping all the write operations until the point where the database decides that
the system is stable with the only difference being that MongoDB has an automated process of
electing a new primary node.
c.Reliability
Hbsae provides a high level of reliability when configured properly with redundancy, Hbase
on Hadoop is a distributed system thus its failures are quite different from other database
solutions. Zookeeper is used to recover from the crashes in region servers by moving them to
other operational region servers and as Data is replicated over different regions in Hbase we
have duplicate copies of data to recover from.
In the event of the Failure of the primary node in MongoDB the secondary nodes vote among
themselves and elect a new primary node with will take the place of the old primary node.
4. Experimental configuration
We discuss how we have set up a test environment to perform the study. We have used HBase
and MongoDB as our two NoSQL database systems for performance benchmarking. Database
performance testing is the key area where the performance of various databases can be tested
using open-source tools such as YCSB(Yahoo! Cloud Service Benchmark), a workload
generator developed by Yahoo based on a common set of workloads. YCSB is by far the most
popular open-source tool used for testing the performance of multiple databases and comparing
16
them. It helps to generate fabricated data which is a set of arbitrary characters indexed by the
primary identifier [6]. It consists of a YCSB client, a load schedular, and a set of workloads
[8]. YCSB client is a Java program that generates data and is added to the database. The client
can have multiple threads which are managed by Load Scheduler. It operates in 2 phases:-
a.Data initialization Phase: In this phase, data is loaded into the data nodes which are
responsible for storing the actual data.
b. Transaction/Execution Phase: In this phase, it executes the data loaded by generating random
keys and sending it to the data nodes.
Workloads and Tools
Database performance was defined by the speed at which the database processed basic
operations. A basic operation is an action performed by a workload executor, which drives
multiple client threads. Each thread executes a sequential series of operations by making calls
to a database interface layer both to load a database (the load phase) and to execute a workload
(the transaction phase).
The threads throttle the rate at which they generate requests, so that we may directly control
the offered load against the database. In addition, the threads measure latency and the achieved
throughput of their operations and report these measurements to the statistics collection
module.
YCSB provides standard workloads with different types of operations and distribution of
percentages of operation. In this paper, Workload A and Workload D have been used.
The performance of each database was evaluated under the following workloads:
i. Workload-A (Update heavy workload): This workload comprises 50% read and 50% write
operations.
ii. Workload-D (Read latest workload): This workload is 5% write and 95% read (inserts
records, with readings of the newly inserted data).
We used the YCSB client as a worker, which consists of the following components:
➢ workload executor
➢ the YCSB client threads
➢ extensions
➢ statistics module
17
➢ database connectors
The workloads were tested under the following conditions:
➢ Data fits the memory.

➢ Durability is false.
➢ Replication is set to “1” signifying that just a single replica is available for each data
set.
Command-line Parameters
• DB to use
• Target throughout
• Number of threads
Workload parameter YCSB Client

file
DB Client
Workload Client Threads
• R\W mix Executor
• Record size
• Dataset Stats
Cloud DB
Extensible: Define new workloads
Extensible: plug in new clients
Figure 3:- YCSB architecture [18]
Following are the steps performed to set up the environment for testing:-
1. On OpenStack, create an instance of Ubuntu. We have installed the Hadoop -2.9.2 version.
But before installing Hadoop, Java needs to be installed first. Java path is- /usr/lib/jvm/java-8-
openjdk-amd64 and it needs to be updated in Hadoop-env. sh file.
2. Once Java and Hadoop are installed, HBase over HDFS is installed. The version of HBase
installed is- hbase-1.4.8.
3. The following components should be up if all are installed correctly. Using jps command
4. Also, by logging into the HBase shell, we can check the status of the server. Once we are
logged into the HBase shell, we created a table called “user table” with column family as “cf1”
18
5. After the successful installation of Hadoop and HBase, MongoDB (mongodb-linux-x86_64-
ubuntu1604- 3.2.10) is installed. Please note for the MongoDB to run, the Mongod process
should be started to make the connection and by default, MongoDB listens to 127.0.0.1 and
27017 port.
6. Now YCSB, needs to be installed. The version of YCSB used is- ycsb-0.11.0. Download
this package and untar the file. Once we untar it, we can see many binding directories and
workloads.
7. Used test harness tool to run the test. Changed the necessary parameters in runtest.sh,
opcounts.txt, and workloadlist.txt.
5. Evaluation and Results

We have performed benchmarking of both the NoSQL solutions HBase and MongoDB with
the help of YCSB in 4 stages in terms of operation count from 125000, 250000, 500000, and
1000000 over Workload A and Workload D.
For Workload A:- 50% Read and 50% Update.
For Workload A we will be analyzing our performance results of MongoDB and Hbase in three
sections as follows:
a.Read Operation Average Latency(us)
MongoDB is performing much better than HBase for the record-read operation. When the
operation count increased, the average latency also increased for both databases and once it
reached 1000000, the average latency of MongoDB was 1.4K while the HBase value was 8.6K
i.e. almost 6 times higher average latency than MongoDB. For Read operations, MongoDB is
better than HBase.(As in [6]).
Here are some observations about the data
• HBase: As the total operation count increases, the read operation count also increases.
• HBase: As the total operation count increases, the average latency of read operations
also increases.
• MongoDB: As the total operation count increases, the read operation count also
increases, but not at the same rate as HBase.
19
• MongoDB: As the total operation count increases, the average latency of read
operations also increases, but not at the same rate as HBase.
b.Update Operation Average Latency(us):-
MongoDB performed better than HBase but as the operation count increased to more than
250000, the average latency drastically increased for MongoDB. At 1000000, the avg. latency
of MongoDB was 1.41K while HBase avg. latency was 0.85K. This means that for smaller
operation counts for updates, MongoDB is good but if the operation count is high HBase
performs better. We can conclude that, for update operations, HBase is better. ( [4], [6]).
• It appears that MongoDB can handle a higher total operation count before the update
operation count starts to significantly increase, compared to HBase.
• In Each data set (HBase or MongoDB), the average latency appears to increase at a
similar rate as the total operation count increases.
c. Overall throughput(ops/sec) of HBase and MongoDB:
The Overall throughput of MongoDB was always higher than HBase and came out as a clear
winner. However, it was observed that, when the total operation count increased, the
throughput rapidly decreased. The throughput decreased from 4.2K to 0.9K for MongoDB. In
comparison, there was a linear decrease in throughput for HBase.
Overall for Workload A, MongoDB performed better in the read operation and HBase
performed better in the update operation.
For Workload D (95% read and 5% insert)
For Workload D we will be analyzing our performance results of MongoDB and Hbase in three
sections as follows:
a.Read Operation Average Latency(us)
MongoDB outperformed HBase in the case of read operation due to the structure in which
MongoDB stores the data(Document-BSON). Although as the read operation count increased,
the avg. latency of MongoDB and HBase increased significantly. Overall, HBase suffers higher
latency. ( [4], [6])
b.Insert Operation Average Latency(us)
20
For 5% insert operation MongoDB was the clear winner as compared to HBase. The average
latency was almost constant throughout the number of operations, while the HBase average.
latency increased when the number of insert operation counts increased.
Overall, MongoDB gave much better performance when the insert operation count was less(
[4], [6]).
c.Overall throughput(ops/sec) of HBase and MongoDB
The Overall throughput of MongoDB was always higher than HBase. However, it was
observed that, when the total operation count increased, the throughput rapidly decreased from
9.3K to 3.7K for MongoDB. HBase throughput also decreased steadily with an increase in total
operation count. Overall for workload D, MongoDB was the clear winner as compared to
HBase.
6. Conclusion
Due to the limitations of relational databases, NoSQL is gaining popularity in efficiently
handling data. There are some NoSQL databases in the market and hence performance
evaluation and comparison are necessary for the NoSQL users and developers to know the
most suitable database for their application. This paper gives benchmarks and models for two
of the most popular NoSQL databases MongoDB and HBase which we deployed on Ubuntu
virtual machine based on the Average Latency(us) and Overall Throughput(ops/sec).
Analyzing the evaluation and results, overall MongoDB outperformed HBase in the read
operations while HBase performed better in the update operation. From the perspective of a
5% insert, MongoDB performed better than HBase. According to the research, if there was an
insert heavy operation, HBase performs better. The overall throughput of MongoDB was
always higher than HBase in both workloads. The design features of databases are suitable for
specific operations.
Future scope concerning this report would be, running the databases on much higher
workloads. Also study effects of various other parameters like replication factor, increasing the
number of nodes, increasing the number of cores, and also figuring out the optimization ways
to provide a generic model framework.
21
References
[1] MongoDB (2018). MongoDB Security Architecture.
[2] Chandra, D. (2015). Future Generation Computer Systems. Elsevier..
[3] Gandini, A., Gribaudo, M., Knottenbelt, W., Osman, R. and Piazzolla, P. (2018).
Performance evaluation of NoSQL databases..
[4] Matallah, H., Belalem, G. and Bouamrane, K. (2017). EXPERIMENTAL
COMPARATIVE STUDY OF NoSQL DATABASES: HBASE VERSUS MONGODB
BY YCSB. ResearchGate.
[5] Cloudera. (2018). Managing HBase Security. [online] Available at:
https://www.cloudera.com/documentation/enterprise/5-6-
x/topics/admin_hbase_security.html [Accessed 19 Dec. 2018]..
[6] Swaminathan, S. and Elmasri, R. (2016). Quantitative Analysis of Scalable NoSQL
Databases. IEEE, pp.323-327..
[7] Kumar, R. and Mary, R. (2017). Comparative Performance Analysis of various NoSQL
Databases: MongoDB, Cassandra, and HBase on Yahoo Cloud Server. Imperial Journal
of Interdisciplinary Research (IJIR), 3(4), pp.265-269..
[8] Krstic, L. and Krstic, M. (2018). TESTING THE PERFORMANCE OF NoSQL
DATABASES VIA THE DATABASE BENCHMARK TOOL. Military Technical
Courier, 66(3), pp.614-639..
[9] Tang, E. and Fan,Y.(2017). Performance Comparison between Five NoSQL Databases.
International Conference on Cloud Computing and Big Data, pp.105-110..
[10] Mohamed M.A., Altrafi O.G., Ismail M.O., Relational vs. NoSQL databases: A survey,
International Journal of Computer and Information Technology3(03) (2014), 598-601..
[11] Surendar, A.” Evolution of gait biometric system and algorithms- A review” (2017)
Biomedical and Pharmacology Journal, 10 (1), pp. 467-472..
[12] Vimal Kumar, M.N., Helenprabha, K., Surendar, A.” Classification of mammographic
image abnormalities based on emo and LS- SVM techniques”,(2017) Research Journal
of Biotechnology, 12 (1), pp. 35-40..
[13] Fiannaca A.J., Justin Huang, Benchmarking of Relational and NoSQL Databases to
Determine Constraints for Querying Robot Execution Logs, Computer Science &
Engineering, University of Washington, USA (2015), 1-8.
22
[14] DeZyre. (2016). Overview of HBase Architecture and its Components. [online]
Available at: https://www.dezyre.com/article/overview-of-hbasearchitecture-and-its-
components/295 [Accessed 11 Dec 2018]..
[15] Kumar, L., Rajawat, S. and Joshi, K. (2015). Comparative analysis of NoSQL
(MongoDB) with MySQL Database. IJMTER, 2(5), pp.120-127..
[16] Patel, H. (2017). [online] Available at:
https://www.researchgate.net/publication/317399857 [Accessed 17 Dec. 2018]..
[17] Apache HBase (2018). Apache HBase ™ Reference Guide. Version 3.0.0-
SNAPSHOT..
[18] Grim, M. and Wiersma, A. (2018). Security and Performance Analysis of Encrypted
NoSQL Databases. Amsterdam, p.13..
23

Dita - Kebede - BIG Data Assignment Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dita - Kebede - BIG Data Assignment Report

Uploaded by

Copyright:

Available Formats

Adama Science & Technology University

School of Electrical Engineering and Computing

MSc in Computer Science and Engineering

Course Title: Big Data Analytics & Visualization (CSEg6419)

Dita Kebede………………………………… PGE/28270/15

Semira Ahmed……………………………… PGE/28280/15

Submitted to: Getinet Yilma (PhD)

Motivations of NoSQL design:-

Key-Value Data Stores

Column Family Stores

Key features are [16]:-

➢ It provides horizontal scalability by adding more servers or machines to the pool of

3.1.2. HBase structure and architecture

Key components of Apache HBase [15]:

Hadoop MapReduce:- It is a YARN-based system that provides parallel processing of large

Zookeeper:- For large distributed systems, it facilitates a distributed configuration and

Lookup Master -ROOT- Register HMaster & ROOT location

Read\Write data Assign Regions to Region

Servers, check health RegionServer

Figure 1: HBase Architecture [14]

3.2.1. Key features of MongoDB

Key features of MongoDB are

3.2.2. MongoDB structure and architecture

Shard 1 Shard 2 Shard 3

Mongodb Mongodb Mongodb

Mongodb Mongodb Mongodb

Figure 2: MongoDB Architecture [3]

3.3. Security Features in MongoDB and HBase

Authentication:- MongoDB provides integration with an external access control mechanism

Encryption:- MongoDB provides an encrypted storage engine, it encrypts dt at network, disk,

RPC(Remote Procedure Call):-The implementation of security is built on top of RPC. It

To enable HBase authorization we need to grant permission by changing hbase-site.xml

NoSQL DB Version Authentication Authorization

Horizontal scalability in MongoDB: MongoDB provides the capability of horizontal scale-

Workloads and Tools

The workloads were tested under the following conditions:

➢ Data fits the memory.

Workload parameter YCSB Client

Figure 3:- YCSB architecture [18]

5. Evaluation and Results

For Workload A:- 50% Read and 50% Update.

a.Read Operation Average Latency(us)

Here are some observations about the data

b.Update Operation Average Latency(us):-

c. Overall throughput(ops/sec) of HBase and MongoDB:

For Workload D (95% read and 5% insert)

a.Read Operation Average Latency(us)

b.Insert Operation Average Latency(us)

c.Overall throughput(ops/sec) of HBase and MongoDB

You might also like