Professional Documents
Culture Documents
Unit-2
1. Relational Database
RDBMS stands for Relational Database Management Systems. It is most popular database. In it,
data is store in the form of row that is in the form of tuple. It contains numbers of table and data can be
easily accessed because data is store in the table. This Model was proposed by E.F. Codd.
2. NoSQL
NoSQL Database stands for a non-SQL database. NoSQL database doesn’t use table to store
the data like relational database. It is used for storing and fetching the data in database and
generally used to store the large amount of data. It supports query language and provides better
performance.
Difference between Relational database and NoSQL :
1. It is used to handle data coming in low 1. It is used to handle data coming in high
velocity. velocity.
2. It gives only read scalability. 2. It gives both read and write scalability.
4. Data arrives from one or few locations. 4. Data arrives from many locations.
10. Its difficult to make changes in database once 10. Enables easy and frequent changes to
it is defined database
11. schema is mandatory to store the data 11. schema design is not required
Dept. of MCA 1
Unit-2 NOSQL Databases
title: 'Geeksforgeeks',
url: 'https://www.geeksforgeeks.org',
type: 'NoSQL'
SQL databases store data in tabular format. This data is stored in a predefined data model which is
not very much flexible for today’s real-world highly growing applications. Modern applications are
more networked, social and interactive than ever. Applications are storing more and more data
and are accessing it at higher rates.
Relational Database Management System(RDBMS) is not the correct choice when it comes to
handling big data by the virtue of their design since they are not horizontally scalable. If the
database runs on a single server, then it will reach a scaling limit. NoSQL databases are more
scalable and provide superior performance. MongoDB is such a NoSQL database that scales by
adding more and more servers and increases productivity with its flexible document model.
RDBMS vs MongoDB:
RDBMS has a typical schema design that shows number of tables and the relationship between
these tables whereas MongoDB is document-oriented. There is no concept of schema or
relationship.
Complex transactions are not supported in MongoDB because complex join operations are not
available.
MongoDB allows a highly flexible and scalable document structure. For example, one data
document of a collection in MongoDB can have two fields whereas the other document in the same
collection can have four.
MongoDB is faster as compared to RDBMS due to efficient indexing and storage techniques.
There are a few terms that are related in both databases. What’s called Table in RDBMS is
called a Collection in MongoDB. Similarly, a Row is called a Document and a Column is called a
Field. MongoDB provides a default ‘_id’ (if not provided explicitly) which is a 12-byte
hexadecimal number that assures the uniqueness of every document. It is similar to the Primary
key in RDBMS.
Dept. of MCA 2
Unit-2 NOSQL Databases
Features of MongoDB:
Document Oriented: MongoDB stores the main subject in the minimal number of documents
and not by breaking it up into multiple relational structures like RDBMS. For example, it stores
all the information of a computer in a single document called Computer and not in distinct
relational structures like CPU, RAM, Hard disk, etc.
Indexing: Without indexing, a database would have to scan every document of a collection to
select those that match the query which would be inefficient. So, for efficient searching
Indexing is a must and MongoDB uses it to process huge volumes of data in very less time.
Scalability: MongoDB scales horizontally using sharding (partitioning data across various
servers). Data is partitioned into data chunks using the shard key, and these data chunks are
evenly distributed across shards that reside across many physical servers. Also, new machines
can be added to a running database.
Replication and High Availability: MongoDB increases the data availability with multiple
copies of data on different servers. By providing redundancy, it protects the database from
hardware failures. If one server goes down, the data can be retrieved easily from other
active servers which also had the data stored on them.
Aggregation: Aggregation operations process data records and return the computed results. It
is similar to the GROUPBY clause in SQL. A few aggregation expressions are sum, avg, min,
max, etc
Where do we use MongoDB?
MongoDB is preferred over RDBMS in the following scenarios:
Big Data: If you have huge amount of data to be stored in tables, think of MongoDB before
RDBMS databases. MongoDB has built-in solution for partitioning and sharding your database.
Unstable Schema: Adding a new column in RDBMS is hard whereas MongoDB is schema-less.
Adding a new field does not effect old documents and will be very easy.
Distributed data Since multiple copies of data are stored across different servers, recovery of
data is instant and safe even if there is a hardware failure.
Language Support by MongoDB:
MongoDB currently provides official driver support for all popular programming languages
like C, C++, Rust, C#, Java, Node.js, Perl, PHP, Python, Ruby, Scala, Go, and Erlang.
Cassandra
Apache Cassandra is a highly scalable, high-performance distributed database designed to handle
large amounts of data across many commodity servers, providing high availability with no single point
of failure. It is a type of NoSQL database. Let us first understand what a NoSQL database does.
NoSQL Database
A NoSQL database (sometimes called as Not Only SQL) is a database that provides a
mechanism to store and retrieve data other than the tabular relations used in relational databases.
Dept. of MCA 3
Unit-2 NOSQL Databases
These databases are schema-free, support easy replication, have simple API, eventually consistent, and
can handle huge amounts of data.
The primary objective of a NoSQL database is to have
simplicity of design,
horizontal scaling, and
finer control over availability.
NoSql databases use different data structures compared to relational databases. It makes some
operations faster in NoSQL. The suitability of a given NoSQL database depends on the problem it must
solve.
NoSQL vs. Relational Database
The following table lists the points that differentiate a relational database from a NoSQL database.
Besides Cassandra, we have the following NoSQL databases that are quite popular −
Apache HBase − HBase is an open source, non-relational, distributed database modeled after
Google’s BigTable and is written in Java. It is developed as a part of Apache Hadoop project
and runs on top of HDFS, providing BigTable-like capabilities for Hadoop.
MongoDB − MongoDB is a cross-platform document-oriented database system that avoids using
the traditional table-based relational database structure in favor of JSON-like documents with
dynamic schemas making the integration of data in certain types of applications easier and
faster.
What is Apache Cassandra?
Apache Cassandra is an open source, distributed and decentralized/distributed storage system
(database), for managing very large amounts of structured data spread out across the world. It
provides highly available service with no single point of failure.
Listed below are some of the notable points of Apache Cassandra −
It is scalable, fault-tolerant, and consistent.
It is a column-oriented database.
Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable.
Created at Facebook, it differs sharply from relational database management systems.
Dept. of MCA 4
Unit-2 NOSQL Databases
Cassandra implements a Dynamo-style replication model with no single point of failure, but
adds a more powerful “column family” data model.
Cassandra is being used by some of the biggest companies such as Facebook, Twitter, Cisco,
Rackspace, ebay, Twitter, Netflix, and more.
Features of Cassandra
Cassandra has become so popular because of its outstanding technical features. Given below are some
of the features of Cassandra:
Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to
accommodate more customers and more data as per requirement.
Always on architecture − Cassandra has no single point of failure and it is continuously
available for business-critical applications that cannot afford a failure.
Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your throughput
as you increase the number of nodes in the cluster. Therefore it maintains a quick response time.
Flexible data storage − Cassandra accommodates all possible data formats including:
structured, semi-structured, and unstructured. It can dynamically accommodate changes to your
data structures according to your need.
Easy data distribution − Cassandra provides the flexibility to distribute data where you need
by replicating data across multiple data centers.
Transaction support − Cassandra supports properties like Atomicity, Consistency, Isolation, and
Durability (ACID).
Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs
blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read
efficiency.
History of Cassandra
Cassandra was developed at Facebook for inbox search.
It was open-sourced by Facebook in July 2008.
Cassandra was accepted into Apache Incubator in March 2009.
It was made an Apache top-level project since February 2010.
Apache HBase
HBase is a data model that is similar to Google’s big table. It is an open source, distributed database
developed by Apache software foundation written in Java. HBase is an essential part of our Hadoop
ecosystem. HBase runs on top of HDFS (Hadoop Distributed File System). It can store massive amounts of
data from terabytes to petabytes. It is column oriented and horizontally scalable.
Dept. of MCA 5
Unit-2 NOSQL Databases
Dept. of MCA 6
Unit-2 NOSQL Databases
6. It supports a Block Cache and Bloom Filters for real-time queries and for high volume query
optimization.
7. HBase provides automatic failure support between Region Servers.
8. It support for exporting metrics with the Hadoop metrics subsystem to files.
9. It doesn’t enforce relationship within your data.
10. It is a platform for storing and retrieving data with random access.
Facebook Messenger Platform was using Apache Cassandra but it shifted from Apache Cassandra to
HBase in November 2010. Facebook was trying to build a scalable and robust infrastructure to handle
set of services like messages, email, chat and SMS into a real time conversation so that’s why HBase is
best suited for that.
RDBMS Vs HBase –
1. RDBMS is mostly Row Oriented whereas HBase is Column Oriented.
2. RDBMS has fixed schema but in HBase we can scale or add columns in run time also.
3. RDBMS is good for structured data whereas HBase is good for semi-structured data.
4. RDBMS is optimized for joins but HBase is not optimized for joins.
Apache HBase is a NoSQL, column-oriented database that is built on top of the Hadoop ecosystem. It is
designed to provide low-latency, high-throughput access to large-scale, distributed datasets. Here are
some of the advantages and disadvantages of using HBase:
Advantages Of Apache HBase:
1. Scalability: HBase can handle extremely large datasets that can be distributed across a cluster
of machines. It is designed to scale horizontally by adding more nodes to the cluster, which
allows it to handle increasingly larger amounts of data.
2. High-performance: HBase is optimized for low-latency, high-throughput access to data. It uses a
distributed architecture that allows it to process large amounts of data in parallel, which can
result in faster query response times.
3. Flexible data model: HBase’s column-oriented data model allows for flexible schema design
and supports sparse datasets. This can make it easier to work with data that has a variable or
evolving schema.
4. Fault tolerance: HBase is designed to be fault-tolerant by replicating data across multiple
nodes in the cluster. This helps ensure that data is not lost in the event of a hardware or network
failure.
Disadvantages Of Apache HBase:
1. Complexity: HBase can be complex to set up and manage. It requires knowledge of the
Hadoop ecosystem and distributed systems concepts, which can be a steep learning curve for
some users.
2. Limited query language: HBase’s query language, HBase Shell, is not as feature-rich as SQL.
This can make it difficult to perform complex queries and analyses.
Dept. of MCA 7
Unit-2 NOSQL Databases
3. No support for transactions: HBase does not support transactions, which can make it difficult to
maintain data consistency in some use cases.
4. Not suitable for all use cases: HBase is best suited for use cases where high throughput and
low-latency access to large datasets is required. It may not be the best choice for applications
that require real-time processing or strong consistency guarantees.
Neo4j Introduction
Neo4j: Neo4j is the most famous database management system and it is also a NoSQL database
system. Neo4j is different from Mysql or MongoDB it has its own features that’s makes it special
compared to other Database Management System.
Neo4j structure: Neo4j stores and present the data in the form of graph not in tabular format or not in
a Jason format. Here the whole data is represented by nodes and there you can create a relationship
between nodes. That means the whole database collection will look like a graph, that’s why it is making
it unique from other database management system. MS Access, SQL server all the relational database
management system use tables to store or present the data with the help of column and row but Neo4j
doesn’t use tables, row or columns like old school style to store or present the data.
Neo4j Usage: If your Database Management System has so many interconnecting relationships then
you can use Neo4j that will be the best choice. Neo4j is highly preferable to store data that contains
multiple connections between nodes. This is where the Neo4j(Graph Database) comes in it’s more
comfortable to use with relational data than the relational database. Because Neo4j doesn’t require a
predefined schema, you just need to load the data here the data is the main structure. It is schema
optional Database Management System.
There are some unique features that will make you choose Neo4j over any other Database
Management System. Neo4j is surrounded by relationships but there is no need to set up primary key
or foreign key constraints to any data. Here you can add any relation between any nodes you want.
That makes the Neo4j extremely suited for Networking data, below is the list of data areas where you
can use this Database Management System.
Social network like in Facebook, Twitter or in Instagram
Network Diagram
Fraud Detection
Graph based searched of digital assets
Data Management
Real-time product recommendation
Advantages:
1. Representation of connected data is very easy.
2. Retrieval or traversal or navigation of connected data is very fast.
3. It uses simple and powerful data model.
4. It can represent semi-structured data is easy.
Dept. of MCA 8
Unit-2 NOSQL Databases
Disadvantages:
1. OLAP support for these types of databases is not well executed.
2. In this area, still there are lots of research happening around.
Difference between NoSQL and RDBMS
The following table highlights the major differences between NoSQL and RDBMS −
Basis of
NoSQL RDBMS
Comparison
Query No declarative query language SQL stands for Structured Query Language.
NoSQL combines multiple database Traditional RDBMS systems use SQL syntax
technologies. These databases were and queries to get insights from data.
Design
created in response to the Different OLAP systems use them.
application's requirements.
Dept. of MCA 9
Unit-2 NOSQL Databases
Maturity: NoSQL proponents would agree that increasing demands will inevitably lead to
obsolescence, but the RDBMS maturity cycle appears to be more secure. RDBMS is more reliable
and functional since it has been around for decades. Many NoSQL alternatives are still in the
pre-production stage and do not yet have all of their essential features implemented.
Database and node failures during backups and restores : Node failures are common in
NoSQL databases since they are designed to grow to hundreds of nodes. As a result, any
backup method must be able to account for data collection failures from down nodes and their
influence on quorum consistency. On the other hand, restorations must take into consideration
failed cluster nodes and modify data re population correspondingly.
Analytics and business intelligence: NoSQL was created to fulfill the needs of Web 2.0
applications, and as a result, all of its characteristics are geared toward that goal. Other
commercial systems, on the other hand, necessitate moving beyond the insert-read-update-
delete cycle. Even the most basic queries need extensive programming knowledge, and
integrated BI tools are insufficient.
Performance : Consider the same system as before in terms of performance (RPOs and RTOs).
Because it is impossible to parallelize database file processing beyond a certain point, it will be
impossible to achieve customer SLAs in terms of RTO if per row processing is sluggish. Due to the
same parallelization limitations, recovery procedures are also impacted. This means that backup
and recovery processing for 100 billion rows over 10 million files must be quick, and a solution
must dramatically minimize per-row and per-file database file processing.
Data Integrity: Verifying data integrity at the block level is another issue with distributed
NoSQL databases. Checksum work for scale-up databases because the restored data is
physically identical to the backup data. The restored data in scale-out databases is semantically
comparable to the backup data, but it is not physically identical. In this situation, we’ll need to
come up with a unique method for identifying semantic equivalence between recovered and
backup data, which will allow us to spot data corruption issues that may arise throughout the
backup and restoration process.
Expertise: Despite the fact that there are millions of developers throughout the world, the
majority of them are familiar with RDBMS architecture. All NoSQL developers should be
considered to be in the early stages of their careers and have yet to achieve expert status.
While this problem will eventually be resolved, selecting the correct developer is now difficult.
De-duplication: In traditional database systems, de-duplication is simple—a block with identical
bitwise values. Data copies are not always identical in modern NoSQL database storage
systems because the sequencing and flushing of changes vary between nodes, creating a new
challenge for de-duplication.
Because of their increased capabilities/capacity and economies, NoSQL databases are becoming more
popular. However, developers should exercise caution and be mindful of the system’s real limits.
Dept. of MCA 10
Unit-2 NOSQL Databases
Dept. of MCA 11
Unit-2 NOSQL Databases
Disadvantages:
As querying language is not present in key-value databases, transportation of queries from
one database to a different database cannot be done.
The key-value store database is not refined. You cannot query the database without a key.
Some examples of key-value databases:
Here are some popular key-value databases which are widely used:
Couchbase: It permits SQL-style querying and searching for text.
Amazon DynamoDB: The key-value database which is mostly used is Amazon DynamoDB
as it is a trusted database used by a large number of users. It can easily handle a large
number of requests every day and it also provides various security options.
Riak: It is the database used to develop applications.
Aerospike: It is an open-source and real-time database working with billions of exchanges.
Berkeley DB: It is a high-performance and open-source database providing scalability.
Columnar Data Model of NoSQL
The Columnar Data Model of NoSQL is important. NoSQL databases are different from SQL
databases. This is because it uses a data model that has a different structure than the previously
followed row-and-column table model used with relational database management systems databases
are a flexible schema model which is designed to scale horizontally across many servers and is used in
large volumes of data.
Columnar Data Model of NoSQL :
Basically, the relational database stores data in rows and also reads the data row by row, column
store is organized as a set of columns. So if someone wants to run analytics on a small number of
columns, one can read those columns directly without consuming memory with the unwanted data.
Columns are somehow are of the same type and gain from more efficient compression, which makes
reads faster than before. Examples of Columnar Data Model: Cassandra and Apache Hadoop Hbase.
Working of Columnar Data Model:
In Columnar Data Model instead of organizing information into rows, it does in columns. This makes them
function the same way that tables work in relational databases. This type of data model is much more
flexible obviously because it is a type of NoSQL database. The below example will help in
understanding the Columnar data model:
Row-Oriented Table:
Dept. of MCA 12
Unit-2 NOSQL Databases
S.No. Name ID
01. Tanmay 2
02. Abhishek 5
03. Samriddha 7
04. Aditi 8
S.No. Course ID
01. B-Tech 2
02. B-Tech 5
03. B-Tech 7
04. B-Tech 8
S.No. Branch ID
01. Computer 2
02. Electronics 5
03. IT 7
04. E & TC 8
Columnar Data Model uses the concept of keyspace, which is like a schema in relational models.
Advantages of Columnar Data Model :
Well structured: Since these data models are good at compression so these are very structured
or well organized in terms of storage.
Dept. of MCA 13
Unit-2 NOSQL Databases
Flexibility: A large amount of flexibility as it is not necessary for the columns to look like each
other, which means one can add new and different columns without disrupting the whole
database
Aggregation queries are fast: The most important thing is aggregation queries are quite fast
because a majority of the information is stored in a column. An example would be Adding up
the total number of students enrolled in one year.
Scalability: It can be spread across large clusters of machines, even numbering in thousands.
Load Times: Since one can easily load a row table in a few seconds so load times are nearly
excellent.
Disadvantages of Columnar Data Model:
Designing indexing Schema: To design an effective and working schema is too difficult and
very time-consuming.
Suboptimal data loading: incremental data loading is suboptimal and must be avoided, but this
might not be an issue for some users.
Security vulnerabilities: If security is one of the priorities then it must be known that the
Columnar data model lacks inbuilt security features in this case, one must look into relational
databases.
Online Transaction Processing (OLTP): Online Transaction Processing (OLTP) applications are
also not compatible with columnar data models because of the way data is stored.
Applications of Columnar Data Model:
Columnar Data Model is very much used in various Blogging Platforms.
It is used in Content management systems like WordPress, Joomla, etc.
It is used in Systems that maintain counters.
It is used in Systems that require heavy write requests.
It is used in Services that have expiring usage.
Aggregate-Oriented Databases in NoSQL
The aggregate-Oriented database is the NoSQL database which does not support ACID
transactions and they sacrifice one of the ACID properties. Aggregate orientation operations are
different compared to relational database operations. We can perform OLAP operations on
the Aggregate-Oriented database. The efficiency of the Aggregate-Oriented database is high if the
data transactions and interactions take place within the same aggregate. Several fields of data can be
put in the aggregates such that they can be commonly accessed together. We can manipulate only a
single aggregate at a time. We can not manipulate multiple aggregates at a time in an atomic way.
Aggregate – Oriented databases are classified into four major data models. They are as follows:
Key-value
Document
Column family
Dept. of MCA 14
Unit-2 NOSQL Databases
Graph-based
Each of the Data models above has its own query language.
key-value Data Model: Key-value and document databases were strongly aggregate-oriented.
The key-value data model contains the key or Id which is used to access the data of the
aggregates. key-value Data Model is very secure as the aggregates are opaque to the
database. Aggregates are encrypted as the big blog of bits that can be decrypted with key or
id. In the key-value Data Model, we can place data of any structure and datatypes in it. The
advantage of the key-value Data Model is that we can store the sensitive information in the
aggregate. But the disadvantage of this model the database has some general size limits. We
can store only the limited data.
Document Data Model: In Document Data Model we can access the parts of aggregates. The
data in this model can be accessed inflexible manner. we can submit queries to the database
based on the fields in the aggregate. There is a restriction on the structure and data types of
data to be paced in this data model. The structure of the aggregate can be accessed by
the Document Data Model.
Column family Data Model: The Column family is also called a two-level map. But, however,
we think about the structure, it has been a model that influenced later databases such as HBase
and Cassandra. These databases with a big table-style data model are often referred to as
column stores. Column-family models divide the aggregate into column families. The Column-
family model is a two-level aggregate structure. The first level consists of keys that act as a row
identifier that selects the aggregate. The second-level values in the Column family Data Model
are referred to as columns.
In the above example, the row key is 234 which selects the aggregate. Here the row key selects
the column families customer and orders. Each column family contains the columns of data. In the
orders column family, we have the orders placed by the customers.
Graph Data Model: In a graph data model, the data is stored in nodes that are connected by
edges. This model is preferred to store a huge amount of complex aggregates and
multidimensional data with many interconnections between them. Graph Data Model has the
application like we can store the Facebook user accounts in the nodes and find out the friends of
the particular user by following the edges of the graph.
We can find the friends of a person by observing this graph data model. If there is an edge
between two nodes then we can say they are friends. Here we also consider the indirect links between
the nodes to determine the friend suggestions.
Dept. of MCA 15