You are on page 1of 15

CCS334 BIG DATA ANALYTICS

UNIT II NOSQL DATA MANAGEMENT


Introduction to NoSQL – aggregate data models – key-value and document data models –
relationships – graph databases – schemaless databases – materialized views – distribution models - master-
slave replication – consistency - Cassandra – Cassandra data model – Cassandra examples – Cassandra
clients

Introduction to NoSQL
What is NoSQL Database
NoSQL Database is used to refer a non-SQL or non relational database. It provides a mechanism for
storage and retrieval of data other than tabular relations model used in relational databases. NoSQL database
doesn't use tables for storing data. It is generally used to store big data and real-time web applications.
Why NoSQL?
NoSQL databases are used in nearly every industry. Use cases range from the highly critical (e.g.,
storing financial data and healthcare records) to the more fun and frivolous (e.g., storing IoT readings from a
smart kitty litter box).
When should NoSQL be used?
When deciding which database to use, decision-makers typically find one or more of the following factors
lead them to selecting a NoSQL database:
 Fast-paced Agile development
 Storage of structured and semi-structured data
 Huge volumes of data
 Requirements for scale-out architecture
 Modern application paradigms like microservices and real-time streaming
Key Features of NoSQL :
 Dynamic schema: NoSQL databases do not have a fixed schema and can accommodate changing data
structures without the need for migrations or schema alterations.
 Horizontal scalability: NoSQL databases are designed to scale out by adding more nodes to a database
cluster, making them well-suited for handling large amounts of data and high levels of traffic.
 Document-based: Some NoSQL databases, such as MongoDB, use a document-based data model, where
data is stored in semi-structured format, such as JSON or BSON.
1
 Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model, where data is
stored as a collection of key-value pairs.
 Column-based: Some NoSQL databases, such as Cassandra, use a column-based data model, where data is
organized into columns instead of rows.
 Distributed and high availability: NoSQL databases are often designed to be highly available and to
automatically handle node failures and data replication across multiple nodes in a database cluster.
 Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible and dynamic
manner, with support for multiple data types and changing data structures.
 Performance: NoSQL databases are optimized for high performance and can handle a high volume of
reads and writes, making them suitable for big data and real-time applications.
Types of NoSQL databases
Over time, four major types of NoSQL databases emerged: document databases, key-value databases,
wide-column stores, and graph databases.
 Document databases store data in documents similar to JSON (JavaScript Object Notation)
objects. Each document contains pairs of fields and values. The values can typically be a
variety of types including things like strings, numbers, booleans, arrays, or objects.
 Key-value databases are a simpler type of database where each item contains keys and values.
 Wide-column stores store data in tables, rows, and dynamic columns.
 Graph databases store data in nodes and edges. Nodes typically store information about people,
places, and things, while edges store information about the relationships between the nodes.
Advantages of NoSQL: There are many advantages of working with NoSQL databases such as MongoDB
and Cassandra. The main advantages are high scalability and high availability.
 High scalability : NoSQL databases use sharding for horizontal scaling. Partitioning of data and placing it
on multiple machines in such a way that the order of the data is preserved is sharding. Vertical scaling
means adding more resources to the existing machine whereas horizontal scaling means adding more
machines to handle the data. Vertical scaling is not that easy to implement but horizontal scaling is easy to
implement. Examples of horizontal scaling databases are MongoDB, Cassandra, etc. NoSQL can handle a
huge amount of data because of scalability, as the data grows NoSQL scale itself to handle that data in an
efficient manner.
 Flexibility: NoSQL databases are designed to handle unstructured or semi-structured data, which means
that they can accommodate dynamic changes to the data model. This makes NoSQL databases a good fit
for applications that need to handle changing data requirements.
2
 High availability : Auto replication feature in NoSQL databases makes it highly available because in case
of any failure data replicates itself to the previous consistent state.
 Scalability: NoSQL databases are highly scalable, which means that they can handle large amounts of data
and traffic with ease. This makes them a good fit for applications that need to handle large amounts of
data or traffic
 Performance: NoSQL databases are designed to handle large amounts of data and traffic, which means
that they can offer improved performance compared to traditional relational databases.
 Cost-effectiveness: NoSQL databases are often more cost-effective than traditional relational databases, as
they are typically less complex and do not require expensive hardware or software.
 Agility: Ideal for agile development.
Disadvantages of NoSQL: NoSQL has the following disadvantages.
 Lack of standardization : There are many different types of NoSQL databases, each with its own unique
strengths and weaknesses. This lack of standardization can make it difficult to choose the right database
for a specific application
 Lack of ACID compliance : NoSQL databases are not fully ACID-compliant, which means that they do
not guarantee the consistency, integrity, and durability of data. This can be a drawback for applications
that require strong data consistency guarantees.
 Narrow focus : NoSQL databases have a very narrow focus as it is mainly designed for storage but it
provides very little functionality. Relational databases are a better choice in the field of Transaction
Management than NoSQL.
 Open-source : NoSQL is open-source database. There is no reliable standard for NoSQL yet. In other
words, two database systems are likely to be unequal.
 Lack of support for complex queries : NoSQL databases are not designed to handle complex queries,
which means that they are not a good fit for applications that require complex data analysis or reporting.
 Lack of maturity : NoSQL databases are relatively new and lack the maturity of traditional relational
databases. This can make them less reliable and less secure than traditional databases.
 Management challenge : The purpose of big data tools is to make the management of a large amount of
data as simple as possible. But it is not so easy. Data management in NoSQL is much more complex than
in a relational database. NoSQL, in particular, has a reputation for being challenging to install and even
more hectic to manage on a daily basis.
 GUI is not available : GUI mode tools to access the database are not flexibly available in the market.

3
 Backup : Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has no
approach for the backup of data in a consistent manner.
 Large document size : Some database systems like MongoDB and CouchDB store data in JSON format.
This means that documents are quite large (BigData, network bandwidth, speed), and having descriptive
key names actually hurts since they increase the document size.
CAP Theorem
The CAP theorem, originally introduced as the CAP principle, can be used to explain some of the competing
requirements in a distributed system with replication. It is a tool used to make system designers aware of the
trade-offs while designing networked shared-data systems.
The three letters in CAP refer to three desirable properties of distributed systems with replicated data:
consistency (among replicated copies), availability (of the system for read and write operations) and partition
tolerance (in the face of the nodes in the system being partitioned by a network fault).
The CAP theorem states that it is not possible to guarantee all three of the desirable properties – consistency,
availability, and partition tolerance at the same time in a distributed system with data replication.
The theorem states that networked shared-data systems can only strongly support two of the following three
properties:
Consistency –
Consistency means that the nodes will have the same copies of a replicated data item visible for various
transactions. A guarantee that every node in a distributed cluster returns the same, most recent and a
successful write. Consistency refers to every client having the same view of the data. There are various types
of consistency models. Consistency in CAP refers to sequential consistency, a very strong form of
consistency.
Availability –
Availability means that each read or write request for a data item will either be processed successfully or will
receive a message that the operation cannot be completed. Every non-failing node returns a response for all
the read and write requests in a reasonable amount of time. The key word here is “every”. In simple terms,
every node (on either side of a network partition) must be able to respond in a reasonable amount of time.
Partition Tolerance –
Partition tolerance means that the system can continue operating even if the network connecting the nodes has
a fault that results in two or more partitions, where the nodes in each partition can only communicate among
each other. That means, the system continues to function and upholds its consistency guarantees in spite of

4
network partitions. Network partitions are a fact of life. Distributed systems guaranteeing partition tolerance
can gracefully recover from partitions once the partition heals.

The CAP theorem states that distributed databases can have at most two of the three properties: consistency,
availability, and partition tolerance. As a result, database systems prioritize only two properties at a time.

The following figure represents which database systems prioritize specific properties at a given time:

CAP theorem with databases examples

CA(Consistency and Availability)-


 The system prioritizes availability over consistency and can respond with possibly stale data.
 Example databases: Cassandra, CouchDB, Riak, Voldemort.

AP(Availability and Partition Tolerance)-


 The system prioritizes availability over consistency and can respond with possibly stale data.
 The system can be distributed across multiple nodes and is designed to operate reliably even in the
face of network partitions.
 Example databases: Amazon DynamoDB, Google Cloud Spanner.

CP(Consistency and Partition Tolerance)-


 The system prioritizes consistency over availability and responds with the latest updated data.
 The system can be distributed across multiple nodes and is designed to operate reliably even in the
face of network partitions.

5
 Example databases: Apache HBase, MongoDB, Redis.

It’s important to note that these database systems may have different configurations and settings that
can change their behavior with respect to consistency, availability, and partition tolerance. Therefore, the
exact behavior of a database system may depend on its configuration and usage.

Aggregate Data Models


What are Aggregate Data Models in NoSQL?

Aggregate means a collection of objects that are treated as a unit. In NoSQL Databases, an aggregate is a
collection of data that interact as a unit. Moreover, these units of data or aggregates of data form the
boundaries for the ACID operations.

Aggregate Data Models in NoSQL make it easier for the Databases to manage data storage over the clusters
as the aggregate data or unit can now reside on any of the machines. Whenever data is retrieved from the
Database all the data comes along with the Aggregate Data Models in NoSQL.

Aggregate Data Models in NoSQL don’t support ACID transactions and sacrifice one of the ACID properties.
With the help of Aggregate Data Models in NoSQL, you can easily perform OLAP operations on the
Database.

Types of Aggregate Data Models in NoSQL Databases

The Aggregate Data Models in NoSQL are majorly classified into 4 Data Models listed below:

Key-Value

 A key-value database consists of individual records organized via key-value pairs. In this model, keys
and values can be any type of data, ranging from numbers to complex objects. However, keys must be
unique. This means this type of database is best when data is attributed to a unique key, like an ID
number. Ideally, the data is also simple, and we are looking to prioritize fast queries over fancy
features.

Key features of the key-value store:


 Simplicity.
 Scalability.
 Speed.

Use Cases:

 These Aggregate Data Models in NoSQL Database are used for storing the user session data.
 Key Value-based Data Models are used for maintaining schema-less user profiles.
 It is used for storing user preferences and shopping cart data.
 For example, let’s say we wanted to store shopping cart information for customers who shop in an e-
commerce store. Our key-value database might look like this:
6
Key Value
customer-123 { “address”: “…”, name: “…”, “preferences”: {…} }
customer-456 { “address”: “…”, name: “…”, “preferences”: {…} }
 Amazon DynamoDB and Redis are popular options for developers looking to work with key-value
databases.

Document

 A document-based (also called document-oriented) database consists of data stored in hierarchical


structures. Some supported document formats include JSON, BSON, XML, and YAML. The
document-based model is considered an extension of the key-value database and provides querying
capabilities not solely based on unique keys. Documents are considered very flexible and can evolve
to fit an application’s needs. They can even model relationships!

Key features of documents database:


 Flexible schema: Documents in the database has a flexible schema. It means the documents in the
database need not be the same schema.
 Faster creation and maintenance: the creation of documents is easy and minimal maintenance is
required once we create the document.
 No foreign keys: There is no dynamic relationship between two documents so documents can be
independent of one another. So, there is no requirement for a foreign key in a document database.
 Open formats: To build a document we use XML, JSON, and others

Use Cases:

 Document Data Models are widely used in E-Commerce platforms


 It is used for storing data from content management systems.
 Document Data Models are well suited for Blogging and Analytics platforms.
 For example, let’s say we wanted to store product information for customers who shop in our e-
commerce store. A products document might look like this:

[
{
"product_title": "Codecademy SQL T-shirt",
"description": "SQL > NoSQL",
"link": "https://shop.codecademy.com/collections/student-swag/products/sql-tshirt"
"shipping_details": {
"weight": 350,
"width": 10,
"height": 10,
"depth": 1
7
},
"sizes": ["S", "M", "L", "XL"],
"quantity": 101010101010,
"pricing": {
"price": 14.99
}
}
]

 MongoDB is a popular option for developers looking to work with a document database.

Graph

 A graph database stores data using a graph structure. In a graph structure, data is stored in individual
nodes (also called vertices) and establishes relationships via edges (also called links or lines). The
advantage of the relationships built using a graph database as opposed to a relational database is that
they are much simpler to set up, manage, and query.

Key features of graph database:


 In a graph-based database, it is easy to identify the relationship between the data by using the links.
 The Query’s output is real-time results.
 The speed depends upon the number of relationships among the database elements.
 Updating data is also easy, as adding a new node or edge to a graph database is a straightforward
task that does not require significant schema changes.

Use Cases:

 Graph-based Data Models are used in social networking sites to store interconnections.
 It is used in fraud detection systems.
 This Data Model is also widely used in Networks and IT operations.
 For example, let’s say we wanted to build a recommendation engine for our e-commerce store. We
could establish relationships between similar items our customers searched for to create
recommendations.

8

 In the graph above, we can see that there are four nodes: “Neo”, “Hiking”, “Cameras”, and “Hiking
Camera Backpack”. Because the user, “Neo”, searched for “Hiking” and “Cameras”, there are edges
connecting all 3 nodes. More edges are created after the search, linking a new node, “Hiking Camera
Backpack”.
 Neo4j is a popular option for developers looking to work with a graph database.

Column Oriented
 A column-oriented NoSQL database stores data similar to a relational database. However, instead of
storing data as rows, it is stored as columns. Column-oriented databases aim to provide faster read
speeds by being able to quickly aggregate data for a specific column.

Key features of columnar oriented database:


 Scalability.
 Compression.
 Very responsive.

Use Cases:

 Column Family Data Models are used in systems that maintain counters.
 These Aggregate Data Models in NoSQL are used for services that have expiring usage.
 It is used in systems that have heavy write requests.
 For example, take a look at the following e-commerce database of products:

9

 If we wanted to analyze the total sales for all the products, all we would need to do is aggregate data
from the sales column. This is in contrast to a relational model that would have to pull data from each
row. We would also be pulling adjacent data (like size information in the above example) that isn’t
relevant to our query.

Schemaless Database
A schemaless database, like MongoDB, does not have these up-front constraints, mapping to a more ‘natural’
database. Even when sitting on top of a data lake, each document is created with a partial schema to aid
retrieval. Any formal schema is applied in the code of your applications; this layer of abstraction protects the
raw data in the NoSQL database and allows for rapid transformation as your needs change.

Any data, formatted or not, can be stored in a non-tabular NoSQL type of database. At the same time, using
the right tools in the form of a schemaless database can unlock the value of all of your structured and
unstructured data types.

How does a schemaless database work?

In schemaless databases, information is stored in JSON-style documents which can have varying sets of fields
with different data types for each field. So, a collection could look like this:
{
name : “Joe”, age : 30, interests : ‘football’ }
{
name : “Kate”, age : 25
}
As you can see, the data itself normally has a fairly consistent structure. With the schemaless MongoDB
database, there is some additional structure — the system namespace contains an explicit list of collections
and indexes. Collections may be implicitly or explicitly created — indexes must be explicitly declared.
What are the benefits of using a schemaless database?
 Greater flexibility over data types
By operating without a schema, schemaless databases can store, retrieve, and query any data type
— perfect for big data analytics and similar operations that are powered by unstructured data. Relational
databases apply rigid schema rules to data, limiting what can be stored.
 No pre-defined database schemas
The lack of schema means that your NoSQL database can accept any data type — including those
that you do not yet use. This future-proofs your database, allowing it to grow and change as your data-
driven operations change and mature.

10
 No data truncation
A schemaless database makes almost no changes to your data; each item is saved in its own
document with a partial schema, leaving the raw information untouched. This means that every detail is
always available and nothing is stripped to match the current schema. This is particularly valuable if your
analytics needs to change at some point in the future.
 Suitable for real-time analytics functions
With the ability to process unstructured data, applications built on NoSQL databases are better able
to process real-time data, such as readings and measurements from IoT sensors. Schemaless databases are
also ideal for use with machine learning and artificial intelligence operations, helping to accelerate
automated actions in your business.
 Enhanced scalability and flexibility
With NoSQL, you can use whichever data model is best suited to the job. Graph databases allow
you to view relationships between data points, or you can use traditional wide table views with an
exceptionally large number of columns. You can query, report, and model information however you
choose. And as your requirements grow, you can keep adding nodes to increase capacity and power.
When a record is saved to a relational database, anything (particularly metadata) that does not match the
schema is truncated or removed. Deleted at write, these details cannot be recovered at a later point in time.

Cassandra
Apache Cassandra is an open-source no SQL database that is used for handling big data. Apache
Cassandra has the capability to handle structure, semi-structured, and unstructured data. Apache Cassandra
was originally developed at Facebook after that it was open-sourced in 2008 and after that, it become one
of the top-level Apache projects in 2010.

Features of Cassandra:

1. it is scalable.
2. it is flexible (can accept structured , semi-structured and unstructured data).
3. it has transaction support as it follows ACID properties.
4. it is highly available and fault tolerant.
5. it is open source

5. it is open source.
Data Replication in Cassandra
In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If it is detected
that some of the nodes responded with an out-of-date value, Cassandra will return the most recent value to the
client. After returning the most recent value, Cassandra performs a read repair in the background to update
the stale values.
The following figure shows a schematic view of how Cassandra uses data replication among the nodes in a
cluster to ensure no single point of failure.

11
Note − Cassandra uses the Gossip Protocol in the background to allow the nodes to communicate with each
other and detect any faulty nodes in the cluster.

CAP theorem of cassandra


Apache Cassandra is a highly scalable, distributed database that strictly follows the principle of the CAP
(Consistency Availability and Partition tolerance) theorem.

Figure-2: CAP Theorem


In Apache Cassandra, there is no master-client architecture. It has a peer-to-peer architecture. In Apache
Cassandra, we can create multiple copies of data at the time of keyspace creation. We can simply define
replication strategy and RF (Replication Factor) to create multiple copies of data.

12
Architecture of Apache Cassandra:
In this section we will describe the following component of Apache Cassandra.
Basic Terminology:
 Node
 Data center
 Cluster

Operations:
 Read Operation
 Write Operation

Storage Engine:
 CommitLog
 Memtables
 SSTables

Basic Terminology:
1. Node:
Node is the basic component in Apache Cassandra. It is the place where actually data is stored. For
Example:As shown in diagram node which has IP address 10.0.0.7 contain data (keyspace which contain
one or more tables).

Figure – Node
2. Data Center:
Data Center is a collection of nodes.
For example:
DC – N1 + N2 + N3 ….
DC: Data Center
N1: Node 1
N2: Node 2
N3: Node 3
3. Cluster:
It is the collection of many data centers.
For example:
C = DC1 + DC2 + DC3….
C: Cluster
DC1: Data Center 1
DC2: Data Center 2
DC3: Data Center 3

13
Figure – Node, Data center, Cluster
Operations:
1. Read Operation:
In Read Operation there are three types of read requests that a coordinator can send to a replica. The node
that accepts the write requests called coordinator for that particular operation.
 Step-1: Direct Request:
In this operation coordinator node sends the read request to one of the replicas.
 Step-2: Digest Request:
In this operation coordinator will contact to replicas specified by the consistency level. For Example:
CONSISTENCY TWO; It simply means that Any two nodes in data center will acknowledge.
 Step-3: Read Repair Request:
If there is any case in which data is not consistent across the node then background Read Repair
Request initiated that makes sure that the most recent data is available across the nodes.
2. Write Operation:
 Step-1:
In Write Operation as soon as we receives request then it is first dumped into commit log to make sure
that data is saved.
 Step-2:
Insertion of data into table that is also written in MemTable that holds the data till it’s get full.
 Step-3:
If MemTable reaches its threshold then data is flushed to SS Table.

14
Figure – Write Operation in Cassandra
Storage Engine:
1. Commit log:
Commit log is the first entry point while writing to disk or memTable. The purpose of commit log in
apache Cassandra is to server sync issues if a data node is down.
2. Mem-table:
After data written in Commit log then after that data is written in Mem-table. Data is written in Mem-
table temporarily.
3. SSTable:
Once Mem-table will reach a certain threshold then data will flushed to the SSTable disk file.

Note:
Refer these topics in technical publication book.
 Cassandra data models and Cassandra client
 Consistency
 Distribution Models

15

You might also like