You are on page 1of 31

Big Data Storage Concepts

Big Data concepts Technology and Architecture

Raghad Joukhadar
2023-2024
Plan
• Introduction
• Cluster computing
• Types of cluster
• Cluster Structure

• Distribution Models
• Sharding
• Data Replication
• Sharding and Replication

• Distributed File System

• Relational and Non-Relational Databases


• RDBMS Databases
• NoSQL Databases
• NewSQL Databases

• Scaling Up and Scaling Out Storage


Introduction
• The big data revolution provides significant improvements to the
data storage architecture.
• Need for framework for storing data on clusters of commodity
hardware

– Example : Hadoop
• open-source
• allows organizations to effectively
store and analyze large volumes of
data.
Cluster Computing

• A group of loosely coupled


computers that work together
closely, so it can be viewed as a
“single larger and more powerful
virtual computer”. 
• The cluster components are
connected together through
local area networks (LANs).
Overview of Cluster computing
• The login node acts as the
gateway into the cluster.
• When the cluster has to be
accessed by the users from a
public network, the user has to
login to the login node.
• This is to prevent unauthorized
access by the users.
Cluster Benefits
• Scalability,
– by removing nodes or adding additional nodes as per the
demand without hindering the system
• Availability,
– As nodes within the cluster provide backup to each other in the
event of a failure
• Performance,
– Multiple computing resources are connected together in a
cluster increasing the performance
TYPES OF CLUSTER (purpose)
• High Availability Clusters
– Nodes in a highly available cluster must have access to a
shared storage
– If a node becomes inoperative, continuous service is
provided by failing over service from the inoperative cluster
node to another, without administrative intervention
TYPES OF CLUSTER cont..
• Load Balancing Cluster
– Distributes incoming requests among multiple nodes running the
same programs or having the same content
– If a node in a load-balancing cluster goes down, the load from that
node is switched over to another node
– Optimize the use of resources, minimize response time
TYPES OF CLUSTER (Structure)
• Symmetric
– Each node functions as an
individual computer capable
of running applications.
– Additional machines can be
added as needed.
Cluster Structure
• Asymmetric
– Are a type of cluster
structure in which one
machine acts as the head
node
– it serves as the gateway
between the user and the
remaining nodes.
Distribution Models
• There are several distribution models
– Replication: placing the same set of data over multiple nodes.
– Sharding: placing different sets of data on different nodes
– Sharding & Replication :can either be used alone or together
Replication

• Replication is the process of creating copies of the same set


of data across multiple servers.
• The copy of a block is called replica.
• To overcome issues like:
– when a node crashes, the data stored in that node will be lost
– when a node is down for maintenance, it will not be available until
the maintenance process is over.
Data Replication Example

Replication Advantages

• Replication makes the system fault tolerant since the data is


not lost when an individual node fails as the data is
redundant across the nodes.
• Replication increases the data availability as the same copy
of data is available across multiple nodes.
Replication Models
• Master controls one or more Master-slave
devices known as slaves
• The flow of control is only
from master to the slaves
• Incoming data are written on
the master node
• Read requests are handled by
slave nodes
• This architecture supports • The cluster still suffers from single
intensive read requests point of failure, if the master fails
• The writes are limited to the
maximum capacity that a master
can handle
Replication Models
• All the nodes have the same Peer-Peer
responsibility and are at the
same level
• Either of the devices involved
in the process can initiate
communication
• The nodes consume as well
as donate the resources
• Reliability is improved through
replication
Sharding
• Partitioning very large data sets into smaller and easily
manageable chunks called shards.
• The shards are stored by distributing them across multiple
machines called nodes.
• No two shards of the same file are stored in the same node
• Shards spread across multiple nodes collectively constitute the
data set.
Sharding Examples
Sharding Advantages

• Scalability where new shards can be added at runtime


without shutting down the application for maintenance

• Improves the fault tolerance of the system as the failure of a


node affects only the block of the data stored in that
particular node.
Sharding & Replication

• In sharding when a node goes down, the data stored in the


node will be lost.
• So it provides only a limited fault tolerance to the system.
• Sharding and replication can be combined to make the system
fault tolerant and highly available.
Sharding & Replication Example

Distributed File System (DFS)
• A file system is a way of storing and organizing the data on storage devices
(HD, DVDs, ...) and to keep track of the files stored on them.
• The file is the smallest unit of storage defined by the file system to pile data.
• File systems store and retrieve data for the application to run effectively and
efficiently on the operating systems.
• A distributed file system stores the files across cluster nodes and allows the
clients to access the files from the cluster.
• Files are distributed across the nodes, but logically it appears to as if they are
residing on the clients local machine.
• Since a DFS provides access to more than one client simultaneously, the
server organizes updates for the clients to access the current updated
version of the file, and no version conflicts arise.
• Big data widely adopts a distributed file system known as Hadoop Distributed
File System (HDFS)
DFS Key concepts

• Data replication where the copies of data are distributed on


multiple cluster nodes so that there is no single point of failure,
which increases the reliability.
• The client can communicate with any of the closest available
nodes to reduce latency and network traffic
• Fault tolerance is achieved through data replication as the data
will not be lost in case of node failure due to the redundancy in
the data across nodes.
Relational and Non-Relational Databases
Relational and Non-Relational Databases
Relational Databases Non-Relational

• Organize data into tables of rows • This has led to the evolution of non-
(records) & columns (attributes| relational databases, which are
fields) schema-less.
• Unsuitable when organizations • NoSQL is a non-relational database
collect vast amount of customer
databases, transactions, and other
data, which may not be structured to
fit into relational databases.
Properties of RDBMS Databases

• Is vertically scalable (by increasing server hardware power)


• Exhibits ACID (atomicity, consistency, isolation,durability) properties
• Support data that adhere to a specific schema
• Can no longer keep pace with the volume, velocity, and variety of data being
generated and consumed
Properties of NoSQL Databases

• Includes all non-relational databases


• Exhibits the BASE (basically available, soft state, eventually consistent) model
• Are not appropriate for implementing large transactions
Properties of NewSQL Databases

• Aim to combine the scalability and performance benefits of NoSQL


databases with the familiar relational data model and ACID transaction
guarantees of traditional SQL databases
• Horizontally scalable
• Fault tolerant
• Support relational data model with three layers: the administrative,
transactional, and storage layer.
• The applications : those that execute the same queries repeatedly with
different inputs and have a large number of transactions
NewSQL Databases comparison

high fault tolerant distributed in-memory scale-out


performance

Clustrix yes yes yes - -

NuoDB - yes yes - yes

VoltDB yes yes yes yes yes

MemSQL yes yes yes yes -


Scaling up vs. Scaling out

Scaling up Scaling out


(Vertical) (Horizontal)
THANK YOU

ANY
QUESTIONS?

You might also like