You are on page 1of 67

COMP6725 - Big Data Technologies

TOPIK 5
HADOOP STORAGE LAYER
LEARNING OUTCOMES

At the end of this session, students will be able to:


o LO1. Describe big data architecture layer and processing concepts
OUTCOMES

Students are able to describe big data architecture layer and processing concepts
OUTLINE

1. Hadoop Distributed File System (HDFS)


2. NoSQL
3. NoSQL (MongoDB)
4. NoSQL (HBase)
5. NoSQL (GraphDB)
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
HDFS DEFINITION

o Hadoop Distributed File System (HDFS) is a distributed file systems (DFS) that
runs on large clusters and provide high-throughput access to data.
o HDFS stores each file as a sequence of blocks.
o The blocks of each file are replicated on multiple machines in a cluster to
provide fault tolerance.
HDFS CHARACTERISTICS

o Scalable Storage for Large Files


• HDFS is designed to store large files.
• HDFS breaks the large files into chunks or blocks.
o Replication
• Data chunks or blocks are replicated by HDFS to multiple machines in a
cluster.
o Streaming Data Access
• HDFS has been designed for streaming data access patterns and provides
high throughput streaming read and writes.
o File Appends
• Recent versions of HDFS have introduced the file append capability.
HDFS ARCHITECTURE

o HDFS has two types of nodes: Namenode and Datanode

Pic 5.1. HDFS architecture


Source : Big Data Science & Analytics: A Hands-On Approach Basic Statistics.,
2016.
HDFS ARCHITECTURE: NAMENODE

o Namenode manages the filesystem namespace.


o All the system file metadata is stored in the namenode.
o Namenode executes the read and write operations while the data is transferred
directly to/from the Datanodes.
o Namenode stores the filesystem meta-data and the mapping of the data block to
the datanotes.
o Namenode stores those informations into two files, fsimage and edits file.
• fsimage contains a complete snapshot of the filesystem meta-data.
• edits file store the incremental updates to the meta-data.
HDFS ARCHITECTURE: SECONDARY
NAMENODE
o Secondary namenode takes responsibility for applying the update into fsimage
file when the edits file in the Namenode keeps growing in size over the time.
o This process is called checkpointing.
o When checkpointing begins, the Secondary Namenode download the fsimage
and edit files from the Namenode to the checkpoint directory on the Secondary
Namenode. The Secondary Namenode then applies the edits on the fsimage file
and creates new fsimage file.
o The new fsimage file is then uploaded by Secondary Namenode to the
Namenode.
HDFS ARCHITECTURE:
DATANODE
o Datanodes store the data blocks and serve read and write requests.
o Datanodes periodically send heartbeat messages and block reports to the
Namenode.
• Hearbeat messages tell the Namenode that a Datanode is still alive.
• Block report contains the information on the blocks within a Datanode
HDFS ARCHITECTURE: DATA BLOCKS

o Data blocks are replicated on the Datanodes and by default three replicas are
created.
o The placement of those three replicas is determined by a rack-aware placement
policy. This placement policy ensures reliability and availability of the blocks.
• First replica is placed on a node on a local rack
• Second replica is placed on a different node on a remote rack
• Third replica is placed on a different node on the same rack
o Replicas placement on different nodes in the same rack also able to minimize the
network traffic between racks.
HDFS READ PATH

Pic 5.2. HDFS read path


Source : Big Data Science & Analytics: A Hands-On Approach Basic Statistics.,
2016.
HDFS READ PATH (CONT)

o The read process begins with the client sending a request to Namenode to obtain
the location of the datablocks for a file.
o The Namenode checks if the file exists and whether the client has sufficient
permission.
o The Namenode responds with the data block locations sorted by distance to the
client to reduce the network traffic.
o During the read process, if a replica becomes unavailable, the client can read
another replica on different Datanode.
HDFS WRITE PATH

Pic 5.3. HDFS write path


Source : Big Data Science & Analytics: A Hands-On Approach Basic Statistics.,
2016.
HDFS WRITE PATH (CONT)

o The write process begins with the client sending a request to the Namenode to
create a new filesystem namespace.
o The Namenode checks whether the client has sufficient permission and if the file
doesn’t already exist in the filesystem.
o The Namenode responds by sending the output stream object.
o The client writes the data to output stream object which splits the data into
packets.
o The data packets are consumed from data queue in a separate thread, which
requests the Namenode to allocate new blocks on the Datanotes.
o Namenode responds with the location of the new blocks.
o Client then establishes direct connections to the Datanodes on which the data
are to be replicated forming a replication pipeline.
HDFS WRITE PATH (CONT)

o The data packets consumed from data queue are written on the first Datanode in
the pipeline, which writes the data into the second Datanode and so on.
o Once the packets are successfully written, each Datanode in pipeline sends an
acknowledgement.
o The client keeps track of which all data packets are acknowledged by Datanodes.
o This writing process continues until the block size is reached. Upon reaching it,
the client requests the Namenode to return a set of new blocks on the
Datanodes.
o The client then stream the packets to the Datanodes.
o The process repeats till all data packets are written and acknowledged. Finally,
the client closes the output stream and send a requests to the Namenode to
close the file.
HDFS: COMMAND LINE EXAMPLES
HDFS: COMMAND LINE EXAMPLES
HDFS: COMMAND LINE EXAMPLES
HDFS: PYTHON EXAMPLES
HDFS: WEB INTERFACE EXAMPLES

Pic 5.4. Browsing files on HDFS using web interface


Source : Big Data Science & Analytics: A Hands-On Approach Basic Statistics.,
2016.
HDFS: WEB INTERFACE EXAMPLES

Pic 5.5. Download a file from HDFS using web interface


Source : Big Data Science & Analytics: A Hands-On Approach Basic Statistics.,
2016.
FEATURES OF HDFS

o Cost-Effective
HDFS is an open-source storage platform and HDFS does not require high-end
hardware for storage.
o Distributed Storage
HDFS splits the input files into blocks, each of size 64MB by default, and then
stores in HDFS. This block can be shared to store other files to make the 64 MB
utilized fully.
o Data Replication
HDFS by default makes three copies of all the data blocks and stores them in dif-
ferent nodes in the cluster. If any node crashes, the node carrying the copy of the
data that is lost is identified and the data is retrieved.
NOSQL
NOSQL DEFINITION

o NoSQL is non-relational database that have better horizontal scaling capability


and improved performance for big data at the cost of having less rigorous
consistency models.
o NoSQL commonly used for applications in which the scale of data involved is very
massive and the data may not be structured.
o Unlike relational database, NoSQL databases don’t have a strict schema. The
record can be in form of key-value pairs or documents.
NOSQL TYPES

o NoSQL databases are classified depends on the data storage model or type of
records that can be stored, which are:
• Key-value Database
• Document Database
• Column Family / Store Database
• Graph Database
KEY-VALUE DATABASE

Definition
o Key-value databases are one of NoSQL database that store data in the form of
key-value pairs. The keys are used to identify uniquely the values stored in the
database.
o Most key-value database have distributed architectures comprising of multiple
storage nodes. The data is partitioned across the storage nodes by the keys.
o Key-value database uses hash function to determine the partition for the keys.
o Key-value database supports many type of values that can be stored such as
strings, integers, floats, binary large objects (BLOB), etc.
o Key-value databases don’t have fixed schema and constraints unlike the relational
databases.
KEY-VALUE DATABASE

Definition (cont.)
o Unlike relational databases which provide specialized query language (SQL), the
key-value database only provide basic querying and searching capabilities.
o Key-value databases are suited for applications for which the ability to store and
retrieve the data in a fast and efficient manner is more important than imposing
structure or constraint on data.
o For example, key-value databases can be used to store:
• Configuration data
• Transient or intermediate data
• Item-attributes data
• BLOBs
DOCUMENT DATABASE

Definition
o Document database is a NoSQL database that store semi-structured data in the
form of documents which are encoded in different standards such as JSON, XML,
BSON, or YMAL.
o Semi-structured data means that the documents stored are similar to each other
(similar keys, attributes, fields)
o Each document stored in a document database has a collection of named fields
and their values. Each document is identified by unique key or ID.
o Document databases have capability for efficiently querying the documents
based on the attribute values in document, unlike the key-value database.
o Unlike relational database, document databases don’t provide the join
functionality between documents.
NOSQL
(MONGODB)
MONGODB

Definition
o MongoDB is a document-oriented non-relational database system.
o MongoDB is powerful, flexible and highly scalable database designed for web
application.
o The basic unit of data stored by MongoDB is a document. A document includes a
JSON-like set of key-value pairs.
o The documents can be grouped together to form collections.
o Collections do not have a fixed schema and different documents in one
collection can have different sets of key-value pairs.
o Collections are organized into databases, and there can be multiple databases
running on a single MongoDB instance.
MONGODB

MongoDB Setup
MONGODB

MongoDB Examples

Pic 5.6. Using document database for storing product records


Source : Big Data Science & Analytics: A Hands-On Approach
Basic Statistics., 2016.
MONGODB

MongoDB Command Line Examples (cont.)


MONGODB

MongoDB Command Line Examples (cont.)


MONGODB

MongoDB Command Line Examples (cont.)


MONGODB

MongoDB with Python


MONGODB

MongoDB with Python (cont.)


COLUMN FAMILY DATABASE

Definition
o In column family database, the basic unit of data storage is column, which has a
name and a value.
o A collection of columns make up a row which identified by a row-key.
o Columns are grouped together into column families.
o Unlike relational databases, the column family databases do not have a fixed
schema and fixed number of columns for each row.
o Column family database store data in denormalized form so that all relevant
information related to an entity required by the applications can be retrieved by
reading a single row.
NOSQL
(HBASE)
HBASE DEFINISION

o HBase is a scalable, non-relational, distributed, column-family database that


provides structured data storage for large tables.
o HBase can store both structured and unstructured data.
o HBase has been designed to work with commodity hardware and is a highly
reliable and fault tolerant system.
o HBase allows fast random read and write.
HBASE – DATA MODEL

Pic 5.7. HBase table structure


Source : Big Data Science & Analytics: A Hands-On
Approach Basic Statistics., 2016.

Pic 5.8. HBase key-value format


Source : Big Data Science & Analytics: A Hands-On Approach Basic Statistics., 2016.
HBASE – DATA MODEL (CONT)

o HBase table consists of rows, which are indexed by row keys.


o Each row includes multiple column families.
o Each column family has multiple columns.
o Each column includes multiple cells or entries which are timestamped.
o HBase column families are declared at the time of creation of the table and
cannot be changed later.
HBASE FEATURES

o Sparse
• HBase has sparse tables as each row doesn’t need to have all columns. Only columns
which are populated in the row are stored.
o Distributed
• HBase tables are partitioned based on row keys into regions.
• Each region contains a range of row keys.
o Persistent
• HBase works on top of HDFS, and all data stored in HBase tables is persisted on HDFS.
o Multi-dimensional
• HBase stores data as key-value pairs where the keys are multi-dimensional. A key includes:
(Table, RowKey, ColumnFamily, Column TimeStamp)
• For each entry, multiple versions are stored, which are timestamped.
o Sorted Map
• HBase rows are sorted by the row key in lexicographic order.
• Columns in a column family are sorted by the column key.
HBASE ARCHITECTURE

Pic 5.9. HBase architecture


Source : Big Data Science & Analytics: A Hands-On Approach Basic Statistics.,
2016.
HBASE ARCHITECTURE (CONT)

o HBase has a distributed architecture.


o An HBase deployment comprise multiple region servers.
o Each region server has multiple regions.
o HBase has a master-slave architecture with one of the nodes acting as the
master node (HMaster) and other nodes are slave nodes.
o The HMaster is responsible for maintaining the HBase meta-data and
assignment of regions to region servers.
o HBase uses Zookeeper for distributed state coordination.
o HBase has two special tables, ROOT and META, to identify which region server is
responsible for serving a read/write request for specific row key.
HBASE - DATA STORAGE &
OPERATIONS

o Each region server stores two types of files – a store file (HFile) and write-ahead
log (HLog).
o HFile contains a variable number of data blocks and the fixed blocks for file
information and trailer.
o Each data block contains a magic number and multiple key-value pairs.
o The default size of a data block is 64 KB.
o Write-Ahead log (WAL), known as HLog, records the writes.
o Each region server has a Memstore and Block Cache.
o The memstore stores the recent edits to the data in memory.
o Block cache caches the data blocks.
HBASE - DATA STORAGE &
OPERATIONS

o HBase supports following operations:


• Get (operation to return the value for a given row key)
• Scan (operation to return values for a range of row keys)
• Put (operation to add new entry)
• Delete (operation to add special marker called Tombstone to an entry.
Entries marked with Tombstones are removed during the compaction
process)
o The storage structure used by HBase is a Log Structured Merge (LSM) Tree.
HBASE - READ AND WRITE PATH

o When reading, the clients first contact Zookeeper to get the location of the ROOT
table.
o The client then checks the ROOT table for correct META table that contains the
row key and obtains the region server name.
o The client then contacts the region server directly to complete the read
operation.
o When writing, all write requests are first logged into HLog sequentially.
o Then it is also written to the Memstore.
o The Memstore stores the most recent updates to enable fast lookup.
HBASE - COMPACTION

o Over time, the Memstore starts filling up as new updates are stored.
o Once it is filled up, it is flushed to disk creating new store file (HFile) that makes
many store files created on HDFS.
o Compaction is a process to merge all the small store files into a single file.
o The compaction process improves the read efficiency as a large number of small
files don’t need to be looked up.
o There are two types of compaction:
• Minor (merging when the number exceeds a threshold)
• Major (mergin all store files into a single file)
HBASE – BLOOM FILTERS

o Over time, the Memstore starts filling up as new updates are stored.
o Once it is filled up, it is flushed to disk creating new store file (HFile) that makes
many store files created on HDFS.
o Compaction is a process to merge all the small store files into a single file.
o The compaction process improves the read efficiency as a large number of small
files don’t need to be looked up.
o There are two types of compaction:
• Minor (merging when the number exceeds a threshold)
• Major (mergin all store files into a single file)
HBASE

Command Line Example


HBASE

Command Line Example (cont.)


HBASE

Phyton Example
HBASE

Phyton Example (cont.)


HBASE

Pic 5.10. HBase web interface showing


details of HBase Master and list of tables
Source : Big Data Science & Analytics: A
Hands-On Approach Basic Statistics.,
2016.
NOSQL
(GRAPHDB)
GRAPHDB DEFINISION

o GraphDB is a NoSQL databases designed for storing data that has graph
structure with nodes and edges.
o GraphDB models the data in the form of nodes and relationships.
o Nodes represent entities in the data model and have a set of attributes.
o The relationship between entities are represented in the form of links between
nodes.
o Link also has a set of attributes and can be directed or undirected.
o GraphDB is suitable for applications which the primary focus is on querying for
relationships between entities and analysing the relationships.
GRAPHDB - NEO4J

o Neo4j is one of popular graph databases which provides support for Atomicity,
Consistency, Isolation, Durability (ACID).
GRAPHDB - NEO4J (CONT)

Pic 5.11. Labeled property graph example


Source : Big Data Science & Analytics: A Hands-On Approach Basic Statistics.,
2016.
GRAPHDB - NEO4J (CONT)
COMPARISON OF NOSQL DATABASES

Pic 5.12. Comparison of NoSQL databases


Source : Big Data Science & Analytics: A Hands-On Approach Basic Statistics.,
2016.
ThankYOU...
SUMMARY
o HDFS is a distributed file system that runs on large clusters and provides high-throughput
access to data. HDFS provides scalable storage for large files which are broken into blocks.
The blocks are replicated to make the system reliable and fault-tolerant. The HDFS
Namenode stores the filesystem meta-data and is responsible for executing operations
such as opening and closing of files. The Secondary Namenode helps in the checkpointing
process by applying the updates in the edits file to the fsimage file which contains a
complete snapshot of the filesystem meta-data. Datanodes store the data blocks which are
replicated. The placement of replicas on the Datanodes is determined by a rack-aware
placement policy. We described examples of accessing HDFS using the command line tools,
a Python library for HDFS and the HDFS web interface.
o Non-relational databases or NoSQL databases are popular for applications in which the
scale of data involved is massive and the data may not be structured. Furthermore, real-
time performance is considered more important than consistency. In this chapter we
described four types of NoSQL databases.
SUMMARY
o Four types of NoSQL databases. The key-value databases store data in the form of key-value
pairs where the keys are used to identify uniquely the values stored. Hash functions are
applied to the key to determine where the value should be stored. Document store
databases store semi-structured data in the form of documents which are encoded in
different standards such as JSON, XML, BSON or YAML. The benefit of using document
databases over key-value databases is that these databases allow efficiently querying the
documents based on the attribute values in the documents. Column family databases store
data as columns where a column has a name and a value. Columns are grouped into
column families and a collection of columns make up a row which is identified by a row-key.
Column family databases support high-throughput reads and writes and have distributed
and highly available architectures. Graph databases model data in the form of nodes and
relationships. Nodes represent the entities in the data model and have a set of attributes.
The relationships between the entities are represented in the form of links between the
nodes.
REFERENCES

o Arshdeep Bahga & Vijay Madisetti. (2016). Big Data Science & Analytics: A Hands-On Approach.
1st E. VPT. India. ISBN: 9781949978001. Chapter 4 and 6.
o Balusamy. Balamurugan, Abirami.Nandhini, Kadry.R, Seifedine, & Gandomi. Amir H. (2021). Big
Data Concepts, Technology, and Architecture. 1st. Wiley. ISBN 978-1-119-70182-8. Chapter 3
and 5
o https://www.youtube.com/watch?v=GJYEsEEfjvk
o https://www.youtube.com/watch?v=0buKQHokLK8

You might also like