Professional Documents
Culture Documents
TOPIK 5
HADOOP STORAGE LAYER
LEARNING OUTCOMES
Students are able to describe big data architecture layer and processing concepts
OUTLINE
o Hadoop Distributed File System (HDFS) is a distributed file systems (DFS) that
runs on large clusters and provide high-throughput access to data.
o HDFS stores each file as a sequence of blocks.
o The blocks of each file are replicated on multiple machines in a cluster to
provide fault tolerance.
HDFS CHARACTERISTICS
o Data blocks are replicated on the Datanodes and by default three replicas are
created.
o The placement of those three replicas is determined by a rack-aware placement
policy. This placement policy ensures reliability and availability of the blocks.
• First replica is placed on a node on a local rack
• Second replica is placed on a different node on a remote rack
• Third replica is placed on a different node on the same rack
o Replicas placement on different nodes in the same rack also able to minimize the
network traffic between racks.
HDFS READ PATH
o The read process begins with the client sending a request to Namenode to obtain
the location of the datablocks for a file.
o The Namenode checks if the file exists and whether the client has sufficient
permission.
o The Namenode responds with the data block locations sorted by distance to the
client to reduce the network traffic.
o During the read process, if a replica becomes unavailable, the client can read
another replica on different Datanode.
HDFS WRITE PATH
o The write process begins with the client sending a request to the Namenode to
create a new filesystem namespace.
o The Namenode checks whether the client has sufficient permission and if the file
doesn’t already exist in the filesystem.
o The Namenode responds by sending the output stream object.
o The client writes the data to output stream object which splits the data into
packets.
o The data packets are consumed from data queue in a separate thread, which
requests the Namenode to allocate new blocks on the Datanotes.
o Namenode responds with the location of the new blocks.
o Client then establishes direct connections to the Datanodes on which the data
are to be replicated forming a replication pipeline.
HDFS WRITE PATH (CONT)
o The data packets consumed from data queue are written on the first Datanode in
the pipeline, which writes the data into the second Datanode and so on.
o Once the packets are successfully written, each Datanode in pipeline sends an
acknowledgement.
o The client keeps track of which all data packets are acknowledged by Datanodes.
o This writing process continues until the block size is reached. Upon reaching it,
the client requests the Namenode to return a set of new blocks on the
Datanodes.
o The client then stream the packets to the Datanodes.
o The process repeats till all data packets are written and acknowledged. Finally,
the client closes the output stream and send a requests to the Namenode to
close the file.
HDFS: COMMAND LINE EXAMPLES
HDFS: COMMAND LINE EXAMPLES
HDFS: COMMAND LINE EXAMPLES
HDFS: PYTHON EXAMPLES
HDFS: WEB INTERFACE EXAMPLES
o Cost-Effective
HDFS is an open-source storage platform and HDFS does not require high-end
hardware for storage.
o Distributed Storage
HDFS splits the input files into blocks, each of size 64MB by default, and then
stores in HDFS. This block can be shared to store other files to make the 64 MB
utilized fully.
o Data Replication
HDFS by default makes three copies of all the data blocks and stores them in dif-
ferent nodes in the cluster. If any node crashes, the node carrying the copy of the
data that is lost is identified and the data is retrieved.
NOSQL
NOSQL DEFINITION
o NoSQL databases are classified depends on the data storage model or type of
records that can be stored, which are:
• Key-value Database
• Document Database
• Column Family / Store Database
• Graph Database
KEY-VALUE DATABASE
Definition
o Key-value databases are one of NoSQL database that store data in the form of
key-value pairs. The keys are used to identify uniquely the values stored in the
database.
o Most key-value database have distributed architectures comprising of multiple
storage nodes. The data is partitioned across the storage nodes by the keys.
o Key-value database uses hash function to determine the partition for the keys.
o Key-value database supports many type of values that can be stored such as
strings, integers, floats, binary large objects (BLOB), etc.
o Key-value databases don’t have fixed schema and constraints unlike the relational
databases.
KEY-VALUE DATABASE
Definition (cont.)
o Unlike relational databases which provide specialized query language (SQL), the
key-value database only provide basic querying and searching capabilities.
o Key-value databases are suited for applications for which the ability to store and
retrieve the data in a fast and efficient manner is more important than imposing
structure or constraint on data.
o For example, key-value databases can be used to store:
• Configuration data
• Transient or intermediate data
• Item-attributes data
• BLOBs
DOCUMENT DATABASE
Definition
o Document database is a NoSQL database that store semi-structured data in the
form of documents which are encoded in different standards such as JSON, XML,
BSON, or YMAL.
o Semi-structured data means that the documents stored are similar to each other
(similar keys, attributes, fields)
o Each document stored in a document database has a collection of named fields
and their values. Each document is identified by unique key or ID.
o Document databases have capability for efficiently querying the documents
based on the attribute values in document, unlike the key-value database.
o Unlike relational database, document databases don’t provide the join
functionality between documents.
NOSQL
(MONGODB)
MONGODB
Definition
o MongoDB is a document-oriented non-relational database system.
o MongoDB is powerful, flexible and highly scalable database designed for web
application.
o The basic unit of data stored by MongoDB is a document. A document includes a
JSON-like set of key-value pairs.
o The documents can be grouped together to form collections.
o Collections do not have a fixed schema and different documents in one
collection can have different sets of key-value pairs.
o Collections are organized into databases, and there can be multiple databases
running on a single MongoDB instance.
MONGODB
MongoDB Setup
MONGODB
MongoDB Examples
Definition
o In column family database, the basic unit of data storage is column, which has a
name and a value.
o A collection of columns make up a row which identified by a row-key.
o Columns are grouped together into column families.
o Unlike relational databases, the column family databases do not have a fixed
schema and fixed number of columns for each row.
o Column family database store data in denormalized form so that all relevant
information related to an entity required by the applications can be retrieved by
reading a single row.
NOSQL
(HBASE)
HBASE DEFINISION
o Sparse
• HBase has sparse tables as each row doesn’t need to have all columns. Only columns
which are populated in the row are stored.
o Distributed
• HBase tables are partitioned based on row keys into regions.
• Each region contains a range of row keys.
o Persistent
• HBase works on top of HDFS, and all data stored in HBase tables is persisted on HDFS.
o Multi-dimensional
• HBase stores data as key-value pairs where the keys are multi-dimensional. A key includes:
(Table, RowKey, ColumnFamily, Column TimeStamp)
• For each entry, multiple versions are stored, which are timestamped.
o Sorted Map
• HBase rows are sorted by the row key in lexicographic order.
• Columns in a column family are sorted by the column key.
HBASE ARCHITECTURE
o Each region server stores two types of files – a store file (HFile) and write-ahead
log (HLog).
o HFile contains a variable number of data blocks and the fixed blocks for file
information and trailer.
o Each data block contains a magic number and multiple key-value pairs.
o The default size of a data block is 64 KB.
o Write-Ahead log (WAL), known as HLog, records the writes.
o Each region server has a Memstore and Block Cache.
o The memstore stores the recent edits to the data in memory.
o Block cache caches the data blocks.
HBASE - DATA STORAGE &
OPERATIONS
o When reading, the clients first contact Zookeeper to get the location of the ROOT
table.
o The client then checks the ROOT table for correct META table that contains the
row key and obtains the region server name.
o The client then contacts the region server directly to complete the read
operation.
o When writing, all write requests are first logged into HLog sequentially.
o Then it is also written to the Memstore.
o The Memstore stores the most recent updates to enable fast lookup.
HBASE - COMPACTION
o Over time, the Memstore starts filling up as new updates are stored.
o Once it is filled up, it is flushed to disk creating new store file (HFile) that makes
many store files created on HDFS.
o Compaction is a process to merge all the small store files into a single file.
o The compaction process improves the read efficiency as a large number of small
files don’t need to be looked up.
o There are two types of compaction:
• Minor (merging when the number exceeds a threshold)
• Major (mergin all store files into a single file)
HBASE – BLOOM FILTERS
o Over time, the Memstore starts filling up as new updates are stored.
o Once it is filled up, it is flushed to disk creating new store file (HFile) that makes
many store files created on HDFS.
o Compaction is a process to merge all the small store files into a single file.
o The compaction process improves the read efficiency as a large number of small
files don’t need to be looked up.
o There are two types of compaction:
• Minor (merging when the number exceeds a threshold)
• Major (mergin all store files into a single file)
HBASE
Phyton Example
HBASE
o GraphDB is a NoSQL databases designed for storing data that has graph
structure with nodes and edges.
o GraphDB models the data in the form of nodes and relationships.
o Nodes represent entities in the data model and have a set of attributes.
o The relationship between entities are represented in the form of links between
nodes.
o Link also has a set of attributes and can be directed or undirected.
o GraphDB is suitable for applications which the primary focus is on querying for
relationships between entities and analysing the relationships.
GRAPHDB - NEO4J
o Neo4j is one of popular graph databases which provides support for Atomicity,
Consistency, Isolation, Durability (ACID).
GRAPHDB - NEO4J (CONT)
o Arshdeep Bahga & Vijay Madisetti. (2016). Big Data Science & Analytics: A Hands-On Approach.
1st E. VPT. India. ISBN: 9781949978001. Chapter 4 and 6.
o Balusamy. Balamurugan, Abirami.Nandhini, Kadry.R, Seifedine, & Gandomi. Amir H. (2021). Big
Data Concepts, Technology, and Architecture. 1st. Wiley. ISBN 978-1-119-70182-8. Chapter 3
and 5
o https://www.youtube.com/watch?v=GJYEsEEfjvk
o https://www.youtube.com/watch?v=0buKQHokLK8