04-Hadoop Distributed File System

Hadoop and HDFS
Hadoop and HDFS
Data Science Foundations
© Copyright IBM Corporation 2019

Course materials may not be reproduced in whole or in part without the written permission of IBM.
Unit 4 Hadoop and HDFS
© Copyright IBM Corp. 2019 4-2

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit objectives
• Understand the basic need for a big data strategy in terms of parallel
reading of large data files and internode network speed in a cluster
• Describe the nature of the Hadoop Distributed File System (HDFS)
• Explain the function of the NameNode and DataNodes in an Hadoop
cluster
• Explain how files are stored and blocks ("splits") are replicated
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Unit objectives

The importance of Hadoop

• "We believe that more than half of the world's data will be stored in
Apache Hadoop within five years" - Hortonworks
Introduction
Hadoop and to
HDFS
The importance of Hadoop

There has been much buzz about Hadoop and big data in the marketplace over the
past year, with Wikibon predicting a $50B marketplace by 2017. Hortonworks says on
their web site that they believe that half of the world's data will be stored in Hadoop
within the next 5 years. With so much at stake, there are a lot of vendors looking for
some of the action and the big database vendors are not standing still.

Hardware improvements through the years

• CPU speeds:
▪ 1990: 44 MIPS at 40 MHz
▪ 2010: 147,600 MIPS at 3.3 GHz
• RAM memory How long does it take to read
▪ 1990: 640K conventional memory
1TB of data (at 80 MB/sec)?
(256K extended memory recommended) 1 disk - 3.4 hrs
▪ 2010: 8-32GB (and more) 10 disks - 20 min
• Disk capacity 100 disks - 2 min

1000 disks - 12 sec
▪ 1990: 20MB
▪ 2010: 1TB
• Disk latency (speed of reads and writes) -

not much improvement in last 7-10 years,
currently around 70-80 MB/sec
Introduction
Hadoop and to
HDFS
Hardware improvements through the years

Before exploring Hadoop, you will review why Hadoop technology is so important.
Moore's law has been true for a long time, but no matter how many more transistors are
added to CPUs, and how powerful they become, it is disk latency where the bottleneck
is. This chart makes the point: scaling up (more powerful computers with power CPUs)
is not the answer to all problems since disk latency is the main issue. Scaling out
(cluster of computers) is a better approach.
One definition of a commodity server is "a piece of fairly standard hardware that can be
purchased at retail, to have any particular software installed on it".
A typical Hadoop cluster node on Intel hardware in 2015:
• two quad core CPUs
• 12 GB to 24 GB memory
• four to six disk drives of 2 terabyte (TB) capacity

Incidentally, the second slowest component in a cluster of computers is the inter-node

network connection. You will find that this leads to the importance of where to place the
data that is to be read.
This topic is well discussed on the internet. As an example, try doing a Google search
of best hardware configuration for hadoop. You should note that the hardware
required for management nodes is considerably higher. Here you are primarily
presented with the hardware requirements for worker nodes, aka DataNodes.
Reference:
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/cluster-
planning/content/server-node.html

What hardware is not used for Hadoop

• Redundant Array of Independent Disks (RAID)
▪ Not suitable for the data in a large cluster
▪ Wanted: Just-a-Bunch-Of-Disks (JBOD)
• Linux Logical Volume Manager (LVM)
▪ HDFS and GPFS are already abstractions that run on top of disk filesystems
- LVM is an abstraction that can obscure the real disk
• Solid-State Disk (SSD)
▪ Low latency of SSD is not useful for streaming file data
▪ Low storage capacity for high cost (currently not commodity hardware)
RA ID is often used on Master N odes

(but never D ata N odes)
as part of fault tolerance mechanisms
Introduction
Hadoop and to
HDFS
What hardware is not used for Hadoop

There is the possibility of software RAID in future versions of Hadoop:
• https://wiki.apache.org/hadoop/HDFS-RAID
Discussion of why not to use RAID:
• https://hortonworks.com/blog/why-not-raid-0-its-about-time-and-snowflakes
• https://hortonworks.com/blog/proper-care-and-feeding-of-drives-in-a-hadoop-
cluster-a-conversation-with-stackiqs-dr-bruno

Parallel data processing is the answer

• There has been:
▪ GRID computing: spreads processing load
▪ distributed workload: hard to manage
applications, overhead on developer
▪ parallel databases: Db2 DPF, Teradata,
Netezza, etc. (distribute the data)
• Challenges:
▪ heterogeneity
▪ openness
▪ security
▪ scalability D is tributed c omputing: Multiple c omputers appear
as one super computer, c ommunicate with each
▪ concurrency other by message passing, and
operate together to achieve a c ommon goal
▪ fault tolerance
▪ transparency
Introduction
Hadoop and to
HDFS
Parallel data processing is the answer

Further comments:
• With GRID computing, the idea is to increase processing power of multiple
computers and bring data to place where processing capacity is available.
• Distributed workload is bringing processing to data. This is great. But writing
applications that do this is very difficult. You need to worry how to exchange
data between the different nodes, you need to think about what happens if one
of these nodes goes down, you need to decide where to put the data, decide to
which node transmit the data, and figure out when all nodes finish, how to send
the data to some central place. All of this is very challenging. You could spend a
lot of time coding on passing the data, rather than dealing with the problem
itself.
• Parallel databases are in use today, but they are generally for structured
relational data.
• Hadoop, is the answer to parallel data processing without having the issues of
GRID, distributed workload or parallel databases.

What is Hadoop?
• Apache open source software framework for reliable, scalable,
distributed computing over massive amount of data:
▪ hides underlying system details and complexities from user
▪ developed in Java
• Consists of 4 sub projects:
▪ MapReduce
▪ Hadoop Distributed File System (HDFS)
▪ YARN
▪ Hadoop Common
• Supported by many Apache/Hadoop-related projects:

▪ HBase, ZooKeeper, Avro, etc.
• Meant for heterogeneous commodity hardware.
Introduction
Hadoop and to
HDFS
What is Hadoop?
Hadoop is an open source project of the Apache Foundation.
It is a framework written in Java originally developed by Doug Cutting who named it
after his son's toy elephant.
Hadoop uses Google's MapReduce and Google File System (GFS) technologies as its
foundation.
It is optimized to handle massive amounts of data which could be structured,
unstructured or semi-structured, and uses commodity hardware (relatively inexpensive
computers).
This massive parallel processing is done with great performance. However, it is a batch
operation handling massive amounts of data, so the response time is not immediate. As
of Hadoop version 0.20.2, updates are not possible, but appends were made possible
starting in version 0.21.
What is the value of a system if the information it stores or retrieves is not consistent?
Hadoop replicates its data across different computers, so that if one goes down, the
data is processed on one of the replicated computers.
You may be familiar with OLTP (Online Transactional processing) workloads where
data is randomly accessed on structured data like a relational database, such as when
you access your bank account.

You may also be familiar with OLAP (Online Analytical processing) or DSS (Decision
Support Systems) workloads where data is sequentially access on structured data, like
a relational database, to generate reports that provide business intelligence.
You may not be that familiar with the concept of big data. Big data is a term used to
describe large collections of data (also known as datasets) that may be unstructured
and grow so large and quickly that is difficult to manage with regular database or
statistics tools.
Hadoop is not used for OLTP nor OLAP, but is used for big data, and it complements
these two to manage data. Hadoop is not a replacement for a RDBMS.

Hadoop open source projects

• Hadoop is supplemented by an extensive ecosystem of open source
projects.
Introduction
Hadoop and to
HDFS
Hadoop open source projects

The Open Data Platform (www.opendataplatform.org):
• The ODP Core will initially focus on Apache Hadoop (inclusive of HDFS, YARN,
and MapReduce) and Apache Ambari. Once the ODP members and processes
are well established, the scope of the ODP Core may expand to include other
open source projects.
Apache Ambari aims to make Hadoop management simpler by developing software for
provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an
intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
Reference summary from http://hadoop.apache.org:
The Apache™ Hadoop® project develops open-source software for reliable, scalable,
distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle failures at the application
layer, so delivering a highly-available service on top of a cluster of computers, each of
which may be prone to failures.

The project includes these modules:

• Hadoop Common: The common utilities that support the other Hadoop
modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that
provides high throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource
management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large
data sets.'
• Ambari™: A web-based tool for provisioning, managing, and monitoring Hadoop
clusters. It also provides a dashboard for viewing cluster health and ability to
view MapReduce, Pig and Hive applications visually.
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data
storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and
ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel
computation.
• Spark™: A fast and general compute engine for Hadoop data. Spark provides a
simple and expressive programming model that supports a wide range of
applications, including ETL, machine learning, stream processing, and graph
computation.
• Tez™: A generalized data-flow programming framework, built on Hadoop
YARN, which provides a powerful and flexible engine to execute an arbitrary
DAG of tasks to process data for both batch and interactive use-cases.
• ZooKeeper™: A high-performance coordination service for distributed
applications.

Advantages and disadvantages of Hadoop

• Hadoop is good for:
▪ processing massive amounts of data through parallelism
▪ handling a variety of data (structured, unstructured, semi-structured)
▪ using inexpensive commodity hardware
• Hadoop is not good for:
▪ processing transactions (random access)
▪ when work cannot be parallelized
▪ low latency data access
▪ processing lots of small files
▪ intensive calculations with small amounts of data
Introduction
Hadoop and to
HDFS
Advantages and disadvantages of Hadoop

Hadoop is cannot resolve all data-related problems. It is being designed to handle big
data. Hadoop works better when handling one single huge file rather than many small
files. Hadoop complements existing RDBMS technology.
Wikipedia: Low latency allows human-unnoticeable delays between an input being
processed and the corresponding output providing real time characteristics. This can be
especially important for internet connections utilizing services such as online gaming
and VOIP.
Hadoop is not good for low latency data access. In practice you can replace latency
with delay. So low latency means negligible delay in processing. Low latency data
access means negligible delay accessing data. Hadoop is not designed for low latency.
Hadoop works best with very large files. The larger the file, the less time Hadoop
spends seeking for the next data location on disk and the more time Hadoop runs at the
limit of the bandwidth of your disks. Seeks are generally expensive operations that are
useful when you only need to analyze a small subset of your dataset. Since Hadoop is
designed to run over your entire dataset, it is best to minimize seeks by using large
files.
Hadoop is good for applications requiring high throughput of data: Clustered machines
can read data in parallel for high throughput.

Timeline for Hadoop
Introduction
Hadoop and to
HDFS
Timeline for Hadoop

Review this slide for the history of Hadoop/MapReduce technology.
Google published information on the GFS (Google File System): Ghemawat, S.,
Gobioff, H., & Leung, S.-T. (2002). The Google File System. Retrieved from
http://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-
sosp2003.pdf
The original MapReduce paper: Dean, J., & Ghemawat, S. (2004). MapReduce:
Simplified data processing on large clusters. Retrieved from
http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapredu
ce-osdi04.pdf
Doug Cutting was developing Lucene (Text index for documents, subproject of Nutch),
and Nutch (Web index for documents). He could not get Nutch to scale. Then he saw
Google's papers on GFS and MapReduce and created Hadoop (open source version of
the info from those papers). Hadoop allowed Nutch to scale.

Timeline:
• 2003: Google launches project Nutch to handle billions of searches and
indexing millions of web pages.
• Oct 2003: Google releases papers with GFS (Google File System)
• Dec 2004: Google releases papers with MapReduce
• 2005: Nutch used GFS and MapReduce to perform operations
• 2006: Yahoo! created Hadoop based on GFS and MapReduce (with Doug
Cutting and team)
• 2007: Yahoo started using Hadoop on a 1000 node cluster
• Jan 2008: Apache took over Hadoop
• Jul 2008: Tested a 4000-node cluster with Hadoop successfully
• 2009: Hadoop successfully sorted a petabyte of data in less than 17 hours to
handle billions of searches and indexing millions of web pages.
• Dec 2011: Hadoop releases version 1.0
For later releases, and the release numbering structure, refer to:
https://wiki.apache.org/hadoop/Roadmap

Hadoop: Major components

• Hadoop Distributed File System (HDFS)
▪ where Hadoop stores data
▪ a file system that spans all the nodes in a Hadoop cluster
▪ links together the file systems on many local nodes to make them into one
large file system that spans all the data nodes of the cluster
• MapReduce framework
• How Hadoop understands and assigns work to the nodes (machines)
• Evolving: MR v1, MR v2, etc.
• Morphed into YARN and other processing frameworks
Introduction
Hadoop and to
HDFS
Hadoop: Major components

There are two aspects of Hadoop that are important to understand:
1. MapReduce is a software framework introduced by Google to support distributed
computing on large data sets of clusters of computers.
2. The Hadoop Distributed File System (HDFS) is where Hadoop stores its data.
This file system spans all the nodes in a cluster. Effectively, HDFS links together
the data that resides on many local nodes, making the data part of one big file
system. You can use other file systems with Hadoop, but HDFS is quite common.

Brief introduction to HDFS and MapReduce

• Driving principles
▪ data is stored across the entire cluster
▪ programs are brought to the data, not the data to the program
• Data is stored across the entire cluster (the DFS)
▪ the entire cluster participates in the file system
▪ blocks of a single file are distributed across the cluster
▪ a given block is typically replicated as well for resiliency
101101001
Cluster
010010011
1
100111111
001010011
101001010 1 3 2
010110010
010101001
2
100010100
101110101 4 1 3
Blocks 110101111
011011010
101101001
010100101
3
010101011 2 4
100100110 2
101110100 4 3
1
4
Logical File
Introduction
Hadoop and to
HDFS
Brief introduction to HDFS and MapReduce

The driving principle of MapReduce is a simple one: spread the data out across a huge
cluster of machines and then, rather than bringing the data to your programs as you do
in a traditional programming, write your program in a specific way that allows the
program to be moved to the data. Thus, the entire cluster is made available in both
reading the data as well as processing the data.
The Distributed File System (DFS) is at the heart of MapReduce. It is responsible for
spreading data across the cluster, by making the entire cluster look like one giant file
system. When a file is written to the cluster, blocks of the file are spread out and
replicated across the whole cluster (in the diagram, notice that every block of the file is
replicated to three different machines).
Adding more nodes to the cluster instantly adds capacity to the file system and
automatically increases the available processing power and parallelism.

Hadoop Distributed File System (HDFS) principles

• Distributed, scalable, fault tolerant, high throughput
• Data access through MapReduce
• Files split into blocks (aka splits)
• 3 replicas for each piece of data by default
• Can create, delete, and copy, but cannot update
• Designed for streaming reads, not random access
• Data locality is an important concept: processing data on or near the
physical storage to decrease transmission of data
Introduction
Hadoop and to
HDFS
Hadoop Distributed File System (HDFS) principles

Lister are the important principles behind the Hadoop Distributed File System (HDFS).

HDFS architecture
• Master/Slave architecture NameNode File1
• Master: NameNode a
b
▪ manages the file system namespace c
and metadata d
― FsImage
― Edits Log
▪ regulates client access to files
• Slave: DataNode
▪ many per cluster
▪ manages storage attached to the
nodes
▪ periodically reports status to a b a c
NameNode b a d b
d c c d
DataNodes
Introduction
Hadoop and to
HDFS
HDFS architecture
Important points still to be discussed:
• Secondary NameNode
• Import checkpoint
• Rebalancing
• SafeMode
• Recovery Mode
The entire file system namespace, including the mapping of blocks to files and file
system properties, is stored in a file called the FsImage. The FsImage is stored as a file
in the NameNode's local file system. It contains the metadata on disk (not exact copy of
what is in RAM, but a checkpoint copy).
The NameNode uses a transaction log called the EditLog (or Edits Log) to persistently
record every change that occurs to file system metadata, synchronizes with metadata in
RAM after each write.
The NameNode can be a potential single point of failure (this has been resolved in later
releases of HDFS with Secondary NameNode, various forms of high availability, and in
Hadoop v2 with NameNode federation and high availability as out-of-the-box options).

• Use better quality hardware for all management nodes, and in particular do not
use inexpensive commodity hardware for the NameNode.
• Mitigate by backing up to other storage.
In case of power failure on NameNode, recover is performed using the FsImage and
the EditLog.

HDFS blocks
• HDFS is designed to support very large files
• Each file is split into blocks: Hadoop default is 128 MB
• Blocks reside on different physical DataNodes
• Behind the scenes, each HDFS block is supported by multiple
operating system blocks
64 MB HDFS blocks
OS blocks
• If a file or a chunk of the file is smaller than the block size, only the
needed space is used. For example, a 210MB file is split as:
64 MB 64 MB 64 MB 18 MB
Introduction
Hadoop and to
HDFS
HDFS blocks
Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and
distributed throughout the cluster. In this way, the map and reduce functions can be ex-
ecuted on smaller subsets of your larger data sets, and this provides the scalability that
is needed for big data processing. See https://hortonworks.com/apache/hdfs.
In earlier versions of Hadoop/HDFS, the default blocksize was often quoted as 64 MB,
but the current default setting for Hadoop/HDFS is noted in
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-
default.xml
• dfs.blocksize = 134217728
Default block size for new files, in bytes. You can use the following suffix (case
insensitive): k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size
(such as 128k, 512m, 1g, etc.), or you can provide the complete size in bytes
(such as 134217728 for 128 MB).
It should be noted that Linux itself has both a logical block size (typically 4 KB) and a
physical or hardware block size (typically 512 bytes).

Linux filesystems:
• For ext2 or ext3, the situation is relatively simple: each file occupies a certain
number of blocks. All blocks on a given filesystem have the same size, usually
one of 1024, 2048 or 4096 bytes.
• What is the physical blocksize?
[clsadmin@chs-gbq-108-mn003 ~]$ lsblk -o NAME,PHY-SEC
NAME PHY-SEC
xvda 512
├─xvda1 512
└─xvda2 512
xvdb 512
└─xvdb1 512
xvdc 512
├─xvdc1 512
├─xvdc2
├─xvdc3
├─xvdc4
└─xvdc5

HDFS replication of blocks

• Blocks of data are replicated to multiple nodes
▪ behavior is controlled by replication factor, configurable per file
▪ default is 3 replicas
• Approach:
▪ first replica goes on
any node in the cluster
▪ second replica on a
node in a different rack
▪ third replica on a
different node in the
second rack
The approach cuts inter-rack
network bandwidth, which
improves write performance
Introduction
Hadoop and to
HDFS
HDFS replication of blocks

To ensure reliability and performance, the placement of replicas is critical to HDFS.
HDFS can be distinguished from most other distributed file systems by replica
placement. This feature requires tuning and experience. Improving data reliability,
availability, and network bandwidth utilization are the purpose of a rack-aware replica
placement policy. The current implementation for the replica placement policy is a first
effort in this direction. The short-term goals of implementing this policy are to validate it
on production systems, learn more about its behavior, and build a foundation to test
and research more sophisticated policies.
Large HDFS instances run on a cluster of computers that commonly spread across
many racks. Communication between two nodes in different racks has to go through
switches. In most cases, network bandwidth between machines in the same rack is
greater than network bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the process
outlined in block awareness (https://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-common/RackAwareness.html). A simple but non-optimal policy is to place
replicas on unique racks. This prevents losing data when an entire rack fails and allows
use of bandwidth from multiple racks when reading data. This policy evenly distributes

replicas in the cluster which makes it easy to balance load on component failure,
however this policy increases the cost of writes because a write needs to transfer
blocks to multiple racks.

For the common case, when the replication factor is three, HDFS's placement policy is
to put one replica on one node in the local rack, another on a different node in the local
rack, and the last on a different node in a different rack. This policy cuts the inter-rack
write traffic which generally improves write performance. The chance of rack failure is
far less than that of node failure; this policy does not impact data reliability and
availability guarantees. However, it does reduce the aggregate network bandwidth used
when reading data since a block is placed in only two unique racks rather than three.
With this policy, the replicas of a file do not evenly distribute across the racks. One third
of replicas are on one node, two thirds of replicas are on one rack, and the other third
are evenly distributed across the remaining racks. This policy improves write
performance without compromising data reliability or read performance.

Setting rack network topology (Rack Awareness)

• Defined by a script which specifies which node is in which rack, where
the rack is the network switch to which the node is connected, not the
metal framework where the nodes are physically stacked.
• The script is referenced in net.topology.script.property.file in the
Hadoop configuration file core-site.xml. For example:
<property>
<nam e>net.topology.script.file.name</name>
<value>/etc/hadoop/conf/rack-topology.sh </value>
</property>
• The network topology script (net.topology.script.file.name in the
example above) receives as arguments one or more IP addresses of
nodes in the cluster. It returns on stdout a list of rack names, one for
each input.
• One simple approach is to use IP addressing of 10.x.y.z where
x = cluster number, y = rack number, z = node within rack; and an
appropriate script to decode this into y/z.
Introduction
Hadoop and to
HDFS
Setting rack network topology (Rack Awareness)

For small clusters in which all servers are connected by a single switch, there are only
two levels of locality: on-machine and off-machine. When loading data from a
DataNode's local drive into HDFS, the NameNode will schedule one copy to go into the
local DataNode, and will pick two other machines at random from the cluster.
For larger Hadoop installations which span multiple racks, it is important to ensure that
replicas of data exist on multiple racks so that the loss of a switch does not render
portions of the data unavailable, due to all the replicas being underneath it.
HDFS can be made rack-aware using a script which allows the master node to map the
network topology of the cluster. While alternate configuration strategies can be used,
the default implementation allows you to provide an executable script which returns the
rack address of each of a list of IP addresses.
The network topology script receives as arguments one or more IP addresses of nodes
in the cluster. It returns on the standard output a list of rack names, one for each input.
The input and output order must be consistent.
To set the rack mapping script, specify the key topology.script.file.name in
conf/Hadoop-site.xml. This provides a command to run to return a rack id; it must be an

executable script or program. By default, Hadoop will attempt to send a set of IP

addresses to the file as several separate command line arguments. You can control the
maximum acceptable number of arguments with the topology.script.number.args key.
Rack ids in Hadoop are hierarchical and look like path names. By default, every node
has a rack id of /default-rack. You can set rack ids for nodes to any arbitrary path, such
as /foo/bar-rack. Path elements further to the left are higher up the tree, so a
reasonable structure for a large installation may be /top-switch-name/rack-name.
The following example script performs rack identification based on IP addresses given
a hierarchical IP addressing scheme enforced by the network administrator. This may
work directly for simple installations; more complex network configurations may require
a file- or table-based lookup process. Care should be taken in that case to keep the
table up-to-date as nodes are physically relocated, etc. This script requires that the
maximum number of arguments be set to 1.
#!/bin/bash
# Set rack id based on IP address.
# Assumes network administrator has complete control
# over IP addresses assigned to nodes and they are
# in the 10.x.y.z address space. Assumes that
# IP addresses are distributed hierarchically. e.g.,
# 10.1.y.z is one data center segment and 10.2.y.z is another;
# 10.1.1.z is one rack, 10.1.2.z is another rack in
# the same segment, etc.)
# This is invoked with an IP address as its only argument
# get IP address from the input
ipaddr=$0
# select "x.y" and convert it to "x/y"
segments=`echo $ipaddr | cut --delimiter=. --fields=2-3 --output-delimiter=/`
echo /${segments}

A more complex rack-aware script:

File name: rack-topology.sh
#!/bin/bash
# Adjust/Add the property "net.topology.script.file.name"
# to core-site.xml with the "absolute" path the this
# file. ENSURE the file is "executable".
# Supply appropriate rack prefix
RACK_PREFIX=default
# To test, supply a hostname as script input:
if [ $# -gt 0 ]; then
CTL_FILE=${CTL_FILE:-"rack_topology.data"}
HADOOP_CONF=${HADOOP_CONF:-"/etc/hadoop/conf"}
if [ ! -f ${HADOOP_CONF}/${CTL_FILE} ]; then
echo -n "/$RACK_PREFIX/rack "
exit 0
fi
while [ $# -gt 0 ] ; do
nodeArg=$1
exec< ${HADOOP_CONF}/${CTL_FILE}
result=""
while read line ; do
ar=( $line )
if [ "${ar[0]}" = "$nodeArg" ] ; then
result="${ar[1]}"
fi
done
shift

if [ -z "$result" ] ; then
else
echo -n "/$RACK_PREFIX/rack_$result "
fi
done
else
fi
Sample Topology Data File
File name: rack_topology.data
# This file should be:

# - Placed in the /etc/hadoop/conf directory
# - On the Namenode (and backups IE: HA, Failover, etc)
# - On the Job Tracker OR Resource Manager (and any Failover JT's/RM's)
# This file should be placed in the /etc/hadoop/conf directory.
# Add Hostnames to this file. Format <host ip> <rack_location>
192.168.2.10 01
192.168.2.11 02
192.168.2.12 03

Compression of files
• File compression brings two benefits:
▪ reduces the space need to store files
▪ speeds up data transfer across the network or to/from disk
• But is the data splitable? (necessary for parallel reading)
• Use codecs, such as org.apache.hadoop.io.compressSnappyCodec
Compression Format Algorithm Filename extension Splitable?
DEFLATE DEFLATE .deflate No
gzip DEFLATE .gz No
bzip2 bzip2 .bz2 Yes
LZO LZO .lzo / .cmx Yes, If indexed in preprocessing
LZ4 LZ4 .lz4 No
Snappy Snappy .snappy No
Hadoop and to
Introduction HDFS
Compression of files
gzip
gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm,
which is a combination of LZ77 and Huffman Coding.
bzip2
bzip2 is a freely available, patent free, high-quality data compressor. It typically
compresses files to within 10% to 15% of the best available techniques (the PPM
family of statistical compressors), whilst being around twice as fast at
compression and six times faster at decompression.
LZO
The LZO compression format is composed of many smaller (~256K) blocks of
compressed data, allowing jobs to be split along block boundaries. Moreover, it
was designed with speed in mind: it decompresses about twice as fast as gzip,
meaning it is fast enough to keep up with hard drive read speeds. It does not
compress quite as well as gzip; expect files that are on the order of 50% larger
than their gzipped version. But that is still 20-50% of the size of the files without
any compression at all, which means that IO-bound jobs complete the map phase
about four times faster.

LZO = Lempel Ziv Oberhummer,

https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Oberhumer. A
free software tool which implements it is lzop. The original library was written in
ANSI C, and it has been made available under the GNU General Purpose
License. Versions of LZO are available for the Perl, Python, and Java languages.
The copyright for the code is owned by Markus F. X. J. Oberhumer.
LZ4
LZ4 is a lossless data compression algorithm that is focused on compression and
decompression speed. It belongs to the LZ77 family of byte-oriented compression
schemes. The algorithm gives a slightly worse compression ratio than algorithms
like gzip. However, compression speeds are several times faster than gzip while
decompression speeds can be significantly faster than LZO .
http://en.wikipedia.org/wiki/LZ4_(compression_algorithm), the reference
implementation in C by Yann Collet is licensed under a BSD license.
Snappy
Snappy is a compression/decompression library. It does not aim for maximum
compression, or compatibility with any other compression library; instead, it aims
for very high speeds and reasonable compression. For instance, compared to the
fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but
the resulting compressed files are anywhere from 20% to 100% bigger. On a
single core of a Core i7 processor in 64-bit mode, Snappy compresses at about
250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is
widely used inside Google, in everything from BigTable and MapReduce to RPC
systems.
All packages produced by the Apache Software Foundation (ASF), such as Hadoop,
are implicitly licensed under the Apache License, Version 2.0, unless otherwise
explicitly stated. The licensing of other algorithms, such as LZO, that are not licensed
under ASF may pose some problems for distributions that rely solely on the Apache
License.

Which compression format should I use?

• Most to least effective:
▪ use a container format (Sequence file, Avro, ORC, or Parquet)
▪ for files use a fast compressor such as LZO, LZ4, or Snappy
▪ use a compression format that supports splitting, such as bz2 (slow)
or one that can be indexed to support splitting, such as LZO
▪ split files into chunks and compress each chunk separately using a
supported compression format (does not matter if splittable) - choose a
chunk size so that compressed chunks are approximately the size of an
HDFS block
▪ store files uncompressed
Take advantage of c ompression - but the c ompression format

s hould depend on file s ize, data format, and tools used
Introduction
Hadoop and to
HDFS
Which compression format should I use?

References:
• http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-
mapreduce-client-core/MapReduceTutorial.html#Data_Compression
• http://comphadoop.weebly.com
• https://www.slideshare.net/Hadoop_Summit/kamat-singh-
june27425pmroom210cv2
• https://www.cloudera.com/documentation/enterprise/5-3-
x/topics/admin_data_compression_performance.html

NameNode startup
1. NameNode reads fsimage in memory
2. NameNode applies editlog changes
3. NameNode waits for block data from data nodes
▪ NameNode does not store the physical-location information of the blocks
▪ NameNode exits SafeMode w hen 99.9% of blocks have at least one copy
accounted for
block inf ormation send
3
1 f simage is read to NameNode
datadir
block1
Nam eNode datanode1 block2
…
edits log is read

2 and applied
datanode2
datadir
block1
namedir block2
edits log …
fsimage
…
Hadoop and to
Introduction HDFS
NameNode startup
During start up, the NameNode loads the file system state from the fsimage and the
edits log file. It then waits for DataNodes to report their blocks so that it does not
prematurely start replicating the blocks though enough replicas already exist in the
cluster.
During this time NameNode stays in SafeMode. SafeMode for the NameNode is
essentially a read-only mode for the HDFS cluster, where it does not allow any
modifications to file system or blocks. Normally the NameNode leaves SafeMode
automatically after the DataNodes have reported that most file system blocks are
available.
If required, HDFS can be placed in SafeMode explicitly using the command hdfs
dfsadmin -safemode. The NameNode front page shows whether SafeMode is on or
off.

NameNode files (as stored in HDFS)

[clsadmin@chs-gbq-108-mn002:~$] ls -l /data/hadoop/hdfs
total 0
drwxr-xr-x. 4 hdf s hadoop 66 Oct 25 00:33 namenode
drwxr-xr-x. 3 hdf s hadoop 40 Oct 25 00:37 namesecondary
[clsadmin@chs-gbq-108-mn002:~$] ls -l /data/hadoop/hdfs/namesecondary/current
total 6780
-rw-r--r--. 1 hdf s hadoop 1613711 Oct 24 18:32 edits_0000000000000000001-0000000000000012697
-rw-r--r--. 1 hdf s hadoop 122144 Oct 25 18:38 f simage_0000000000000047662
-rw-r--r--. 1 hdf s hadoop 62 Oct 25 18:38 f simage_0000000000000047662.md5
-rw-r--r--. 1 hdf s hadoop 124700 Oct 26 00:38 f simage_0000000000000056108
-rw-r--r--. 1 hdf s hadoop 62 Oct 26 00:38 f simage_0000000000000056108.md5
-rw-r--r--. 1 hdf s hadoop 206 Oct 26 00:38 VERSION
Introduction
Hadoop and to
HDFS
NameNode files (as stored in HDFS)

These are the actual storage files (in HDFS) where the NameNode stores its metadata:
• fsimage
• edits
• VERSION
There is a current edits_inprogress file that is accumulating edits (adds, deletes) since
the last update of the fsimage. This current edits file is closed off and the changes
incorporated into a new version of the fsimage based on whichever of two configurable
events occurs first:
• edits file reaches a certain size (here 1 MB, default 64 MB)
• time limit between updates is reached, and there have been updates (default 1
hour)

Adding a file to HDFS: replication pipelining

1. File is added to NameNode memory by persisting info in edits log
2. Data is written in blocks to DataNodes
▪ DataNode starts chained copy to two other DataNodes
▪ if at least one write for each block succeeds, the write is successful
Client talks to NameNode, which Client

1 determines which DataNodes will API on client send data
store the replicas of each block 3 block to first node
Nam eNode datadir

1st datanode block1
4.First DataNode
daisychain-writes to
block2…
edits log is changed in
2 memory and on disk second, second to
third - with ack back
to previous node 2nd datanode
5.Then first DataNode datadir
namedir confirms replication block1
edits log complete to the block2…
fsimage NameNode …
Introduction
Hadoop and to
HDFS
Adding a file to HDFS: replication pipelining

Data Blocks
HDFS is designed to support very large files. Applications that are compatible with
HDFS are those that deal with large data sets. These applications write their data only
once but they read it one or more times and require these reads to be satisfied at
streaming speeds. HDFS supports write-once-read-many semantics on files. A typical
block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB
"splits," and if possible, each split (or block) will reside on a different DataNode.
Staging
A client request to create a file does not reach the NameNode immediately. In fact,
initially the HDFS client caches the file data into a temporary local file. Application writes
are transparently redirected to this temporary local file. When the local file accumulates
data worth over one HDFS block size, the client contacts the NameNode. The
NameNode inserts the file name into the file system hierarchy and allocates a data
block for it. The NameNode responds to the client request with the identity of the
DataNode and the destination data block. Then the client flushes the block of data from
the local temporary file to the specified DataNode. When a file is closed, the remaining
un-flushed data in the temporary local file is transferred to the DataNode.

The client then tells the NameNode that the file is closed. At this point, the NameNode
commits the file creation operation into a persistent store. If the NameNode dies before
the file is closed, the file is lost.
The aforementioned approach was adopted after careful consideration of target
applications that run on HDFS. These applications need streaming writes to files. If a
client writes to a remote file directly without any client side buffering, the network speed
and the congestion in the network impacts throughput considerably. This approach is
not without precedent. Earlier distributed file systems, such as AFS, have used client
side caching to improve performance. A POSIX requirement has been relaxed to
achieve higher performance of data uploads.
Replication Pipelining
When a client is writing data to an HDFS file, its data is first written to a local file as
explained previously. Suppose the HDFS file has a replication factor of three. When the
local file accumulates a full block of user data, the client retrieves a list of DataNodes
from the NameNode. This list contains the DataNodes that will host a replica of that
block. The client then flushes the data block to the first DataNode. The first DataNode
starts receiving the data in small portions (4 KB), writes each portion to its local
repository and transfers that portion to the second DataNode in the list. The second
DataNode, in turn starts receiving each portion of the data block, writes that portion to
its repository and then flushes that portion to the third DataNode. Finally, the third
DataNode writes the data to its local repository. Thus, a DataNode can be receiving
data from the previous one in the pipeline and at the same time forwarding data to the
next one in the pipeline. The data is pipelined from one DataNode to the next.
For good descriptions of the process, see the tutorials at:
• HDFS Users Guide at http://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-hdfs/HdfsUserGuide.html
• An introduction to the Hadoop Distributed File System at
https://hortonworks.com/apache/hdfs
• How HDFS works at https://hortonworks.com/apache/hdfs/#section_2

Managing the cluster

• Adding a data node
▪ Start new DataNode (pointing to NameNode)
▪ If required, run balancer to rebalance blocks across the cluster:
hdfs balancer
• Removing a node
▪ Simply remove DataNode
▪ Better: Add node to exclude file and wait till all blocks have been moved
▪ Can be checked in server admin console server:50070
• Checking filesystem health

▪ Use: hdfs fsck …
Introduction
Hadoop and to
HDFS
Managing the cluster

Apache Hadoop clusters grow and change with use. The normal method is to use
Apache Ambari to build your initial cluster with a base set of Hadoop services targeting
known use cases. You may want to add other services for new use cases, and even
later you may need to expand the storage and processing capacity of the cluster.
Ambari can help in both scenarios, the initial configuration and the later
expansion/reconfiguration of your cluster.
When you can add more hosts to the cluster you can assign these hosts to run as
DataNodes (and NodeManagers under YARN, as you will see later). This allows you to
expand both your HDFS storage capacity and your overall processing power.
Similarly, you can remove DataNodes if they are malfunctioning or you want to
reorganize your cluster.

HDFS-2 NameNode HA (High Availability)

• HDFS-2 adds NameNode High Availability
• Standby NameNode needs filesystem transactions and block locations for fast failover
• Every filesystem modification is logged to at least 3 quorum journal nodes by active
Namenode
▪ Standby Node applies changes from journal nodes as they occur
▪ Majority of journal nodes define reality
▪ Split Brain is av oided by JournalNodes (They will only allow one NameNode to write to them )
• DataNodes send block locations and heartbeats to both NameNodes
• Memory state of Standby NameNode is very close to Active NameNode
▪ Much f aster failover than cold start
JournalNode1 JournalNode2 JournalNode3
Active Standby
Nam eNode Nam eNode
Datanode1 Datanode2 Datanode3 Datanodex

Introduction
Hadoop and to
HDFS
HDFS-2 NameNode HA (High Availability)

With Hadoop 2, high availability is supported out-of-the-box.
Features:
• two available NameNodes: Active and Standby
• transactions logged to Quorum Journal Nodes (QJM)
• standby node periodically gets updates
• DataNodes send block locations and heartbeats to both NameNodes
• when failures occur, Standby can take over with very small downtime
• no cold start
Deployment:
• need to have two dedicated NameNodes in the cluster
• QJM may coexist with other services (at least 3 in a cluster)
• no need of a Secondary NameNode

Secondary NameNode
• During operation primary NameNode cannot merge fsImage and edits log
• This is done on the secondary NameNode
▪ Ev ery couple minutes, secondary NameNode copies new edit log from primary NN
▪ Merges edits log into fsimage
▪ Copies the new merged fsImage back to primary NameNode
• Not HA but faster startup time
▪ Secondary NN does not have complete image. In-flight transactions would be lost
▪ Primary NameNode needs to merge less during startup
• Was temporarily deprecated because of NameNode HA but has some advantages
▪ (No need f or Quorum nodes, less network traffic, less moving parts )
New Edits Log entries are

Prim ary copied to Secondary NN Secondary
Nam eNode Nam eNode
Merged fsimage is copied

namedir back namedir
edits log edits log
fsimage simage
Introduction
Hadoop and to
HDFS
Secondary NameNode
In the older approach, a Secondary NameNode is used.
The NameNode stores the HDFS filesystem information in a file named fsimage.
Updates to the file system (add/remove blocks) are not updating the fsimage file, but
instead are logged into a file, so the I/O is fast append only streaming as opposed to
random file writes. When restarting, the NameNode reads the fsimage and then applies
all the changes from the log file to bring the filesystem state up to date in memory. This
process takes time.
The job of the Secondary NameNode is not to be a secondary to the name node, but
only to periodically read the filesystem changes log and apply them into the fsimage file,
thus bringing it up to date. This allows the NameNode to start up faster next time.
Unfortunately, the Secondary NameNode service is not a standby secondary
NameNode, despite its name. Specifically, it does not offer HA for the NameNode. This
is well illustrated in the slide above.
Note that more recent distributions have NameNode High Availability using NFS
(shared storage) and/or NameNode High Availability using a Quorum Journal Manager
(QJM).

Possible FileSystem setup approaches

• Hadoop 2 with HA
▪ no single point of failure
▪ wide community support
• Hadoop 2 without HA (or, Hadoop 1.x in older versions)
▪ copy namedir to NFS (RAID)
▪ have virtual IP for backup NameNode
▪ still some failover time to read blocks, no instant failover but less overhead
Introduction
Hadoop and to
HDFS
Possible FileSystem setup approaches

The slide shows two approaches to high availability.

Federated NameNode (HDFS2)

• New in Hadoop2: NameNodes (NN) can be federated
▪ Historically NameNodes can become a bottleneck on huge clusters
▪ One million blocks or ~100TB of data require roughly one GB RAM in NN
• Blockpools
▪ Administrator can create separate blockpools/namespaces with different
NNs
▪ DataNodes register on all NNs
▪ DataNodes store data of all blockpools (otherwise setup separate clusters)
▪ New ClusterID identifies all NNs in a cluster.
▪ A namespace and its block pool together are called Namespace Volume
▪ You define which blockpool to use by connecting to a specific NN
▪ Each NameNode still has its own separate backup/secondary/checkpoint
node
• Benefits
▪ One NN failure will not impact other blockpools
▪ Better scalability for large numbers of file operations
Introduction
Hadoop and to
HDFS
Federated NameNode (HDFS2)

References:
• http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
hdfs/Federation.html
With Federated NameNodes in HDFS2, there are main layers:
• Namespace
• Consists of directories, files and blocks.
• It supports all the namespace related file system operations such as create,
delete, modify and list files and directories.

• Block Storage Service, which has two parts:

• Block Management (performed in the NameNode)
• Provides DataNode cluster membership by handling registrations, and
periodic heart beats.
• Processes block reports and maintains location of blocks.
• Supports block related operations such as create, delete, modify and
get block location.
• Manages replica placement, block replication for under replicated
blocks, and deletes blocks that are over replicated.
• Storage is provided by DataNodes by storing blocks on the local file system
and allowing read/write access.
The prior HDFS architecture allows only a single namespace for the entire cluster. In
that configuration, a single NameNode manages the namespace. HDFS Federation
addresses this limitation by adding support for multiple NameNodes/namespaces to
HDFS.
Multiple NameNodes / Namespaces
To scale the name service horizontally, federation uses multiple independent
NameNodes / namespaces. The NameNodes are federated; the NameNodes are
independent and do not require coordination with each other. The DataNodes are used
as common storage for blocks by all the NameNodes. Each DataNode registers with all
the NameNodes in the cluster. DataNodes send periodic heartbeats and block reports.
They also handle commands from the NameNodes.
Users may use ViewFS to create personalized namespace views. ViewFS is analogous
to client-side mount tables in some UNIX/Linux systems.

dfs: file system shell (1 of 4)

• File System Shell (dfs)
▪ Invoked as follows:
hdfs dfs <args>

▪ Example: listing the current directory in HDFS
hdfs dfs -ls /
▪ Note that the current directory is designated by dot (".")

- the here symbol in Linux/UNIX
▪ If you want the root of the HDFS file system, you would use slash ("/")
Introduction
Hadoop and to
HDFS
fs: file system shell

HDFS can be manipulated through a Java API or through a command line interface. All
commands for manipulating HDFS through Hadoop's command line interface begin
with hdfs dfs. This is the file system shell. This is followed by the command name as
an argument to hdfs dfs. These commands start with a dash. For example, the ls
command for listing a directory is a common UNIX command and is preceded with a
dash. As on UNIX systems, ls can take a path as an argument. In this example, the
path is the current directory, represented by a single dot.
dfs is one of the command options for hdfs. If you just type the command hdfs by
itself, you will see other options.
A good tutorial at this stage can be found at
https://developer.yahoo.com/hadoop/tutorial/module2.html.


• DFS shell commands take URIs as argument
▪ URI format:
scheme://authority/path
• Scheme:
▪ For the local filesystem, the scheme is file
▪ For HDFS, the scheme is hdfs
• Authority is the hostname and port of the NameNode
hdfs dfs -copyFromLocal file:///myfile.txt
dfs://localhost:9000/user/virtuser/myfile.txt
• Scheme and authority are often optional
▪ Defaults are taken from configuration file core-site.xml
29
Introduction
Hadoop and to
HDFS
Just as for the ls command, the file system shell commands can take paths as
arguments. These paths can be expressed in the form of uniform resource identifiers or
URIs. The URI format consists of a scheme, an authority, and path. There are multiple
schemes supported. The local file system has a scheme of "file". HDFS has a scheme
called "hdfs."
For example, if you want to copy a file called "myfile.txt" from your local filesystem to an
HDFS file system on the localhost, you can do this by issuing the command shown.
The copyFromLocal command takes a URI for the source and a URI for the destination.
"Authority" is the hostname of the NameNode. For example, if the NameNode is in
localhost and accessed on port 9000, the authority would be localhost:9000.
The scheme and the authority do not always need to be specified. Instead you may rely
on their default values. These defaults can be overridden by specifying them in a file
named core-site.xml in the conf directory of your Hadoop installation.


• Many POSIX-like commands
▪ cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail
• Some HDFS-specific commands
▪ copyFromLocal, put, copyToLocal, get, getmerge, setrep
• copyFromLocal / put
▪ Copy files from the local file system into fs
hdfs dfs -copyFromLocal localsrc dst

or
hdfs dfs -put localsrc dst
Introduction
Hadoop and to
HDFS
HDFS supports many POSIX-like commands. HDFS is not a fully POSIX (Portable
operating system interface for UNIX) compliant file system, but it supports many of the
commands. The HDFS commands are mostly easily-recognized UNIX commands like
cat and chmod. There are also a few commands that are specific to HDFS such as
copyFromLocal.
Note that:
• localsrc and dst are placeholders for your actual file(s)
• localsrc can be a directory or a list of files separated by space(s)
• dst can be a new file name (in HDFS) for a single-file-copy, or a directory (in
HDFS), that is the destination directory

Example:
hdfs dfs -put *.txt ./Gutenberg
…copies all the text files in the local Linux directory with the suffix of .txt to the
directory Gutenberg in the user’s home directory in HDFS
The "direction" implied by the names of these commands (copyFromLocal, put) is
relative to the user, who can be thought to be situated outside HDFS.
Also, you should note there is no cd (change directory) command available for hadoop.


• copyToLocal / get
▪ Copy files from fs into the local file system
hdfs dfs -copyToLocal [-ignorecrc] [-crc]<src><localdst>
or
hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>
• Creating a directory: mkdir

hdfs dfs -mkdir /newdir
Introduction
Hadoop and to
HDFS
The copyToLocal (aka get) command copies files out of the file system you specify
and into the local file system.
get
Usage: hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>
• Copy files to the local file system. Files that fail the CRC check may be copied
with the -ignorecrc option. Files and CRCs may be copied using the -crc option.
• Example: hdfs dfs -get hdfs:/mydir/file file:///home/hdpadmin/localfile
Another important note: for files in Linux, where you would use the file:// authority, two
slashes represent files relative to your current Linux directory (pwd). To reference files
absolutely, use three slashes (and mentally pronounce as "slash-slash pause slash").

Unit summary
• Understand the basic need for a big data strategy in terms of parallel
reading of large data files and internode network speed in a cluster
• Describe the nature of the Hadoop Distributed File System (HDFS)
• Explain the function of the NameNode and DataNodes in an Hadoop
cluster
• Explain how files are stored and blocks ("splits") are replicated
Introduction
Hadoop and to
HDFS
Unit summary

Checkpoint
1. True or False? Hadoop systems are designed for transaction
processing.
2. List the Hadoop open source projects.
3. What is the default number of replicas in a Hadoop system?
4. True or False? One of the driving principal of Hadoop is that the data is
brought to the program.
5. True or False? At least 2 NameNodes are required for a standalone
Hadoop cluster.
Introduction
Hadoop and to
HDFS
Checkpoint

Checkpoint solution
1. True or False? Hadoop systems are designed for transaction
processing.
▪ Hadoop systems are not designed for transaction processing, and w ould be
very terrible at it. Hadoop systems are designed for batch processing.
2. List the Hadoop open source projects.
▪ To name a few, MapReduce, YARN, Ambari, Hive, HBase, etc.
3. What is the default number of replicas in a Hadoop system?
▪ 3
4. True or False? One of the driving principal of Hadoop is that the data
is brought to the program.
▪ False. The program is brought to the data, to eliminate the need to move large
amounts of data.
5. True or False? At least 2 NameNodes are required for a standalone
Hadoop cluster.
▪ Only 1 NameNode is required per cluster; 2 are required for high-availability.
Introduction
Hadoop and to
HDFS
Checkpoint solution

Lab 1
• File access and basic commands with HDFS
Introduction
Hadoop and to
HDFS
Lab 1: File access and basic commands with HDFS

Lab 1:
File access and basic commands with HDFS
Purpose:
This lab is intended to provide you with experience in using the Hadoop
Distributed File System (HDFS). The basic HDFS file system commands
learned here will be used throughout the remainder of the course.
You will also be moving some data into HDFS that will be used in later units of
this course. The files that you will need are stored in the Linux directory
/home/labfiles.
Property Location Sample Value
Ambari cluser. service_endpoints. https://chs-gbq-108-mn001.us-

URL ambari_console south.ae.appdomain.cloud:9443
Hostname cluser. service_endpoints. ssh chs-gbq-108-mn003.us-

(after the @) south.ae.appdomain.cloud
Password cluser.password 24Z5HHf7NUuy
SSH cluser.service_endpoints. ssh ssh clsadmin@chs-gbq-108-mn003.us-

south.ae.appdomain.cloud
Username cluster.user clsadmin

Task 1. Install IBM Analytics Engine command line interface
and upload sample files
The IBM Analytics Engine CLI will be needed in later steps to upload files from
your local system to the cluster.
1. Download and install the Cloud Foundry CLI, from here:
https://github.com/cloudfoundry/cli/blob/master/README.md#downloads
2. Download and install the IBM Cloud CLI as described here:
https://console.bluemix.net/docs/services/AnalyticsEngine/Upload-files-to-
HDFS.html
Note: You will need to restart your computer after the installation on Windows.
3. To install the IBM Analytics Engine CLI from IBM Cloud repository, open
Command Prompt as Administrator then run the below command:
C:\>bx plugin install -r Bluemix analytics-engine
Looking up 'analytics-engine' from repository 'IBM Cloud'...
Plug-in 'analytics-engine 1.0.142' found in repository 'IBM Cloud'
Attempting to download the binary file...

10.25 MiB / 10.25 MiB

[==========================================================] 100.00% 35s
10747904 bytes downloaded
Installing binary...
OK
Plug-in 'analytics-engine 1.0.142' was successfully installed into
C:\Users\Administrator\.bluemix\plugins\analytics-engine. Use 'bx plugin
show analytics-engine' to show its details.
4. Set the Analytics Engine server endpoint to Hostname (press Enter to accept
default values when you are asked about Ambari and Knox port numbers):
C:\>bx ae endpoint Hostname
Example:
C:\>bx ae endpoint https://chs-gbq-108-mn001.us-
south.ae.appdomain.cloud:9443
Registering endpoint 'https://chs-gbq-108-mn001.us-
south.ae.appdomain.cloud'...
Ambari Port Number [Optional: Press enter for default value] (9443)>
Knox Port Number [Optional: Press enter for default value] (8443)>
OK
Endpoint 'https://chs-gbq-108-mn001.us-south.ae.appdomain.cloud' set.
5. Create the GutenbergDocs directory under the user’s home in HDFS:
bx ae file-system --user Error! Reference source not found. --password
Password mkdir Gutenberg
Example:
bx ae file-system --user clsadmin --password 24Z5HHf7NUuy mkdir
Gutenberg
This creates a directory under the user’s home directory in HDFS, which is
/user/clsadmin.
6. Change directory to the LabFiles directory on your local machine. Example:
cd C:\LabFiles
7. Upload sample files to HDFS:
Password put Frankenstein.txt Gutenberg/Frankenstein.txt
Password put Pride_and_Prejudice.txt Gutenberg/Pride_and_Prejudice.txt
Password put Tale_of_Two_Cities.txt Gutenberg/Tale_of_Two_Cities.txt
Password put The_Prince.txt Gutenberg/The_Prince.txt
Example:
bx ae file-system --user clsadmin --password 24Z5HHf7NUuy put
Frankenstein.txt Gutenberg/Frankenstein.txt
Pride_and_Prejudice.txt Gutenberg/Pride_and_Prejudice.txt


Tale_of_Two_Cities.txt Gutenberg/Tale_of_Two_Cities.txt
The_Prince.txt Gutenberg/The_Prince.txt
Task 2. Learn some of the basic HDFS file system commands
The major reference for the HDFS File System Commands can be found on the
Apache Hadoop website. Additional commands can be found there. The URL
for the most current Hadoop version is:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
hdfs/HDFSCommands.html
The HDFS File System commands are generally prefixed by: hdfs dfs.
Other HDFS commands -generally administrative- use parameters other than
dfs (for example: fsck, balancer, …).
1. Using PuTTY, connect to management node mn003 of your cluster.
2. To list the contents of your home directory and its subdirectories in HDFS, type:
hdfs dfs -ls -R .
Note: At the end of this command there is a period (it means the current
directory in HDFS).
Your results appear as shown below:
[clsadmin@chs-gbq-108-mn003 ~]$ hdfs dfs -ls -R .
drwxr-xr-x - clsadmin biusers 0 2018-10-27 22:30 Gutenberg
-rwxr-xr-x 3 clsadmin biusers 421504 2018-10-27 22:29
Gutenberg/Frankenstein.txt
Gutenberg/Pride_and_Prejudice.txt
Gutenberg/Tale_of_Two_Cities.txt
Gutenberg/The_Prince.txt
Note here in the listing of files the following for the last file:
Gutenberg/The_Prince.txt
Here you will see the read-write (rwx) permissions that you would find with a
typical Linux file.
The "3" here is the typical replication factor for the blocks (or "splits") of the
individual files. These files are too small (last file is almost 281KB) to have more
than one block (the max block size is 128MB), but each block of each file is
replicated three times.
You may see "1" instead of "3" in a single-node cluster (pseudo-distributed
mode). That too is normal for a single-node cluster as it does not really make
sense to replicate multiple copies on a single node.

Task 3. Explore one of the HDFS administrative commands.

There are several HDFS administration commands in addition to the HDFS file
system commands.
We will look at just one of them, fsck. We will run it as: hdfs fsck
Administrative commands cannot be normally run as regular users. You will do
the following as the hdfs user.
Fsck: Runs a HDFS filesystem checking utility.
Usage: hdfs fsck [GENERIC_OPTIONS] <path> [-move | -delete | -
openforwrite] [-files [-blocks [-locations | -racks]]]
COMMAND_OPTION Description
Path Start checking from this path.
-move Move corrupted files to /lost+found
-delete Delete corrupted files.
-openforwrite Print out files opened for write.
-files Print out files being checked.
-blocks Print out block report.
-locations Print out locations for every block.
-racks Print out network topology for data-

node locations.
1. In the PuTTY session that is connected to management node mn003 of your

cluster, type the following command:
hdfs fsck /
The results that you should see will be like the following:
[clsadmin@chs-gbq-108-mn003 ~]$ hdfs fsck /
Connecting to namenode via http://chs-gbq-108-mn002.us-
south.ae.appdomain.cloud:50070/fsck?ugi=clsadmin&path=%2F
FSCK started by clsadmin (auth:SIMPLE) from /172.16.162.135 for path /
at Sun Oct 28 05:33:20 UTC 2018
........................................................................
........................................Status: HEALTHY
Total size: 3861556034 B (Total open files size: 28636 B)
Total dirs: 253
Total files: 1240
Total symlinks: 0 (Files currently being written: 9)
Total blocks (validated): 1220 (avg. block size 3165209 B) (Total
open file blocks (not validated): 7)
Minimally replicated blocks: 1220 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)

Mis-replicated blocks: 0 (0.0 %)

Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Sun Oct 28 05:33:20 UTC 2018 in 151 milliseconds
The filesystem under path '/' is HEALTHY

2. Close all open windows.
Results:
You used basic Hadoop Distributed File System (HDFS) file system
commands, moving some data into HDFS that will be used in later units of this
course.


04-Hadoop Distributed File System

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04-Hadoop Distributed File System

Uploaded by

Copyright:

Available Formats

Hadoop and HDFS

Hadoop and HDFS

Data Science Foundations

© Copyright IBM Corporation 2019

© Copyright IBM Corp. 2019 4-2

© Copyright IBM Corp. 2019 4-3

The importance of Hadoop

The importance of Hadoop

© Copyright IBM Corp. 2019 4-4

Hardware improvements through the years

• Disk capacity 100 disks - 2 min

• Disk latency (speed of reads and writes) -

Hardware improvements through the years

© Copyright IBM Corp. 2019 4-5

Incidentally, the second slowest component in a cluster of computers is the inter-node

© Copyright IBM Corp. 2019 4-6

What hardware is not used for Hadoop

RA ID is often used on Master N odes

What hardware is not used for Hadoop

© Copyright IBM Corp. 2019 4-7

Parallel data processing is the answer

Parallel data processing is the answer

© Copyright IBM Corp. 2019 4-8

• Supported by many Apache/Hadoop-related projects:

• Meant for heterogeneous commodity hardware.

© Copyright IBM Corp. 2019 4-9

© Copyright IBM Corp. 2019 4-10

Hadoop open source projects

Hadoop open source projects

© Copyright IBM Corp. 2019 4-11

The project includes these modules:

© Copyright IBM Corp. 2019 4-12

Advantages and disadvantages of Hadoop

Advantages and disadvantages of Hadoop

© Copyright IBM Corp. 2019 4-13

Timeline for Hadoop

Timeline for Hadoop

© Copyright IBM Corp. 2019 4-14

© Copyright IBM Corp. 2019 4-15

Hadoop: Major components

Hadoop: Major components

© Copyright IBM Corp. 2019 4-16

Brief introduction to HDFS and MapReduce

Brief introduction to HDFS and MapReduce

© Copyright IBM Corp. 2019 4-17

Hadoop Distributed File System (HDFS) principles

Hadoop Distributed File System (HDFS) principles

© Copyright IBM Corp. 2019 4-18

© Copyright IBM Corp. 2019 4-19

© Copyright IBM Corp. 2019 4-20

© Copyright IBM Corp. 2019 4-21

© Copyright IBM Corp. 2019 4-22

HDFS replication of blocks

HDFS replication of blocks

© Copyright IBM Corp. 2019 4-23

© Copyright IBM Corp. 2019 4-24

© Copyright IBM Corp. 2019 4-25

Setting rack network topology (Rack Awareness)

Setting rack network topology (Rack Awareness)

© Copyright IBM Corp. 2019 4-26

executable script or program. By default, Hadoop will attempt to send a set of IP

# Set rack id based on IP address.

# Assumes network administrator has complete control

# over IP addresses assigned to nodes and they are