HDFS High Availability with Zookeeper

Prof.
Dinkar Sitaram
Prof. K V Subramaniam
Prof. Sanchika Gupta
Prof. Prafullata K A
Prof Mamatha Shetty
HDFS – HIGH AVAILABILITY

Recall - Exercise
▪ Design the program that

 On startup, registers its process id with
zookeeper
 Checks to see which other processes are
registered
 Monitors the other processes
 If any process fails, the process with the
largest surviving id starts a new process
▪ What zookeeper data structures do you
need?
▪ What is the pseudo-code?
Solution
▪ Similar to master election

▪ Which data structures
 Create an ephemeral sequential znode
 Smallest remaining sequential # assumes role of
master
▪ On failure,
 All are notified.
 Smallest sequential # assumes role of master.
 Restarts the failed process.
 Failed process may have lower sequential # but will
not be the master
Tips on Assignment
▪ Install ZooKeeper
▪ Create a zookeeper client as a process and
register with zookeeper
 Try out how to create a znode – persistent
 Persistent znodes used for things like configuration
information – port on which to listen to etc.
 Then create an ephemeral znode
 Check what happens on zookeeper when the zookeeper
client process is killed
 Now try a ephemeral sequential znode
 Try two instances of the process
 See what numbers each one gets.
Recall: Reading a file
How does this translate to HDFS?
▪ Consider NameNodes and DataNodes

separately.
▪ Consider only NameNodes for now.
▪ NameNode
 Keep two namenodes – in Active and Standby
configuration
Class Exercise (10 mins)
▪ How will you configure Zookeeper for

 Detecting failure of Active NameNode?
 Detecting failure of Standby NameNode?
HDFS Failover
▪ Failover Controller
 Handles transition from
active→standby
 Failover controllers are
pluggable,
 ZooKeeper to ensure that
only one namenode is
active.
▪ Manual Trigger for
maintenance
▪ Ungraceful failure
 Is it a real failure?
 Slow network – active
treated as failed, but is up
and running
HDFS Failover Config
▪ HDFS uses Zookeeper for

 Failure Detection
 Namenode maintains persistent session with ZK
 If Machine crashes
 Session would expire
 Notify the other NameNode of the crash
 Active Namenode Election
 On crash, other node takes exclusive lock
 Indicating that it is the leader
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
hdfs/HDFSHighAvailabilityWithNFS.html#Architecture
Failover Controller
▪ Zookeeper Failover Controller – monitors and manages
state of the NameNode
▪ Runs on the NameNode
▪ Responsible for
 Health monitoring:
 pings the local NameNode
 HealthCheck Command
 Session Management
 Keeps on open session with ZooKeeper
 Also holds the lock for active NameNode
 Zookeeper based election
 If current node is active and no other node has a lock
 Then it will initiate process to get a lock
 Success → it has won the election
▪ Namenode stores
 Edit.logs and fsimage in persistent store
 In memory block mappings
▪ How will you get the standby node to keep
track of this information?
Method 1: Shared storage
▪ Regular Operation
 Shared storage stores
edit logs
 DataNodes send block
mappings to both
Active/Standby
▪ On Failure
 Standby has block
mappings already in Name Edit logs Shared storage
memory node
 Retrieves edit logs from block reports to Block mappings

Datanodes both stored in
shared storage namenodes memory

Clients Transparency Provide failover
Method 2:Quorum Journal
Manager
▪ Additional servers
configured as JournalNodes
▪ At least 3 (odd number)
▪ Active NameNode Write
 Sent to JournalNodes
 Majority to succeed
▪ Standby NameNode Read
 Watches for changes in edit
logs and applies them
 Before becoming active
▪ Only one writer allowed at a
time
▪ How do we handle the case when two

namenodes are being updated by two clients
in parallel?
 Will data be consistent?
 How long will it take to get the standby to active
mode?
HDFS Fencing
▪ Prevent previously
active namenode from
doing damage and
causing corruption
▪ Fencing mechanism
 killing the namenode’s
process,
When all else fails…. STONITH
 revoking its access to the
shared storage directory
 disabling its network port
via a remote management
command.
HDFS Fencing
▪ Shared storage
 Should be written only by one writer
 Fencing to ensure that both NameNodes do not
write.
▪ Quorum Storage
 JournalNodes permit only one writer.
 Standby has to become the writer
 Ensures that two writers cannot write
simultaneously
 Fencing not a must.
Namenode memory requirements
▪ How to estimate the amount of memory?

 Determine what is stored
 Make some assumptions
 Do a quick computation? Back of envelope
calculation
Namenode memory requirements
▪ What takes up space? ▪ Example calculation

 #blocks/file  200 node cluster
 Filename length  24TB/node
 #directories  128MB block size
▪ Rule of thumb –  Replication factor = 3
1000MB per million ▪ How much space is
blocks required?
▪ Solution:  #blocks =
 Limit the responsibility 200*24*2^20/(128*3)
of each node  ~12Million blocks
 ~12,000 MB memory.
Secondary Namenodes
▪ Not a hot standby for the
active
▪ Connects to active NN
regularly
▪ Housekeeping
 Backup of NN metadata
▪ Prevents edit logs from
growing
 Recovery performance
▪ Runs on a separate
machine
 Requires as much RAM as
namenode
HDFS Federation
▪ Each namenode manages a portion of filesystem namespace

 Each namenode manages a namespace volume
 Metadata for namespace
 Blocks for the files
▪ introduced in the 0.23 release
▪ Ok, we ensured that the namenode is up,

how do we now ensure that data is also
available?
▪ How will you organize the datanodes so that
data is always available?
▪ How is failure of a datanode taken care of?
Solution
▪ Replicate the data

Write Pipeline
Data broken into packets
Each packet is acknowleged
When datanode in pipeline detects

Error like checksum failure, it removes
Itself from the pipeline
Client reconstructs pipeline

Recall: Failure Masking by
Redundancy
▪ Strategy: hide the occurrence of failure from other
Checksum in
processes using redundancy. DataNodes
 Information Redundancy – add extra bits to allow for
error detection/recovery (e.g., Hamming codes and
the like). Pipeline Recovery
 Time Redundancy – perform operation and, if needs
be, perform it again. Think about how transactions
work (BEGIN/END/COMMIT/ABORT). Having Multiple
Replicas
 Physical Redundancy – add extra (duplicate) hardware
and/or software to the system.
26
Review Q’s
▪ Why is HDFS suited for

streaming operations?
▪ Why is the block size
128MB?
▪ “The secondary
namenode is for
handling HA” – T/F
▪ “Edit logs are stored on
shared storage”, Why
Review Q - A
▪ Why is HDFS suited for ▪ Optimized for read-

streaming operations? mostly operation
▪ Why is the block size ▪ To allow for very large
128MB? files
▪ “The secondary ▪ Secondary mainly for
namenode is for quick startup. For
handling HA” – T/F Housekeeping
▪ “Edit logs are stored on ▪ For HA
shared storage”, Why
Video
▪ https://www.ted.com/talks/del_harvey_the_s
trangeness_of_scale_at_twitter?language=e
n
 Please prepare a presentation
Further reading
Read – Considerations
▪ Network latency
▪ Data center layout
 Organized as racks
▪ Operation within rack is
faster
▪ Distance metric
▪ Network represented as
tree
 Distance between two
nodes = sum of distances
to closest ancestor

HDFS High Availability with Zookeeper

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HDFS High Availability with Zookeeper

Uploaded by

Copyright:

Available Formats

Prof.

HDFS – HIGH AVAILABILITY

▪ Design the program that

▪ Similar to master election

▪ Consider NameNodes and DataNodes

▪ How will you configure Zookeeper for

▪ HDFS uses Zookeeper for

 Retrieves edit logs from block reports to Block mappings

▪ How do we handle the case when two

▪ How to estimate the amount of memory?

▪ What takes up space? ▪ Example calculation

▪ Each namenode manages a portion of filesystem namespace

▪ Ok, we ensured that the namenode is up,

▪ Replicate the data

Data broken into packets

Each packet is acknowleged

When datanode in pipeline detects

Client reconstructs pipeline

▪ Why is HDFS suited for

▪ Why is HDFS suited for ▪ Optimized for read-

You might also like