Professional Documents
Culture Documents
Cloudera
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 2
Question 1
Which two updates occur when a client applicaton opens a stream to begin a fle write on a cluster running
MapReduce v1 (MRv1)?
A. Once the write stream closes on the DataNode, the DataNode immediately initates a black report to the
NameNode.
B. The change is writen to the NameNode disk.
C. The metadata in the RAM on the NameNode is fushed to disk.
D. The metadata in RAM on the NameNode is fushed disk.
E. The metadata in RAM on the NameNode is updated.
F. The change is writen to the edits fle.
Aoswern E,F
Question 2
For a MapReduce job, on a cluster running MapReduce v1 (MRv1), what’s the relatonship between tasks and task
templates?
A. There are always at least as many task atempts as there are tasks.
B. There are always at most as many tasks atempts as there are tasks.
C. There are always exactly as many task atempts as there are tasks.
D. The developer sets the number of task atempts on job submission.
Aoswern A
Question 3
A. The NameNode forces re-replicaton of all the blocks which were stored on the dead DataNode.
B. The next tme a client submits job that requires blocks from the dead DataNode, the JobTracker receives no heart
beats from the DataNode. The JobTracker tells the NameNode that the DataNode is dead, which triggers block re-
replicaton on the cluster.
C. The replicaton factor of the fles which had blocks stored on the dead DataNode is temporarily reduced, untl the
dead DataNode is recovered and returned to the cluster.
D. The NameNode informs the client which write the blocks that are no longer available; the client then re-writes the
blocks to a diferent DataNode.
Aoswern A
Explanatonn
How NameNode Handles data node failures?
NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a
Heartbeat implies that the DataNode is functoning properly. A Blockreport contains a list of all blocks on a DataNode.
When NameNode notces that it has not recieved a hearbeat message from a data node afer a certain amount of
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 3
tme, the data node is marked as dead. Since blocks will be under replicated the system begins replicatng the blocks
that were stored on the dead datanode. The NameNode Orchestrates the replicaton of data blocks from one
datanode to another. The replicaton data transfer happens directly between datanodes and the data never passes
through the namenode.
NotenIf the Name Node stops receiving heartbeats from a Data Node it presumes it to be dead and any data it had to
be gone as well. Based on the block reports it had been receiving from the dead node, the Name Node knows which
copies of blocks died along with the node and can make the decision to re-replicate those blocks to other Data Nodes.
It will also consult the Rack Awareness data in order to maintain thetwo copies in one rack, one copy in another
rackreplica rule when deciding which Data Node should receive a new copy of the blocks.
Referencen
24 Interview Questons & Answers for Hadoop MapReduce developers, How NameNode Handles data node failures’
Question 4
How does the NameNode know DataNodes are available on a cluster running MapReduce v1 (MRv1)
A. DataNodes listed in the dfs.hosts fle. The NameNode uses as the defnitve list of available DataNodes.
B. DataNodes heartbeat in the master on a regular basis.
C. The NameNode broadcasts a heartbeat on the network on a regular basis, and DataNodes respond.
D. The NameNode send a broadcast across the network when it frst starts, and DataNodes respond.
Aoswern B
Explanatonn
How NameNode Handles data node failures?
NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a
Heartbeat implies that the DataNode is functoning properly. A Blockreport contains a list of all blocks on a DataNode.
When NameNode notces that it has not recieved a hearbeat message from a data node afer a certain amount of
tme, the data node is marked as dead. Since blocks will be under replicated the system begins replicatng the blocks
that were stored on the dead datanode. The NameNode Orchestrates the replicaton of data blocks from one
datanode to another. The replicaton data transfer happens directly between datanodes and the data never passes
through the namenode.
Referencen 24 Interview Questons & Answers for Hadoop MapReduce developers, How NameNode Handles data
node failures?
Question 5
A. Use distcp to copy fles only between two clusters or more. You cannot use distcp to copy data between directories
inside the same cluster.
B. Use distcp to copy HBase table fles.
C. Use distcp to copy physical blocks from the source to the target destnaton in your cluster.
D. Use distcp to copy data between directories inside the same cluster.
E. Use distcp to run an internal MapReduce job to copy fles.
Aoswern B,D,E
Explanatonn
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to efect its
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 4
distributon, error handling and recovery, and reportng. It expands a list of fles and directories into input to map
tasks, each of which will copy a partton of the fles specifed in the source list. Its Map/Reduce pedigree has
endowed it with some quirks in both its semantcs and executon.
Referencen Hadoop DistCp Guide
Question 6
A. HDFS Federaton improves the resiliency of HDFS in the face of network issues by removing the NameNode as a
single-point-of-failure.
B. HDFS Federaton allows the Standby NameNode to automatcally resume the services of an actve NameNode.
C. HDFS Federaton provides cross-data center (non-local) support for HDFS, allowing a cluster administrator to split
the Block Storage outside the local cluster.
D. HDFS Federaton reduces the load on any single NameNode by using the multple, independent NameNode to
manage individual pars of the flesystem namespace.
Aoswern D
Explanatonn
HDFS FederatonIn order to scale the name service horiiontally, federaton uses multple independent
Namenodes/Namespaces. The Namenodes are federated, that is, the Namenodes are independent and don’t require
coordinaton with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each
datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and
handles commands from the Namenodes.
Referencen Apache Hadoop 2.0.2-alpha
htpn//hadoop.apache.org/docs/current/
Question 7
Choose which best describe a Hadoop cluster's block siie storage parameters once you set the HDFS default block siie
to 64MB?
A. The block siie of fles in the cluster can be determined as the block is writen.
B. The block siie of fles in the Cluster will all be multples of 64MB.
C. The block siie of fles in the duster will all at least be 64MB.
D. The block siie of fles in the cluster will all be the exactly 64MB.
Aoswern D
Explanatonn
Noten What is HDFS Block siie? How is it diferent from traditonal fle system block siie? In HDFS data is split into
blocks and distributed across multple nodes in the cluster. Each block is typically 64Mb or 128Mb in siie. Each block is
replicated multple tmes. Default is to replicate each block three tmes. Replicas are stored on diferent nodes. HDFS
utliies the local fle system to store each HDFS block as a separate fle. HDFS Block siie can not be compared with the
traditonal fle system block siie.
Question 8
Which MapReduce daemon instantates user code, and executes map and reduce tasks on a cluster running
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 5
MapReduce v1 (MRv1)?
A. NameNode
B. DataNode
C. JobTracker
D. TaskTracker
E. ResourceManager
F. ApplicatonMaster
G. NodeManager
Aoswern D
Explanatonn
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shufe operatons) from a
JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM
process. Every TaskTracker is confgured with a set of slots, these indicate the number of tasks that it can accept. The
TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that
process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the
output and exit codes. When the Task instances fnish, successfully or not, the task tracker notfes the JobTracker. The
TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the
JobTracker that it is stll alive. These message also inform the JobTracker of the number of available slots, so the
JobTracker can stay up to date with where in the cluster work can be delegated. Noten How many Daemon processes
run on a Hadoop system?
Hadoop is comprised of fve separate daemons. Each of these daemon run in its own JVM. Following 3 Daemons run
on Master nodes NameNode - This daemon stores and maintains the metadata for HDFS.
Secondary NameNode - Performs housekeeping functons for the NameNode.
JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.
Following 2 Daemons run on each Slave nodes
DataNode – Stores actual HDFS data blocks.
TaskTracker - Responsible for instantatng and monitoring individual Map and Reduce tasks.
Referencen 24 Interview Questons & Answers for Hadoop MapReduce developers, What is a Task Tracker in Hadoop?
How many instances of TaskTracker run on a Hadoop Cluster
Question 9
What two processes must you do if you are running a Hadoop cluster with a single NameNode and six DataNodes, and
you want to change a confguraton parameter so that it afects all six DataNodes.
A. You must restart the NameNode daemon to apply the changes to the cluster
B. You must restart all six DataNode daemons to apply the changes to the cluster.
C. You don't need to restart any daemon, as they will pick up changes automatcally.
D. You must modify the confguraton fles on each of the six DataNode machines.
E. You must modify the confguraton fles on only one of the DataNode machine
F. You must modify the confguraton fles on the NameNode only. DataNodes read their confguraton from the
master nodes.
Aoswern A,F
Question 10
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 6
Identfy the functon performed by the Secondary NameNode daemon on a cluster confgured to run with a single
NameNode.
A. In this confguraton, the Secondary NameNode performs a checkpoint operaton on the fles by the NameNode.
B. In this confguraton, the Secondary NameNode is standby NameNode, ready to failover and provide high
availability.
C. In this confguraton, the Secondary NameNode performs deal-tme backups of the NameNode.
D. In this confguraton, the Secondary NameNode servers as alternate data channel for clients to reach HDFS, should
the NameNode become too busy.
Aoswern A
Explanatonn
The term "secondary name-node" is somewhat misleading. It is not a name-node in the sense that data-nodes cannot
connect to the secondary name-node, and in no event it can replace the primary name-node in case of its failure.
The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary name-node
periodically downloads current name-node image and edits log fles, joins them into new image and uploads the new
image back to the (primary and the only) name-node. So if the name-node fails and you can restart it on the same
physical node then there is no need to shutdown data-nodes, just the name-node need to be restarted. If you cannot
use the old node anymore you will need to copy the latest image somewhere else. The latest image can be found
either on the node that used to be the primary before failure if available; or on the secondary name-node. The later
will be the latest checkpoint without subsequent edits logs, that is the most recent name space modifcatons may be
missing there. You will also need to restart the whole cluster in this case.
Referencen Hadoop Wiki, What is the purpose of the secondary name-node?
Question 11
You install Cloudera Manager on a cluster where each host has 1 GB of RAM. All of the services show their status as
concerning. However, all jobs submited complete without an error. Why is Cloudera Manager showing the
concerning status KM the services?
Aoswern B
Explanatonn
Concerningn There is an irregularity in the status of a service instance or role instance, but Cloudera Manager
calculates that the instance might recover. For example, if the number of missed heartbeats exceeds a confgurable
threshold, the health status becomes Concerning. Or, if an instance is running on a host and the host is rebooted, the
instance will be reported as In Progress for some period of tme while it is restartng. Because the instance is expected
to be Started, its health will be reported as Concerning untl it transitons to started. Noten
Badn The service instance or role instance is not performing or did not fnish performing the last command as
expected, and Cloudera Manager calculates that the instance will not recover.
Forexample, if the number of missed heartbeats exceeds a second (higher) confgurable threshold, the health status
becomes Bad. Another example of bad health is if a role you have stopped is actually stll running, or a started role has
stopped unexpectedly.
Goodn The service instance or role instance is performing or has fnished performing the last command as expected.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 7
This does not necessarily mean the service is running, it means it is behaving as expected. For example, if you clicked
Stop to stop a role instance and it stopped successfully, then that role instance has a Good health status, even though
it is not running.
Referencen About Service, Role, and Host Health
Question 12
What is the recommended disk confguraton for slave nodes in your Hadoop cluster with 6 x 2 TB hard drives?
A. RAID 10
B. JBOD
C. RAID 5
D. RAID 1+0
Aoswern B
Explanatonn
Noten Let me be clear here…there are absolutely tmes when using a Enterprise-class storage device makes perfect
sense. But for Hadoop it is very much unnecessary, and it is these three areas that I am going to hit as well as some
others that I hope will demonstrate that Hadoop works best with inexpensive, internal storage in JBOD mode. Some
of you might say “if you lose a disk in a JBOD confguraton, you’re toast…you lose everything”. While this might be
true, with Hadoop, it isn’t. Not only do you have the beneft that JBOD gives you in speed, you have the beneft that
Hadoop Distributed File System (HDFS) negates this risk. HDFS basically creates three copies of the data. This is a very
robust way to guard against data loss due to a disk failure or node outage, so you can eliminate the need for
performance-reducing RAID.
Referencen Hadoop and Storage Area Networks
Question 13
You confgure you cluster with HDFS High Availability (HA) using Quorum-Based storage. You do not implement HDFS
Federaton. What is the maximum number of NameNodes daemon you should run on you cluster in order to avoid a
“split-brain” scenario with your NameNodes?
A. Unlimited. HDFS High Availability (HA) is designed to overcome limitatons on the number of NameNodes you can
deploy.
B. Two actve NameNodes and one Standby NameNode
C. One actve NameNode and one Standby NameNode
D. Two actve NameNodes and two Standby NameNodes
Aoswern C
Explanatonn
In a typical HA cluster, two separate machines are confgured as NameNodes. At any point in tme, one of the
NameNodes is in an Actve state, and the other is in a Standby state. The Actve NameNode is responsible for all client
operatons in the cluster, while the Standby is simply actng as a slave, maintaining enough state to provide a fast
failover if necessary. Noten It is vital for the correct operaton of an HA cluster that only one of the NameNodes be
actve at a tme. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other
incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the JournalNodes
will only ever allow a single NameNode to be a writer at a tme. During a failover, the NameNode which is to become
actve will simply take over the role of writng to the JournalNodes, which will efectvely prevent the other
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 8
NameNode from contnuing in the Actve state, allowing the new Actve NameNode to safely proceed with failover.
Referencen Cloudera CDH4 High Availability Guide, Quorum-based Storage
Question 14
You confgure Hadoop cluster with both MapReduce frameworks, MapReduce v1 (MRv1) and MapReduce v2
(MRv2/YARN). Which two MapReduce (computatonal) daemons do you need to confgure to run on your master
nodes?
A. JobTracker
B. ResourceManager
C. ApplicatonMaster
D. JournalNode
E. NodeManager
Aoswern A,B
Explanatonn
htpn//hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
As you can see, ApplicatonMaster is in slave nodes instead of master nodes. So it won’t be the answer.
Only JobTracker and ResourceManger are MapReduce daemons running in master nodes so the answers
Question 15
You observe that the number of spilled records from map tasks for exceeds the number of map output records. You
child heap siie is 1 GB and your io.sort.mb value is set to 100MB. How would you tune your io.sort.mb value to
achieve maximum memory to disk I/O rato?
A. Tune io.sort.mb value untl you observe that the number of spilled records equals (or is as close to equals) the
number of map output records.
B. Decrease the io.sort.mb value below 100MB.
C. Increase the IO.sort.mb as high you can, as close to 1GB as possible.
D. For 1GB child heap siie an io.sort.mb of 128MB will always maximum memory to disk I/O.
Aoswern A
Explanatonn
here are a few tradeofs to consider.
1. the number of seeks being done when merging fles. If you increase the merge factor too high,
then the seek cost on disk will exceed the savings from doing a parallel merge (note that OS cache might mitgate this
somewhat).
2. Increasing the sort factor decreases the amount of data in each partton. I believe the number is io.sort.mb /
io.sort.factor for each partton of sorted data. I believe the general rule of thumb is to have io.sort.mb = 10 *
io.sort.factor (this is based on the seek latency of the disk on the transfer speed, I believe. I'm sure this could be tuned
beter if it was your botleneck. If you keep these in line with each other, then the seek overhead from merging should
be minimiied
3. you increase io.sort.mb, then you increase memory pressure on the cluster, leaving less memory available for job
tasks. Memory usage for sortng is mapper tasks * io.sort.mb -- so you could fnd yourself causing extra GCs if this is
too high Essentally, If you fnd yourself swapping heavily, then there's a good chance you have set the sort factor too
high.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 9
If the rato between io.sort.mb and io.sort.factor isn't correct, then you may need to change io.sort.mb (if you have
the memory) or lower the sort factor.
If you fnd that you are spending more tme in your mappers than in your reducers, then you may want to increase the
number of map tasks and decrease the sort factor (assuming there is memory pressure).
Referencen
How could I tell if my hadoop confg parameter io.sort.factor is too small or too big?
htpn//stackoverfow.com/questons/8642566/how-could-i-tell-if-my-hadoop-confg-parameter-iosort-factor-is-too-
small-or-to
Question 16
Your Hadoop cluster has 25 nodes with a total of 100 TB (4 TB per node) of raw disk space allocated HDFS storage.
Assuming Hadoop's default confguraton, how much data will you be able to store?
A. Approximately 100TB
B. Approximately 25TB
C. Approximately 10TB
D. Approximately 33 TB
Aoswern D
Explanatonn
In default confguraton there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same
rack and 3rd copy on a diferent rack. Referencen 24 Interview Questons & Answers for Hadoop MapReduce
developers, How the HDFS Blocks are replicated?
Question 17
You set up the Hadoop cluster using NameNode Federaton. One NameNode manages the/users namespace and one
NameNode manages the/data namespace. What happens when client tries to write a fle to/reports/myreport.txt?
Aoswern C
Explanatonn
Noten
* The current HDFS architecture allows only a single namespace for the entre cluster. A single Namenode manages
this namespace. HDFS Federaton addresses limitaton of current architecture by adding support multple
Namenodes/namespaces to HDFS fle system.
* HDFS Federaton enables multple NameNodes in a cluster for horiiontal scalability of NameNode. All these
NameNodes work independently and don't require any co-ordinaton. A DataNode can register with multple
NameNodes in the cluster and can store the data blocks for multple NameNodes.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 10
Question 18
Aoswern A,C
Explanatonn
An MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0
(MRv2) or YARN. The fundamental idea of MRv2 is to split up the two major functonalites of the JobTracker, resource
management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager
(RM) and per-applicaton ApplicatonMaster (AM). An applicaton is either a single job in the classical sense of Map-
Reduce jobs or a DAG of jobs. The ResourceManager and per-node slave, the NodeManager (NM), form the data-
computaton framework. The ResourceManager is the ultmate authority that arbitrates resources among all the
applicatons in the system. The per-applicaton ApplicatonMaster is, in efect, a framework specifc library and is
tasked with negotatng resources from the ResourceManager and working with the NodeManager(s) to execute and
monitor the tasks.
Cn YARN, as an aspect of Hadoop, has two major kinds of beneftsn
The ability to use programming frameworks other than MapReduce.
Scalability, no mater what programming framework you use.
Question 19
The most important consideraton for slave nodes in a Hadoop cluster running producton jobs that require short
turnaround tmes isn
A. The rato between the amount of memory and the number of disk drives.
B. The rato between the amount of memory and the total storage capacity.
C. The rato between the number of processor cores and the amount of memory.
D. The rato between the number of processor cores and total storage capacity.
E. The rato between the number of processor cores and number of disk drives.
Aoswern E
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 11
Question 20
The failure of which daemon makes HDFS unavailable on a cluster running MapReduce v1 (MRv1)?
A. Node Manager
B. Applicaton Manager
C. Resource Manager
D. Secondary NameNode
E. NameNode
F. DataNode
Aoswern E
Explanatonn
The NameNode is the centerpiece of an HDFS fle system. It keeps the directory tree of all fles in the fle system, and
tracks where across the cluster the fle data is kept. It does not store thedata of these fles itself. There is only One
NameNode process run on any hadoop cluster. NameNode runs on its own JVM process. In a typical producton
cluster its run on a separate machine. The NameNode is a Single Point of Failure for the HDFS Cluster. When the
NameNode goes down, the fle system goes ofine.
Referencen 24 Interview Questons & Answers for Hadoop MapReduce developers, What is a NameNode? How many
instances of NameNode run on a Hadoop Cluster?
Question 21
Choose three reasons why should you run the HDFS balancer periodically?
Aoswern A,B,E
Explanatonn
The balancer is a tool that balances disk space usage on an HDFS cluster when some datanodes become full or when
new empty nodes join the cluster. The tool is deployed as an applicaton program that can be run by the cluster
administrator on a live HDFS cluster while applicatons adding and deletng fles. DESCRIPTION The threshold
parameter is a fracton in the range of (0%, 100%) with a default value of 10%. The threshold sets a target for whether
the cluster is balanced. A cluster is balanced if for each datanode, the utliiaton of the node (rato of used space at
the node to total capacity of the node) difers from the utliiaton of the (rato of used space in the cluster to total
capacity of the cluster) by no more than the threshold value. The smaller the threshold, the more balanced a cluster
will become. It takes more tme to run the balancer for small threshold values. Also for a very small threshold the
cluster may not be able to reach the balanced state when applicatons write and delete fles concurrently. The tool
moves blocks from highly utliied datanodes to poorly utliied datanodes iteratvely. In each iteraton a datanode
moves or receives no more than the lesser of 10G bytes or the threshold fracton of its capacity. Each iteraton runs no
more than 20 minutes. At the end of each iteraton, the balancer obtains updated datanodes informaton from the
namenode.
Referencen
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 12
Question 22
Aoswern E
Explanatonn
Ganglia itself collects metrics, such as CPU and memory usage; by using GangliaContext, you can inject Hadoop
metrics into Ganglia.
Noten
Ganglia is an open-source, scalable and distributed monitoring system for large clusters. It collects, aggregates and
provides tme-series views of tens of machine-related metrics such as CPU, memory, storage, network usage. Ganglia
is also a popular soluton for monitoring Hadoop and HBase clusters, since Hadoop (and HBase) has built-in support
for publishing its metrics to Ganglia. With Ganglia you may easily see the number of bytes writen by a partcular
HDSF datanode over tme, the block cache hit rato for a given HBase region server, the total number of requests to
the HBase cluster, tme spent in garbage collecton and many, many others. Hadoop and HBase use GangliaContext
class to send the metrics collected by each daemon (such as datanode, tasktracker, jobtracker, HMaster etc) to
gmonds.
Question 23
In a cluster confgured with HDFS High Availability (HA) but NOT HDFS federaton, each map task runn
Aoswern C
Explanatonn
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shufe operatons) from a
JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM
process. Every TaskTracker is confgured with a set of slots, these indicate the number of tasks that it can accept. The
TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that
process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the
output and exit codes. When the Task instances fnish, successfully or not, the task tracker notfes the JobTracker. The
TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the
JobTracker that it is stll alive. These message also inform the JobTracker of the number of available slots, so the
JobTracker can stay up to date with where in the cluster work can be delegated. Noten Despite this very high level of
reliability, HDFS has always had a well-known single point of failure which impacts HDFS’s availabilityn the system
relies on a single Name Node to coordinate access to the fle system data. In clusters which are used exclusively for
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 13
ETL or batch-processing workfows, a brief HDFS outage may not have immediate business impact on an organiiaton;
however, in the past few years we have seen HDFS begin to be used for more interactve workloads or, in the case of
HBase, used to directly serve customer requests in real tme. In cases such as this, an HDFS outage will immediately
impact the productvity of internal users, and perhaps result in downtme visible to external users. For these reasons,
adding high availability (HA) to the HDFS Name Node became one of the top priorites for the HDFS community.
Referencen
24 Interview Questons & Answers for Hadoop MapReduce developers , What is a Task Tracker in Hadoop? How many
instances of TaskTracker run on a Hadoop Cluster
Question 24
Where does a MapReduce job store the intermediate data output from Mappers?
A. On the underlying flesystem of the local disk machine on which the JobTracker ran.
B. In HDFS, in the job’s output directory.
C. In HDFS, in temporary directory defned mapred.tmp.dir.
D. On the underlying flesystem of the local disk of the machine on which the Mapper ran.
E. Stores on the underlying flesystem of the local disk of the machine on which the Reducer.
Aoswern D
Explanatonn
The mapper output (intermediate data) is stored on the Local fle system (NOT HDFS) of each individual mapper
nodes. This is typically a temporary directory locaton which can be setup in confg by the hadoop administrator. The
intermediate data is cleaned up afer the Hadoop Job completes.
Referencen 24 Interview Questons & Answers for Hadoop MapReduce developers , Where is the Mapper Output
(intermediate kay-value data) stored ?
Question 25
Aoswern A,D
Explanatonn
Hadoop can use the Kerberos protocol to ensure that when someone makes a request, they really are who they say
they are. This mechanism is used throughout the cluster. In a secure Hadoop confguraton, all of the Hadoop
daemons use Kerberos to perform mutual authentcaton, which means that when two daemons talk to each other,
they each make sure that the other daemon is who it says it is. Additonally, this allows the NameNode and JobTracker
to ensure that any HDFS or MR requests are being executed with the appropriate authoriiaton level.
Referencen
Documentaton CDH3 Documentaton CDH3 Security Guide, Introducton to Hadoop Security
Question 26
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 14
You are running a Hadoop cluster with NameNode on host mynamenode, a secondary NameNode on host
mysecondary and DataNodes. Which best describes how you determine when the last checkpoint happened?
A. Execute hdfs dfsadmin –report on the command line in and look at the Last Checkpoint informaton.
B. Execute hdfs dfsadmin –saveNameSpace on the command line which returns to you the last checkpoint value in
fstme fle.
C. Connect to the web UI of the Secondary NameNode (htpn//mysecondarynamenoden50000) and look at the “Last
Checkpoint” informaton
D. Connect to the web UI of the NameNode (htpn//mynamenoden50000/) and look at the “Last Checkpoint”
informaton
Aoswern C
Explanatonn
Noten SecondaryNameNoden Is the worst name ever given to the module in the history of naming conventons. It is
only a check point server which actually gets a back up of the fsimage+edits fles from the namenode.
It basically serves as a checkpoint server.
But it does not come up online automatcally when the namenode goes down!
Although the secondary namenode can be used to bring up the namenode in the worst case scenario (manually) with
some data loss.
Question 27
Identfy the daemon that performs checkpoint operatons of the namespace state in a cluster confgured with HDFS
High Availability (HA) using Quorum based-storage?
A. NodeManeger
B. BackupNode
C. JournalNode
D. Standby NameNode
E. Secondary NameNode
F. CheckpointNode
G. NameNode
Aoswern D
Explanatonn
Noten SecondaryNameNoden Is the worst name ever given to the module in the history of naming conventons. It is
only a check point server which actually gets a back up of the fsimage+edits fles from the namenode.
It basically serves as a checkpoint server.
But it does not come up online automatcally when the namenode goes down!
Although the secondary namenode can be used to bring up the namenode in the worst case scenario (manually) with
some data loss.
Question 28
Your existng Hadoop cluster has 30 slave nodes, each of which has 4 x 2T hard drives. You plan to add another 10
nodes. How much disk space can your new nodes contain?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 15
A. The new nodes must all contain 8TB of disk space, but it does not mater how the disks are confgured
B. The new nodes cannot contain more than 8TB of disk space
C. The new nodes can contain any amount of disk space
D. The new nodes must all contain 4 x 2TB hard drives
Aoswern C
Question 29
Cluster Summary 45 fles and directories, 12 blocks = 50 total. Heap Siie is 15.31 MB / 103.38MB(0%)
Aoswern A
Explanatonn
The data from the dead node is being replicated. The cluster is in safemode.
Noten
* Safemode
During start up Namenode loads the flesystem state from fsimage and edits log fle. It then waits for datanodes to
report their blocks so that it does not prematurely start replicatng the blocks though enough replicas already exist in
the cluster. During this tme Namenode stays in safemode. A Safemode for Namenode is essentally a read-only mode
for the HDFS cluster, where it does not allow any modifcatons to flesystem or blocks. Normally Namenode gets out
of safemode automatcally at the beginning. If required, HDFS could be placed in safemode explicitly using
'bin/hadoop dfsadmin -safemode' command. Namenode front page shows whether safemode is on or of. A more
detailed descripton and confguraton is maintained as JavaDoc for setSafeMode().
* Data Disk Failure, Heartbeats and Re-Replicaton Each DataNode sends a Heartbeat message to the NameNode
periodically. A network partton can cause a subset of DataNodes to lose connectvity with the NameNode. The
NameNode detects this conditon by the absence of a Heartbeat message. The NameNode marks DataNodes without
recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead
DataNode is not available to HDFS any more. DataNode death may cause the replicaton factor of some blocks to fall
below their specifed value. The NameNode constantly tracks which blocks need to be replicated and initates
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 16
replicaton whenever necessary. The necessity for re-replicaton may arise due to many reasonsn a DataNode may
become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replicaton factor of
a fle may be increased.
* NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt
of a Heartbeat implies that the DataNode is functoning properly. A Blockreport contains a list of all blocks on a
DataNode. When NameNode notces that it has not recieved a hearbeat message from a data node afer a certain
amount of tme, the data node is marked as dead. Since blocks will be under replicated the system begins replicatng
the blocks that were stored on the dead datanode. The NameNode Orchestrates the replicaton of data blocks from
one datanode to another. The replicaton data transfer happens directly between datanodes and the data never
passes through the namenode.
Incorrrect answersn
Bn The data is not lost, it is being replicated.
Referencen 24 Interview Questons & Answers for Hadoop MapReduce developers, How NameNode Handles data
node failures?
Question 30
You have cluster running with the FIFO Scheduler enabled. You submit a large job A to the cluster, which you expect to
run for one hour. Then, you submit job B to cluster, which you expect to run a couple of minutes only. You submit both
jobs with the same priority. Which two best describes how the FIFO Scheduler arbitrates the cluster resources for a
job and its tasks?
A. Given Jobs A and B submited in that order, all tasks from job A are guaranteed to fnish before all tasks from job B
B. The order of executon of tasks within a job may vary.
C. Tasks are scheduled in the order of their jobs' submission.
D. The FIFO Scheduler will give, on average, equal share of the cluster resources over the job lifecycle.
E. Because there is more then a single job on the cluster, the FIFO Scheduler will enforce a limit on the percentage of
resources allocated to a partcular job at any given tme.
F. The FIFO Schedule will pass an excepton back to the client when Job B is submited, since all slots on the cluster are
in use.
Aoswern B,C
Question 31
On a cluster running MapReduce v1 (MRv1), a MapReduce job is given a directory of 10 plain text as its input
directory. Each fle is made up of 3 HDFS blocks. How many Mappers will run?
Aoswern B
Question 32
Your developers request that you enable them to use Hive on your Hadoop cluster. What do install and/or confgure?
A. Install the Hive interpreter on the client machines only, and confgure a shared remote Hive Metastore.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 17
B. Install the Hive Interpreter on the client machines and all the slave nodes, and confgure a shared remote Hive
Metastore.
C. Install the Hive interpreter on the master node running the JobTracker, and confgure a shared remote Hive
Metastore.
D. Install the Hive interpreter on the client machines and all nodes on the cluster
Aoswern A
Explanatonn
The Hive Interpreter runs on a client machine.
Question 33
How must you format the underlying flesystem of your Hadoop cluster’s slave nodes running on Linux?
Aoswern A
Explanatonn
Referencen Hortonworks, Linux File Systems for HDFS
Question 34
Your cluster is running Map v1 (MRv1), with default replicaton set to 3, and a cluster blocks 64MB. Identfy which best
describes the fle read process when a Client applicaton connects into the cluster and requests a 50MB fle?
A. The client queries the NameNode for the locatons of the block, and reads all three copies. The frst copy to
complete transfer to the client is the one the client reads as part of Hadoop’s executon framework.
B. The client queries the NameNode for the locatons of the block, and reads from the frst locaton in the list of
receives.
C. The client queries the NameNode for the locatons of the block, and reads from a random locaton in the list it
receives to eliminate network I/O loads by balancing which nodes it retrieves data from at any given tme.
D. The client queries the NameNode and then retrieves the block from the nearest DataNode to the client and then
passes that block back to the client.
Aoswern B
Question 35
Identty four pieces of cluster informaton that are stored on disk on the NameNode?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 18
E. An edit log of changes that have been made since the last snapshot compacton by the Secondary NameNode.
F. File permissions of the fles in HDFS.
G. The status of the heartbeats of each DataNode.
Aoswern B,C,E,F
Explanatonn
Bn An HDFS cluster consists of a single NameNode, a master server that manages the fle system namespace and
regulates access to fles by clients. The NameNode executes fle system namespace operatons like opening, closing,
and renaming fles and directories. It also determines the mapping of blocks to DataNodes. The NameNode maintains
the fle system namespace. Any change to the fle system namespace or its propertes is recorded by the NameNode.
An applicaton can specify the number of replicas of a fle that should be maintained by HDFS. The number of copies
of a fle is called the replicaton factor of that fle. This informaton is stored by the NameNode.
Cn The NameNode is the centerpiece of an HDFS fle system. It keeps the directory tree of all fles in the fle system,
and tracks where across the cluster the fle data is kept. It does not store the data of these fles itself
En The NameNode uses a transacton log called the EditLog to persistently record every change that occurs to fle
system metadata. The SecondaryNameNode periodically compacts the EditLog into a “checkpoint;” the EditLog is then
cleared.
Noten
The NameNode is the centerpiece of an HDFS fle system. It keeps the directory tree of all fles in the fle system, and
tracks where across the cluster the fle data is kept. It does not store the data of these fles itself. There is only One
NameNode process run on any hadoop cluster. NameNode runs on its own JVM process. In a typical producton
cluster its run on a separate machine. The NameNode is a Single Point of Failure for the HDFS Cluster. When the
NameNode goes down, the fle system goes ofine. Client applicatons talk to the NameNode whenever they wish to
locate a fle, or when they want to add/copy/move/delete a fle. The NameNode responds the successful requests by
returning a list of relevant DataNode servers where the data lives.
Question 36
A. Half the number of the maximum number of Reduce tasks which can run simultaneously on an individual node.
B. The maximum number of Map tasks can run simultaneously on an individual node.
C. The same value on each slave node.
D. The maximum number of Map tasks which can run on the cluster as a whole.
E. Half the number of the maximum number of Reduce tasks which can run on the cluster as a whole.
Aoswern B
Explanatonn
mapred.tasktracker.map.tasks.maximum
Rangen1/2 * (cores/node) to 2 * (cores/node)
Descriptonn Number of map tasks to deploy on each machine.
Question 37
Which command does Hadoop ofer to discover missing or corrupt HDFS data?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 19
B. Fsck
C. Du
D. Dskchk
E. Hadoop does not provide any tools to discover missing or corrupt data; there is no need because three replicas are
kept for each data block.
Aoswern B
Explanatonn
HDFS supports fsck command to check for various inconsistencies. It it is designed for reportng problems with various
fles, for e.g. missing blocks for a fle or under replicated blocks. Unlike a traditonal fsck utlity for natve flesystems,
this command does not correct the errors it detects. Normally Namenode automatcally corrects most of the
recoverable failures. HDFS' fsck is not a Hadoop shell command. It can be run as 'bin/hadoop fsck'. Fsck can be run on
the whole flesystem or on a subset of fles.
Referencen Hadoop DFS User Guide
Question 38
Aoswern C
Explanatonn
During CDH4 package installaton of MRv1, the following Unix user accounts are
automatcally created to support securityn
This User, Runs These Hadoop Programs
hdfs HDFSn NameNode, DataNodes, Secondary NameNode, Standby NameNode (if you are using HA) mapredMRv1n
JobTracker and TaskTrackers
Referencen Confguring Hadoop Security in CDH4
Question 39
Your Hadoop cluster contains nodes in three racks. Choose which scenario results if you leave the dfs.hosts property in
the NameNode’s confguraton fle empty (blank)?
A. The NameNode will update dfs.hosts property to include machines running the DataNode daemon on the next
NameNode reboot or with a dfsadmin –refreshNodes.
B. Any machine running the DataNode daemon can immediately join the cluster.
C. Presented with a blank dfs.hosts property, the NameNode will permit DataNodes specifed in mapred.hosts to join
the cluster.
D. No new can be added to the cluster untl you specify them in the dfs.hosts fle.
Aoswern D
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 20
Explanatonn
Referencen Apache Hadoop , Module 0n Managing a Hadoop Cluster
Question 40
Aoswern D
Explanatonn
Each DataNode keeps a small amount of metadata allowing it to identfy the cluster it partcipates in. If this metadata
is lost, then the DataNode cannot partcipate in an HDFS instance and the data blocks it stores cannot be reached.
When an HDFS instance is formated, the NameNode generates a unique namespace id for the instance. When
DataNodes frst connect to the NameNode, they bind to this namespace id and establish a unique “storage id” that
identfes that partcular DataNode in the HDFS instance. This data as well as informaton about what version of
Hadoop was used to create the block fles, is stored in a fled named VERSION in the ${dfs.data.dir}/current directory.
Noten
Administrators of HDFS clusters understand that the HDFS metadata is some of the most precious bits they have.
While you might have hundreds of terabytes of informaton stored in HDFS, the NameNode’s metadata is the key that
allows this informaton, spread across several million “blocks” to be reassembled into coherent, ordered fles.
Referencen Protectng per-DataNode Metadata
Question 41
Your cluster implements HDFS High Availability (HA). You two NameNodes are named nn01 and nn02. What occurs
when you execute the commandn
Hdfs haadmin -failover nn01 nn02
A. nn02 becomes the standby NameNode and nn02 becomes the actve NameNode
B. Nn01 is fenced, and nn01 becomes the actve NameNode
C. Nn01 is fenced, and nn02 becomes the actve NameNode
D. Nn01 becomes the standby NameNode and nn02 becomes the actve NameNode
Aoswern C
Explanatonn
failover - initate a failover between two NameNodes This subcommand causes a failover from the frst provided
NameNode to the second. If the frst NameNode is in the Standby state, this command simply transitons the second
to the Actve state without error. If the frst NameNode is in the Actve state, an atempt will be made to gracefully
transiton it to the Standby state. If this fails, the fencing methods (as confgured by dfs.ha.fencing.methods) will be
atempted in order untl one of the methods succeeds. Only afer this process will the second NameNode be
transitoned to the Actve state. If no fencing method succeeds, the second NameNode will not be transitoned to the
Actve state, and an error will be returned.
Referencen
HDFS High Availability Administraton, HA Administraton using the haadmin command
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 21
Question 42
You've confgured your cluster with HDFS Federaton. One NameNode manages the /data namesapace and another
Name/Node manages the /reports namespace. How do you confgure a client machine to access both the /data and
the /reports directories on the cluster?
A. Confgure the client to mount the /data namespace. As long as a single namespace is mounted and the client
partcipates in the cluster, HDFS grants access to all fles in the cluster to that client.
B. Confgure the client to mount both namespaces by specifying the appropriate propertes in the core-site.xml
C. You cannot confgure a client to access both directories in the current implementaton of HDFS Federaton.
D. You don’t need to confgure any parameters on the client machine. Access is controlled by the NameNodes
managing the namespace.
Aoswern B
Explanatonn
Noten HDFS Federaton improves the existng HDFS architecture through a clear separaton of namespace and storage,
enabling generic block storage layer. It enables support for multple namespaces in the cluster to improve scalability
and isolaton. Federaton also opens up the architecture, expanding the applicability of HDFS cluster to new
implementatons and use cases.
Referencen Hortonworks, An Introducton to HDFS Federaton
Question 43
Which three processes does HDFS High Availability (HA) enable on your cluster?
Aoswern A,C,D
Explanatonn
The HDFS High Availability feature addresses the above problems by providing the opton of running two redundant
NameNodes in the same cluster in an Actve/Passive confguraton with a hot standby. This allows a fast failover to a
new NameNode in the case that a machine crashes, or a graceful administrator-initated failover for the purpose of
planned maintenance.
Question 44
Identfy which two daemons typically run each slave node in a Hadoop cluster running MapReduce v1 (MRv1)
A. NodeManager
B. TaskTracker
C. DataNode
D. NameNode
E. Secondary NameNode
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 22
F. JobTracker
Aoswern B,C
Explanatonn
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map,
Reduce and Shufe operatons) from a JobTracker.
The following two daemons run on each Slave nodesn
* DataNode – Stores actual HDFS data blocks.
* TaskTracker - Responsible for instantatng and monitoring individual Map and Reduce tasks.
Referencen 24 Interview Questons & Answers for Hadoop MapReduce developers, How many Daemon processes run
on a Hadoop system?
Question 45
Identfy four characteristcs of a 300MB fle that has been writen to HDFS with block siie of 128MB and all other
Hadoop defaults unchanged?
Aoswern C,D,E,F
Explanatonn
Not An The fle will take (2x128 + 44) * 3 = 000 MB
C (not B)n The third block siie is 300 – 2 * 128 = 44 MB
D (Not H)n all blocks in a fle except the last block are the same siie.
E (not G)n All blocks are replicated three tmes by default.
Question 46
For each job, the Hadoop framework generates task log fles. Where are Hadoop's task log fles stored?
A. Cached on the local disk of the slave node running the task, then purged immediately upon task completon.
B. Cached on the local disk of the slave node running the task, then copied into HDFS.
C. In HDFS, in the directory of the user who generates the job.
D. On the local disk of the slave node running the task.
Aoswern D
Explanatonn
Referencen
Apache Hadoop Log Filesn Where to fnd them in CDH, and what info they contain
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 23
Question 47
Compare the hardware requirements of the NameNode with that of the DataNodes in a Hadoop cluster running
MapReduce v1 (MRv1)n
A. The NameNode requires more memory and requires greater disk capacity than the DataNodes.
B. The NameNode and DataNodes should the same hardware confguraton.
C. The NameNode requires more memory and no disk drives.
D. The NameNode requires more memory but less disk capacity.
E. The NameNode requires less memory and less disk capacity than the DataNodes.
Aoswern D
Explanatonn
Noten
* The NameNode is the centerpiece of an HDFS fle system. It keeps the directory tree of all fles in the fle system, and
tracks where across the cluster the fle data is kept. It does not store the data of these fles itself. There is only One
NameNode process run on any hadoop cluster. NameNode runs on its own JVM process. In a typical producton
cluster its run on a separate machine. The NameNode is a Single Point of Failure for the HDFS Cluster. When the
NameNode goes down, the fle system goes ofine. Client applicatons talk to the NameNode whenever they wish to
locate a fle, or when they want to add/copy/move/delete a fle. The NameNode responds the successful requests by
returning a list of relevant DataNode servers where the data lives.
* A DataNode stores data in the Hadoop File System HDFS. There is only One DataNode process run on any hadoop
slave node. DataNode runs on its own JVM process. On startup, a DataNode connects to the NameNode. DataNode
instances can talk to each other, this is mostly during replicatng data.
Referencen
24 Interview Questons & Answers for Hadoop MapReduce developers
Question 48
Which scheduler would you deploy to ensure that your cluster allows short jobs to fnish within a reasonable tme
without starving long-running jobs?
A. FIFO Scheduler
B. Fair Scheduler
C. Capacity Scheduler
D. Completely Fair Scheduler (CFS)
Aoswern B
Explanatonn
Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of
resources over tme. When there is a single job running, that job uses the entre cluster. When other jobs are
submited, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of
CPU tme. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs fnish in reasonable
tme while not starving long jobs. It is also a reasonable way to share a cluster between a number of users.
Finally, fair sharing can also work with job priorites - the priorites are used as weights to determine the fracton of
total compute tme that each job should get.
Referencen Hadoop, Fair Scheduler Guide
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 24
Question 49
Your Hadoop cluster has 12 slave nodes, a block siie set to 64MB, and a replicaton factor of three. Choose which best
describes how the Hadoop Framework distributes block writes into HDFS from a Reducer outputng a 150MB fle?
A. The Reducer will generate twelve blocks and write them to slave nodes nearest the node on which the Reducer
runs.
B. The Reducer will generate nine blocks and write them randomly to nodes throughout the cluster.
C. The slave node on which the Reducer runs gets the frst copy of every block writen. Other block replicas will be
placed on other nodes.
D. Reducers don't write blocks into HDFS
Aoswern C
Explanatonn
Noten
*The placement of replicas is critcal to HDFS reliability and performance. Optmiiing replica placement distnguishes
HDFS from most other distributed fle systems. This is a feature thatneeds lots of tuning and experience. The purpose
of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utliiaton.
The current implementaton for the replica placement policy is a frst efort in this directon. The short-term goals of
implementng this policy are to validate it on producton systems, learn more about its behavior, and build a
foundaton to test and research more sophistcated policies.
* In HDFS data is split into blocks and distributed across multple nodes in the cluster. Each block is typically 64Mb or
128Mb in siie. Each block is replicated multple tmes. Default is to replicate each block three tmes. Replicas are
stored on diferent nodes. HDFS utliies the local fle system to store each HDFS block as a separate fle. HDFS Block
siie can not be compared with the traditonal fle system block siie.
Question 50
You has a cluster running with the Fail Scheduler enabled. There are currently no jobs running on the cluster you
submit a job A, so that only job A is running on the cluster. A while later, you submit job B. Now job A and Job B are
running on the cluster al the same tme. How will the Fair' Scheduler handle these two Jobs?
Aoswern D
Explanatonn
Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of
resources over tme. When there is a single job running, that job uses the entre cluster. When other jobs are
submited, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of
CPU tme. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs fnish in reasonable
tme while not starving long jobs. It is also a reasonable way to share a cluster between a number of users. Finally, fair
sharing can also work with job priorites - the priorites are used as weights to determine the fracton of total compute
tme that each job should get.
Referencen Hadoop, Fair Scheduler Guide
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 25
Question 51
You are a Hadoop cluster with a NameNode on host mynamenode. What are two ways to determine available HDFS
space in your cluster?
Aoswern B,C
Question 52
In the context of confguring a Hadoop cluster for HDFS High Availability (HA), ‘fencing’ refers ton
A. Isolatng a failed NameNode from write access to the fsimage and edits fles so that is cannot resume write
operatons if it recovers.
B. Isolatng the cluster’s master daemon to limit write access only to authoriied clients.
C. Isolatng both HA NameNodes to prevent a client applicaton from killing the NameNodedaemons.
D. Isolatng the standby NameNode from write access to the fsimage and edits fles.
Aoswern A
Explanatonn
A fencing method is a method by which one node can forcibly prevent another node from making contnued progress.
This might be implemented by killing a process on the other node, by denying the other node's access to shared
storage, or by accessing a PDU to cut the other node's power. Since these methods are ofen vendor- or device-
specifc, operators may implement this interface in order to achieve fencing. Fencing is confgured by the operator as
an ordered list of methods to atempt. Each method will be tried in turn, and the next in the list will only be
atempted if the previous one fails. See NodeFencer for more informaton.
Noten
failover - initate a failover between two NameNodes
This subcommand causes a failover from the frst provided NameNode to the second. If the frst NameNode is in the
Standby state, this command simply transitons the second to the Actve state without error. If the frst NameNode is
in the Actve state, an atempt will be made to gracefully transiton it to the Standby state. If this fails, the fencing
methods (as confgured by dfs.ha.fencing.methods) will be atempted in order untl one of the methods succeeds.
Only aferthis process will the second NameNode be transitoned to the Actve state. If no fencing method succeeds,
the second NameNode will not be transitoned to the Actve state, and an error will be returned.
Referencen org.apache.hadoop.ha, Interface FenceMethod
Referencen HDFS High Availability Administraton, HA Administraton using the haadmin command
Question 53
You are planning a Hadoop duster, and you expect to be receiving just under 1TB of data per week which will be stored
on the cluster, using Hadoop's default replicaton. You decide that your slave nodes will be confgured with 4 x 1TB
disks. Calculate how many slave nodes you need to deploy at a minimum to store one year's worth of data.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 26
C. 10 slave nodes
D. 50 slave nodes
Aoswern D
Explanatonn
Total number available space requiredn 52 (weeks) * 1 (disk space per week) * 3 (default replicaton factor) = 156 TB
Minimum number of slave nodes requiredn 156 /4 = 30
Question 54
Under which scenario would it be most appropriate to consider using faster (e.g 10 Gigabit) Ethernet as the network
fabric for your Hadoop cluster?
A. When the typical workloads generates a large amount of intermediate data, on the order of the input data itself.
B. When the typical workloads consists of processor-intensive tasks.
C. When the typical workloads consumes a large amount of input data, relatve to the entre capacity of HDFS.
D. When the typical workloads generates a large amount of output data, signifcantly larger than the amount of
intermediate data.
Aoswern A
Explanatonn
When we encounter applicatons that produce large amounts of intermediate data–on the order of the same amount
as is read in–we recommend two ports on a single Ethernet card or two channel-bonded Ethernet cards to provide 2
Gbps per machine. Alternatvely for customers who have already moved to 10 Gigabit Ethernet or Infniband, these
solutons can be used to address network bound workloads. Be sure that your operatng system and BIOS are
compatble if you’re considering switching to 10 Gigabit Ethernet.
Referencen
Cloudera’s Support Team Shares Some Basic Hardware Recommendatons
Question 55
What determines the number of Reduces that run a given MapReduce job on a cluster running MapReduce v1
(MRv1)?
A. It is set by the Hadoop framework and is based on the number of InputSplits of the job.
B. It is set by the developer.
C. It is set by the JobTracker based on the amount of intermediate data.
D. It is set and fxed by the cluster administrator in mapred-site.xml. The number set always run for any submited job.
Aoswern B
Explanatonn
Number of Reduces
The right number of reduces seems to be 0.05 or 1.05 * (nodes *
mapred.tasktracker.tasks.maximum). At 0.05 all of the reduces can launch immediately and start transfering map
outputs as the maps fnish. At 1.05 the faster nodes will fnish their frst round of reduces and launch a second round
of reduces doing a much beter job of load balancing.
Currently the number of reduces is limited to roughly 1000 by the bufer siie for the output fles (io.bufer.siie * 2 *
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 27
numReduces << heapSiie). This will be fxed at some point, but untl it is it provides a prety frm upper bound.
The number of reduces also controls the number of output fles in the output directory, but usually that is not
important because the next map/reduce step will split them into even smaller splits for the maps.
The number of reduce tasks can also be increased in the same way as the map tasks, via JobConf's
conf.setNumReduceTasks(int num).
Referencen org.apache.hadoop.mapred
Class JobConf
Question 56
In HDFS, you view a fle with rw-r--r-- set as its permissions. What does this tell you about the fle?
Aoswern A
Question 57
Your cluster Mode siie is set to 128MB. A client applicaton (client applicaton A) is writng a 500MB fle to HDFS. Afer
client applicaton A has writen 300MB of data, another client (client applicaton B) atempts to read the fle. What is
the efect of a second client requestng a fle during a write?
Aoswern A
Question 58
Your cluster has nodes in seven racks, and you have provided a rack topology script. What is Hadoop's block
placement policy, assuming a block replicaton factor of three?
Aoswern B
Explanatonn
HDFS uses rack-aware replica placement policy. In default confguraton there are total 3 copies of a datablock on
HDFS, 2 copies are stored on datanodes on same rack and 3rd copy on a diferent rack.
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 28
Noten HDFS is designed to reliably store very large fles across machines in a large cluster. It stores each fle as a
sequence of blocks; all blocks in a fle except the last block are the same siie. The blocks of a fle are replicated for
fault tolerance. The block siie and replicaton factor are confgurable per fle. An applicaton can specify the number of
replicas of a fle. The replicaton factor can be specifed at fle creaton tme and can be changed later. Files in HDFS are
write-once and have strictly one writer at any tme.
Referencen 24 Interview Questons & Answers for Hadoop MapReduce developers, How the HDFS Blocks are
replicated?
Question 59
Which MapReduce v2 (MR2/YARN) daemon is a per-machine slave responsible for launching applicaton containers
and monitoring applicaton resources usage?
A. JobTracker
B. ResourceManager
C. ApplicatonMaster
D. NodeManager
E. ApplicatonMasterService
F. TaskTracker
Aoswern C
Explanatonn
The fundamental idea of MRv2 (YARN) is to split up the two major functonalites of the JobTracker, resource
management and job scheduling/monitoring, into separate daemons.
The idea is to have a global ResourceManager (RM) and per-applicaton ApplicatonMaster (AM). NotenLet’s walk
through an applicaton executon sequence n
Referencen Apache Hadoop YARN – Concepts & Applicatons
Question 60
What happens if a Mapper on one node goes into an infnite loop while running a MapReduce job?
A. Afer a period of tme, the JobTracker will restart the TaskTracker on the node on which the map task is running
B. The Mapper will run indefnitely; the TaskTracker must be restarted to kill it
C. The job will immediately fail.
D. Afer a period of tme, the TaskTracker will kill the Map Task.
Aoswern D
Explanatonn
* The TaskTracker nodes are monitored. If they do not submit heartbeat signals ofen enough, they are deemed to
have failed and the work is scheduled on a diferent TaskTracker.
* A TaskTracker will notfy the JobTracker when a task fails. The JobTracker decides what to do thenn it may resubmit
the job elsewhere, it may mark that specifc record as something to avoid, and it may may even blacklist the
TaskTracker as unreliable.
Question 61
Your cluster Mapreduce V1 (MVR1) what determines where blocks are return into HDFS.client applicaton?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 29
Aoswern B
Question 62
You are running two Hadoop clusters(cluster1 and cluster2), they run identcal versions of hadoop. You want to copy
the data inside /home/foo/cluster1 to cluster2 into the directory /home/bar/ What is the correct distcp syntax to copy
one directory tree from one cluster to the other cluster?
Aoswern A
Question 63
Using cloudera manager on CDH4 cluster running mapreduce V1(MRv1), you delete a trasktracker role instance from a
host that also a runs a datanode role instance and a region server role instance.cloudera Manager make changes to
the cluster and prompts you to the accept the changes. What other confguraton opton will cloudera manager
automatcally prompt you to change?
Aoswern C
Question 64
You have a cluster running 32 slave nodes and 3 master nodes running mapreduce V1 (MRv1). You execute the
commandn
$ hadoop fsck /
What four cluster conditons running this command will return to you?
A. The current state of the fle system returned from scanning individual blocks on each datanode
B. Number of dead datanodes
C. Confgure capacity of your cluster
D. Under-replicated blocks
E. Blocks replicated improperly or that don’t satsfy your cluster enhancement policy (e. g. , too many blocks
replicated on the same node)
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 30
F. Number of datanodes
G. the current state of the fle system according to the namenode
H. the locaton for every block
Aoswern A,D,F,H
Question 65
Your running a hadoop cluster with a name node on the host mynamenode. What are two ways you can determine
available HDFS space in your cluster?
Aoswern C
Explanatonn
Referencen
htpn//stackoverfow.com/questons/11583210/hdfs-free-space-available-command
Question 66
Your developers request that you enable them to use pig on your hadoop cluster. What do you need to confgure and /
or install?
A. Install the pig interpreter on all nodes in the cluster, and the client machines
B. Install the pig interpreter on the client machines only.
C. Install the pig jar’s on the all slave nodes in the cluster the pig interpreter on the client machines
D. Install the pig interpreter on the master node which is running the jobtracker
Aoswern B
Explanatonn
Referencen
htpn//cdn.oreillystatc.com/en/assets/1/event/01/Introducton%20to%20Apache%20Had
oop%20Presentaton%201.pdf(slide 51)
Question 67
What is the best disk confguraton for slave nodes in hadoop cluster where each node has 6x2TB drives?
Aoswern B
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 31
Question 68
MapReduce V2 (MRv2/YARN) splits which two major functons of the jobtracker into separate daemons?
A. Managing tasks
B. Managing fle system metadata
C. Launching tasks
D. Job coordinaton between the resource manager and the node manager
E. Job scheduling/monitoring
F. Resource management
G. Health status check (heartbeats)
H. MapReduce metric reportng
Aoswern E,F
Explanatonn
Referencen
htpn//www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.1/CDH4-Installaton-
Guide/cdh4ig_topic_11_4.html(see about MapReduce v2 (YARN))
Question 69
What is the rule governing the formatng of the underlying flesystem in the hadoop cluster?
A. They must all use the same fle system but this does not need to be the same flesystem as the flesystem used by
the namenode
B. they must all be lef asformated raw disk, hadoopformat them automatcally
C. They must all use the same flesystem as the namenode
D. They must all be lef as unformated, rawdisk;hadoop uses raw unformated disk for HDFS
E. They can use diferent fle system
Aoswern C
Question 70
Your cluster implements hdfs high availability (HA) your two namenodes are named hadoop01 and hadoop02. What
occurs when you execute the commandn
Sudo –u hdfs haadmin –failover hadoop01 hadoop02
A. Hadoop02 becomes the standby namenode and hadoop01 becomes the actve namenode
B. Hadoop02 is fenced, and hadoop01 becomes actve namenode
C. Hadoop01 becomes inactve and hadoop02 becomes the actve namenode
D. Hadoop01 is fenced, and hadoop02 becomes the actve namenode
Aoswern A
Question 71
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 32
Your company stores user profle records in an OLTP database. You want to join these records with webserver logs. You
have already ingested into the hadoop fle system. what is the best way to obtain and ingest these user records?
Aoswern D
Question 72
In the executon of a MapReduce job, where does the mapper place the intermediate data in each map task?
A. Themapper stores the intermediate data on the underlying flesystem of the localdisk of the machine which ran
themap task
B. The mapper transfers the intermediate data to the jobtracker, which then sends it to the reducers
C. The hadoopframework holds the intermediate data in the task trackers memory untl it is transferred to the
reducers
D. The mapper transfer the intermediate data immediately to the reducers as it is generated bymap task.
Aoswern A
Question 73
How many mappers required for a map reduce, job determine on a cluster running map reduce V1 (MRv1)?
Aoswern D
Question 74
What occurs when you run a hadoop job specifying output directory for job output which already exists in HDFS?
A. an error will occur afer the mappers have completed but before any reducers begin to run because the output
path must not exist during the shufe and sort.
B. the job will run successfully. Output from the reducers will override the contents of existng directory
C. the job will run successfully output from the reducers will be placed in a directory called job output -1
D. An error will occur immediately because the output directory must not already exist.
Aoswern C
Question 75
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 33
You have a cluster running with the fair scheduler enabled and confgured. You submit multple jobs to the cluster.
Each job is assigned to a pool. What are the two key points to remember about how jobs are scheduled with the fair
scheduler?
A. Each pool gets 1/M of the total available tasks slots, where M is the no. of nodes in the cluster
B. Pools are assigned priorites.pools with higher priorites an executed b4 pools with lower priorites
C. Each pool gets 1/N of the total available tasks slots, where N is the no of jobs running on the cluster
D. Pools get a dynamically-allocated share of the available task slots (subject to additonal constraints)
E. Each pools share of the tasks slots remains statc within the executon of any individual job
F. Each pools share of task slotsmay change throughout the course of job executon
Aoswern D,F
Explanatonn
Referencen
htpn//www.quora.com/Eric-Sammer/answers/Apache-Hadoop
Question 76
Your cluster has 0 slave nodes. The cluster block siie is set to 128 MB and it’s replicaton factor set to three. How will
the hadoop framework distribute block writes from a reducer into HDFS from a reducer outputng a 300MB fle?
Aoswern C
Question 77
How do you access the long messages a mapreduce applicaton generates afer you run it on your cluster?
A. You connect to the job tracker web UI and locate the details for your job this will include long messages
B. You remotely log in to any slave node in the cluster to browse your jobs log fles in/var/log/hadoop/archive directly
C. You browse the /logs directly in HDFS where all jobs log fles as store
D. You browse the currect working directory of the client machine from which You submited the job
Aoswern C
Question 78
What does each block of a fle contain where it is return into HDFS?
A. Each block writes a separate meta fle containing informaton on the fle name of which the block is a part
B. Each block has a header and footer containing metadata
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 34
Aoswern C
Question 79
Which two occur when individual blocks are returned to DATANODE on a cluster local flesystem?
Aoswern A,C
Question 80
Once a client applicaton validates its identty and is granted access to fle in a cluster what is the reminder of read
path back to the client?
A. The namenode givesthe client the block ID’s anda list of Datanodeson which those blocks are found. And the
applicaton reads the block directly from the datanodes
B. the namenode maps the read request against the block locatons in it’s stored metadata, And reads those blocks
from the datanodes. The client applicatons then read the block from the namenode
C. The namenode maps the read request against the block locatons in it’s stored metadata the block id’s are stored by
their distance to the client and moved to the datanode closest to the client according to hadoop rac topology. The
client applicaton then reads the blocks from the single datanode
Aoswern B
Question 81
Which two daemons must be installed on master nodes in hadoop cluster running mapreduce V1 (MRV1)?
A. Resourcemanager
B. Namenode
C. Datanode
D. Tasktracker
E. ApplicatonMaster
F. Hmaster
G. Zookeeper
Aoswern B,E
Question 82
What is the smallest number of slave nodes you would need to confgure in your hadoop cluster to store 100TB of
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 35
data, using Hadoop default replicaton values, on nodes with 10TB of RAW disk space per node?
A. 100
B. 25
C. 10
D. 40
E. 05
Aoswern B
Question 83
When planning a hadoop cluster, what general rule governs the h/w requirements between master nodes and slave
nodes?
A. The master nodes require more memory and greater disk capacity then the slave nodes
B. The master and slave nodes should have the same h/w confguraton
C. The master nodes requires more memory and no disk drives
D. The master nodes requires more memory but less disk capacity
E. the master nodes requires less memory and fewer number disk drives than the slave nodes
Aoswern D
Question 84
A slave node in your cluster has 24GB of Ram and 12 physical processor cores on hyper threading-enabled processor.
You set the value of mapped. child.java.opts to –Xmx1G, and the value of mapred.tasktracker.map.tasks.maximum to
12. What is the appropriate value to set for mapred.Tastracker.reducer.tasks.maximum?
A. 24
B. 16.
C. 6
D. 2
E. 12
F. 20
Aoswern B
Question 85
A client applicaton opens a fle write stream on your cluster. Which two metadata changes occur during a fle write?
A. the namenode triggers a block report to update block locatons in the edits fle
B. The change is return to the namenode disk
C. The change is return to the edit’s fle
D. The metadata in a Ram on the name node is updated
E. the metadata in Ram on the namenode is fushed to disk
F. the change is return to the fsimage fle
G. the change is return to the secondary namenode
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 36
Aoswern B
Question 86
Each slave node in your cluster has four 2 TB hard drives installed (4x2TB). You set a value of the
dfs.du.datanode.received parameter to 100GB on each slave node. How does this alter HDFS block storage?
Aoswern C
Question 87
A. HDFS runs in user space which makes all users with access to the namespace able to read, write and modify all fles
B. the owner and group cannot delete the fle, but other can
C. the owner and group can modify the contents of the fle other can’t
D. the owner and group can read the fle other can’t
E. No one can modify the content of the fle
Aoswern C
Question 88
You confgure your hadoop cluster with mapreduce V1 (MRv1) along with HDFS high availability (HA) Quorum-based
storage. On which nodes should you confgure and run your journal node daemon(s) to guarantee a quorum?
Aoswern A
Question 89
Which three fle actons can execute as you write a fle into HDFS?
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 37
Aoswern B,C,E
Question 90
You set the value of dfs.block.siie to 64MB in hdfs-site.xml on a client machine, but you set the same property to
128MB on your clusters name node. What happens when the client writes a fle to HDFS?
A. An executon will be thrown when the client atempts to write the fle, because the values are diferent.
B. A block siie of 64MB will be used
C. A block siie of 128MB will be used
D. The fle will be writen successfully with a block siie of 64MB but client atemptng to read the fle will fail because
the namenode believes the blocks to be 128MB in siie
Aoswern B
Question 91
Which daemons instantates JVM’s to perform mapreduce processing in a cluster running mapreduce v1 (MRv1)?
A. Nodemanager
B. applicatonmanager
C. Applicatonmaster
D. Tasktracker
E. Jobtracker
F. Datanode
G. Namenode
H. Resourcemanager
Aoswern D
Question 92
A. Maximum of 4 reducers, but the actual number of reducers that run for any given job is based on the volume of
input data
B. a maximum of 4 reducer’s, but the actual number of reducer’s that run for any given job is based on the volume
intermediate data.
C. Four reducer’s will run. Once set by the cluster administrator, this parameter can’t be overridden
D. The number of reducer’s for any given job is set by the developer
Aoswern C
Question 93
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 38
You confgure your hadoop development cluster with both mapreduce frameworks, mapreduceV1 (MRv1) and
mapreduce v2 (MRv2/YARN). You plan to run only one set of mapreduce daemons at a tme in this development
environment (running both simultaneously results in an unsuitable cluster but confgure in both moving between
them is fne). Which two mapreduce (capitaton) daemons do you need to confgure to run on your masternodes?
A. Containermanager
B. Nodemanager
C. Journalnode
D. Applicatonmaster
E. Resourcemanager
F. Jobtracker
Aoswern C,F
Question 94
Using hadoop’s default setngs, how much data will be able to store on your hadoop cluster if it is has 12 nodes with
4TB raw diskspace per node allocated to HDFS storage?
A. Approximately 3TB
B. Approximately 12TB
C. Approximately 16TB
D. Approximately 48TB
Aoswern D
Question 95
When requestng a fle, how does HDFS retrieves the blocks associated with that fle?
Aoswern D
Question 96
________________________________________________________________________________________________
https://www. pass4sures.com/
Page No | 39
Aoswern A,B,E,F
Question 97
What happens when a map task crashes while running a mapreduce job?
Aoswern C
________________________________________________________________________________________________
https://www. pass4sures.com/