You are on page 1of 45

TestKings.CCD-410.

70questions

Number: Cloudera CCD-410


Passing Score: 800
Time Limit: 120 min
File Version: 4.8

Cloudera CCD-410
Cloudera Certified Developer for Apache Hadoop (CCDH)

1. Passed today with 92%.all questions from this dump. The hardest part was pretending I hadn't already seen the questions! Thanks to all involved
2. ALL the credit goes to this Excellent and wonderful vce file. Thanks
3. Use this, definitely you figure out the differences that I have mentioned.
4. All the questions sort out properly and have obvious answers.
5. It contains 100% Real Questions.Prepare yourself to Face the Exam with Real Exam Questions from Previous Exams, walk into the Testing Center
with confidence.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Exam A

QUESTION 1
In a MapReduce job, you want each of your input files processed by a single map task. How do you configure a MapReduce job so that a single map
task processes each input file regardless of how many blocks the input file occupies?

A. Increase the parameter that controls minimum split size in the job configuration.
B. Write a custom MapRunner that iterates over all key-value pairs in the entire file.
C. Set the number of mappers equal to the number of input files you want to process.
D. Write a custom FileInputFormat and override the method isSplitable to always return false.

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:
Explanation: FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobContext).
Subclasses of FileInputFormat can also override the isSplitable(JobContext, Path) method to ensure input-files are not split-up and are processed as a
whole by Mappers.

Reference: org.apache.hadoop.mapreduce.lib.input, Class FileInputFormat<K,V>

QUESTION 2
Which process describes the lifecycle of a Mapper?

A. The JobTracker calls the TaskTracker's configure () method, then its map () method and finally its close () method.
B. The TaskTracker spawns a new Mapper to process all records in a single input split.
C. The TaskTracker spawns a new Mapper to process each key-value pair.
D. The JobTracker spawns a new Mapper to process all records in a single file.

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
Explanation: For each map instance that runs, the TaskTracker creates a new instance of your mapper.

Note:
* The Mapper is responsible for processing Key/Value pairs obtained from the InputFormat. The mapper may perform a number of Extraction and
Transformation functions on the Key/Value pair before ultimately outputting none, one or many Key/Value pairs of the same, or different Key/Value type.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
* With the new Hadoop API, mappers extend the org.apache.hadoop.mapreduce.Mapper class.

This class defines an 'Identity' map function by default - every input Key/Value pair obtained from the InputFormat is written out.
Examining the run() method, we can see the lifecycle of the mapper:
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException { setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context); }
cleanup(context);
}

setup(Context) - Perform any setup for the mapper. The default implementation is a no-op method. map(Key, Value, Context) - Perform a map
operation in the given Key / Value pair. The default implementation calls Context.write(Key, Value)
cleanup(Context) - Perform any cleanup for the mapper. The default implementation is a no-op method.

Reference: Hadoop/MapReduce/Mapper

QUESTION 3
Determine which best describes when the reduce method is first called in a MapReduce job?

A. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what
percentage of the intermediate data should arrive before the reduce method begins.
B. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all
intermediate data has been copied and sorted.
C. Reduce methods and map methods all start at the beginning of a job, in order to provide optimal performance for map-only or reduce-only jobs.
D. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the
intermediate key-value pairs start to arrive.

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
Explanation:

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers , When is the reducers are started in a MapReduce job?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
QUESTION 4
You want to count the number of occurrences for each unique word in the supplied input data. You've decided to implement this by having your mapper
tokenize each word and emit a literal value 1, and then have your reducer increment a counter for each literal 1 it receives. After successful
implementing this, it occurs to you that you could optimize this by specifying a combiner. Will you be able to reuse your existing Reduces as your
combiner in this case and why or why not?

A. Yes, because the sum operation is both associative and commutative and the input and output types to the reduce method match.
B. No, because the sum operation in the reducer is incompatible with the operation of a Combiner.
C. No, because the Reducer and Combiner are separate interfaces.
D. No, because the Combiner is incompatible with a mapper which doesn't use the same data type for both the key and value.
E. Yes, because Java is a polymorphic object-oriented language and thus reducer code can be reused as a combiner.

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
Explanation: Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on
individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your
reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or
may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners
execution.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my
MapReduce Job?

QUESTION 5
Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop daemon on which the Hadoop framework will look for an
available slot schedule a MapReduce operation.

A. TaskTracker
B. NameNode
C. DataNode
D. JobTracker
E. Secondary NameNode

Correct Answer: D
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Explanation/Reference:
Explanation: JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There is only One Job Tracker process run on
any hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is
configured with job tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs
are halted. JobTracker in Hadoop performs following actions(from Hadoop Wiki:)

Client applications submit jobs to the Job tracker.


The JobTracker talks to the NameNode to determine the location of the data The JobTracker locates TaskTracker nodes with available slots at or near
the data The JobTracker submits the work to the chosen TaskTracker nodes.

The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled
on a different TaskTracker. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job
elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable. When the work is
completed, the JobTracker updates its status.

Client applications can poll the JobTracker for information.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What is a JobTracker in Hadoop? How many instances of
JobTracker run on a Hadoop Cluster?

QUESTION 6
Which project gives you a distributed, Scalable, data store that allows you random, realtime read/write access to hundreds of terabytes of data?

A. HBase
B. Hue
C. Pig
D. Hive
E. Oozie
F. Flume
G. Sqoop

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
Explanation: Use Apache HBase when you need random, realtime read/write access to your Big Data.

Note: This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache
HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured
Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
capabilities on top of Hadoop and HDFS.

Features

Linear and modular scalability.


Strictly consistent reads and writes.

Automatic and configurable sharding of tables


Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

Reference: http://hbase.apache.org/ (when would I use HBase? First sentence)

QUESTION 7
You use the hadoop fs put command to write a 300 MB file using and HDFS block size of 64 MB. Just after this command has finished writing 200 MB
of this file, what would another user see when trying to access this life?

A. They would see Hadoop throw an ConcurrentFileAccessException when they try to access this file.
B. They would see the current state of the file, up to the last bit written by the command.
C. They would see the current of the file through the last completed block.
D. They would see no content until the whole file written and closed.

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
reliable answer.

QUESTION 8
A combiner reduces:

A. The number of values across different keys in the iterator supplied to a single reduce method call.
B. The amount of intermediate data that must be transferred between the mapper and reducer.
C. The number of input files a mapper must process.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
D. The number of output files a reducer must produce.

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
Explanation: Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on
individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your
reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or
may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners
execution.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my
MapReduce Job?

QUESTION 9
To process input key-value pairs, your mapper needs to lead a 512 MB data file in memory. What is the best way to accomplish this?

A. Serialize the data file, insert in it the JobConf object, and read the data into memory in the configure method of the mapper.
B. Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.
C. Place the data file in the DataCache and read the data into memory in the configure method of the mapper.
D. Place the data file in the DistributedCache and read the data into memory in the configure method of the mapper.

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Best answer.

QUESTION 10
For each intermediate key, each reducer task can emit:

A. As many final key-value pairs as desired. There are no restrictions on the types of those key- value pairs (i.e., they can be heterogeneous).
B. As many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs.
C. As many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type.
D. One final key-value pair per value associated with the key; no restrictions on the type.
E. One final key-value pair per key; no restrictions on the type.

Correct Answer: C

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Section: (none)
Explanation

Explanation/Reference:
Reference: Hadoop Map-Reduce Tutorial; Yahoo! Hadoop Tutorial, Module 4: MapReduce

QUESTION 11
You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the year (IntWritable) and input values
representing product indentifies (Text).

Indentify what determines the data types used by the Mapper for a given job.

A. The key and value types specified in the JobConf.setMapInputKeyClass and JobConf.setMapInputValuesClass methods
B. The data types specified in HADOOP_MAP_DATATYPES environment variable
C. The mapper-specification.xml file submitted with the job determine the mapper's input key and value types.
D. The InputFormat used by the job determines the mapper's input key and value types.

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:
Explanation: The input types fed to the mapper are controlled by the InputFormat used. The default input format, "TextInputFormat," will load data in as
(LongWritable, Text) pairs. The long value is the byte offset of the line in the file. The Text object holds the string contents of the line of the file.

Note: The data types emitted by the reducer are identified by setOutputKeyClass() andsetOutputValueClass(). The data types emitted by the reducer
are identified by setOutputKeyClass() and setOutputValueClass().
By default, it is assumed that these are the output types of the mapper as well. If this is not the case, the methods setMapOutputKeyClass() and
setMapOutputValueClass() methods of the JobConf class will override these.

Reference: Yahoo! Hadoop Tutorial, THE DRIVER METHOD

QUESTION 12
How are keys and values presented and passed to the reducers during a standard sort and shuffle phase of MapReduce?

A. Keys are presented to reducer in sorted order; values for a given key are not sorted.
B. Keys are presented to reducer in sorted order; values for a given key are sorted in ascending order.
C. Keys are presented to a reducer in random order; values for a given key are not sorted.
D. Keys are presented to a reducer in random order; values for a given key are sorted in ascending order.

Correct Answer: A

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Section: (none)
Explanation

Explanation/Reference:
Explanation: Reducer has 3 primary phases:

1. Shuffle

The Reducer copies the sorted output from each Mapper using HTTP across the network.

2. Sort

The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

SecondarySort

To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a
grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values
are sent in the same call to reduce.

3. Reduce

In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.

The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).

The output of the Reducer is not re-sorted.

Reference: org.apache.hadoop.mapreduce, Class


Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

QUESTION 13
Can you use MapReduce to perform a relational join on two large tables sharing a key? Assume that the two tables are formatted as comma-separated
files in HDFS.

A. Yes.
B. Yes, but only if one of the tables fits into memory
C. Yes, so long as both tables fit into memory.
D. No, MapReduce cannot perform relational operations.
E. No, but it can be done with either Pig or Hive.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
Explanation: Note:
* Join Algorithms in MapReduce
A) Reduce-side join
B) Map-side join
C) In-memory join
/ Striped Striped variant variant
/ Memcached variant

* Which join to use?


/ In-memory join > map-side join > reduce-side join
/ Limitations of each?
In-memory join: memory
Map-side join: sort order and partitioning
Reduce-side join: general purpose

QUESTION 14
Determine which best describes when the reduce method is first called in a MapReduce job?

A. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what
percentage of the intermediate data should arrive before the reduce method begins.
B. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all
intermediate data has been copied and sorted.
C. Reduce methods and map methods all start at the beginning of a job, in order to provide optimal performance for map-only or reduce-only jobs.
D. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the
intermediate key-value pairs start to arrive.

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
Explanation:

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers , When is the reducers are started in a MapReduce job?

QUESTION 15

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values. Which interface should your
class implement?

A. Combiner <Text, IntWritable, Text, IntWritable>


B. Mapper <Text, IntWritable, Text, IntWritable>
C. Reducer <Text, Text, IntWritable, IntWritable>
D. Reducer <Text, IntWritable, Text, IntWritable>
E. Combiner <Text, Text, IntWritable, IntWritable>

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:
selected answer is right.

QUESTION 16
You wrote a map function that throws a runtime exception when it encounters a control character in input data. The input supplied to your mapper
contains twelve such characters totals, spread across five file splits. The first four file splits each have two control characters and the last split has four
control characters.

Indentify the number of failed task attempts you can expect when you run the job with mapred.max.map.attempts set to 4:

A. You will have forty-eight failed task attempts


B. You will have seventeen failed task attempts
C. You will have five failed task attempts
D. You will have twelve failed task attempts
E. You will have twenty failed task attempts

Correct Answer: E
Section: (none)
Explanation

Explanation/Reference:
Explanation: There will be four failed task attempts for each of the five file splits.

Note:

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
QUESTION 17
You want to populate an associative array in order to perform a map-side join. You've decided to put this information in a text file, place that file into the
DistributedCache and read it in your Mapper before any records are processed.

Indentify which method in the Mapper you should use to implement code for reading the file and populating the associative array?

A. combine
B. map
C. init
D. configure

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:
Explanation:

Reference: org.apache.hadoop.filecache , Class DistributedCache

QUESTION 18
You've written a MapReduce job that will process 500 million input records and generated 500 million key-value pairs. The data is not uniformly
distributed. Your MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reduces which is a
potential bottleneck. A custom implementation of which interface is most likely to reduce the amount of intermediate data transferred across the
network?

A. Partitioner
B. OutputFormat
C. WritableComparable
D. Writable

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
E. InputFormat
F. Combiner

Correct Answer: F
Section: (none)
Explanation

Explanation/Reference:
Explanation: Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on
individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your
reducer code as a combiner if the operation performed is commutative and associative.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my
MapReduce Job?

QUESTION 19
You want to perform analysis on a large collection of images. You want to store this data in HDFS and process it with MapReduce but you also want to
give your data analysts and data scientists the ability to process the data directly from HDFS with an interpreted high-level programming

language like Python. Which format should you use to store this data in HDFS?

A. SequenceFiles
B. Avro
C. JSON
D. HTML
E. XML
F. CSV

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
Explanation:
Reference: Hadoop binary files processing introduced by image duplicates finder

QUESTION 20
Your cluster's HDFS block size in 64MB. You have directory containing 100 plain text files, each of which is 100MB in size. The InputFormat for your job
is TextInputFormat. Determine how many Mappers will run?

A. 64

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
B. 100
C. 200
D. 640

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:

Explanation: Each file would be split into two as the block size (64 MB) is less than the file size (100 MB), so 200 mappers would be running.

Note:
If you're not compressing the files then hadoop will process your large files (say 10G), with a number of mappers related to the block size of the file.

Say your block size is 64M, then you will have ~160 mappers processing this 10G file (160*64 ~= 10G). Depending on how CPU intensive your mapper
logic is, this might be an

acceptable blocks size, but if you find that your mappers are executing in sub minute times, then you might want to increase the work done by each
mapper (by increasing the block size to 128, 256, 512m - the actual size depends on how you intend to process the data). Reference: http://
stackoverflow.com/questions/11014493/hadoop-mapreduce-appropriate-input- files-size (first answer, second paragraph)

QUESTION 21
When is the earliest point at which the reduce method of a given Reducer can be called?

A. As soon as at least one mapper has finished processing its input split.
B. As soon as a mapper has emitted at least one record.
C. Not until all mappers have finished processing all records.
D. It depends on the InputFormat used for the job.

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Explanation: In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying
intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the
mappers have finished.

Note: The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen
while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Why is starting the reducers early a good thing? Because it spreads out the data transfer from the mappers to the reducers over time, which is a good
thing if your network is the bottleneck.

Why is starting the reducers early a bad thing? Because they "hog up" reduce slots while only copying data. Another job that starts later that will actually
use the reduce slots now can't use them.

You can customize when the reducers startup by changing the default value of mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of
1.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A value of 0.5 will start the
reducers when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis.

Typically, keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has multiple jobs running at once. This way the job doesn't hog
up reducers when they aren't doing anything but copying data. If you only ever have one job running at a time, doing 0.1 would probably be appropriate.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, When is the reducers are started in a MapReduce job?

QUESTION 22
Which describes how a client reads a file from HDFS?

A. The client queries the NameNode for the block location(s). The NameNode returns the block location(s) to the client. The client reads the data
directory off the DataNode(s).
B. The client queries all DataNodes in parallel. The DataNode that contains the requested data responds directly to the client. The client reads the data
directly off the DataNode.
C. The client contacts the NameNode for the block location(s). The NameNode then queries the DataNodes for block locations. The DataNodes
respond to the NameNode, and the NameNode redirects the client to the DataNode that holds the requested data block(s). The client then reads the
data directly off the DataNode.
D. The client contacts the NameNode for the block location(s). The NameNode contacts the DataNode that holds the requested data block. Data is
transferred from the DataNode to the NameNode, and then from the NameNode to the client.

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
Explanation:

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, How the Client communicates with HDFS?

QUESTION 23
You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper's map method?

A. Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk.
B. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
C. Into in-memory buffers that spill over to the local file system of the TaskTracker node running the Mapper.
D. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
E. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Explanation: The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a
temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job
completes.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, Where is the Mapper Output (intermediate kay-value data) stored ?

QUESTION 24
You want to understand more about how users browse your public website, such as which pages they visit prior to placing an order. You have a farm of
200 web servers hosting your website. How

will you gather this data for your analysis?

A. Ingest the server web logs into HDFS using Flume.


B. Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for reduces.
C. Import all users' clicks from your OLTP databases into Hadoop, using Sqoop.
D. Channel these clickstreams inot Hadoop using Hadoop Streaming.
E. Sample the weblogs from the web servers, copying them into Hadoop using curl.

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
answer is to the point.

QUESTION 25
MapReduce v2 (MRv2/YARN) is designed to address which two issues?

A. Single point of failure in the NameNode.


B. Resource pressure on the JobTracker.
C. HDFS latency.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
D. Ability to run frameworks other than MapReduce, such as MPI.
E. Reduce complexity of the MapReduce APIs.
F. Standardize on a single MapReduce API.

Correct Answer: AB
Section: (none)
Explanation

Explanation/Reference:
Explanation:
Reference: Apache Hadoop YARN Concepts & Applications

QUESTION 26
You need to run the same job many times with minor variations. Rather than hardcoding all job configuration options in your drive code, you've decided
to have your Driver subclass org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface.

Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?

A. hadoop "mapred.job.name=Example" MyDriver input output


B. hadoop MyDriver mapred.job.name=Example input output
C. hadoop MyDrive D mapred.job.name=Example input output
D. hadoop setproperty mapred.job.name=Example MyDriver input output
E. hadoop setproperty ("mapred.job.name=Example") MyDriver input output

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Explanation: Configure the property using the -D key=value notation:

-D mapred.job.name='My Job'
You can list a whole bunch of options by calling the streaming jar with just the -info argument

Reference: Python hadoop streaming : Setting a job name

QUESTION 27
You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the year (IntWritable) and input values
representing product indentifies (Text).

Indentify what determines the data types used by the Mapper for a given job.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
A. The key and value types specified in the JobConf.setMapInputKeyClass and JobConf.setMapInputValuesClass methods
B. The data types specified in HADOOP_MAP_DATATYPES environment variable
C. The mapper-specification.xml file submitted with the job determine the mapper's input key and value types.
D. The InputFormat used by the job determines the mapper's input key and value types.

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:
Explanation: The input types fed to the mapper are controlled by the InputFormat used. The default input format, "TextInputFormat," will load data in as
(LongWritable, Text) pairs. The long value is the byte offset of the line in the file. The Text object holds the string contents of the line of the file.

Note: The data types emitted by the reducer are identified by setOutputKeyClass() andsetOutputValueClass(). The data types emitted by the reducer
are identified by setOutputKeyClass() and setOutputValueClass().
By default, it is assumed that these are the output types of the mapper as well. If this is not the case, the methods setMapOutputKeyClass() and
setMapOutputValueClass() methods of the JobConf class will override these.

Reference: Yahoo! Hadoop Tutorial, THE DRIVER METHOD

QUESTION 28
Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application

containers and monitoring application resource usage?

A. ResourceManager
B. NodeManager
C. ApplicationMaster
D. ApplicationMasterService
E. TaskTracker
F. JobTracker

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
Explanation:
Reference: Apache Hadoop YARN Concepts & Applications

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
QUESTION 29
Which best describes how TextInputFormat processes input files and line breaks?

A. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken
line.
B. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the broken line.
C. The input file is split exactly at the line breaks, so each RecordReader will read a series of complete lines.
D. Input file splits may cross line breaks. A line that crosses file splits is ignored.
E. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
Reference: How Map and Reduce operations are actually carried out

QUESTION 30
For each input key-value pair, mappers can emit:

A. As many intermediate key-value pairs as designed. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
B. As many intermediate key-value pairs as designed, but they cannot be of the same type as the input key-value pair.
C. One intermediate key-value pair, of a different type.
D. One intermediate key-value pair, but of the same type.
E. As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.

Correct Answer: E
Section: (none)
Explanation

Explanation/Reference:
Explanation: Mapper maps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the
same type as the input records. A given input pair may map to zero or many output pairs.

Reference: Hadoop Map-Reduce Tutorial

QUESTION 31
You have the following key-value pairs as output from your Map task:

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
(the, 1)

(fox, 1)

(faster, 1)

(than, 1)

(the, 1)

(dog, 1)

How many keys will be passed to the Reducer's reduce method?

A. Six
B. Five
C. Four
D. Two
E. One
F. Three

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
Explanation: Only one key value pair will be passed from the two (the, 1) key value pairs.

QUESTION 32
Indentify the utility that allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer?

A. Oozie
B. Sqoop
C. Flume
D. Hadoop Streaming
E. mapred

Correct Answer: D
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Explanation/Reference:
Explanation: Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any
executable or script as the mapper and/or the reducer.

Reference: http://hadoop.apache.org/common/docs/r0.20.1/streaming.html (Hadoop Streaming, second sentence)

QUESTION 33
How are keys and values presented and passed to the reducers during a standard sort and shuffle phase of MapReduce?

A. Keys are presented to reducer in sorted order; values for a given key are not sorted.
B. Keys are presented to reducer in sorted order; values for a given key are sorted in ascending order.
C. Keys are presented to a reducer in random order; values for a given key are not sorted.
D. Keys are presented to a reducer in random order; values for a given key are sorted in ascending order.

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
Explanation: Reducer has 3 primary phases:

1. Shuffle

The Reducer copies the sorted output from each Mapper using HTTP across the network.

2. Sort

The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

SecondarySort

To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a
grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values
are sent in the same call to reduce.

3. Reduce

In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.

The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
The output of the Reducer is not re-sorted.

Reference: org.apache.hadoop.mapreduce, Class


Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

QUESTION 34
Assuming default settings, which best describes the order of data provided to a reducer's reduce method:

A. The keys given to a reducer aren't in a predictable order, but the values associated with those keys always are.
B. Both the keys and values passed to a reducer always appear in sorted order.
C. Neither keys nor values are in any predictable order.
D. The keys given to a reducer are in sorted order but the values associated with each key are in no predictable order

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:
Explanation: Reducer has 3 primary phases:

1. Shuffle

The Reducer copies the sorted output from each Mapper using HTTP across the network.

2. Sort

The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

SecondarySort

To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a
grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values
are sent in the same call to reduce.

3. Reduce

In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.

The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
The output of the Reducer is not re-sorted.

Reference: org.apache.hadoop.mapreduce, Class


Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

QUESTION 35
You have user profile records in your OLPT database, that you want to join with web logs you have already ingested into the Hadoop file system. How
will you obtain these user records?

A. HDFS command
B. Pig LOAD command
C. Sqoop import
D. Hive LOAD DATA command
E. Ingest with Flume agents
F. Ingest with Hadoop Streaming

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Explanation:

Reference: Hadoop and Pig for Large-Scale Web Log Analysis

QUESTION 36
The Hadoop framework provides a mechanism for coping with machine issues such as faulty configuration or impending hardware failure. MapReduce
detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. All the tasks run simultaneously and
the task finish first are used. This is called:

A. Combine
B. IdentityMapper
C. IdentityReducer
D. Default Partitioner
E. Speculative Execution

Correct Answer: E
Section: (none)
Explanation

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Explanation/Reference:
Explanation: Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow
nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of
all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer
than all the other nodes. By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the
Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in
machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks
across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce
this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop
tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed
successfully, first.

Reference: Apache Hadoop, Module 4: MapReduce

Note:

* Hadoop uses "speculative execution." The same task may be started on multiple boxes. The first one to finish wins, and the other copies are killed.

Failed tasks are tasks that error out.

* There are a few reasons Hadoop can kill tasks by his own decisions:

a) Task does not report progress during timeout (default is 10 minutes)

b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue (CapacityScheduler).

c) Speculative execution causes results of task not to be needed since it has completed on other place.

Reference: Difference failed tasks vs killed tasks

QUESTION 37
For each intermediate key, each reducer task can emit:

A. As many final key-value pairs as desired. There are no restrictions on the types of those key- value pairs (i.e., they can be heterogeneous).
B. As many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs.
C. As many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type.
D. One final key-value pair per value associated with the key; no restrictions on the type.
E. One final key-value pair per key; no restrictions on the type.

Correct Answer: C

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Section: (none)
Explanation

Explanation/Reference:
Reference: Hadoop Map-Reduce Tutorial; Yahoo! Hadoop Tutorial, Module 4: MapReduce

QUESTION 38
What data does a Reducer reduce method process?

A. All the data in a single input file.


B. All data produced by a single mapper.
C. All data for a given key, regardless of which mapper(s) produced it.
D. All data for a given value, regardless of which mapper(s) produced it.

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Explanation: Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines
these values together, returning a single output value.

All values with the same key are presented to a single reduce task.

Reference: Yahoo! Hadoop Tutorial, Module 4: MapReduce

QUESTION 39
All keys used for intermediate output from mappers must:

A. Implement a splittable compression algorithm.


B. Be a subclass of FileInputFormat.
C. Implement WritableComparable.
D. Override isSplitable.
E. Implement a comparator for speedy sorting.

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Explanation: The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key,

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.

The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes
have to implement the WritableComparable interface to facilitate sorting by the framework.

Reference: MapReduce Tutorial

QUESTION 40
On a cluster running MapReduce v1 (MRv1), a TaskTracker heartbeats into the JobTracker on your cluster, and alerts the JobTracker it has an open
map task slot.

What determines how the JobTracker assigns each map task to a TaskTracker?

A. The amount of RAM installed on the TaskTracker node.


B. The amount of free disk space on the TaskTracker node.
C. The number and speed of CPU cores on the TaskTracker node.
D. The average system load on the TaskTracker node over the past fifteen (15) minutes.
E. The location of the InsputSplit to be processed in relation to the location of the node.

Correct Answer: E
Section: (none)
Explanation

Explanation/Reference:
Explanation: The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still
alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work
can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on
the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, How JobTracker schedules a task?

QUESTION 41
Indentify which best defines a SequenceFile?

A. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous Writable objects


B. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous Writable objects
C. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
D. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be the same
type.

Correct Answer: D

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Section: (none)
Explanation

Explanation/Reference:
Explanation: SequenceFile is a flat file consisting of binary key/value pairs.

There are 3 different SequenceFile formats:

Uncompressed key/value records.

Record compressed key/value records - only 'values' are compressed here. Block compressed key/value records - both keys and values are collected in
'blocks' separately and compressed. The size of the 'block' is configurable.

Reference: http://wiki.apache.org/hadoop/SequenceFile

QUESTION 42
A client application creates an HDFS file named foo.txt with a replication factor of 3. Identify which best describes the file access rules in HDFS if the file
has a single block that is stored on data nodes A, B and C?

A. The file will be marked as corrupted if data node B fails during the creation of the file.
B. Each data node locks the local file to prohibit concurrent readers and writers of the file.
C. Each data node stores a copy of the file in the local file system with the same name as the HDFS file.
D. The file can be accessed if at least one of the data nodes storing the file is available.

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:
Explanation: HDFS keeps three copies of a block on three different datanodes to protect against true data corruption. HDFS also tries to distribute these
three replicas on more than one rack to protect against data availability issues. The fact that HDFS actively monitors any failed datanode(s) and upon
failure detection immediately schedules re-replication of blocks (if needed) implies that three copies of data on three different nodes is sufficient to avoid
corrupted files.
Note:
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file
except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per
file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later.
Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. HDFS uses
rack-aware replica placement policy. In default configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on
same rack and 3rd copy on a different rack.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers , How the HDFS Blocks are replicated?

QUESTION 43
What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?

A. You will not be able to compress the intermediate data.


B. You will longer be able to take advantage of a Combiner.
C. By using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.
D. There are no concerns with this approach. It is always advisable to use multiple reduces.

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Explanation: Multiple reducers and total ordering

If your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred- site.xml has been set to a number larger than 1, or
because you've used the -r option to specify the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner to
distribute records across the reducers. Use of the HashPartitioner means that you can't concatenate your output files to create a single sorted output
file. To do this you'll need total ordering,

Reference: Sorting text files with MapReduce

QUESTION 44
Given a directory of files with the following structure: line number, tab character, string:

Example:

1abialkjfjkaoasdfjksdlkjhqweroij

2kadfjhuwqounahagtnbvaswslmnbfgy

3kjfteiomndscxeqalkzhtopedkfsikj

You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the line: conf.setInputFormat (____.class) ; ?

A. SequenceFileAsTextInputFormat
B. SequenceFileInputFormat
C. KeyValueFileInputFormat

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
D. BDBInputFormat

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Explanation:
http://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-in-hadoop

QUESTION 45
Can you use MapReduce to perform a relational join on two large tables sharing a key? Assume that the two tables are formatted as comma-separated
files in HDFS.

A. Yes.
B. Yes, but only if one of the tables fits into memory
C. Yes, so long as both tables fit into memory.
D. No, MapReduce cannot perform relational operations.
E. No, but it can be done with either Pig or Hive.

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
Explanation: Note:
* Join Algorithms in MapReduce
A) Reduce-side join
B) Map-side join
C) In-memory join
/ Striped Striped variant variant
/ Memcached variant

* Which join to use?


/ In-memory join > map-side join > reduce-side join
/ Limitations of each?
In-memory join: memory
Map-side join: sort order and partitioning
Reduce-side join: general purpose

QUESTION 46
You need to perform statistical analysis in your MapReduce job and would like to call methods in the Apache Commons Math library, which is distributed

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
as a 1.3 megabyte Java archive (JAR) file. Which is the best way to make this library available to your MapReducer job at runtime?

A. Have your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable
before you submit your job.
B. Have your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment
variable to its location.
C. When submitting the job on the command line, specify the libjars option followed by the JAR file path.
D. Package your code and the Apache Commands Math library into a zip file named JobJar.zip

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:

Explanation: The usage of the jar command is like this,

Usage: hadoop jar <jar> [mainClass] args...

If you want the commons-math3.jar to be available for all the tasks you can do any one of these
1. Copy the jar file in $HADOOP_HOME/lib dir
or
2. Use the generic option -libjars.

QUESTION 47
You have written a Mapper which invokes the following five calls to the OutputColletor.collect method:

output.collect (new Text ("Apple"), new Text ("Red") ) ;

output.collect (new Text ("Banana"), new Text ("Yellow") ) ;

output.collect (new Text ("Apple"), new Text ("Yellow") ) ;

output.collect (new Text ("Cherry"), new Text ("Red") ) ;

output.collect (new Text ("Apple"), new Text ("Green") ) ;

How many times will the Reducer's reduce method be invoked?

A. 6
B. 3
C. 1

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
D. 0
E. 5

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
Explanation: reduce() gets called once for each [key, (list of values)] pair. To explain, let's say you called:
out.collect(new Text("Car"),new Text("Subaru");
out.collect(new Text("Car"),new Text("Honda");
out.collect(new Text("Car"),new Text("Ford");
out.collect(new Text("Truck"),new Text("Dodge");
out.collect(new Text("Truck"),new Text("Chevy");
Then reduce() would be called twice with the pairs
reduce(Car, <Subaru, Honda, Ford>)
reduce(Truck, <Dodge, Chevy>)

Reference: Mapper output.collect()?

QUESTION 48
To process input key-value pairs, your mapper needs to lead a 512 MB data file in memory. What is the best way to accomplish this?

A. Serialize the data file, insert in it the JobConf object, and read the data into memory in the configure method of the mapper.
B. Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.
C. Place the data file in the DataCache and read the data into memory in the configure method of the mapper.
D. Place the data file in the DistributedCache and read the data into memory in the configure method of the mapper.

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
straight answer.

QUESTION 49
In a MapReduce job, the reducer receives all values associated with same key. Which statement best describes the ordering of these values?

A. The values are in sorted order.


B. The values are arbitrarily ordered, and the ordering may vary from run to run of the same MapReduce job.
C. The values are arbitrary ordered, but multiple runs of the same MapReduce job will always have the same ordering.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
D. Since the values come from mapper outputs, the reducers will receive contiguous sections of sorted values.

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
Explanation:
Note:
* Input to the Reducer is the sorted output of the mappers.
* The framework calls the application's Reduce function once for each unique key in the sorted order.
* Example:
For the given sample input the first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>

< World, 1>


The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

QUESTION 50
You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that uses TextInputFormat and splits each
value (a line of text from an input file) into individual characters. For each one of these characters, you will emit the character as a key and an
InputWritable as the value. As this will produce proportionally more intermediate data than input data, which two resources should you expect to be
bottlenecks?

A. Processor and network I/O


B. Disk I/O and network I/O
C. Processor and RAM
D. Processor and disk I/O

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
suitable answer.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
QUESTION 51
Analyze each scenario below and indentify which best describes the behavior of the default partitioner?

A. The default partitioner assigns key-values pairs to reduces based on an internal random number generator.
B. The default partitioner implements a round-robin strategy, shuffling the key-value pairs to each reducer in turn. This ensures an event partition of the
key space.
C. The default partitioner computes the hash of the key. Hash values between specific ranges are associated with different buckets, and each bucket is
assigned to a specific reducer.
D. The default partitioner computes the hash of the key and divides that valule modulo the number of reducers. The result determines the reducer
assigned to process the key-value pair.
E. The default partitioner computes the hash of the value and takes the mod of that value with the number of reducers. The result determines the
reducer assigned to process the key-value pair.

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:
Explanation: The default partitioner computes a hash value for the key and assigns the partition based on this result.

The default Partitioner implementation is called HashPartitioner. It uses the hashCode() method of the key objects modulo the number of partitions total
to determine which partition to send a given (key, value) pair to.

In Hadoop, the default partitioner is HashPartitioner, which hashes a record's key to determine which partition (and thus which reducer) the record
belongs in.The number of partition is then equal to the number of reduce tasks for the job.

Reference: Getting Started With (Customized) Partitioning

QUESTION 52
You need to move a file titled "weblogs" into HDFS. When you try to copy the file, you can't. You know you have ample space on your DataNodes.
Which action should you take to relieve this situation and store more files in HDFS?

A. Increase the block size on all current files in HDFS.


B. Increase the block size on your remaining files.
C. Decrease the block size on your remaining files.
D. Increase the amount of memory for the NameNode.
E. Increase the number of disks (or size) for the NameNode.
F. Decrease the block size on all current files in HDFS.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
answer is assessed.

QUESTION 53
You write MapReduce job to process 100 files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression
over input values and emits key-values pairs with the key consisting of the matching text, and the value containing the filename and byte offset.
Determine the difference between setting the number of reduces to one and settings the number of reducers to zero.

A. There is no difference in output between the two settings.


B. With zero reducers, no reducer runs and the job throws an exception. With one reducer, instances of matching patterns are stored in a single file on
HDFS.
C. With zero reducers, all instances of matching patterns are gathered together in one file on HDFS. With one reducer, instances of matching patterns
are stored in multiple files on HDFS.
D. With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer, all instances of matching patterns are
gathered together in one file on HDFS.

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:
Explanation: * It is legal to set the number of reduce-tasks to zero if no reduction is desired.

In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the
map-outputs before writing them out to the FileSystem.
* Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job.reduces to zero. The MapReduce
framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.

Note:
Reduce
In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each <key, (list of values)> pair in the grouped
inputs.

The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(WritableComparable, Writable).

Applications can use the Reporter to report progress, set application-level status messages and

update Counters, or just indicate that they are alive.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
The output of the Reducer is not sorted.

QUESTION 54
A combiner reduces:

A. The number of values across different keys in the iterator supplied to a single reduce method call.
B. The amount of intermediate data that must be transferred between the mapper and reducer.
C. The number of input files a mapper must process.
D. The number of output files a reducer must produce.

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
Explanation: Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on
individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your
reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or
may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners
execution.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my
MapReduce Job?

QUESTION 55
In a MapReduce job with 500 map tasks, how many map task attempts will there be?

A. It depends on the number of reduces in the job.


B. Between 500 and 1000.
C. At most 500.
D. At least 500.
E. Exactly 500.

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:

Explanation:

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
From Cloudera Training Course:
Task attempt is a particular instance of an attempt to execute a task There will be at least as many task attempts as there are tasks If a task attempt
fails, another will be started by the JobTracker Speculative execution can also result in more task attempts than completed tasks

QUESTION 56
MapReduce v2 (MRv2/YARN) splits which major functions of the JobTracker into separate daemons? Select two.

A. Heath states checks (heartbeats)


B. Resource management
C. Job scheduling/monitoring
D. Job coordination between the ResourceManager and NodeManager
E. Launching tasks
F. Managing file system metadata
G. MapReduce metric reporting
H. Managing tasks

Correct Answer: BC
Section: (none)
Explanation

Explanation/Reference:
Explanation: The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/
monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is
either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

Note:
The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly)
JobTracker:

/ Monitoring the status of the cluster with respect to which nodes have which resources available.
Under YARN, this will be global.
/ Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job.

Reference: Apache Hadoop YARN Concepts & Applications

QUESTION 57
What types of algorithms are difficult to express in MapReduce v1 (MRv1)?

A. Algorithms that require applying the same mathematical function to large numbers of individual binary records.
B. Relational operations on large amounts of structured and semi-structured data.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
C. Algorithms that require global, sharing states.
D. Large-scale graph algorithms that require one-step link traversal.
E. Text analysis algorithms on large collections of unstructured text (e.g, Web crawls).

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Explanation: See 3) below.
Limitations of Mapreduce where not to use Mapreduce

While very powerful and applicable to a wide variety of problems, MapReduce is not the answer to every problem. Here are some problems I found
where MapReudce is not suited and some papers that address the limitations of MapReuce.

1. Computation depends on previously computed values


If the computation of a value depends on previously computed values, then MapReduce cannot be used. One good example is the Fibonacci series
where each value is summation of the previous two values. i.e., f(k+2) = f(k+1) + f(k). Also, if the data set is small enough to be computed on a single
machine, then it is better to do it as a single reduce(map(data)) operation rather than going through the entire map reduce process.

2. Full-text indexing or ad hoc searching


The index generated in the Map step is one dimensional, and the Reduce step must not generate a large amount of data or there will be a serious
performance degradation. For example, CouchDB's MapReduce may not be a good fit for full-text indexing or ad hoc searching. This is a problem better
suited for a tool such as Lucene.

3. Algorithms depend on shared global state


Solutions to many interesting problems in text processing do not require global synchronization. As a result, they can be expressed naturally in
MapReduce, since map and reduce tasks run independently and in isolation. However, there are many examples of algorithms that depend crucially on
the existence of shared global state during processing, making them difficult to implement in MapReduce (since the single opportunity for global
synchronization in MapReduce is the barrier between the map and reduce phases of processing)

Reference: Limitations of Mapreduce where not to use Mapreduce

QUESTION 58
In the reducer, the MapReduce API provides you with an iterator over Writable values. What does calling the next () method return?

A. It returns a reference to a different Writable object time.


B. It returns a reference to a Writable object from an object pool.
C. It returns a reference to the same Writable object each time, but populated with different data.
D. It returns a reference to a Writable object. The API leaves unspecified whether this is a reused object or a new object.
E. It returns a reference to the same Writable object if the next value is the same as the previous value, or a new Writable object otherwise.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Explanation: Calling Iterator.next() will always return the SAME EXACT instance of IntWritable, with the contents of that instance replaced with the next
value.

Reference: manupulating iterator in mapreduce

QUESTION 59
Table metadata in Hive is:

A. Stored as metadata on the NameNode.


B. Stored along with the data in HDFS.
C. Stored in the Metastore.
D. Stored in ZooKeeper.

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Explanation: By default, hive use an embedded Derby database to store metadata information. The metastore is the "glue" between Hive and HDFS. It
tells Hive where your data files live in HDFS, what type of data they contain, what tables they belong to, etc.

The Metastore is an application that runs on an RDBMS and uses an open source ORM layer called DataNucleus, to convert object representations into
a relational schema and vice versa. They chose this approach as opposed to storing this information in hdfs as they need the Metastore to be very low
latency. The DataNucleus layer allows them to plugin many different RDBMS technologies.

Note:
* By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.
* features of Hive include:
Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.

Reference: Store Hive Metadata into RDBMS

QUESTION 60
When can a reduce class also serve as a combiner without affecting the output of a MapReduce program?

A. When the types of the reduce operation's input key and input value match the types of the reducer's output key and output value and when the

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
reduce operation is both communicative and associative.
B. When the signature of the reduce method matches the signature of the combine method.
C. Always. Code can be reused in Java since it is a polymorphic object-oriented programming language.
D. Always. The point of a combiner is to serve as a mini-reducer directly after the map phase to increase performance.
E. Never. Combiners and reducers must be implemented separately because they serve different purposes.

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
Explanation: You can use your reducer code as a combiner if the operation performed is commutative and associative.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my
MapReduce Job?

QUESTION 61
You want to perform analysis on a large collection of images. You want to store this data in HDFS and process it with MapReduce but you also want to
give your data analysts and data scientists the ability to process the data directly from HDFS with an interpreted high-level programming

language like Python. Which format should you use to store this data in HDFS?

A. SequenceFiles
B. Avro
C. JSON
D. HTML
E. XML
F. CSV

Correct Answer: B
Section: (none)
Explanation

Explanation/Reference:
Explanation:
Reference: Hadoop binary files processing introduced by image duplicates finder

QUESTION 62
You want to run Hadoop jobs on your development workstation for testing before you submit them to your production cluster. Which mode of operation
in Hadoop allows you to most closely simulate a production cluster while using a single machine?

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
A. Run all the nodes in your production cluster as virtual machines on your development workstation.
B. Run the hadoop command with the jt local and the fs file:///options.
C. Run the DataNode, TaskTracker, NameNode and JobTracker daemons on a single machine.
D. Run simldooop, the Apache open-source software for simulating Hadoop clusters.

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:
Best answer.

QUESTION 63
Your cluster's HDFS block size in 64MB. You have directory containing 100 plain text files, each of which is 100MB in size. The InputFormat for your job
is TextInputFormat. Determine how many Mappers will run?

A. 64
B. 100
C. 200
D. 640

Correct Answer: C
Section: (none)
Explanation

Explanation/Reference:

Explanation: Each file would be split into two as the block size (64 MB) is less than the file size (100 MB), so 200 mappers would be running.

Note:
If you're not compressing the files then hadoop will process your large files (say 10G), with a number of mappers related to the block size of the file.

Say your block size is 64M, then you will have ~160 mappers processing this 10G file (160*64 ~= 10G). Depending on how CPU intensive your mapper
logic is, this might be an

acceptable blocks size, but if you find that your mappers are executing in sub minute times, then you might want to increase the work done by each
mapper (by increasing the block size to 128, 256, 512m - the actual size depends on how you intend to process the data). Reference: http://
stackoverflow.com/questions/11014493/hadoop-mapreduce-appropriate-input- files-size (first answer, second paragraph)

QUESTION 64

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
What is a SequenceFile?

A. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous writable objects.


B. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous writable objects.
C. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
D. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be same
type.

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:
Explanation: SequenceFile is a flat file consisting of binary key/value pairs.

There are 3 different SequenceFile formats:

Uncompressed key/value records.


Record compressed key/value records - only 'values' are compressed here. Block compressed key/value records - both keys and values are collected in
'blocks' separately and compressed. The size of the 'block' is configurable.

Reference: http://wiki.apache.org/hadoop/SequenceFile

QUESTION 65
Which best describes what the map method accepts and emits?

A. It accepts a single key-value pair as input and emits a single key and list of corresponding values as output.
B. It accepts a single key-value pairs as input and can emit only one key-value pair as output.
C. It accepts a list key-value pairs as input and can emit only one key-value pair as output.
D. It accepts a single key-value pairs as input and can emit any number of key-value pair as output, including zero.

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:
Explanation: public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

extends Object

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Maps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same
type as the input records. A given input pair may map to zero or many output pairs.

Reference: org.apache.hadoop.mapreduce

Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

QUESTION 66
In a large MapReduce job with m mappers and n reducers, how many distinct copy operations will there be in the sort/shuffle phase?

A. mXn (i.e., m multiplied by n)


B. n
C. m
D. m+n (i.e., m plus n)
E. E. mn (i.e., m to the power of n)

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
Explanation: A MapReduce job with m mappers and r reducers involves up to m * r distinct copy operations, since each mapper may have intermediate
output going to every reducer.

QUESTION 67
Workflows expressed in Oozie can contain:

A. Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks, decision points, and path joins.
B. Sequences of MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce sequences can be combined with forks and path joins.
C. Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actions with exception handlers but no forks.
D. Iterntive repetition of MapReduce jobs until a desired answer or state is reached.

Correct Answer: A
Section: (none)
Explanation

Explanation/Reference:
Explanation: Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic
Graph), specifying a sequence of actions execution. This graph is specified in hPDL (a XML Process Definition Language).

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
hPDL is a fairly compact language, using a limited amount of flow control and action nodes. Control nodes define the flow of execution and include
beginning and end of a workflow (start, end and fail nodes) and mechanisms to control the workflow execution path ( decision, fork and join nodes).

Workflow definitions
Currently running workflow instances, including instance states and variables

Reference: Introduction to Oozie

Note: Oozie is a Java Web-Application that runs in a Java servlet-container - Tomcat and uses a database to store:

QUESTION 68
Which best describes what the map method accepts and emits?

A. It accepts a single key-value pair as input and emits a single key and list of corresponding values as output.
B. It accepts a single key-value pairs as input and can emit only one key-value pair as output.
C. It accepts a list key-value pairs as input and can emit only one key-value pair as output.
D. It accepts a single key-value pairs as input and can emit any number of key-value pair as output, including zero.

Correct Answer: D
Section: (none)
Explanation

Explanation/Reference:
Explanation: public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

extends Object
Maps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same
type as the input records. A given input pair may map to zero or many output pairs.

Reference: org.apache.hadoop.mapreduce

Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

QUESTION 69
Identify the tool best suited to import a portion of a relational database every day as files into HDFS, and generate Java classes to interact with that
imported data?

A. Oozie
B. Flume

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
C. Pig
D. Hue
E. Hive
F. Sqoop
G. fuse-dfs

Correct Answer: F
Section: (none)
Explanation

Explanation/Reference:
Explanation:
Sqoop ("SQL-to-Hadoop") is a straightforward command-line tool with the following capabilities:

Imports individual tables or entire databases to files in HDFS Generates Java classes to allow you to interact with your imported data Provides the ability
to import from SQL databases straight into your Hive data warehouse

Note:
Data Movement Between Hadoop and Relational Databases
Data can be moved between Hadoop and a relational database as a bulk data transfer, or relational tables can be accessed from within a MapReduce
map function.

Note:
* Cloudera's Distribution for Hadoop provides a bulk data transfer tool (i.e., Sqoop) that imports individual tables or entire databases into HDFS files.
The tool also generates Java classes that support interaction with the imported data. Sqoop supports all relational databases over JDBC, and Quest
Software provides a connector (i.e., OraOop) that has been optimized for access to data residing in Oracle databases.

Reference: http://log.medcl.net/item/2011/08/hadoop-and-mapreduce-big-data-analytics-gartner/ (Data Movement between hadoop and relational


databases, second paragraph)

QUESTION 70
You have a directory named jobdata in HDFS that contains four files: _first.txt, second.txt, .third.txt and #data.txt. How many files will be processed by
the FileInputFormat.setInputPaths () command when it's given a path object representing this directory?

A. Four, all files will be processed


B. Three, the pound sign is an invalid character for HDFS file names
C. Two, file names with a leading period or underscore are ignored
D. None, the directory cannot be named jobdata
E. One, no special characters can prefix the name of an input file

Correct Answer: C

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn
Section: (none)
Explanation

Explanation/Reference:
Explanation: Files starting with '_' are considered 'hidden' like unix files starting with '.'.

# characters are allowed in HDFS file names.

accurate answer.

www.vceplus.com - Website designed to help IT pros advance their careers - Born to Learn

You might also like