Hadoop MCQ Challenge

Q studocu
1613368248533 Big Data MCQ QA Parti
Data Science and Analytics (Dr. Vishwanath Karad MIT World Peace University)
Studocu is not sponsored or endorsed by any college or university

Downloaded by yash swami (yashswami284@gmail.com)
History of Hadoop
1. Point out the correct statement:
a) Hadoop is an ideal environment for extracting and transforming small volumes of data
b) Hadoop stores data in HDFS and supports data compression/decompression

c) The Giraph framework is less useful than a MapReduce job to solve graph and machine
learning
d)None of the mentioned
View Answer
Answer: b
Explanation: Data compression can be achieved using compression algorithms like bzip2, gzip,
LZO, etc. Different algorithms can be used in different scenarios based on then- capabilities.
2. What was Hadoop written in ?

a) Java (software platform)
b) Perl
c) Java (programming language)
d) Lua (programming language)
View Answer
Answer: c
Explanation: The Hadoop framework itself is mostly written in the Java programming language,
with some native code in C and command line utilities written as shell-scripts.
3. Which of the following genres does Hadoop produce ?

a)Distributed file system
b) JAX-RS
c) Java Message Service
d) Relational Database Management System
View Answer
Answer: a
Explanation: The Hadoop Distributed File System (HDFS) is designed to store very large data sets
reliably, and to stream those data sets at high bandwidth to user
4. The Hadoop list includes the HBase database, the Apache Mahout system, and
matrix operations.
a) Machine learning
b) Pattern recognition
c) Statistical classification

d) Artificial intelligence
View Answer
Answer: a
Explanation: The Apache Mahout project’s goal is to build a scalable machine learning tool.
Big Data
a) Hadoop do need specialized hardware to process the data
b) Hadoop 2.0 allows live stream processing of real time data
c) In Hadoop programming framework output files are divided in to lines or records
d) None of the mentioned
View Answer
Answer: b
Explanation: Hadoop batch processes data distributed over a number of computers ranging in 100s
and 1000s.
6. Hadoop is a framework that works with a variety of related tools. Common cohorts include:
a) MapReduce, Hive and HBase

b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet
View Answer
Answer: a
Explanation: To use Hive with HBase you’ll typically want to launch two clusters, one to run
HBase and the other to run Hive.
7 .can best be described as a programming model used to develop Hadoop-based applications that
can process massive amounts of data.
a) MapReduce
b) Mahout
c) Oozie
d) All of the mentioned
View Answer
Answer: a
Explanation: MapReduce is a programming model and an associated implementation for

processing and generating large data sets with a parallel, distributed algorithm.
8 .has the world’s largest Hadoop cluster.

a) Apple
b) Datamatics
c) Facebook
View Answer
Answer: c
Explanation: Facebook has many Hadoop clusters, the largest among them is the one that is used
for Data warehousing.
9 . Facebook Tackles Big Data With based on Hadoop.

a) ‘Project Prism’
b) ‘Prism’
c) ‘Project Big’
d) ‘Project Data’
View Answer
Answer: a
Explanation: Prism automatically replicates and moves data wherever it’s needed across a vast
network of computing facilities.
10.is a platform for constructing data flows for extract, transform, and load (ETL) processing and
analysis of large datasets.
a) Pig Latin
b) Oozie
c)Pig
d) Hive
View Answer
Answer: c
Explanation: Apache Pig is a platform for analyzing large data sets that consists of a high- level
language for expressing data analysis programs.

a) Hive is not a relational database, but a query engine that supports the parts of SQL specific to

querying data
b) Hive is a relational database with SQL support
c) Pig is a relational database with SQL support
View Answer
Answer: a
Explanation: Hive is a SQL-based data warehouse system for Hadoop that facilitates data
summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop- compatible file
systems.
12. hides the limitations of Java behind a powerful and concise Clojure API for Cascading.
a) Scalding
b) HCatalog
c) Cascalog
View Answer
Answer: c
Explanation: Cascalog also adds Logic Programming concepts inspired by Datalog. Hence the
name “Cascalog” is a contraction of Cascading and Datalog.
13. Point out the wrong statement:

a) Elastic MapReduce (EMR) is Facebook’s packaged Hadoop offering
b) Amazon Web Service Elastic MapReduce (EMR) is Amazon’s packaged Hadoop offering
c) Scalding is a Scala API on top of Cascading that removes most Java boilerplate
View Answer
Answer: a
Explanation: Rather than building Hadoop deployments manually on EC2 (Elastic Compute
Cloud) clusters, users can spin up fully configured Hadoop installations using simple invocation
commands, either through the AWS Web Console or through command-line tools.
14.is the most popular high-level Java API in Hadoop Ecosystem a) Scalding
b) HCatalog
c) Cascalog
d) Cascading
View Answer

Answer: d
Explanation: Cascading hides many of the complexities of MapReduce programming behind more
intuitive pipes and data flow abstractions.
15.is general-purpose computing model and runtime system for distributed data analytics.
a) Mapreduce
b) Drill
c) Oozie
View Answer
Answer: a
Explanation: Mapreduce provides a flexible and scalable foundation for analytics, from traditional
reporting to leading-edge machine learning algorithms.
16. The Pig Latin scripting language is not only a higher-level data flow language but also has
operators similar to :
a) SQL
b) JSON
c) XML
View Answer
Answer: a
Explanation: Pig Latin, in essence, is designed to fill the gap between the declarative style of SQL
and the low-level procedural style of MapReduce.
17. jobs are optimized for scalability but not latency.
a) Mapreduce
b) Drill
c) Oozie
d) Hive
View Answer
Answer: d
Explanation: Hive Queries are translated to MapReduce jobs to exploit the scalability of
MapReduce.

Introduction to Mapreduce
18. A node acts as the Slave and is responsible for executing a Task assigned to it
by the JobTracker.
a) MapReduce
b) Mapper
c) TaskTracker
d) JobTracker
View Answer
Answer: c
Explanation: TaskTracker receives the information necessary for execution of a Task from
JobTracker, Executes the Task, and Sends the Results back to JobTracker.
19. function is responsible for consolidating the results produced by each of the Map()
functions/tasks.
a) Reduce
b) Map
c) Reducer
View Answer
Answer: a
Explanation: Reduce function collates the work and resolves the results.
20. Although the Hadoop framework is implemented in Java , MapReduce applications need not
be written in :
a) Java b)C c)C#
View Answer
Answer: a
Explanation: Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce

applications (non JNITM based).
21.is a utility which allows users to create and run jobs with any executables as the mapper and/or
the reducer.
a) Hadoop Strdata
b) Hadoop Streaming
c) Hadoop Stream

View Answer
Answer: b
Explanation: Hadoop streaming is one of the most important utilities in the Apache Hadoop
distribution.
22. _________ maps input key/value pairs to a set of intermediate key/value pairs.
a) Mapper
b) Reducer
c) Both Mapper and Reducer
View Answer
Answer: a
Explanation: Maps are the individual tasks that transform input records into intermediate records.
23. The number of maps is usually driven by the total size of:
a) inputs
b) outputs
c) tasks
View Answer
Answer: a
Explanation: Total size of inputs means total number of blocks of the input files.
24.is the default Partitioner for partitioning key space.

a) HashPar
b) Partitioner
c) HashPartitioner
View Answer
Answer: c
Explanation: The default partitioner in Hadoop is the HashPartitioner which has a method called
getPartition to partition.

Analyzing Data with Hadoop
25. Mapper implementations are passed the JobConf for the job via the _______method
a) JobConfigure.configure
b) JobConfigurable.configure
c) JobConfigurable.configureable
View Answer
Answer: b
Explanation: JobConfigurable.configure method is overridden to initialize themselves.
26. Input to the a) is the sorted output of the mappers.

Reducer b) Mapper c) Shuffle
View Answer
Answer: a
Explanation: In Shuffle phase the framework fetches the relevant partition of the output of all the
mappers, via HTTP.

a) Reducer has 2 primary phases
b) Increasing the number of reduces increases the framework overhead, but increases load
balancing and lowers the cost of failures
c) It is legal to set the number of reduce-tasks to zero if no reduction is desired
d) The framework groups Reducer inputs by keys (since different mappers may have output the
same key) in sort stage
View Answer
Answer: a
Explanation: Reducer has 3 primary phases: shuffle, sort and reduce.
28. The output of the is not sorted in the Mapreduce framework for Hadoop.
a) Mapper
b) Cascader
c) Scalding

View Answer
Answer: d
Explanation: The output of the reduce task is typically written to the FileSystem. The output of the
Reducer is not sorted.
29. Which of the following phases occur simultaneously ? a) Shuffle and Sort
b) Reduce and Sort c) Shuffle and Map d) All of the mentioned View Answer
Answer: a
Explanation: The shuffle and sort phases occur simultaneously; while map-outputs are being
fetched they are merged.
30. Mapper and Reducer implementations can use the to report progress or just
indicate that they are alive.
a) Partitioner
b) OutputCollector
c) Reporter
View Answer
Answer: c
Explanation: Reporter is a facility for MapReduce applications to report progress, set application-
level status messages and update Counters.
31.is a generalization of the facility provided by the MapReduce framework to collect data output
by the Mapper or the Reducer
a) Partitioner
b) OutputCollector
c) Reporter
View Answer
Answer: b
Explanation: Hadoop MapReduce comes bundled with a library of generally useful mappers,
reducers, and partitioners.
32.is the primary interface for a user to describe a MapReduce job to the Hadoop framework for

execution.
a) Map Parameters
b) JobConf
c) MemoryConf
View Answer
Answer: b
Explanation: JobConf represents a MapReduce job configuration.

Scaling out in Hadoop
33. are highly resilient and eliminate the single-point-of-failure risk with traditional
Hadoop deployments
a) EMR
b) Isilon solutions
c) AWS
View Answer
Answer: b
Explanation: enterprise data protection and security options including file system auditing and
data-at-rest encryption to address compliance requirements is also provided by Isilon solution.
34. Which is the most popular NoSQL database for scalable big data store with
Hadoop ?
a) Hbase
b) MongoDB
c) Cassandra
View Answer
Answer: a
Explanation: HBase is the Hadoop database: a distributed, scalable Big Data store that lets you
host very large tables — billions of rows multiplied by millions of columns — on clusters built
with commodity hardware.
35. The can also be used to distribute both jars and native libraries for use in
the map and/or reduce tasks.
a) DataCache

b) DistributedData
c) DistributedCache
View Answer
Answer: c
Explanation: The child-jvm always has its current working directory added to the java.library.path
and LD_LIBRARY_PATH.
36. HBase provides like capabilities on top of Hadoop and HDFS.

a) TopTable
b) BigTop
c) Bigtable
View Answer
Answer: c
Explanation: Google Bigtable leverages the distributed data storage provided by the Google File
System.
Hadoop Streaming
37. Streaming supports streaming command options as well as command options,

a) generic
b) tool
c) library
d)task
View Answer
Answer: a
Explanation: Place the generic options before the streaming options, otherwise the command will
fail.
38. Which of the following Hadoop streaming command option parameter is required
?
a) output directoryname
b) mapper executable
c) input directoryname

d) all of the mentioned
View Answer
Answer: d
Explanation: Required parameters is used for Input and Output location for mapper.
39. To set an environment variable in a streaming command use:
a) -cmden EXAMPLE_DIR=/home/example/dictionaries/
b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/
c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/ d) -cmenv
EXAMPLE_DIR=/home/example/dictionaries/ View Answer
Answer: c
Explanation: Environment Variable is set using cmdenv command.
40. class allows the Map/Reduce framework to partition the map outputs based on
certain key fields, not the whole keys.
a) KeyFieldPartitioner
b) KeyFieldBasedPartitioner
c) KeyFieldBased
View Answer
Answer: b
Explanation: The primary key is used for partitioning, and the combination of the primary and
secondary keys is used for sorting.
41. Which of the following class provides a subset of features provided by the
Unix/GNU Sort?
a) KeyFieldBased
b) KeyFieldComparator
c) KeyFieldBasedComparator
View Answer
Answer: c
Explanation: Hadoop has a library class, KeyFieldBasedComparator, that is useful for many
applications.
42. Which of the following class is provided by Aggregate package ? a) Map

b) Reducer
c) Reduce
View Answer
Answer: b
Explanation: Aggregate provides a special reducer class and a special combiner class, and a list of
simple aggregators that perform aggregations such as “sum”, “max”, “min” and so on over a
sequence of values
43. Hadoop has a library class,

org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that effectively allows you to process
text data like the unix utility.
a) Copy
b)Cut
c) Paste
d) Move
View Answer
Answer: b
Explanation: The map function defined in the class treats each input key/value pair as a list of
fields.
Introduction to HDFS
44. A serves as the master and there is only one NameNode per cluster.
a) Data Node
b) NameNode
c) Data block
d) Replication
View Answer
Answer: b
Explanation: All the metadata related to HDFS including the information about data nodes, files
stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.

a) DataNode is the slave/worker node and holds the user data in the form of Data Blocks
b) Each incoming file is broken into 32 MB by default

c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of
fault tolerance
View Answer
Answer: a
Explanation: There can be any number of DataNodes in a Hadoop Cluster.
46. NameNode is used when the Primary NameNode goes down.

a) Rack
b) Data
c) Secondary
View Answer
Answer: c
Explanation: Secondary namenode is used for all time availability and reliability.

a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file
level
b) Block Report from each DataNode contains a list of all the blocks that are stored on that
DataNode
c) User data is stored on the local file system of DataNodes
d) DataNode is aware of the files to which the blocks stored on it belong to
View Answer
Answer: d
Explanation: NameNode is aware of the files to which the blocks stored on it belong to.
48. Which of the following scenario may not be a good fit for HDFS ?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file b)
HDFS is suitable for storing data related to applications requiring low latency data access c) HDFS
is suitable for storing data related to applications requiring low latency data access d) None of the
mentioned
View Answer
Answer: a
Explanation: HDFS can be used for storing archive data since it is cheaper as HDFS allows storing

the data on low cost commodity hardware while ensuring a high degree of faulttolerance.
49. The need for data replication can arise in various scenarios tike :
a) Replication Factor is changed
b) DataNode goes down
c) Data Blocks get corrupted
View Answer
Answer: d
Explanation: Data is replicated across different DataNodes to ensure a high degree of fault-
tolerance.
50.is the slave/worker node and holds the user data in the form of Data Blocks, a) DataNode
b) NameNode
c) Data block
d) Replication
View Answer
Answer: a
Explanation: A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more
than one DataNode, with data replicated across them.
51. HDFS provides a command line interface called used to interact with
HDFS.
a) “HDFS Shell”
b) “FS Shell”
c) “DFS Shell”
View Answer
Answer: b
Explanation: The File System (FS) shell includes various shell-like commands that directly interact
with the Hadoop Distributed File System (HDFS).
52. HDFS is implemented in programming language.
a) C++
b)Java
c) Scala

View Answer
Answer: b
Explanation: HDFS is implemented in Java and any computer which can run Java can host a
NameNode/DataNode on it.
Java Interface
53. The output of the reduce task is typically written to the FileSystem via :
a) OutputCollector
b) InputCollector
c) OutputCollect
View Answer
Answer: a
Explanation: In reduce phase the reduce(Object, Iterator, OutputCollector, Reporter) method is

called for each pair in the grouped inputs.
54. Applications can use the provided to report progress or just indicate that they
are alive.
a) Collector
b) Reporter
c) Dashboard
View Answer
Answer: b
Explanation: In scenarios where the application takes a significant amount of time to process
individual key/value pairs, this is crucial since the framework might assume that the task has timed-
out and kill that task.
Data Flow
55.is a programming model designed for processing large volumes of data in parallel by dividing
the work into a set of independent tasks.
a) Hive
b) MapReduce
c)Pig
d) Lucene
View Answer

Answer: b
Explanation: MapReduce is the heart of hadoop.

a) Data locality means movement of algorithm to the data instead of data to algorithm
b) When the processing is done on the data algorithm is moved across the Action Nodes rather
than data to the algorithm
c) Moving Computation is expensive than Moving Data
View Answer
Answer: a
Explanation: Data flow framework possesses the feature of data locality.
57. The daemons associated with the MapReduce phase are and task-trackers.
a) job-tracker
b) map-tracker
c) reduce-tracker
View Answer
Answer: a
Explanation: Map-Reduce jobs are submitted on job-tracker.
58. The default InputFormat is which treats each value of input a new value and
the associated key is byte offset.
a) TextFormat
b) TextlnputFormat
c) InputFormat
View Answer
Answer: b
Explanation: A RecordReader is little more than an iterator over records, and the map task uses
one to generate record key-value pairs.
59. controls the partitioning of the keys of the intermediate map-outputs.

a) Collector
b) Partitioner
c) InputFormat

View Answer
Answer: b
Explanation: The output of the mapper is sent to the partitioner.
60. Output of the mapper is first written on the local disk for sorting and ________ process,
a) shuffling
b) secondary sorting
c) forking
d) reducing
View Answer
Answer: a
Explanation: All values corresponding to the same key will go the same reducer.
Hadoop Archives
61. The guarantees that excess resources taken from a queue will be restored to
it within N minutes of its need for them.
a) capacitor
b) scheduler
c) datanode
d) none of the mentioned
View Answer
Answer: b
Explanation: Free resources can be allocated to any queue beyond its guaranteed capacity.
62.is a pluggable Map/Reduce scheduler for Hadoop which provides a way to share large clusters.
a) Flow Scheduler
b) Data Scheduler
c) Capacity Scheduler
View Answer
Answer: c
Explanation: The Capacity Scheduler supports for multiple queues, where a job is submitted to a
queue.
Data Integrity
63. The __________ machine is a single point of failure for an HDFS cluster.

a) DataNode
b) NameNode
c) ActionNode
View Answer
Answer: b
Explanation: If the NameNode machine fails, manual intervention is necessary. Currently,
automatic restart and failover of the NameNode software to another machine is not supported.
64. The and the EditLog are central data structures of HDFS.
a) Dslmage
b) Fslmage
c) Fshnages
View Answer
Answer: b
Explanation: A corruption of these files can cause the HDFS instance to be non-functional.

a) HDFS is designed to support small files only
b) Any update to either the Fslmage or EditLog causes each of the Fshnages and EditLogs to get
updated synchronously
c) NameNode can be configured to support maintaining multiple copies of the Fslmage and
EditLog
View Answer
Answer: a
Explanation: HDFS is designed to support very large files.
66. HDFS, by default, replicates each data block times on different nodes and on at
least racks.
a) 3,2
b) 1,2
c) 2,3
View Answer
Answer: a
Explanation: HDFS has a simple yet robust architecture that was explicitly designed for data

reliability in the face of faults and failures in disks, nodes and networks.
67. The HDFS file system is temporarily unavailable whenever the HDFS is down.
a) DataNode
b) NameNode
c) ActionNode
View Answer
Answer: b
Explanation: When the HDFS NameNode is restarted it recovers its metadata.
Serialization
68. Apache is a serialization framework that produces data in a compact binary

format.
a) Oozie
b) Impala
c) kafka
d) Avro
View Answer
Answer: d
Explanation: Apache Avro doesn’t require proxy objects or code generation.

a) Java code is used to deserialize the contents of the file into objects
b) Avro allows you to use complex data structures within Hadoop MapReduce jobs
c) The m2e plug-in automatically downloads the newly added JAR files and their dependencies
View Answer
Answer: d
Explanation: A unit test is useful because you can make assertions to verify that the values of the
deserialized object are the same as the original values.
70. The class extends and implements several Hadoop-supplied interfaces.

a) AvroReducer
b) Mapper
c) AvroMapper

View Answer
Answer: c
Explanation: AvroMapper is used to provide the ability to collect or map data.
71. The _______ method in the ModelCountReducer class “reduces” the values the mapper
collects into a derived value
a) count
b) add
c) reduce
View Answer
Answer: c
Explanation: In some case, it can be simple sum of the values.
Metrics in Hbase
72. can change the maximum number of cells of a column family.

a) set
b) reset
c) alter
d) select
View Answer
Answer: c
Explanation: Alter is the command used to make changes to an existing table.
73. Which of the following is not a table scope operator ?

a) MEMSTORE_FLUSH
b) MEMSTORE-FLUSHSIZE
c) MAX_FILESIZE
View Answer
Answer: a
Explanation: Using alter, you can set and remove table scope operators such as MAX_FILESIZE,
READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc.
74. You can delete a column family from a table using the method of
HBAseAdmin class.
a) delColumn()

b) removeColumn()
c) deleteColumn()
View Answer
Answer: c
Explanation: Alter command also can be used to delete a column family.

a) To read data from an HBase table, use the get() method of the HTable class
b) You can retrieve data from the HBase table using the get() method of the HTable class
c) While retrieving data, you can get a single row by id, or get a set of rows by a set of row ids, or
scan an entire table or a subset of rows
View Answer
Answer: d
Explanation: You can retrieve an HBase table data using the add method variants in Get class.
76. HBase uses the File System to store its data.

a) Hive
b) hnphala
c) Hadoop
d) Scala
View Answer
Answer: c
Explanation: The data storage will be in the form of regions (tables). These regions will be split up
and stored in region servers.
Mapreduce Development-2

a) Mapper maps input key/value pairs to a set of intermediate key/value pairs
b) Applications typically implement the Mapper and Reducer interfaces to provide the map and
reduce methods
c) Mapper and Reducer interfaces form the core of the job
View Answer
Answer: d

Explanation: The transformed intermediate records do not need to be of the same type as the input
records.
MapReduce Features-1
78. Which of the following is the default Partitioner for Mapreduce ?

a) MergePartitioner
b) HashedPartitioner
c) HashPartitioner
View Answer
Answer: c
Explanation: The total number of partitions is the same as the number of reduce tasks for the job.
79. Which of the following partitions the key space ?

a) Partitioner
b) Compactor
c) Collector
View Answer
Answer: a
Explanation: Partitioner controls the partitioning of the keys of the intermediate map-outputs.
80.is a generalization of the facility provided by the MapReduce framework to collect data output
by the Mapper or the Reducer
a) OutputCompactor
b) OutputCollector
c) InputCollector
View Answer
Answer: b
Explanation: Hadoop MapReduce comes bundled with a library of generally useful mappers,
reducers, and partitioners.

a) It is legal to set the number of reduce-tasks to zero if no reduction is desired
b) The outputs of the map-tasks go directly to the FileSystem
c) The Mapreduce framework does not sort the map-outputs before writing them out to the
FileSystem

View Answer
Answer: d
Explanation: Outputs of the map-tasks go directly to the FileSystem, into the output path set by
setOutputPath(Path).
82.is the primary interface for a user to describe a MapReduce job to the Hadoop framework for
execution.
a) JobConfig
b) JobConf
c) JobConfiguration
View Answer
Answer: b
Explanation: JobConf is typically used to specify the Mapper, combiner (if any), Partitioner,
Reducer, InputFormat, OutputFormat and OutputCommitter implementations.
83. Maximum virtual memory of the launched child-task is specified using : a) mapv
b) mapred
c) mapvim
View Answer
Answer: b
Explanation: Admins can also specify the maximum virtual memory of the launched childtask, and
any sub-process it launches recursively, using mapred.
84.is percentage of memory relative to the maximum heapsize in which map outputs may be
retained during the reduce.
a) mapred.job.shuffle.merge.percent
b) mapred.job.reduce.inputbuffer.percen
c) mapred.inmem.merge.threshold
d) io.sort.factor
View Answer
Answer: b
Explanation: When the reduce begins, map outputs will be merged to disk until those that remain
are under the resource limit this defines.
MapReduce Features-2
85.specifies the number of segments on disk to be merged at the same time.

a) mapred.job.shuffle.merge.percent
b) mapred.job.reduce.input.buffer.percen
c) mapred.inmem.merge.threshold
d) io.sort.factor
View Answer
Answer: d
Explanation: io.sort.factor limits the number of open files and compression codecs during the
merge.

a) The number of sorted map outputs fetched into memory before being merged to disk
b) The memory threshold for fetched map outputs before an in-memory merge is finished
c) The percentage of memory relative to the maximum heapsize in which map outputs may not be
retained during the reduce
View Answer
Answer: a
Explanation: When the reduce begins, map outputs will be merged to disk until those that remain
are under the resource limit this defines.
87. Jobs can enable task JVMs to be reused by specifying the job configuration :
a) mapred.j ob.recycle.j vm.num.tasks
b) mapissue.job.reuse.jvm.num.tasks
c) mapred.j ob.reuse.j vm.num.tasks
View Answer
Answer: b
Explanation: Many of my tasks had performance improved over 50% using
mapissue.job.reuse.jvm.num.tasks.
88. During the execution of a streaming job, the names of the parameters are
transformed.
a) vmap
b) mapvim
c) mapreduce
d) mapred
View Answer
Answer: d
Explanation: To get the values in a streaming job’s mapper/reducer use the parameter names with

the underscores.
Hadoop Configuration
89. Which of the following class provides access to configuration parameters ?

a) Config
b) Configuration
c) OutputConfig
View Answer
Answer: b
Explanation: Configurations are specified by resources.
90. gives site-specific configuration for a given hadoop installation.

a) core-default.xml
b) core-site.xml
c) coredefaultxml
View Answer
Answer: b
Explanation: core-default.xml is read-only defaults for hadoop.
Hadoop Cluster-2
91. Hadoop data is not sequenced and is in 64MB to 256 MB block sizes of delimited record
values with schema applied on read based on:
a) HCatalog
b) Hive
c) Hbase
Answer: a
Explanation: Other means of tagging the values also can be used.
Monitoring HDFS
92. For YARN, the Manager UI provides host and port information.
a) Data Node
b) NameNode
c) Resource

d) Replication
Answer: c
Explanation: All the metadata related to HDFS including the information about data nodes, files
stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.

a) The Hadoop framework publishes the job flow status to an internally running web server
on the master nodes of the Hadoop cluster
b) Each incoming file is broken into 32 MB by default
c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault
tolerance
Answer: a
Explanation: The web interface for the Hadoop Distributed File System (HDFS) shows
information about the NameNode itself.
94)NameNode is used when the Primary NameNode goes down.

a) Rack
b) Data
c) Secondary
Answer: c
Explanation: Secondary namenode is used for all time availability and reliability.
95) Point out the wrong statement:

a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file
level
b) Block Report from each DataNode contains a list of all the blocks that are stored on that
DataNode
c) User data is stored on the local file system of DataNodes
d) DataNode is aware of the files to which the blocks stored on it belong to
Answer: d
Explanation: NameNode is aware of the files to which the blocks stored on it belong to.
96)For, the HBase Master UI provides information about the HBase Master uptime.
a) HBase
b) Oozie
c) Kafka

Answer: a
Explanation: HBase Master UI provides information about the number of live, dead and
transitional servers, logs, ZooKeeper information, debug dumps, and thread stacks.
97)Which of the following scenario may not be a good fit for HDFS ?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file b)
HDFS is suitable for storing data related to applications requiring low latency data access c) HDFS
is suitable for storing data related to applications requiring low latency data access d) None of the
mentioned
Answer: a
Explanation: HDFS can be used for storing archive data since it is cheaper as HDFS allows storing
the data on low cost commodity hardware while ensuring a high degree of faulttolerance.
98)The need for data replication can arise in various scenarios like :
a) Replication Factor is changed
b) DataNode goes down
c) Data Blocks get corrupted
Answer: d
Explanation: Data is replicated across different DataNodes to ensure a high degree of fault-
tolerance.
99)During start up, the loads the file system state from the fsimage and the
edits log file.
a) DataNode
b) NameNode
c) ActionNode
Answer: b
Explanation: HDFS is implemented on any computer which can run Java can host a
NameNode/DataNode on it.
HDFS Maintenance
100)Which of the following is a common hadoop maintenance issue ?

a) Lack of tools
b) Lack of configuration management
c) Lack of web interface
Answer: b
Explanation: Without a centralized configuration management framework, you end up with a

number of issues that can cascade just as your usage picks up.

101)Manager’s Service feature monitors dozens of service health and performance metrics about
the services and role instances running on your cluster, a) Microsoft
b) Cloudera
c) Amazon
Answer: b
Explanation: Manager’s Service feature presents health and performance data in a variety of
formats.
Introduction to Pig
102)Pig operates in mainly how many nodes ?

a) Two
b) Three
c) Four
d) Five
Answer: a
Explanation: You can run Pig (execute Pig Latin statements and Pig commands) using various
mode: Interactive and Batch Mode.
103)Point out the correct statement:

a) You can run Pig in either mode using the “pig” command
b) You can run Pig in batch mode using the Grunt shell
c) You can run Pig in interactive mode using the FS shell
Answer: a
Explanation: You can run Pig in either mode using the “pig” command (the bin/pig Perl script) or
the “java” command (java -cp pig.jar ...)
104)You can run Pig in batch mode using.

a) Pig shell command
b) Pig scripts
c) Pig options
Answer: b
Explanation: Pig script contains Pig Latin statements.
105)Pig Latin statements are generally organized in one of the following ways :
a) A LOAD statement to read data from the file system
b) A series of “transformation” statements to process the data

c) A DUMP statement to view results or a STORE statement to save the results
Answer: d
Explanation: A DUMP or STORE statement is required to generate output.
106)Point out the wrong statement:

a) To run Pig in local mode, you need access to a single machine
b) The DISPLAY operator will display the results to your terminal screen
c) To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation
Answer: b
Explanation: The DUMP operator will display the results to your terminal screen.
107)Which of the following function is used to read data in PIG ?

a) WRITE
b)READ
c) LOAD
Answer: c
Explanation: PigStorage is the default load function.
108)You can run Pig in interactive mode using the shell.

a) Grunt b)FS c) HDFS d) None of the mentioned
Answer: a
Explanation: Invoke the Grunt shell using the “pig” command (as shown below) and then enter
your Pig Latin statements and Pig commands interactively at the command line.
109)Which of the following is the default mode ?

a) Mapreduce
b) Tez
c) Local
Answer: a
Explanation: To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS
installation.
110)Which of the following will run pig in local mode ?

a) $ pig -x local...
b) $ pig -x tez local...
c) $pig ...

Answer: a
Explanation: Specify local mode using the -x flag (pig -x local).
111)$ pig -x tez_local... will enable mode in Pig.

a) Mapreduce
b) Tez
c) Local
Answer: d
Explanation: Tez Local Mode is similar to local mode, except internally Pig will invoke tez
runtime engine.
Pig Latin
112)operator is used to review the schema of a relation, a) DUMP b) DESCRIBE

c) STORE
d) EXPLAIN
Answer: b
Explanation: DESCRIBE returns the schema of a relation.

a) During the testing phase of your implementation, you can use LOAD to display results to your
terminal screen
b) You can view outer relations as well as relations defined in a nested FOREACH statement c)
Hadoop properties are interpreted by Pig
Answer: b
Explanation: Viewing outer relations is possible using DESCRIBE operator.
114)Which of the following operator is used to view the map reduce execution plans ?
a) DUMP
b) DESCRIBE
c) STORE
d) EXPLAIN
Answer: d
Explanation: EXPLAIN displays execution plans
User-defined functions in Pig

a) LoadMeta has methods to convert byte arrays to specific types

b) The Pig load/store API is aligned with Hadoop’s InputFormat class only
c) LoadPush has methods to push operations from Pig runtime into loader implementations
Answer: c
Explanation: Currently only the pushProjection() method is called by Pig to communicate to the
loader the exact fields that are required in the Pig script.
116)Which of the following has methods to deal with metadata ?

a) LoadPushDown
b) LoadMetadata
c) LoadCaster
Answer: b
Explanation: Most implementation of loaders don’t need to implement this unless they interact
with some metadata system.
117) A loader implementation should implement if casts (implicit or explicit)
from DataByteArray fields to other types need to be supported.
a) LoadPushDown
b) LoadMetadata
c) LoadCaster
Answer: c
Explanation: LoadCaster has methods to convert byte arrays to specific types.
Data processing operators in Pig
118)Which of the following is shortcut for DUMP operator ?

a) \de alias
b) \d alias
c)\q
Answer: b
Explanation: If alias is ignored last defined alias will be used.

a) Invoke the Grunt shell using the “enter” command
b) Pig does not support jar files
c) Both the run and exec commands are useful for debugging because you can modify a Pig script
in an editor
Answer: c
Explanation: Both commands promote Pig script modularity as they allow you to reuse existing

components.
120)Which of the following command is used to show values to keys used in Pig ?
a) set
b) declare
c) display
Answer: a
Explanation: All Pig and Hadoop properties can be set, either in the Pig script or via the Grunt
command line.
121)Use the command to run a Pig script that can interact with the Grunt shell
(interactive mode).
a) fetch
b) declare
c) run
Answer: c
Explanation: With the run command, every store triggers execution.
122)Which of the following command can be used for debugging ?
a) exec
b) execute
c) error
d) throw
Answer: a
Explanation: With the exec command, store statements will not trigger execution; rather, the entire
script is parsed before execution starts.
123)Which of the following file contains user defined functions (UDFs) ?

a) script2-local.pig
b) pig.jar
c) tutorial.) ar
d) excite.log.bz2
Answer: c
Explanation: tutorial.jar contains java classes also.
124)Which of the following is correct syntax for parameter substitution using cmd ?
a) pig {-param param_name = param_value | -param_file file_name} [-debug | -dryrun] script b)
{%declare | %default] param_name param_value
c) {%declare | %default] param_name param_value cmd

Answer: a
Explanation: Parameter Substitution is used to substitute values for parameters at run time
Pig in practice
125)Pig Latin is and fits very naturally in the pipeline paradigm while SQL is
instead declarative.
a) functional
b) procedural
c) declarative
Answer: b
Explanation: In SQL users can specify that data from two tables must be joined, but not what join
implementation to use.
Introduction to Hive
126)Which of the following command sets the value of a particular configuration variable (key)?
a) set -v
b) set <key>=<value>
c) set
d) reset
Answer: b
Explanation: If you misspell the variable name, the CLI will not show an error.
127)Which of the following is a command line option ?

a) -d,-define <key=value>
b) -e,-define <key=value> c) -f,-define <key=value>
Answer: a
Explanation: Variable substitution to apply to hive commands, e.g. -d A=B or -define A=B
128)When $HIVE_HOME/bin/hive is run without either the -e or -f option, it enters mode.
a) Batch
b) Interactive shell
c) Multiple

Answer: b
Explanation: Use (semicolon) to terminate commands for multiple options available
HiveQL-2
129)Hive specific commands can be run from Beeline, when the Hive driver is used.
a) ODBC
b) JDBC
c) ODBC-JDBC
d) All of the Mentioned
Answer: b
Explanation: Hive specific commands are same as Hive CLI commands.
130)Which of the following data type is supported by Hive ?

a) map
b)record
c) string
d) enum
Answer: d
Explanation: Hive has no concept of enums.
Querying data with HiveQL-2
131)will overwrite any existing data in the table or partition

a) INSERT WRITE
b) INSERT OVERWRITE
c) INSERT INTO
Answer: c
Explanation: INSERT INTO will append to the table or partition, keeping the existing data intact.
Introduction to HBase

a) HDFS provides low latency access to single rows from billions of records (Random access)
b) HBase sits on top of the Hadoop File System and provides read and write access
c) HBase is a distributed file system suitable for storing large files

Answer: b
Explanation: One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase.
133)Apache HBase is a non-relational database modeled after Google’s

a) BigTop
b) Bigtable
c) Scanner
d) FoundationDB
Answer: b
Explanation: Bigtable acts up on Google File System, likewise Apache HBase works on top of
Hadoop and HDFS.
134)Point out the wrong statement:

a) HBase provides only sequential access of data
b) HBase provides high latency batch processing
c) HBase internally provides serialized access
Answer: c
Explanation: HBase internally uses Hash tables and provides random access.
135)The Server assigns regions to the region servers and takes the help of Apache
ZooKeeper for this task.
a) Region
b) Master
c) Zookeeper
Answer: b
Explanation: Master Server maintains the state of the cluster by negotiating the load balancing.
Kafka with Hadoop-2
136)provides the functionality of a messaging system.

a) Oozie
b) Kafka
c) Lucene
d) BigTop
Answer: b
Explanation: Kafka is a distributed, partitioned, replicated commit log service
Amazon Elastic MapReduce
137)is an RPC framework that defines a compact binary serialization format used to persist data

structures for later analysis.
a) Pig
b) Hive
c) Thrift
Answer: b
Explanation: Amazon EMR does not support Hive Authorization.
138. One supported datatype that deserves special mention are :

a) money
b) counters
c) smallint
d) tinyint
View Answer
Answer: b
Explanation: Synchronization on counters are done on the Region Server, not in the client.
139. You can delete a column family from a table using the method of
HBAseAdmin class.
a) delColumnO
b) removeColumnO
c) deleteColumnO
View Answer
Answer: c
Explanation: Alter command also can be used to delete a column family.
140. HBase uses the File System to store its data.

a) Hive
b) hnphala
c) Hadoop
d) Scala
View Answer
Answer: c
Explanation: The data storage will be in the form of regions (tables). These regions will be split up
and stored in region servers.

141. Which of the guarantee is provided by Zookeeper ?
a) Interactivity
b) Flexibility
c) Scalability
d) Reliability
View Answer
Answer: d
Explanation: Once an update has been applied, it will persist from that time forward until a client
overwrites the update.
142. A server is a machine that keeps a copy of the state of the entire system
and persists this information in local log files.
a) Master
b) Region
c) Zookeeper
View Answer
Answer: c
Explanation: A very large Hadoop cluster can be supported by multiple ZooKeeper servers.
143. ZooKeeper’s architecture supports high __________ through redundant services,

a) flexibility
b) scalability
c) availability
d) interactivity
View Answer
Answer: c
Explanation: The clients can thus ask another ZooKeeper master if the first fails to answer.
144. has a design policy of using ZooKeeper only for transient data a) Hive
b) hnphala
c) Hbase
d) Oozie
View Answer
Answer: c
Explanation: If the HBase’s ZooKeeper data is removed, only the transient operations are affected
- data can continue to be written and read to/from HBase.

145. Zookeeper keep track of the cluster state such as the a) table location.
DOMAIN
b) NODE
c) ROOT
View Answer
Answer: c
Explanation: Zookeeper keeps track of list of online RegionServers, unassigned Regions.
146. Which of the following interface is implemented by Sqoop for recording ?

a) SqoopWrite
b) SqoopRecord
c) SqoopRead
View Answer
Answer: b
Explanation: Class SqoopRecord is interface implemented by the classes generated by sqoop’s
orm.ClassWriter.
147. tool can list all the available database schemas.

a) sqoop-list-tables
b) sqoop-list-databases
c) sqoop-list-schema
d) sqoop-list-columns
View Answer
Answer: b
Explanation: Sqoop also includes a primitive SQL execution shell (the sqoop-eval tool).
148. is a component on top of Spark Core, a) Spark Streaming

b) Spark SQL
c) RDDs
View Answer
Answer: b
Explanation: Spark SQL introduces a new data abstraction called SchemaRDD, which provides
support for structured and semi-structured data.
149. Spark SQL provides a domain-specific language to manipulate in Scala,

Java, or Python.

a) Spark Streaming
b) Spark SQL
c) RDDs
View Answer
Answer: c
Explanation: Spark SQL provides SQL language support, with command-line interfaces and
ODBC/JDBC server.
4.1 .leverages Spark Core’s fast scheduling capability to perform streaming analytics.
a) MLlib
b) Spark Streaming
c) GraphX
d) RDDs
View Answer
Answer: b
Explanation: Spark Streaming ingests data in mini-batches and performs RDD transformations on
those mini-batches of data.
151.is a distributed machine learning framework on top of Spark a) MLlib

b) Spark Streaming
c) GraphX
d) RDDs
View Answer
Answer: a
Explanation: MLlib implements many common machine learning and statistical algorithms to
simplify large scale machine learning pipelines.
2 52.is a distributed graph processing framework on top of Spark.
a) MLlib
b) Spark Streaming
c) GraphX
View Answer
Answer: c
Explanation: GraphX started initially as a research project at UC Berkeley AMPLab and
Databricks, and was later donated to the Spark project.
153 . GraphX provides an API for expressing graph computation that can model the
abstraction.
a) GaAdt

b) Spark Core
c) Pregel
View Answer
Answer: c
Explanation: GraphX is used for machine learning.
154 .Which of the following storage policy is used for both storage and compute ?
a) Hot
b) Cold
c) Warm
d) A11_SSD
Answer: a
Explanation: When a block is hot, all replicas are stored in DISK.
155 .Which of the following is used to list out the storage policies ?
a) hdfs storagcpolicics
b) hdfs storage
c) hd storagepolicies
Answer: a
Explanation: Arguments are none for the hdfs storagepolicies command.
156 .Which of the following statement can be used get the storage policy of a file or a directory ?
a) hdfs dfsadmin -getStoragePolicy path
b) hdfs dfsadmin -setStoragePolicy path policyName c) hdfs dfsadmin -listStoragePolicy path
policyName d) all of the mentioned
Answer: a
Explanation: refers to the path referring to either a directory or a file.
157 .Which of the following method is used to get user-specified job name ?
a) getJobName()
b) getJobState()
c) getPriorityO
Answer: a
Explanation: getPriorityO is used to get scheduling info of the job.
158 .The number of maps is usually driven by the total size of:
a) inputs
b) outputs
c) tasks

Answer: a
Explanation: Total size of inputs means total number of blocks of the input files.
159.is a utility which allows users to create and run jobs with any executable as the mapper and/or
the reducer.
a) Hadoop Strdata
b) Hadoop Streaming
c) Hadoop Stream
Answer: b
Explanation: Hadoop streaming is one of the most important utilities in the Apache Hadoop
distribution.
160 .part of the MapReduce is responsible for processing one or more chunks of data and
producing the output results.
a) Maptask
b) Mapper
c) Task execution
Answer: a
Explanation: Map Task in MapReduce is performed using the Map() function.
161 .function is responsible for consolidating the results produced by each of the Map()
functions/tasks.
a) Reduce
b) Map
c) Reducer
Answer: a
Explanation: Reduce function collates the work and resolves the results.
162 .The is responsible for allocating resources to the various running

applications subject to familiar constraints of capacities, queues etc.
a) Manager
b) Master
c) Scheduler
Answer: c
Explanation: The Scheduler is pure scheduler in the sense that it performs no monitoring or
tracking of status for the application

163 .Apache Hadoop YARN stands for:
a) Yet Another Reserve Negotiator
b) Yet Another Resource Network
c) Yet Another Resource Negotiator
Answer: c
Explanation: YARN is a cluster management technology.
164 .The is a framework-specific entity that negotiates resources from the

ResourceManager
a) NodeManager
b) ResourceManager
c) ApplicationMaster
Answer: c
Explanation: Each ApplicationMaster has responsibility for negotiating appropriate resource
containers from the schedule.
165 .Yam commands are invoked by the script.
a) hive
b)bin
c)hadoop
d) home
Answer: b
Explanation: Running the yam script without any arguments prints the description for all
commands.
166 .Which of the following command runs ResourceManager admin client ?

a) proxyserver
b) run
c) admin
d) rmadmin
Answer: d
Explanation: proxyserver command starts the web proxy server.
167 .generates keys of type LongWritable and values of type Text.

a) TextOutputFormat
b) TextlnputFormat
c) OutputlnputFormat
Answer: b
Explanation: If K2 and K3 are the same, you don’t need to call setMapOutputKeyClass().

168 . An input is a chunk of the input that is processed by a single map.
a) textformat
b) split
c) datanode
Answer: b
Explanation: Each split is divided into records, and the map processes each record—a keyvalue
pair—in turn.
169 .Which of the following method add a path or paths to the list of inputs ?
a) setInputPaths()
b) add!nputPath()
c) setlhput()
Answer: b
Explanation: FilelnputFormat offers four static convenience methods for setting a JobConf’s input
paths.
170 .The split size is normally the size of an block, which is appropriate for most
applications.
a) Generic
b) Task
c) Library
d) HDFS
Answer: d
Explanation: FilelnputFormat splits only large files(Here “large” means larger than an HDFS
block).
171 .Point out the correct statement:

a) The minimum split size is usually 1 byte, although some formats have a lower bound on the
split size
b) Applications may impose a minimum split size
c) The maximum split size defaults to the maximum value that can be represented by a Java long
type
Answer: a
Explanation: The maximum split size has an effect only when it is less than the block size, forcing
splits to be smaller than a block.
172 .To set an environment variable in a streaming command use:

a) -cmden EXAMPLE_DIR=/home/example/dictionaries/
b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/
c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/
Answer: c
Explanation: Environment Variable is set using cmdenv command.
173 .Point out the wrong statement:

a) Hadoop works better with a small number of large files than a large number of small files b)
CombineFilelnputFormat is designed to work well with small files
c) CombineFilelnputFormat does not compromise the speed at which it can process the input in a
typical MapReduce job
Answer: c
Explanation: If the file is very small (“small” means significantly smaller than an HDFS block)
and there are a lot of them, then each map task will process very little input, and there will be a lot
of them (one per file), each of which imposes extra bookkeeping overhead.
174.Which of the following class is provided by Aggregate package ? a) Map

b) Reducer
c) Reduce
Answer: b
Explanation: Aggregate provides a special reducer class and a special combiner class, and a list of
simple aggregators that perform aggregations such as “sum”, “max”, “min” and so on over a
sequence of values.
175.is the output produced by TextOutputFor mat, Hadoop’s default OutputFormat.

a) KeyValueTextInputFormat
b) KeyValueTextOutputFormat
c) FileValueTextlnputFormat
Answer: b
Explanation: To interpret such files correctly, KeyValueTextInputFormat is appropriate.
176 .Point out the wrong statement:

a) Reducer has 2 primary phases

b) Increasing the number of reduces increases the framework overhead, but increases load
balancing and lowers the cost of failures
c) It is legal to set the number of reduce-tasks to zero if no reduction is desired
d) The framework groups Reducer inputs by keys (since different mappers may have output the
same key) in sort stage
Answer: a
Explanation: Reducer has 3 primary phases: shuffle, sort and reduce
177 .The output of the is not sorted in the Mapreduce framework for Hadoop.
a) Mapper
b) Cascader
c) Scalding
Answer: d
Explanation: The output of the reduce task is typically written to the FileSystem. The output of the
Reducer is not sorted.
178 .Which of the following phases occur simultaneously ?

a) Shuffle and Sort
b) Reduce and Sort
c) Shuffle and Map
Answer: a
Explanation: The shuffle and sort phases occur simultaneously; while map-outputs are being
fetched they are merged
4.1 .1nput to the is the sorted output of the mappers.

a) Reducer
b) Mapper
c) Shuffle
Answer: a
Explanation: In Shuffle phase the framework fetches the relevant partition of the output of all the
mappers, via HTTP.
18O.Mapper implementations are passed the JobConf for the job via the method
a) JobConfigure.configure
b) JobConfigurable.configure
c) JobConfigurable.configureable
Answer: b
Explanation: JobConfigurable.configure method is overrided to initialize themselves.

BIG DATA
Q.No QUESTION OPTION 1 OPTION 2 OPTION 3 OPTION 4
1 A open source build tool simple build tools sequential build complex build script build tool
for scala project is tool tool
2 which are tuple? val exam=(l,l) val exam=(“one,”t val exam=(l,”ele none
wo”,’’three”) ment”,10.2)
3 take(n) top()
which of the following is
CountByValue 0 mapPartitionW
transformation? ithlndex()
4 distinct() intersection() union(datasets)

CountByValue 0
action?
5 RDD Dataframe Dataset none

good for low level
transformation and actions
6 Identify the scenario in Customer Fraud POS transaction

which the Big Data Segmentation Identification Social Marketing
analytics benefit is not Optimization
evident
7@ Which unique feature of By adding more By moving the By distributing the

By distributing the
distributed computing Highcapacity disk computing task in results to several
computing task
increases processing power the cloud users of the
among several
computers networks
8 Point out correct statement The Giraph None

Hadoop is an ideal Hadoop stores framework is less
environment for data in HDFS and useful than MR
extracting and supports data
transforming small compression /
volumes of data decompression
9 Volume Velocity Variety Variable

Which is not a
characteristic of Bigdata

10 Which are the 3 major lass, Pass, Saas Database, SQL,
Clusters or grids, Network,Clou d,
parallel processing Network
MPP, HPC multitenancy
11 Which of the following Distributed file JAX-RS JAVA RDBMS

genres does hadoop system MESSAGE
produce SERVICES
12 Which of the following is Execution Analysis

Logical
not Spark-Sql query Physical planning
optimization
execution
13 Indexing batch Dealing with Managing the Compensating the

What does serving layer do
views recent data master dataset high latecy of
in lambda Architecture
batch layer
compliant system
14
15
Hive support Complex TRUE FALSE
Index type
16 -/hdfs -Is -/hadoop dfs - Hadoop -Is dfs -Is
What is the correct
/user/hduser/in Is /user/hduser/i /user/hduser/i
statement to access
put /user/hduser/i nput nput
HDFS from HIVE CU
nput
17 CREATE statement in Session control DDI statement Embedded

DML statement
HIVE is related to statement SQL statement
18 Partitioning of a table in Files under Subdirectories Subdirectories Files under table

HIVE creates more database name under the database under the name
name tablename
19 To list table with prefix SHOW SHOW DESCRIBE

SHOW TABLES
‘page’ in hive we use the TABLESS? PARTITIONS EXTENDED
“page.*”?
syntax page_view? page_view?
20 Hive is designed mainly OLAP OLTP Both 1 & 2

None of the listed
for
21 Hive is a schema on read True False

and not schema on write
22 The underlying data is not True False

deleted from HDFS when
an HIVE external table is
dropped

23 Hive supports random read True False
and writes
■ Identify correct syntax to

run
hive -e “select” Hive -e
Hive -f query.hql
Hive -f “select”
25 TRUE FALSE
In order to use bucketing in
hive session for we should
set the below parameter
SET
HTVE.ENFORCE.BUC
KETING=TRUE
26 The default poty of HIVE 10 70050 10000 1245

thrift is
27 Create table a(al int,a2 20 5 8 69 26 3 69 5 6 45 9 20

int,a3 int)?
what will be the result
select *from a?
90 5 6
20 5 8
28 Cartesian Product join is When we do not When we retain When we need to

needed specify the key on When we need to all records from access all the data
which we want to access all the data the table that is on from a table
make the join from a table and the right hand
when we retain all side of the join in
records from the a given query
table that is on the
right hand side of
the join in a given
query
29 Denormalization data in improve the Avoid multiple Avoid unrelated Avoid multiple
Hive only performance disk seekd and data disk seekd
improve the
performance
30 True False
Hive support external table
We can run PIG in batch PIG shell

31 PIG scripts PIG options All the above
mode using command
Which of the following

32 $pig -x local.. $ pig -x tez_local.. $pig None
will PIG run

33 You can run PIG in Grunt FS HDFS None
interactive mode using the
___________ shell
34 Which of the following is \de alias \d alias \q None

shortcut for DUMP
operator
35 Which of the following HIVE PIG Sqoop Flume
ecosystem tool provide
dataflow language to
transfer
36@ double bag map string

Which of the following is
not a PIG static helper
functions?
37 Point out the wrong To run PIG in local The DISPLAY All the listed
statement mode ,you need operator will To run PIG in option
access to a single display the result mapreduce mode
machine to your terminal ,you need access
screen to a Hadoop
cluster and HDFS
installation
38 Which OS issued to read a WRITE READ LOAD None of the given

data in PIG options
39 Pig is a _____ language Dataflow Declarative Import export

Scheduling engine
■ Pig operates in how many 2

nodes
3 4 5
41 Pig supports following type Inner join Left outer join All the listed
Right outer join
of joins options
42 We can run PIG in Grunt FS HDFS None of the

interactive mode using options
43 Which of the following is mapreduce Tez Local All the listed

the default mode option
44 restrict overwrite F drop cascade

If the database contains
some tables then it can be
forced to drop without
dropping the table by using
the keyword

45 Which of the following is tez mapreduce local All the listed
the default mode above
46 Piglatin statements are ALO AD statement A series of A DUMP All the listed
generally organized in one read transformation statement to option
of
47 Which ecosystem tool HIVE PIG Sqoop Flume

schema is optional
48 None, because the Namenode alone All the 4 daemons Datanode

Which daemons of hadoop
script is running in of Hadoop
should be running while
local mode
executing a PIG Script in
local mode
49 Point out the correct LoadMeta has Pig load/store API All the listed
statement method to convert is aligned with LoadPush has options
byte arrays to hadoop methods to
specific types InputFormat class
only operations from
pig runtime into
loader
implementatio ns
50 Which are all the different I and HI II and in I and IV

modes available in which I and II
PIG ^^There are
two
modes
interact
ive mode
and batch
mode
51 Language used in hive is true false

piglatin
52 Which of the following read write load None of the above

function is used to read
data in pig
53 True False
Pig can be called from java
54 Input to the _____ is Reducer Mapper Shuffle All the listed

the sorted output of the options
mapper

55 A BIGDATA company HDFS is almost Map or reduce NameNode goes mapreduce jobs
was running a Hadoop full tasks that are down that are causing
cluster with all the stuck in the excessive
monitoring facilities infinite loop memory swaps
properly configured.
Which of the following
scenarios will be
undetected.
56 A mapper can True False

communicate
57 jobconfig jobconf jobconfigurati on All of the above

Primary interface for a user
to describe a mapreduce
job to the hadoop
framework for execution
58 partitioner outputcollector reporter All of the above

Mapper and reducer
implementation can use the
_________ to report
progress or just indicate
that they are alive
59 Identify the correct Automatic fault-tolerance Speculative All of the above

mapreduce feature parallelization and execution
distribution
60 mapreduce mapper tasktracker jobtracker

A ___ node acts as the
slave and is responsible for
executing task assigned to
it by the jobtracker
61 which is correct statement In a MR job map Reducer output None

tasks stores the will be written to Intermediate data
intermediate data HDFS created by
into HDFS Maptask will be
used to analyze
the job history
62 In the mapreduce
Framework,map and No,because the Yes, because in No,because the Yes, because the
reduce functions can be run output of the functional output of the map functional use
in any order reduce function is programming, the function is the KVP as input and
the input of the order of execution input of the output,order is not
map function is not important reduce function important

63 None
which of the following is HashPartitione HasPartitioner
the default partitioner for MergePartitione r
Mapreduce
64 Which of the following MAP Reducer Reduce None

class is provided by
Aggregate package ?
65 The number of maps is inputs OutputCollecto r tasks None

usually driven by the total
size of
66 MapParameters JobConf memoryConf None

___________ is the
primary interface for a user
to describe a Mapreduce
job to Hadoop framework
for execution
67 mapv mapred mapvim All the listed

Maximum Virtual memory
option
of the launched child-tasl is
specified using
68 JobConfigure JobConfigurabl JobConfigurab None

Mapper implementation
are passed the JobConf for Configure
e.Configure le.Configurele
the job via ______
method
69 Each mapper will True False

communicate with each
reducer
70 To set an environment cmdenv

cmden cmdev Cmenv
variable in a streaming EXAMPLE_D
EXAMPLE_DI EXAMPLE_D EXAMPLE_DI
command use IR=/home/exa
R=/home/exam IR=/home/exa R=ZhomeZexam
mple/dictionar
ple/dictionaries/ mple/dictionari es/ pleZdictionaries Z
iesZ
71 ------ map input key/value Mapper Reducer Both mapper and None
pairs to a set of reducer
intermediate key/value
pairs
72 What is data localization Running the Map All the listed

Bringing the data Bringing the
task in the node options
to be processed in replicated blocks
where data block
a single node into a single node
sits

73 Hadoop Strdata Hadoop streaming Hadoop stream None
------ is a utility which
allows user to create and
run jobs with any
executable as the mapper
and/or the reducer
74 Mapper output zero or TRUE FALSE

manre key value pair
75 Reduce Map Reducer All of the above
------ function is
responsible for
consolidating the results
produced by each of the
map() functions/tasks
76 OutputCompact or OutputCollecto | InputCollector All the listed

------- is a generalization of options
the facility provided by the
MapReduce framework to
collect data output by the
Mapper or the Reducer
77 Correct statement for Multiple mappers All the listed

Hadoop attempts to
Hadoop framework run parallel Mappers read options
run a mapper on a
data in the form
node in which the
of key/value pair
data locally
78 Point out the correct Applications may All the listed

statement The minimum split impose a The maximum options
size is usually 1 minimum split split size defaults
byte, although size to the maximum
some formats have value that can be
a lower bound on represented
the split size byjava long type
79 Combiner increases the True False

amount of work to be done
by the reducer by reducing
the traffic
80 Point out the wrong CombineFileln none

Hadoop works CombineFilel
statement putFormat is
better with a small nputFormat does
designed to work
no. of large files not compromise
well with small
than a large no. of the speed at
files
small files which it can

process the input
in a typical MR
job
By default a maptask Whole stream of All the listed

81 Single block Multiple blocks
works on data options
82 Point out the wrong It is legal to set the The output of none
The MR
statement number of reducer map-tasks go
framework does
task to zero if no directly to the
not sort the map
reduction is desired FileSystem
outputs before
writing them out
to the
83 Generic Task library HDFS

The split size is normally
the size of a - block,
which is appropriate for
most applications
84 Point out the wrong Reducer has two It is legal to set

Increasing the
statement primary phases the number of The framework
number of reduce
Reducer task to groups Reducer
increases the
zero if no inputs by
framework
reduction is keys(since
overhead,but
desired different Mappers
increases the load
may have output
balancing and
the same key) in
lowers the cost of
sort stage
failures
85 KeyFieldPartiti KeyFieldBased KeyFieldBase d none

----- class allows the
oner Partitioner
framework to a partition
the map outputs
86 Reducer task cannot be TRUE FALSE

started until all mapper
tasks get completed
87 What happens if a Mapper Reducer task Hadoop will If another All the above
task run slow, relative to cannot started until trigger another instance of the
other Mapper task the last Mapper instance of same mapper task
gets completed Mapper task in finishes then
another node hadoop will kill
the slowly
running mapper
task

88 In MR job, maps stores TRUE FALSE
intermediated data into
local disk and it will be
deleted once the job is
done
89 Which of the following Partitioner Compactor Collector All the listed

partitions the keyspace options
90 mapper cascader scalding none
The output of the is not
sorted in the MR
framework for Hadoop
91 Reduce phase can be TRUE FALSE

started before all the
mappers complete its task
92 Correct syntax for pig {-param .{%declare | .{%declare | all the listed
parameter substitution param_name=p %default} %default} options
using cmd aram_value | - param_name param_namr
param_file param_value param_value cmd
file_name} [ -
debug | - dryrun]
scriot
93 shuffle and sort reduce and sort

which of the following all the listed
shuffle and map
phase occur simultaneously options
94 A sqoop import table In HDFS HDFS inside a hi local file system

Will be saved in a
would be saved in which of special destination directory with the of the hadoop
the following? same name of the cluster
present in the
table
sqoop
configuration
95 API’s JDBC connector Sqoop Sqoop

Which Sqoop tools would
listdatabases tool showdatabases
the Hadoop admin use to
list all the databases in a
server that sqoop can
connect to
96 Point out the wrong Each incoming None of the above

DataNode is the Data block
statement file is broken into
slave/worker node replicated across
32MB by default
and holds the user different nodes in
data in the form of the cluster

datablocks to ensure a low
degree of fault
tolerance
97 A ________ serves as DataNode NameNode Data Block Replication

the master and there is only
one namenode per cluster
98 Number of mappers is TRUE FALSE

decided by the Mapreduce
framework
99 ______ NameNode is used rack data secondary none
when the primary
NameNode goes down
100 Number of reducers can’t TRUE FALSE

be decided by user
101 Client reading the data get the data from get the block get only the block get both the data
from HDFS filesystem in the namenode location from the location from the and the block
Hadoop datanode namenode location from the
namenode
**** In HDFS system with

*
102 Point out wrong statement The MR Applications None

A mapreduce job framework typically
usually splits the operates implement the
input data-set into exclusively on mapper and
independent pairs reducer interfaces
chunks which are to provide the
processed by the map and reduce
map task in a method
completely parallel
manner
103 Each mapper will True False

communicate with each
reducer
104 LoadPushDown LoadMetadata LoadCaster All the listed

A loader implementation
option
should implement

_________ if
cast(implicit or explicit)
from databyte array fields
to other types need to be
supported
105 Which ecosystem tool HIVE PIG Sqoop Flume

schema is optional
106 Which of the following is Acyclic graphs Direct Non- Direct acyclic Non-acyclic
design and execution acyclic graphs graphs graphs
structure supported by
oozie workflow
107 If a req is to run so manu PIG sqoop HIVE MapReduce

ad-hoc query on top of
HDFS data, which
ecosystem toll will you
suggest to the client for the
use case
108 SELECT * from cat/home/myus er. hive - All of the above

A user writes a query in a
Tablel f/home/path/q u
file called gunahql, which
is correct
109 TEZ
Pig -z tez_local will enable mapreduce Local None f the option
__________ mode in
Pig
110 True False

Datanode send heartbeat
every 9 second
111 What is the default 1 2 3 4

replication factor in
HDFS
112 When using HDFS, what It permanently Files in HDFS none

It become hidden
occurs when a file is deleted and the file cannot be deleted
from the user but
deleted from the command attributes are
stays in the file
line recorded in a lof
system
file
113 Namenode is single point True False

of failure
114 True about HDFS HDFS is based on HDFS is written Sits on top of
All the list options
Google file in java native file

system system
115 How data received in GPS Structured Unstructured Semi structured

Both Structured
Satellite and the Data Data
and semi structured
Web be classified as
116 Namespace DataNode NameNode Blocks

HDFS works by breaking
large files into smaller
pieces,these smaller pieces
of files
117 Velocity Variety Volumn Variable

Data is captured which can
be in any form and can be
structure and unstructure.
Which characteristics of
BigData
118 List all the types of Semi structured Structured, semi

Semi structured
BigData Structured data and data structured and
and unstructured
unstructured data unstructure
data
119 What is the default user hivedb hbase default

database when we connect
to HIVE
120 Which of the following Zookeeper Oozie Ambari All of the options
components provides Coordinator
support for automatic
execution of the workflows
based on events and the
presence of system
resources
121 Hive parse data at time of True False

loading
123 Hive support Complex True False

Index type
124 What is the correct

./hdfs -Is ./hadoop dfs -Is Hadoop -Is dfs -Is
statement to access HDFS /user/hduser/in
/user/hduser/inp ut /user/hduser/in put /user/hduser/in put
from HIVE CLI put
125 Which of the following SerDe can be A SerDe is a All of the options
SerDe is a library
statement is true about customized to mechanism that listed
for Serialization
SerDes in HIVE allow HIVE HIVE uses to
and
understand your Parse

own custom format deserialization
various formats of
data stored in
HDFS,to be used
by HIVE
126 Hive support Left True False

Semi join
127 Hive building function size Return nill of the Convert the results Return the Only return the
(map<K,V>) is used to conversion does in the Map type number of number of
not succeed elements in the elements in the
map type Map type and
return null if the
conversion does
not
128 Upon action Upon None

On both
transformatio
When does spark transformatio n
n
evaluate ROD
129 Spark Core’s fast Spark streaming MLib GraphX RDDs
scheduling capability
to perform streaming
analytics is leveraged
by
130 Which is default input SequenceFilel BytelnputFor KeyValuelnp TextlnputFor

format defined in nputFormat mat utFormat mat
hadoop
131 YARN stands for Yahoo another Yet another Yet another
resource name resource resouce need Yahoo archived
negotiator resource name
132 Secondary namenode FALSE TRUE

is a backup for
Namenode
133 What is the default 128MB 64MB 128GB 64GB

block size of Hadoop
2
134 Datanode sends TRUE FALSE

heartbeat every 9
seconds
135 1 2 3 4
What is the default
replication data factor
136 HDFS stands for Hadoop direct Hadoop Hadoop direct

Hadoop
file system distributed file file
distributed file
system synchronizati
synchronizati
on
on
137 Which database does MYSQL Hbase derby monogodb

hive use, for storing
metadata about the
hive tables
138 Hive is Schema on Schema less All of the

Schema on read
write options
139 What are the complex Array Maps Structs All of the
data types Hive options
supports
140 Which statements are Data gets Data gets

true while loading data copied from Data gets Data gets moved from
into the table in Hive source location moved in case moved in case source location
into the tables Managed table, External table, into the tables
directory in while a pointer while a pointer directory in
both managed to the source to the source both managed
and external data location is data location is and external
tables created in case created in case tables
of external of Managed
tables tables
141 Which of the acyclic graphs Direct Non- Non-acyclic

Direct acyclic
following is
graphs

142 Why spark is an It contains It contains It contain All of the
integrated solution for sparkcore that spark sql for spark above
processing on all includes high- sqland streaming that
lambda architecture level API and structured data enable
layers an optimized processing scalable,high
engine that throughput,fa
supports ult-
general tolerance,stri
executing graph ng process of
live data

Hadoop MCQ Challenge

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop MCQ Challenge

Uploaded by

Copyright:

Available Formats

Q studocu

1613368248533 Big Data MCQ QA Parti

Studocu is not sponsored or endorsed by any college or university

b) Hadoop stores data in HDFS and supports data compression/decompression

2. What was Hadoop written in ?

3. Which of the following genres does Hadoop produce ?

Downloaded by yash swami (yashswami284@gmail.com)

a) MapReduce, Hive and HBase

Explanation: MapReduce is a programming model and an associated implementation for

Downloaded by yash swami (yashswami284@gmail.com)

8 .has the world’s largest Hadoop cluster.

9 . Facebook Tackles Big Data With based on Hadoop.

11. Point out the correct statement:

Downloaded by yash swami (yashswami284@gmail.com)

13. Point out the wrong statement:

Downloaded by yash swami (yashswami284@gmail.com)

Downloaded by yash swami (yashswami284@gmail.com)

Explanation: Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce

Downloaded by yash swami (yashswami284@gmail.com)

24.is the default Partitioner for partitioning key space.

Downloaded by yash swami (yashswami284@gmail.com)

Explanation: JobConfigurable.configure method is overridden to initialize themselves.

26. Input to the a) is the sorted output of the mappers.

27. Point out the wrong statement:

Explanation: Reducer has 3 primary phases: shuffle, sort and reduce.

Downloaded by yash swami (yashswami284@gmail.com)

Downloaded by yash swami (yashswami284@gmail.com)

Explanation: JobConf represents a MapReduce job configuration.

Downloaded by yash swami (yashswami284@gmail.com)

36. HBase provides like capabilities on top of Hadoop and HDFS.

37. Streaming supports streaming command options as well as command options,

Downloaded by yash swami (yashswami284@gmail.com)

Explanation: Environment Variable is set using cmdenv command.

42. Which of the following class is provided by Aggregate package ? a) Map

Downloaded by yash swami (yashswami284@gmail.com)

43. Hadoop has a library class,

45. Point out the correct statement:

Downloaded by yash swami (yashswami284@gmail.com)

Explanation: There can be any number of DataNodes in a Hadoop Cluster.

46. NameNode is used when the Primary NameNode goes down.

47. Point out the wrong statement:

Downloaded by yash swami (yashswami284@gmail.com)

Downloaded by yash swami (yashswami284@gmail.com)

Explanation: In reduce phase the reduce(Object, Iterator, OutputCollector, Reporter) method is

Downloaded by yash swami (yashswami284@gmail.com)

56. Point out the correct statement:

59. controls the partitioning of the keys of the intermediate map-outputs.

Downloaded by yash swami (yashswami284@gmail.com)

Downloaded by yash swami (yashswami284@gmail.com)

65. Point out the wrong statement:

Downloaded by yash swami (yashswami284@gmail.com)

68. Apache is a serialization framework that produces data in a compact binary

69. Point out the wrong statement:

70. The class extends and implements several Hadoop-supplied interfaces.

Downloaded by yash swami (yashswami284@gmail.com)

72. can change the maximum number of cells of a column family.

73. Which of the following is not a table scope operator ?

Downloaded by yash swami (yashswami284@gmail.com)

75. Point out the wrong statement:

76. HBase uses the File System to store its data.

77. Point out the correct statement:

Downloaded by yash swami (yashswami284@gmail.com)

78. Which of the following is the default Partitioner for Mapreduce ?