You are on page 1of 63

Q studocu

1613368248533 Big Data MCQ QA Parti

Data Science and Analytics (Dr. Vishwanath Karad MIT World Peace University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by yash swami (yashswami284@gmail.com)
History of Hadoop
1. Point out the correct statement:
a) Hadoop is an ideal environment for extracting and transforming small volumes of data

b) Hadoop stores data in HDFS and supports data compression/decompression


c) The Giraph framework is less useful than a MapReduce job to solve graph and machine
learning
d)None of the mentioned
View Answer
Answer: b

Explanation: Data compression can be achieved using compression algorithms like bzip2, gzip,
LZO, etc. Different algorithms can be used in different scenarios based on then- capabilities.

2. What was Hadoop written in ?


a) Java (software platform)
b) Perl
c) Java (programming language)
d) Lua (programming language)
View Answer
Answer: c

Explanation: The Hadoop framework itself is mostly written in the Java programming language,
with some native code in C and command line utilities written as shell-scripts.

3. Which of the following genres does Hadoop produce ?


a)Distributed file system
b) JAX-RS
c) Java Message Service
d) Relational Database Management System
View Answer
Answer: a

Explanation: The Hadoop Distributed File System (HDFS) is designed to store very large data sets
reliably, and to stream those data sets at high bandwidth to user
4. The Hadoop list includes the HBase database, the Apache Mahout system, and
matrix operations.

a) Machine learning
b) Pattern recognition
c) Statistical classification

Downloaded by yash swami (yashswami284@gmail.com)


d) Artificial intelligence
View Answer
Answer: a

Explanation: The Apache Mahout project’s goal is to build a scalable machine learning tool.
Big Data
5. Point out the correct statement:
a) Hadoop do need specialized hardware to process the data
b) Hadoop 2.0 allows live stream processing of real time data
c) In Hadoop programming framework output files are divided in to lines or records
d) None of the mentioned
View Answer
Answer: b

Explanation: Hadoop batch processes data distributed over a number of computers ranging in 100s
and 1000s.

6. Hadoop is a framework that works with a variety of related tools. Common cohorts include:

a) MapReduce, Hive and HBase


b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet
View Answer
Answer: a

Explanation: To use Hive with HBase you’ll typically want to launch two clusters, one to run
HBase and the other to run Hive.

7 .can best be described as a programming model used to develop Hadoop-based applications that
can process massive amounts of data.
a) MapReduce
b) Mahout
c) Oozie
d) All of the mentioned

View Answer

Answer: a

Explanation: MapReduce is a programming model and an associated implementation for

Downloaded by yash swami (yashswami284@gmail.com)


processing and generating large data sets with a parallel, distributed algorithm.

8 .has the world’s largest Hadoop cluster.


a) Apple
b) Datamatics
c) Facebook
d) None of the mentioned
View Answer
Answer: c

Explanation: Facebook has many Hadoop clusters, the largest among them is the one that is used
for Data warehousing.

9 . Facebook Tackles Big Data With based on Hadoop.


a) ‘Project Prism’
b) ‘Prism’
c) ‘Project Big’
d) ‘Project Data’
View Answer
Answer: a

Explanation: Prism automatically replicates and moves data wherever it’s needed across a vast
network of computing facilities.

10.is a platform for constructing data flows for extract, transform, and load (ETL) processing and
analysis of large datasets.
a) Pig Latin
b) Oozie
c)Pig
d) Hive
View Answer
Answer: c
Explanation: Apache Pig is a platform for analyzing large data sets that consists of a high- level
language for expressing data analysis programs.

11. Point out the correct statement:


a) Hive is not a relational database, but a query engine that supports the parts of SQL specific to

Downloaded by yash swami (yashswami284@gmail.com)


querying data
b) Hive is a relational database with SQL support
c) Pig is a relational database with SQL support
d) All of the mentioned
View Answer
Answer: a

Explanation: Hive is a SQL-based data warehouse system for Hadoop that facilitates data
summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop- compatible file
systems.

12. hides the limitations of Java behind a powerful and concise Clojure API for Cascading.
a) Scalding
b) HCatalog
c) Cascalog
d) All of the mentioned
View Answer
Answer: c

Explanation: Cascalog also adds Logic Programming concepts inspired by Datalog. Hence the
name “Cascalog” is a contraction of Cascading and Datalog.

13. Point out the wrong statement:


a) Elastic MapReduce (EMR) is Facebook’s packaged Hadoop offering
b) Amazon Web Service Elastic MapReduce (EMR) is Amazon’s packaged Hadoop offering
c) Scalding is a Scala API on top of Cascading that removes most Java boilerplate
d) All of the mentioned
View Answer
Answer: a

Explanation: Rather than building Hadoop deployments manually on EC2 (Elastic Compute
Cloud) clusters, users can spin up fully configured Hadoop installations using simple invocation
commands, either through the AWS Web Console or through command-line tools.

14.is the most popular high-level Java API in Hadoop Ecosystem a) Scalding
b) HCatalog
c) Cascalog
d) Cascading
View Answer

Downloaded by yash swami (yashswami284@gmail.com)


Answer: d

Explanation: Cascading hides many of the complexities of MapReduce programming behind more
intuitive pipes and data flow abstractions.

15.is general-purpose computing model and runtime system for distributed data analytics.
a) Mapreduce
b) Drill
c) Oozie
d) None of the mentioned
View Answer
Answer: a

Explanation: Mapreduce provides a flexible and scalable foundation for analytics, from traditional
reporting to leading-edge machine learning algorithms.

16. The Pig Latin scripting language is not only a higher-level data flow language but also has
operators similar to :
a) SQL
b) JSON
c) XML
d) All of the mentioned
View Answer
Answer: a

Explanation: Pig Latin, in essence, is designed to fill the gap between the declarative style of SQL
and the low-level procedural style of MapReduce.
17. jobs are optimized for scalability but not latency.
a) Mapreduce
b) Drill
c) Oozie
d) Hive
View Answer
Answer: d

Explanation: Hive Queries are translated to MapReduce jobs to exploit the scalability of
MapReduce.

Downloaded by yash swami (yashswami284@gmail.com)


Introduction to Mapreduce

18. A node acts as the Slave and is responsible for executing a Task assigned to it
by the JobTracker.
a) MapReduce
b) Mapper
c) TaskTracker
d) JobTracker
View Answer
Answer: c

Explanation: TaskTracker receives the information necessary for execution of a Task from
JobTracker, Executes the Task, and Sends the Results back to JobTracker.

19. function is responsible for consolidating the results produced by each of the Map()
functions/tasks.
a) Reduce
b) Map
c) Reducer
d) All of the mentioned
View Answer
Answer: a

Explanation: Reduce function collates the work and resolves the results.
20. Although the Hadoop framework is implemented in Java , MapReduce applications need not
be written in :
a) Java b)C c)C#
d) None of the mentioned
View Answer
Answer: a

Explanation: Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce


applications (non JNITM based).

21.is a utility which allows users to create and run jobs with any executables as the mapper and/or
the reducer.
a) Hadoop Strdata
b) Hadoop Streaming
c) Hadoop Stream
d) None of the mentioned

Downloaded by yash swami (yashswami284@gmail.com)


View Answer
Answer: b

Explanation: Hadoop streaming is one of the most important utilities in the Apache Hadoop
distribution.

22. _________ maps input key/value pairs to a set of intermediate key/value pairs.
a) Mapper
b) Reducer
c) Both Mapper and Reducer
d) None of the mentioned
View Answer
Answer: a

Explanation: Maps are the individual tasks that transform input records into intermediate records.

23. The number of maps is usually driven by the total size of:
a) inputs
b) outputs
c) tasks
d) None of the mentioned
View Answer
Answer: a

Explanation: Total size of inputs means total number of blocks of the input files.

24.is the default Partitioner for partitioning key space.


a) HashPar
b) Partitioner
c) HashPartitioner
d) None of the mentioned
View Answer
Answer: c

Explanation: The default partitioner in Hadoop is the HashPartitioner which has a method called
getPartition to partition.

Downloaded by yash swami (yashswami284@gmail.com)


Analyzing Data with Hadoop

25. Mapper implementations are passed the JobConf for the job via the _______method
a) JobConfigure.configure
b) JobConfigurable.configure
c) JobConfigurable.configureable
d) None of the mentioned
View Answer
Answer: b

Explanation: JobConfigurable.configure method is overridden to initialize themselves.

26. Input to the a) is the sorted output of the mappers.


Reducer b) Mapper c) Shuffle
d) All of the mentioned
View Answer
Answer: a

Explanation: In Shuffle phase the framework fetches the relevant partition of the output of all the
mappers, via HTTP.

27. Point out the wrong statement:


a) Reducer has 2 primary phases
b) Increasing the number of reduces increases the framework overhead, but increases load
balancing and lowers the cost of failures
c) It is legal to set the number of reduce-tasks to zero if no reduction is desired
d) The framework groups Reducer inputs by keys (since different mappers may have output the
same key) in sort stage
View Answer
Answer: a

Explanation: Reducer has 3 primary phases: shuffle, sort and reduce.

28. The output of the is not sorted in the Mapreduce framework for Hadoop.
a) Mapper
b) Cascader
c) Scalding

Downloaded by yash swami (yashswami284@gmail.com)


d) None of the mentioned
View Answer
Answer: d

Explanation: The output of the reduce task is typically written to the FileSystem. The output of the
Reducer is not sorted.

29. Which of the following phases occur simultaneously ? a) Shuffle and Sort
b) Reduce and Sort c) Shuffle and Map d) All of the mentioned View Answer
Answer: a

Explanation: The shuffle and sort phases occur simultaneously; while map-outputs are being
fetched they are merged.

30. Mapper and Reducer implementations can use the to report progress or just
indicate that they are alive.
a) Partitioner
b) OutputCollector
c) Reporter
d) All of the mentioned
View Answer
Answer: c

Explanation: Reporter is a facility for MapReduce applications to report progress, set application-
level status messages and update Counters.

31.is a generalization of the facility provided by the MapReduce framework to collect data output
by the Mapper or the Reducer
a) Partitioner
b) OutputCollector
c) Reporter
d) All of the mentioned
View Answer
Answer: b

Explanation: Hadoop MapReduce comes bundled with a library of generally useful mappers,
reducers, and partitioners.

32.is the primary interface for a user to describe a MapReduce job to the Hadoop framework for

Downloaded by yash swami (yashswami284@gmail.com)


execution.
a) Map Parameters
b) JobConf
c) MemoryConf
d) None of the mentioned
View Answer
Answer: b

Explanation: JobConf represents a MapReduce job configuration.


Scaling out in Hadoop

33. are highly resilient and eliminate the single-point-of-failure risk with traditional
Hadoop deployments
a) EMR
b) Isilon solutions
c) AWS
d) None of the mentioned
View Answer
Answer: b

Explanation: enterprise data protection and security options including file system auditing and
data-at-rest encryption to address compliance requirements is also provided by Isilon solution.

34. Which is the most popular NoSQL database for scalable big data store with
Hadoop ?
a) Hbase
b) MongoDB
c) Cassandra
d) None of the mentioned
View Answer
Answer: a

Explanation: HBase is the Hadoop database: a distributed, scalable Big Data store that lets you
host very large tables — billions of rows multiplied by millions of columns — on clusters built
with commodity hardware.

35. The can also be used to distribute both jars and native libraries for use in
the map and/or reduce tasks.
a) DataCache

Downloaded by yash swami (yashswami284@gmail.com)


b) DistributedData

c) DistributedCache
d) All of the mentioned
View Answer
Answer: c

Explanation: The child-jvm always has its current working directory added to the java.library.path
and LD_LIBRARY_PATH.

36. HBase provides like capabilities on top of Hadoop and HDFS.


a) TopTable
b) BigTop
c) Bigtable
d) None of the mentioned
View Answer
Answer: c

Explanation: Google Bigtable leverages the distributed data storage provided by the Google File
System.

Hadoop Streaming

37. Streaming supports streaming command options as well as command options,


a) generic
b) tool
c) library
d)task
View Answer
Answer: a

Explanation: Place the generic options before the streaming options, otherwise the command will
fail.

38. Which of the following Hadoop streaming command option parameter is required
?
a) output directoryname
b) mapper executable
c) input directoryname

Downloaded by yash swami (yashswami284@gmail.com)


d) all of the mentioned
View Answer
Answer: d

Explanation: Required parameters is used for Input and Output location for mapper.
39. To set an environment variable in a streaming command use:
a) -cmden EXAMPLE_DIR=/home/example/dictionaries/
b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/
c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/ d) -cmenv
EXAMPLE_DIR=/home/example/dictionaries/ View Answer
Answer: c

Explanation: Environment Variable is set using cmdenv command.

40. class allows the Map/Reduce framework to partition the map outputs based on
certain key fields, not the whole keys.
a) KeyFieldPartitioner
b) KeyFieldBasedPartitioner
c) KeyFieldBased
d) None of the mentioned
View Answer
Answer: b

Explanation: The primary key is used for partitioning, and the combination of the primary and
secondary keys is used for sorting.

41. Which of the following class provides a subset of features provided by the
Unix/GNU Sort?
a) KeyFieldBased
b) KeyFieldComparator
c) KeyFieldBasedComparator
d) All of the mentioned
View Answer
Answer: c

Explanation: Hadoop has a library class, KeyFieldBasedComparator, that is useful for many
applications.

42. Which of the following class is provided by Aggregate package ? a) Map

Downloaded by yash swami (yashswami284@gmail.com)


b) Reducer

c) Reduce
d) None of the mentioned
View Answer
Answer: b

Explanation: Aggregate provides a special reducer class and a special combiner class, and a list of
simple aggregators that perform aggregations such as “sum”, “max”, “min” and so on over a
sequence of values

43. Hadoop has a library class,


org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that effectively allows you to process
text data like the unix utility.
a) Copy
b)Cut
c) Paste
d) Move
View Answer
Answer: b

Explanation: The map function defined in the class treats each input key/value pair as a list of
fields.

Introduction to HDFS

44. A serves as the master and there is only one NameNode per cluster.
a) Data Node
b) NameNode
c) Data block
d) Replication
View Answer
Answer: b

Explanation: All the metadata related to HDFS including the information about data nodes, files
stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.

45. Point out the correct statement:


a) DataNode is the slave/worker node and holds the user data in the form of Data Blocks
b) Each incoming file is broken into 32 MB by default

Downloaded by yash swami (yashswami284@gmail.com)


c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of
fault tolerance
d) None of the mentioned
View Answer
Answer: a

Explanation: There can be any number of DataNodes in a Hadoop Cluster.

46. NameNode is used when the Primary NameNode goes down.


a) Rack
b) Data
c) Secondary
d) None of the mentioned
View Answer
Answer: c

Explanation: Secondary namenode is used for all time availability and reliability.

47. Point out the wrong statement:


a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file
level
b) Block Report from each DataNode contains a list of all the blocks that are stored on that
DataNode
c) User data is stored on the local file system of DataNodes
d) DataNode is aware of the files to which the blocks stored on it belong to
View Answer
Answer: d

Explanation: NameNode is aware of the files to which the blocks stored on it belong to.

48. Which of the following scenario may not be a good fit for HDFS ?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file b)
HDFS is suitable for storing data related to applications requiring low latency data access c) HDFS
is suitable for storing data related to applications requiring low latency data access d) None of the
mentioned
View Answer
Answer: a

Explanation: HDFS can be used for storing archive data since it is cheaper as HDFS allows storing

Downloaded by yash swami (yashswami284@gmail.com)


the data on low cost commodity hardware while ensuring a high degree of faulttolerance.

49. The need for data replication can arise in various scenarios tike :
a) Replication Factor is changed
b) DataNode goes down
c) Data Blocks get corrupted
d) All of the mentioned
View Answer
Answer: d

Explanation: Data is replicated across different DataNodes to ensure a high degree of fault-
tolerance.

50.is the slave/worker node and holds the user data in the form of Data Blocks, a) DataNode
b) NameNode
c) Data block
d) Replication
View Answer
Answer: a

Explanation: A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more
than one DataNode, with data replicated across them.

51. HDFS provides a command line interface called used to interact with
HDFS.
a) “HDFS Shell”
b) “FS Shell”
c) “DFS Shell”
d) None of the mentioned
View Answer
Answer: b

Explanation: The File System (FS) shell includes various shell-like commands that directly interact
with the Hadoop Distributed File System (HDFS).
52. HDFS is implemented in programming language.
a) C++
b)Java
c) Scala
d) None of the mentioned

Downloaded by yash swami (yashswami284@gmail.com)


View Answer
Answer: b

Explanation: HDFS is implemented in Java and any computer which can run Java can host a
NameNode/DataNode on it.

Java Interface
53. The output of the reduce task is typically written to the FileSystem via :
a) OutputCollector
b) InputCollector
c) OutputCollect
d) All of the mentioned
View Answer
Answer: a

Explanation: In reduce phase the reduce(Object, Iterator, OutputCollector, Reporter) method is


called for each pair in the grouped inputs.

54. Applications can use the provided to report progress or just indicate that they
are alive.
a) Collector
b) Reporter
c) Dashboard
d) None of the mentioned
View Answer
Answer: b

Explanation: In scenarios where the application takes a significant amount of time to process
individual key/value pairs, this is crucial since the framework might assume that the task has timed-
out and kill that task.
Data Flow

55.is a programming model designed for processing large volumes of data in parallel by dividing
the work into a set of independent tasks.
a) Hive
b) MapReduce
c)Pig
d) Lucene
View Answer

Downloaded by yash swami (yashswami284@gmail.com)


Answer: b
Explanation: MapReduce is the heart of hadoop.

56. Point out the correct statement:


a) Data locality means movement of algorithm to the data instead of data to algorithm
b) When the processing is done on the data algorithm is moved across the Action Nodes rather
than data to the algorithm
c) Moving Computation is expensive than Moving Data
d) None of the mentioned
View Answer
Answer: a
Explanation: Data flow framework possesses the feature of data locality.

57. The daemons associated with the MapReduce phase are and task-trackers.
a) job-tracker
b) map-tracker
c) reduce-tracker
d) all of the mentioned
View Answer
Answer: a
Explanation: Map-Reduce jobs are submitted on job-tracker.

58. The default InputFormat is which treats each value of input a new value and
the associated key is byte offset.
a) TextFormat
b) TextlnputFormat
c) InputFormat
d) All of the mentioned
View Answer
Answer: b
Explanation: A RecordReader is little more than an iterator over records, and the map task uses
one to generate record key-value pairs.

59. controls the partitioning of the keys of the intermediate map-outputs.


a) Collector
b) Partitioner
c) InputFormat
d) None of the mentioned

Downloaded by yash swami (yashswami284@gmail.com)


View Answer
Answer: b
Explanation: The output of the mapper is sent to the partitioner.

60. Output of the mapper is first written on the local disk for sorting and ________ process,
a) shuffling
b) secondary sorting
c) forking
d) reducing
View Answer
Answer: a
Explanation: All values corresponding to the same key will go the same reducer.

Hadoop Archives

61. The guarantees that excess resources taken from a queue will be restored to
it within N minutes of its need for them.
a) capacitor
b) scheduler
c) datanode
d) none of the mentioned
View Answer
Answer: b
Explanation: Free resources can be allocated to any queue beyond its guaranteed capacity.
62.is a pluggable Map/Reduce scheduler for Hadoop which provides a way to share large clusters.
a) Flow Scheduler
b) Data Scheduler
c) Capacity Scheduler
d) None of the mentioned
View Answer
Answer: c
Explanation: The Capacity Scheduler supports for multiple queues, where a job is submitted to a
queue.

Data Integrity

63. The __________ machine is a single point of failure for an HDFS cluster.

Downloaded by yash swami (yashswami284@gmail.com)


a) DataNode
b) NameNode
c) ActionNode
d) All of the mentioned
View Answer
Answer: b
Explanation: If the NameNode machine fails, manual intervention is necessary. Currently,
automatic restart and failover of the NameNode software to another machine is not supported.

64. The and the EditLog are central data structures of HDFS.
a) Dslmage
b) Fslmage
c) Fshnages
d) All of the mentioned
View Answer
Answer: b
Explanation: A corruption of these files can cause the HDFS instance to be non-functional.

65. Point out the wrong statement:


a) HDFS is designed to support small files only
b) Any update to either the Fslmage or EditLog causes each of the Fshnages and EditLogs to get
updated synchronously
c) NameNode can be configured to support maintaining multiple copies of the Fslmage and
EditLog
d) None of the mentioned
View Answer
Answer: a
Explanation: HDFS is designed to support very large files.

66. HDFS, by default, replicates each data block times on different nodes and on at
least racks.
a) 3,2
b) 1,2
c) 2,3
d) All of the mentioned
View Answer
Answer: a
Explanation: HDFS has a simple yet robust architecture that was explicitly designed for data

Downloaded by yash swami (yashswami284@gmail.com)


reliability in the face of faults and failures in disks, nodes and networks.

67. The HDFS file system is temporarily unavailable whenever the HDFS is down.
a) DataNode
b) NameNode
c) ActionNode
d) None of the mentioned
View Answer
Answer: b
Explanation: When the HDFS NameNode is restarted it recovers its metadata.

Serialization

68. Apache is a serialization framework that produces data in a compact binary


format.
a) Oozie
b) Impala
c) kafka
d) Avro
View Answer
Answer: d
Explanation: Apache Avro doesn’t require proxy objects or code generation.

69. Point out the wrong statement:


a) Java code is used to deserialize the contents of the file into objects
b) Avro allows you to use complex data structures within Hadoop MapReduce jobs
c) The m2e plug-in automatically downloads the newly added JAR files and their dependencies
d) None of the mentioned
View Answer
Answer: d
Explanation: A unit test is useful because you can make assertions to verify that the values of the
deserialized object are the same as the original values.

70. The class extends and implements several Hadoop-supplied interfaces.


a) AvroReducer
b) Mapper
c) AvroMapper

Downloaded by yash swami (yashswami284@gmail.com)


d) None of the mentioned
View Answer
Answer: c
Explanation: AvroMapper is used to provide the ability to collect or map data.

71. The _______ method in the ModelCountReducer class “reduces” the values the mapper
collects into a derived value
a) count
b) add
c) reduce
d) all of the mentioned
View Answer
Answer: c
Explanation: In some case, it can be simple sum of the values.
Metrics in Hbase

72. can change the maximum number of cells of a column family.


a) set
b) reset
c) alter
d) select
View Answer
Answer: c
Explanation: Alter is the command used to make changes to an existing table.

73. Which of the following is not a table scope operator ?


a) MEMSTORE_FLUSH
b) MEMSTORE-FLUSHSIZE
c) MAX_FILESIZE
d) All of the mentioned
View Answer
Answer: a
Explanation: Using alter, you can set and remove table scope operators such as MAX_FILESIZE,
READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc.

74. You can delete a column family from a table using the method of
HBAseAdmin class.
a) delColumn()

Downloaded by yash swami (yashswami284@gmail.com)


b) removeColumn()
c) deleteColumn()
d) all of the mentioned
View Answer
Answer: c
Explanation: Alter command also can be used to delete a column family.

75. Point out the wrong statement:


a) To read data from an HBase table, use the get() method of the HTable class
b) You can retrieve data from the HBase table using the get() method of the HTable class
c) While retrieving data, you can get a single row by id, or get a set of rows by a set of row ids, or
scan an entire table or a subset of rows
d) None of the mentioned
View Answer
Answer: d
Explanation: You can retrieve an HBase table data using the add method variants in Get class.

76. HBase uses the File System to store its data.


a) Hive
b) hnphala
c) Hadoop
d) Scala
View Answer
Answer: c
Explanation: The data storage will be in the form of regions (tables). These regions will be split up
and stored in region servers.

Mapreduce Development-2

77. Point out the correct statement:


a) Mapper maps input key/value pairs to a set of intermediate key/value pairs
b) Applications typically implement the Mapper and Reducer interfaces to provide the map and
reduce methods
c) Mapper and Reducer interfaces form the core of the job
d) None of the mentioned
View Answer
Answer: d

Downloaded by yash swami (yashswami284@gmail.com)


Explanation: The transformed intermediate records do not need to be of the same type as the input
records.

MapReduce Features-1

78. Which of the following is the default Partitioner for Mapreduce ?


a) MergePartitioner
b) HashedPartitioner
c) HashPartitioner
d) None of the mentioned
View Answer
Answer: c
Explanation: The total number of partitions is the same as the number of reduce tasks for the job.

79. Which of the following partitions the key space ?


a) Partitioner
b) Compactor
c) Collector
d) All of the mentioned
View Answer
Answer: a
Explanation: Partitioner controls the partitioning of the keys of the intermediate map-outputs.

80.is a generalization of the facility provided by the MapReduce framework to collect data output
by the Mapper or the Reducer
a) OutputCompactor
b) OutputCollector
c) InputCollector
d) All of the mentioned
View Answer
Answer: b
Explanation: Hadoop MapReduce comes bundled with a library of generally useful mappers,
reducers, and partitioners.

81. Point out the wrong statement:


a) It is legal to set the number of reduce-tasks to zero if no reduction is desired
b) The outputs of the map-tasks go directly to the FileSystem
c) The Mapreduce framework does not sort the map-outputs before writing them out to the
FileSystem

Downloaded by yash swami (yashswami284@gmail.com)


d) None of the mentioned
View Answer
Answer: d
Explanation: Outputs of the map-tasks go directly to the FileSystem, into the output path set by
setOutputPath(Path).
82.is the primary interface for a user to describe a MapReduce job to the Hadoop framework for
execution.
a) JobConfig
b) JobConf
c) JobConfiguration
d) All of the mentioned
View Answer
Answer: b
Explanation: JobConf is typically used to specify the Mapper, combiner (if any), Partitioner,
Reducer, InputFormat, OutputFormat and OutputCommitter implementations.

83. Maximum virtual memory of the launched child-task is specified using : a) mapv
b) mapred
c) mapvim
d) All of the mentioned
View Answer
Answer: b
Explanation: Admins can also specify the maximum virtual memory of the launched childtask, and
any sub-process it launches recursively, using mapred.

84.is percentage of memory relative to the maximum heapsize in which map outputs may be
retained during the reduce.
a) mapred.job.shuffle.merge.percent
b) mapred.job.reduce.inputbuffer.percen
c) mapred.inmem.merge.threshold
d) io.sort.factor
View Answer
Answer: b
Explanation: When the reduce begins, map outputs will be merged to disk until those that remain
are under the resource limit this defines.

MapReduce Features-2

85.specifies the number of segments on disk to be merged at the same time.

Downloaded by yash swami (yashswami284@gmail.com)


a) mapred.job.shuffle.merge.percent
b) mapred.job.reduce.input.buffer.percen
c) mapred.inmem.merge.threshold
d) io.sort.factor
View Answer
Answer: d
Explanation: io.sort.factor limits the number of open files and compression codecs during the
merge.

86. Point out the correct statement:


a) The number of sorted map outputs fetched into memory before being merged to disk
b) The memory threshold for fetched map outputs before an in-memory merge is finished
c) The percentage of memory relative to the maximum heapsize in which map outputs may not be
retained during the reduce
d) None of the mentioned
View Answer
Answer: a
Explanation: When the reduce begins, map outputs will be merged to disk until those that remain
are under the resource limit this defines.

87. Jobs can enable task JVMs to be reused by specifying the job configuration :
a) mapred.j ob.recycle.j vm.num.tasks
b) mapissue.job.reuse.jvm.num.tasks
c) mapred.j ob.reuse.j vm.num.tasks
d) all of the mentioned
View Answer
Answer: b
Explanation: Many of my tasks had performance improved over 50% using
mapissue.job.reuse.jvm.num.tasks.

88. During the execution of a streaming job, the names of the parameters are
transformed.
a) vmap
b) mapvim
c) mapreduce
d) mapred
View Answer
Answer: d
Explanation: To get the values in a streaming job’s mapper/reducer use the parameter names with

Downloaded by yash swami (yashswami284@gmail.com)


the underscores.
Hadoop Configuration

89. Which of the following class provides access to configuration parameters ?


a) Config
b) Configuration
c) OutputConfig
d) None of the mentioned
View Answer
Answer: b
Explanation: Configurations are specified by resources.

90. gives site-specific configuration for a given hadoop installation.


a) core-default.xml
b) core-site.xml
c) coredefaultxml
d) all of the mentioned
View Answer
Answer: b
Explanation: core-default.xml is read-only defaults for hadoop.

Hadoop Cluster-2

91. Hadoop data is not sequenced and is in 64MB to 256 MB block sizes of delimited record
values with schema applied on read based on:
a) HCatalog
b) Hive
c) Hbase
d) All of the mentioned
Answer: a

Explanation: Other means of tagging the values also can be used.

Monitoring HDFS

92. For YARN, the Manager UI provides host and port information.
a) Data Node
b) NameNode
c) Resource

Downloaded by yash swami (yashswami284@gmail.com)


d) Replication
Answer: c

Explanation: All the metadata related to HDFS including the information about data nodes, files
stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.

93. Point out the correct statement:


a) The Hadoop framework publishes the job flow status to an internally running web server
on the master nodes of the Hadoop cluster
b) Each incoming file is broken into 32 MB by default
c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault
tolerance
d) None of the mentioned
Answer: a

Explanation: The web interface for the Hadoop Distributed File System (HDFS) shows
information about the NameNode itself.

94)NameNode is used when the Primary NameNode goes down.


a) Rack
b) Data
c) Secondary
d) None of the mentioned

Answer: c
Explanation: Secondary namenode is used for all time availability and reliability.

95) Point out the wrong statement:


a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file
level
b) Block Report from each DataNode contains a list of all the blocks that are stored on that
DataNode
c) User data is stored on the local file system of DataNodes
d) DataNode is aware of the files to which the blocks stored on it belong to
Answer: d
Explanation: NameNode is aware of the files to which the blocks stored on it belong to.

96)For, the HBase Master UI provides information about the HBase Master uptime.

a) HBase
b) Oozie
c) Kafka
d) All of the mentioned

Downloaded by yash swami (yashswami284@gmail.com)


Answer: a
Explanation: HBase Master UI provides information about the number of live, dead and
transitional servers, logs, ZooKeeper information, debug dumps, and thread stacks.

97)Which of the following scenario may not be a good fit for HDFS ?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file b)
HDFS is suitable for storing data related to applications requiring low latency data access c) HDFS
is suitable for storing data related to applications requiring low latency data access d) None of the
mentioned
Answer: a
Explanation: HDFS can be used for storing archive data since it is cheaper as HDFS allows storing
the data on low cost commodity hardware while ensuring a high degree of faulttolerance.

98)The need for data replication can arise in various scenarios like :
a) Replication Factor is changed
b) DataNode goes down
c) Data Blocks get corrupted
d) All of the mentioned
Answer: d

Explanation: Data is replicated across different DataNodes to ensure a high degree of fault-
tolerance.

99)During start up, the loads the file system state from the fsimage and the
edits log file.
a) DataNode
b) NameNode
c) ActionNode
d) None of the mentioned
Answer: b

Explanation: HDFS is implemented on any computer which can run Java can host a
NameNode/DataNode on it.
HDFS Maintenance

100)Which of the following is a common hadoop maintenance issue ?


a) Lack of tools
b) Lack of configuration management
c) Lack of web interface
d) None of the mentioned
Answer: b

Explanation: Without a centralized configuration management framework, you end up with a


number of issues that can cascade just as your usage picks up.

Downloaded by yash swami (yashswami284@gmail.com)


101)Manager’s Service feature monitors dozens of service health and performance metrics about
the services and role instances running on your cluster, a) Microsoft
b) Cloudera
c) Amazon
d) None of the mentioned
Answer: b

Explanation: Manager’s Service feature presents health and performance data in a variety of
formats.

Introduction to Pig

102)Pig operates in mainly how many nodes ?


a) Two
b) Three
c) Four
d) Five
Answer: a

Explanation: You can run Pig (execute Pig Latin statements and Pig commands) using various
mode: Interactive and Batch Mode.

103)Point out the correct statement:


a) You can run Pig in either mode using the “pig” command
b) You can run Pig in batch mode using the Grunt shell
c) You can run Pig in interactive mode using the FS shell
d) None of the mentioned
Answer: a

Explanation: You can run Pig in either mode using the “pig” command (the bin/pig Perl script) or
the “java” command (java -cp pig.jar ...)

104)You can run Pig in batch mode using.


a) Pig shell command
b) Pig scripts
c) Pig options
d) All of the mentioned
Answer: b
Explanation: Pig script contains Pig Latin statements.

105)Pig Latin statements are generally organized in one of the following ways :
a) A LOAD statement to read data from the file system
b) A series of “transformation” statements to process the data

Downloaded by yash swami (yashswami284@gmail.com)


c) A DUMP statement to view results or a STORE statement to save the results
d) All of the mentioned
Answer: d
Explanation: A DUMP or STORE statement is required to generate output.

106)Point out the wrong statement:


a) To run Pig in local mode, you need access to a single machine
b) The DISPLAY operator will display the results to your terminal screen
c) To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation
d) All of the mentioned
Answer: b
Explanation: The DUMP operator will display the results to your terminal screen.

107)Which of the following function is used to read data in PIG ?


a) WRITE
b)READ
c) LOAD
d) None of the mentioned
Answer: c
Explanation: PigStorage is the default load function.

108)You can run Pig in interactive mode using the shell.


a) Grunt b)FS c) HDFS d) None of the mentioned
Answer: a
Explanation: Invoke the Grunt shell using the “pig” command (as shown below) and then enter
your Pig Latin statements and Pig commands interactively at the command line.

109)Which of the following is the default mode ?


a) Mapreduce
b) Tez
c) Local
d) All of the mentioned
Answer: a
Explanation: To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS
installation.

110)Which of the following will run pig in local mode ?


a) $ pig -x local...
b) $ pig -x tez local...
c) $pig ...
d) None of the mentioned

Downloaded by yash swami (yashswami284@gmail.com)


Answer: a
Explanation: Specify local mode using the -x flag (pig -x local).

111)$ pig -x tez_local... will enable mode in Pig.


a) Mapreduce
b) Tez
c) Local
d) None of the mentioned
Answer: d
Explanation: Tez Local Mode is similar to local mode, except internally Pig will invoke tez
runtime engine.

Pig Latin

112)operator is used to review the schema of a relation, a) DUMP b) DESCRIBE


c) STORE
d) EXPLAIN
Answer: b
Explanation: DESCRIBE returns the schema of a relation.

113)Point out the correct statement:


a) During the testing phase of your implementation, you can use LOAD to display results to your
terminal screen
b) You can view outer relations as well as relations defined in a nested FOREACH statement c)
Hadoop properties are interpreted by Pig
d) None of the mentioned
Answer: b
Explanation: Viewing outer relations is possible using DESCRIBE operator.

114)Which of the following operator is used to view the map reduce execution plans ?
a) DUMP
b) DESCRIBE
c) STORE
d) EXPLAIN
Answer: d
Explanation: EXPLAIN displays execution plans

User-defined functions in Pig

115)Point out the correct statement:


a) LoadMeta has methods to convert byte arrays to specific types

Downloaded by yash swami (yashswami284@gmail.com)


b) The Pig load/store API is aligned with Hadoop’s InputFormat class only
c) LoadPush has methods to push operations from Pig runtime into loader implementations
d) All of the mentioned
Answer: c
Explanation: Currently only the pushProjection() method is called by Pig to communicate to the
loader the exact fields that are required in the Pig script.

116)Which of the following has methods to deal with metadata ?


a) LoadPushDown
b) LoadMetadata
c) LoadCaster
d) All of the mentioned
Answer: b
Explanation: Most implementation of loaders don’t need to implement this unless they interact
with some metadata system.
117) A loader implementation should implement if casts (implicit or explicit)
from DataByteArray fields to other types need to be supported.
a) LoadPushDown
b) LoadMetadata
c) LoadCaster
d) All of the mentioned
Answer: c
Explanation: LoadCaster has methods to convert byte arrays to specific types.

Data processing operators in Pig

118)Which of the following is shortcut for DUMP operator ?


a) \de alias
b) \d alias
c)\q
d) None of the mentioned
Answer: b
Explanation: If alias is ignored last defined alias will be used.

119)Point out the correct statement:


a) Invoke the Grunt shell using the “enter” command
b) Pig does not support jar files
c) Both the run and exec commands are useful for debugging because you can modify a Pig script
in an editor
d) All of the mentioned
Answer: c
Explanation: Both commands promote Pig script modularity as they allow you to reuse existing

Downloaded by yash swami (yashswami284@gmail.com)


components.

120)Which of the following command is used to show values to keys used in Pig ?
a) set
b) declare
c) display
d) All of the mentioned
Answer: a
Explanation: All Pig and Hadoop properties can be set, either in the Pig script or via the Grunt
command line.
121)Use the command to run a Pig script that can interact with the Grunt shell
(interactive mode).
a) fetch
b) declare

c) run
d) all of the mentioned
Answer: c
Explanation: With the run command, every store triggers execution.

122)Which of the following command can be used for debugging ?

a) exec
b) execute
c) error
d) throw
Answer: a
Explanation: With the exec command, store statements will not trigger execution; rather, the entire
script is parsed before execution starts.

123)Which of the following file contains user defined functions (UDFs) ?


a) script2-local.pig
b) pig.jar
c) tutorial.) ar
d) excite.log.bz2
Answer: c
Explanation: tutorial.jar contains java classes also.

124)Which of the following is correct syntax for parameter substitution using cmd ?
a) pig {-param param_name = param_value | -param_file file_name} [-debug | -dryrun] script b)
{%declare | %default] param_name param_value
c) {%declare | %default] param_name param_value cmd

Downloaded by yash swami (yashswami284@gmail.com)


d) All of the mentioned
Answer: a
Explanation: Parameter Substitution is used to substitute values for parameters at run time

Pig in practice

125)Pig Latin is and fits very naturally in the pipeline paradigm while SQL is
instead declarative.
a) functional
b) procedural

c) declarative
d) all of the mentioned

Answer: b
Explanation: In SQL users can specify that data from two tables must be joined, but not what join
implementation to use.

Introduction to Hive

126)Which of the following command sets the value of a particular configuration variable (key)?
a) set -v

b) set <key>=<value>
c) set
d) reset
Answer: b
Explanation: If you misspell the variable name, the CLI will not show an error.

127)Which of the following is a command line option ?


a) -d,-define <key=value>
b) -e,-define <key=value> c) -f,-define <key=value>
d) None of the mentioned
Answer: a
Explanation: Variable substitution to apply to hive commands, e.g. -d A=B or -define A=B

128)When $HIVE_HOME/bin/hive is run without either the -e or -f option, it enters mode.

a) Batch

b) Interactive shell

c) Multiple

Downloaded by yash swami (yashswami284@gmail.com)


d) None of the mentioned

Answer: b

Explanation: Use (semicolon) to terminate commands for multiple options available

HiveQL-2

129)Hive specific commands can be run from Beeline, when the Hive driver is used.
a) ODBC
b) JDBC
c) ODBC-JDBC
d) All of the Mentioned
Answer: b
Explanation: Hive specific commands are same as Hive CLI commands.

130)Which of the following data type is supported by Hive ?


a) map
b)record
c) string
d) enum
Answer: d

Explanation: Hive has no concept of enums.

Querying data with HiveQL-2

131)will overwrite any existing data in the table or partition


a) INSERT WRITE
b) INSERT OVERWRITE
c) INSERT INTO
d) None of the mentioned
Answer: c
Explanation: INSERT INTO will append to the table or partition, keeping the existing data intact.

Introduction to HBase

132)Point out the correct statement:


a) HDFS provides low latency access to single rows from billions of records (Random access)
b) HBase sits on top of the Hadoop File System and provides read and write access
c) HBase is a distributed file system suitable for storing large files
d) None of the mentioned

Downloaded by yash swami (yashswami284@gmail.com)


Answer: b
Explanation: One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase.

133)Apache HBase is a non-relational database modeled after Google’s


a) BigTop
b) Bigtable
c) Scanner
d) FoundationDB
Answer: b
Explanation: Bigtable acts up on Google File System, likewise Apache HBase works on top of
Hadoop and HDFS.

134)Point out the wrong statement:


a) HBase provides only sequential access of data
b) HBase provides high latency batch processing
c) HBase internally provides serialized access
d) All of the mentioned
Answer: c
Explanation: HBase internally uses Hash tables and provides random access.

135)The Server assigns regions to the region servers and takes the help of Apache
ZooKeeper for this task.
a) Region
b) Master
c) Zookeeper
d) All of the mentioned
Answer: b
Explanation: Master Server maintains the state of the cluster by negotiating the load balancing.

Kafka with Hadoop-2

136)provides the functionality of a messaging system.


a) Oozie
b) Kafka
c) Lucene
d) BigTop
Answer: b
Explanation: Kafka is a distributed, partitioned, replicated commit log service
Amazon Elastic MapReduce

137)is an RPC framework that defines a compact binary serialization format used to persist data

Downloaded by yash swami (yashswami284@gmail.com)


structures for later analysis.
a) Pig
b) Hive
c) Thrift
d) None of the mentioned
Answer: b
Explanation: Amazon EMR does not support Hive Authorization.

138. One supported datatype that deserves special mention are :


a) money
b) counters
c) smallint
d) tinyint
View Answer
Answer: b
Explanation: Synchronization on counters are done on the Region Server, not in the client.

139. You can delete a column family from a table using the method of
HBAseAdmin class.
a) delColumnO
b) removeColumnO
c) deleteColumnO
d) all of the mentioned
View Answer
Answer: c
Explanation: Alter command also can be used to delete a column family.

140. HBase uses the File System to store its data.


a) Hive
b) hnphala
c) Hadoop
d) Scala
View Answer

Answer: c
Explanation: The data storage will be in the form of regions (tables). These regions will be split up
and stored in region servers.

Downloaded by yash swami (yashswami284@gmail.com)


141. Which of the guarantee is provided by Zookeeper ?
a) Interactivity
b) Flexibility
c) Scalability
d) Reliability
View Answer
Answer: d
Explanation: Once an update has been applied, it will persist from that time forward until a client
overwrites the update.

142. A server is a machine that keeps a copy of the state of the entire system
and persists this information in local log files.
a) Master
b) Region
c) Zookeeper
d) All of the mentioned
View Answer
Answer: c
Explanation: A very large Hadoop cluster can be supported by multiple ZooKeeper servers.

143. ZooKeeper’s architecture supports high __________ through redundant services,


a) flexibility
b) scalability
c) availability
d) interactivity
View Answer
Answer: c
Explanation: The clients can thus ask another ZooKeeper master if the first fails to answer.

144. has a design policy of using ZooKeeper only for transient data a) Hive
b) hnphala
c) Hbase
d) Oozie
View Answer
Answer: c
Explanation: If the HBase’s ZooKeeper data is removed, only the transient operations are affected
- data can continue to be written and read to/from HBase.

Downloaded by yash swami (yashswami284@gmail.com)


145. Zookeeper keep track of the cluster state such as the a) table location.
DOMAIN
b) NODE
c) ROOT
d) All of the mentioned
View Answer
Answer: c
Explanation: Zookeeper keeps track of list of online RegionServers, unassigned Regions.

146. Which of the following interface is implemented by Sqoop for recording ?


a) SqoopWrite
b) SqoopRecord
c) SqoopRead
d) None of the mentioned
View Answer
Answer: b
Explanation: Class SqoopRecord is interface implemented by the classes generated by sqoop’s
orm.ClassWriter.

147. tool can list all the available database schemas.


a) sqoop-list-tables
b) sqoop-list-databases
c) sqoop-list-schema
d) sqoop-list-columns
View Answer
Answer: b
Explanation: Sqoop also includes a primitive SQL execution shell (the sqoop-eval tool).

148. is a component on top of Spark Core, a) Spark Streaming


b) Spark SQL
c) RDDs
d) All of the mentioned
View Answer
Answer: b
Explanation: Spark SQL introduces a new data abstraction called SchemaRDD, which provides
support for structured and semi-structured data.

149. Spark SQL provides a domain-specific language to manipulate in Scala,


Java, or Python.

Downloaded by yash swami (yashswami284@gmail.com)


a) Spark Streaming
b) Spark SQL
c) RDDs
d) All of the mentioned
View Answer
Answer: c
Explanation: Spark SQL provides SQL language support, with command-line interfaces and
ODBC/JDBC server.

4.1 .leverages Spark Core’s fast scheduling capability to perform streaming analytics.
a) MLlib
b) Spark Streaming
c) GraphX
d) RDDs
View Answer
Answer: b
Explanation: Spark Streaming ingests data in mini-batches and performs RDD transformations on
those mini-batches of data.

151.is a distributed machine learning framework on top of Spark a) MLlib


b) Spark Streaming
c) GraphX
d) RDDs
View Answer
Answer: a
Explanation: MLlib implements many common machine learning and statistical algorithms to
simplify large scale machine learning pipelines.
2 52.is a distributed graph processing framework on top of Spark.
a) MLlib
b) Spark Streaming
c) GraphX
d) All of the mentioned
View Answer
Answer: c
Explanation: GraphX started initially as a research project at UC Berkeley AMPLab and
Databricks, and was later donated to the Spark project.

153 . GraphX provides an API for expressing graph computation that can model the
abstraction.
a) GaAdt

Downloaded by yash swami (yashswami284@gmail.com)


b) Spark Core
c) Pregel
d) None of the mentioned
View Answer
Answer: c
Explanation: GraphX is used for machine learning.

154 .Which of the following storage policy is used for both storage and compute ?

a) Hot
b) Cold
c) Warm
d) A11_SSD
Answer: a
Explanation: When a block is hot, all replicas are stored in DISK.

155 .Which of the following is used to list out the storage policies ?
a) hdfs storagcpolicics
b) hdfs storage
c) hd storagepolicies
d) all of the mentioned
Answer: a
Explanation: Arguments are none for the hdfs storagepolicies command.

156 .Which of the following statement can be used get the storage policy of a file or a directory ?
a) hdfs dfsadmin -getStoragePolicy path
b) hdfs dfsadmin -setStoragePolicy path policyName c) hdfs dfsadmin -listStoragePolicy path
policyName d) all of the mentioned
Answer: a
Explanation: refers to the path referring to either a directory or a file.

157 .Which of the following method is used to get user-specified job name ?
a) getJobName()
b) getJobState()
c) getPriorityO
d) all of the mentioned
Answer: a
Explanation: getPriorityO is used to get scheduling info of the job.

158 .The number of maps is usually driven by the total size of:
a) inputs
b) outputs
c) tasks

Downloaded by yash swami (yashswami284@gmail.com)


d) none of the mentioned
Answer: a
Explanation: Total size of inputs means total number of blocks of the input files.

159.is a utility which allows users to create and run jobs with any executable as the mapper and/or
the reducer.
a) Hadoop Strdata
b) Hadoop Streaming
c) Hadoop Stream
d) None of the mentioned
Answer: b
Explanation: Hadoop streaming is one of the most important utilities in the Apache Hadoop
distribution.

160 .part of the MapReduce is responsible for processing one or more chunks of data and
producing the output results.
a) Maptask
b) Mapper
c) Task execution
d) All of the mentioned
Answer: a
Explanation: Map Task in MapReduce is performed using the Map() function.

161 .function is responsible for consolidating the results produced by each of the Map()
functions/tasks.
a) Reduce
b) Map
c) Reducer
d) All of the mentioned
Answer: a
Explanation: Reduce function collates the work and resolves the results.

162 .The is responsible for allocating resources to the various running


applications subject to familiar constraints of capacities, queues etc.
a) Manager
b) Master
c) Scheduler
d) None of the mentioned

Answer: c
Explanation: The Scheduler is pure scheduler in the sense that it performs no monitoring or
tracking of status for the application

Downloaded by yash swami (yashswami284@gmail.com)


163 .Apache Hadoop YARN stands for:
a) Yet Another Reserve Negotiator
b) Yet Another Resource Network
c) Yet Another Resource Negotiator
d) All of the mentioned
Answer: c
Explanation: YARN is a cluster management technology.

164 .The is a framework-specific entity that negotiates resources from the


ResourceManager
a) NodeManager
b) ResourceManager
c) ApplicationMaster
d) All of the mentioned
Answer: c
Explanation: Each ApplicationMaster has responsibility for negotiating appropriate resource
containers from the schedule.
165 .Yam commands are invoked by the script.
a) hive
b)bin
c)hadoop
d) home
Answer: b
Explanation: Running the yam script without any arguments prints the description for all
commands.

166 .Which of the following command runs ResourceManager admin client ?


a) proxyserver
b) run
c) admin
d) rmadmin
Answer: d
Explanation: proxyserver command starts the web proxy server.

167 .generates keys of type LongWritable and values of type Text.


a) TextOutputFormat
b) TextlnputFormat
c) OutputlnputFormat
d) None of the mentioned
Answer: b
Explanation: If K2 and K3 are the same, you don’t need to call setMapOutputKeyClass().

Downloaded by yash swami (yashswami284@gmail.com)


168 . An input is a chunk of the input that is processed by a single map.
a) textformat
b) split
c) datanode
d) all of the mentioned
Answer: b
Explanation: Each split is divided into records, and the map processes each record—a keyvalue
pair—in turn.

169 .Which of the following method add a path or paths to the list of inputs ?
a) setInputPaths()
b) add!nputPath()
c) setlhput()
d) none of the mentioned
Answer: b
Explanation: FilelnputFormat offers four static convenience methods for setting a JobConf’s input
paths.

170 .The split size is normally the size of an block, which is appropriate for most
applications.
a) Generic
b) Task
c) Library
d) HDFS
Answer: d
Explanation: FilelnputFormat splits only large files(Here “large” means larger than an HDFS
block).

171 .Point out the correct statement:


a) The minimum split size is usually 1 byte, although some formats have a lower bound on the
split size
b) Applications may impose a minimum split size
c) The maximum split size defaults to the maximum value that can be represented by a Java long
type
d) All of the mentioned
Answer: a
Explanation: The maximum split size has an effect only when it is less than the block size, forcing
splits to be smaller than a block.

172 .To set an environment variable in a streaming command use:

Downloaded by yash swami (yashswami284@gmail.com)


a) -cmden EXAMPLE_DIR=/home/example/dictionaries/
b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/
c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/
Answer: c
Explanation: Environment Variable is set using cmdenv command.

173 .Point out the wrong statement:


a) Hadoop works better with a small number of large files than a large number of small files b)
CombineFilelnputFormat is designed to work well with small files
c) CombineFilelnputFormat does not compromise the speed at which it can process the input in a
typical MapReduce job
d) None of the mentioned
Answer: c
Explanation: If the file is very small (“small” means significantly smaller than an HDFS block)
and there are a lot of them, then each map task will process very little input, and there will be a lot
of them (one per file), each of which imposes extra bookkeeping overhead.

174.Which of the following class is provided by Aggregate package ? a) Map


b) Reducer
c) Reduce
d) None of the mentioned
Answer: b
Explanation: Aggregate provides a special reducer class and a special combiner class, and a list of
simple aggregators that perform aggregations such as “sum”, “max”, “min” and so on over a
sequence of values.

175.is the output produced by TextOutputFor mat, Hadoop’s default OutputFormat.


a) KeyValueTextInputFormat
b) KeyValueTextOutputFormat
c) FileValueTextlnputFormat
d) All of the mentioned
Answer: b
Explanation: To interpret such files correctly, KeyValueTextInputFormat is appropriate.

176 .Point out the wrong statement:


a) Reducer has 2 primary phases

Downloaded by yash swami (yashswami284@gmail.com)


b) Increasing the number of reduces increases the framework overhead, but increases load
balancing and lowers the cost of failures
c) It is legal to set the number of reduce-tasks to zero if no reduction is desired
d) The framework groups Reducer inputs by keys (since different mappers may have output the
same key) in sort stage
Answer: a
Explanation: Reducer has 3 primary phases: shuffle, sort and reduce
177 .The output of the is not sorted in the Mapreduce framework for Hadoop.
a) Mapper
b) Cascader
c) Scalding
d) None of the mentioned
Answer: d
Explanation: The output of the reduce task is typically written to the FileSystem. The output of the
Reducer is not sorted.

178 .Which of the following phases occur simultaneously ?


a) Shuffle and Sort
b) Reduce and Sort
c) Shuffle and Map
d) All of the mentioned
Answer: a
Explanation: The shuffle and sort phases occur simultaneously; while map-outputs are being
fetched they are merged

4.1 .1nput to the is the sorted output of the mappers.


a) Reducer
b) Mapper
c) Shuffle
d) All of the mentioned
Answer: a
Explanation: In Shuffle phase the framework fetches the relevant partition of the output of all the
mappers, via HTTP.

18O.Mapper implementations are passed the JobConf for the job via the method
a) JobConfigure.configure
b) JobConfigurable.configure
c) JobConfigurable.configureable
d) None of the mentioned
Answer: b
Explanation: JobConfigurable.configure method is overrided to initialize themselves.

Downloaded by yash swami (yashswami284@gmail.com)


BIG DATA

Q.No QUESTION OPTION 1 OPTION 2 OPTION 3 OPTION 4

1 A open source build tool simple build tools sequential build complex build script build tool
for scala project is tool tool

2 which are tuple? val exam=(l,l) val exam=(“one,”t val exam=(l,”ele none
wo”,’’three”) ment”,10.2)

3 take(n) top()
which of the following is
CountByValue 0 mapPartitionW
transformation? ithlndex()

4 distinct() intersection() union(datasets)


which of the following is
CountByValue 0
action?

5 RDD Dataframe Dataset none


which of the following is
good for low level
transformation and actions

6 Identify the scenario in Customer Fraud POS transaction


which the Big Data Segmentation Identification Social Marketing
analytics benefit is not Optimization
evident

7@ Which unique feature of By adding more By moving the By distributing the


By distributing the
distributed computing Highcapacity disk computing task in results to several
computing task
increases processing power the cloud users of the
among several
computers networks

8 Point out correct statement The Giraph None


Hadoop is an ideal Hadoop stores framework is less
environment for data in HDFS and useful than MR
extracting and supports data
transforming small compression /
volumes of data decompression

9 Volume Velocity Variety Variable


Which is not a
characteristic of Bigdata

Downloaded by yash swami (yashswami284@gmail.com)


10 Which are the 3 major lass, Pass, Saas Database, SQL,
Clusters or grids, Network,Clou d,
parallel processing Network
MPP, HPC multitenancy

11 Which of the following Distributed file JAX-RS JAVA RDBMS


genres does hadoop system MESSAGE
produce SERVICES

12 Which of the following is Execution Analysis


Logical
not Spark-Sql query Physical planning
optimization
execution

13 Indexing batch Dealing with Managing the Compensating the


What does serving layer do
views recent data master dataset high latecy of
in lambda Architecture
batch layer
compliant system

14

15
Hive support Complex TRUE FALSE
Index type
16 -/hdfs -Is -/hadoop dfs - Hadoop -Is dfs -Is
What is the correct
/user/hduser/in Is /user/hduser/i /user/hduser/i
statement to access
put /user/hduser/i nput nput
HDFS from HIVE CU
nput

17 CREATE statement in Session control DDI statement Embedded


DML statement
HIVE is related to statement SQL statement

18 Partitioning of a table in Files under Subdirectories Subdirectories Files under table


HIVE creates more database name under the database under the name
name tablename

19 To list table with prefix SHOW SHOW DESCRIBE


SHOW TABLES
‘page’ in hive we use the TABLESS? PARTITIONS EXTENDED
“page.*”?
syntax page_view? page_view?

20 Hive is designed mainly OLAP OLTP Both 1 & 2


None of the listed
for

21 Hive is a schema on read True False


and not schema on write

22 The underlying data is not True False


deleted from HDFS when
an HIVE external table is
dropped

Downloaded by yash swami (yashswami284@gmail.com)


23 Hive supports random read True False
and writes

■ Identify correct syntax to


run
hive -e “select” Hive -e
Hive -f query.hql
Hive -f “select”

25 TRUE FALSE
In order to use bucketing in
hive session for we should
set the below parameter
SET
HTVE.ENFORCE.BUC
KETING=TRUE

26 The default poty of HIVE 10 70050 10000 1245


thrift is

27 Create table a(al int,a2 20 5 8 69 26 3 69 5 6 45 9 20


int,a3 int)?
what will be the result
select *from a?
90 5 6
20 5 8

28 Cartesian Product join is When we do not When we retain When we need to


needed specify the key on When we need to all records from access all the data
which we want to access all the data the table that is on from a table
make the join from a table and the right hand
when we retain all side of the join in
records from the a given query
table that is on the
right hand side of
the join in a given
query

29 Denormalization data in improve the Avoid multiple Avoid unrelated Avoid multiple
Hive only performance disk seekd and data disk seekd
improve the
performance

30 True False
Hive support external table

We can run PIG in batch PIG shell


31 PIG scripts PIG options All the above
mode using command

Which of the following


32 $pig -x local.. $ pig -x tez_local.. $pig None
will PIG run

Downloaded by yash swami (yashswami284@gmail.com)


33 You can run PIG in Grunt FS HDFS None
interactive mode using the
___________ shell

34 Which of the following is \de alias \d alias \q None


shortcut for DUMP
operator
35 Which of the following HIVE PIG Sqoop Flume
ecosystem tool provide
dataflow language to
transfer

36@ double bag map string


Which of the following is
not a PIG static helper
functions?

37 Point out the wrong To run PIG in local The DISPLAY All the listed
statement mode ,you need operator will To run PIG in option
access to a single display the result mapreduce mode
machine to your terminal ,you need access
screen to a Hadoop
cluster and HDFS
installation

38 Which OS issued to read a WRITE READ LOAD None of the given


data in PIG options

39 Pig is a _____ language Dataflow Declarative Import export


Scheduling engine

■ Pig operates in how many 2


nodes
3 4 5

41 Pig supports following type Inner join Left outer join All the listed
Right outer join
of joins options

42 We can run PIG in Grunt FS HDFS None of the


interactive mode using options

43 Which of the following is mapreduce Tez Local All the listed


the default mode option

44 restrict overwrite F drop cascade


If the database contains
some tables then it can be
forced to drop without
dropping the table by using
the keyword

Downloaded by yash swami (yashswami284@gmail.com)


45 Which of the following is tez mapreduce local All the listed
the default mode above

46 Piglatin statements are ALO AD statement A series of A DUMP All the listed
generally organized in one read transformation statement to option
of

47 Which ecosystem tool HIVE PIG Sqoop Flume


schema is optional

48 None, because the Namenode alone All the 4 daemons Datanode


Which daemons of hadoop
script is running in of Hadoop
should be running while
local mode
executing a PIG Script in
local mode

49 Point out the correct LoadMeta has Pig load/store API All the listed
statement method to convert is aligned with LoadPush has options
byte arrays to hadoop methods to
specific types InputFormat class
only operations from
pig runtime into
loader
implementatio ns

50 Which are all the different I and HI II and in I and IV


modes available in which I and II
PIG ^^There are
two
modes
interact
ive mode
and batch
mode

51 Language used in hive is true false


piglatin

52 Which of the following read write load None of the above


function is used to read
data in pig

53 True False
Pig can be called from java

54 Input to the _____ is Reducer Mapper Shuffle All the listed


the sorted output of the options
mapper

Downloaded by yash swami (yashswami284@gmail.com)


55 A BIGDATA company HDFS is almost Map or reduce NameNode goes mapreduce jobs
was running a Hadoop full tasks that are down that are causing
cluster with all the stuck in the excessive
monitoring facilities infinite loop memory swaps
properly configured.
Which of the following
scenarios will be
undetected.

56 A mapper can True False


communicate

57 jobconfig jobconf jobconfigurati on All of the above


Primary interface for a user
to describe a mapreduce
job to the hadoop
framework for execution

58 partitioner outputcollector reporter All of the above


Mapper and reducer
implementation can use the
_________ to report
progress or just indicate
that they are alive

59 Identify the correct Automatic fault-tolerance Speculative All of the above


mapreduce feature parallelization and execution
distribution

60 mapreduce mapper tasktracker jobtracker


A ___ node acts as the
slave and is responsible for
executing task assigned to
it by the jobtracker

61 which is correct statement In a MR job map Reducer output None


tasks stores the will be written to Intermediate data
intermediate data HDFS created by
into HDFS Maptask will be
used to analyze
the job history

62 In the mapreduce
Framework,map and No,because the Yes, because in No,because the Yes, because the
reduce functions can be run output of the functional output of the map functional use
in any order reduce function is programming, the function is the KVP as input and
the input of the order of execution input of the output,order is not
map function is not important reduce function important

Downloaded by yash swami (yashswami284@gmail.com)


63 None
which of the following is HashPartitione HasPartitioner
the default partitioner for MergePartitione r
Mapreduce

64 Which of the following MAP Reducer Reduce None


class is provided by
Aggregate package ?

65 The number of maps is inputs OutputCollecto r tasks None


usually driven by the total
size of

66 MapParameters JobConf memoryConf None


___________ is the
primary interface for a user
to describe a Mapreduce
job to Hadoop framework
for execution

67 mapv mapred mapvim All the listed


Maximum Virtual memory
option
of the launched child-tasl is
specified using

68 JobConfigure JobConfigurabl JobConfigurab None


Mapper implementation
are passed the JobConf for Configure
e.Configure le.Configurele
the job via ______
method

69 Each mapper will True False


communicate with each
reducer

70 To set an environment cmdenv


cmden cmdev Cmenv
variable in a streaming EXAMPLE_D
EXAMPLE_DI EXAMPLE_D EXAMPLE_DI
command use IR=/home/exa
R=/home/exam IR=/home/exa R=ZhomeZexam
mple/dictionar
ple/dictionaries/ mple/dictionari es/ pleZdictionaries Z
iesZ

71 ------ map input key/value Mapper Reducer Both mapper and None
pairs to a set of reducer
intermediate key/value
pairs

72 What is data localization Running the Map All the listed


Bringing the data Bringing the
task in the node options
to be processed in replicated blocks
where data block
a single node into a single node
sits

Downloaded by yash swami (yashswami284@gmail.com)


73 Hadoop Strdata Hadoop streaming Hadoop stream None
------ is a utility which
allows user to create and
run jobs with any
executable as the mapper
and/or the reducer

74 Mapper output zero or TRUE FALSE


manre key value pair
75 Reduce Map Reducer All of the above
------ function is
responsible for
consolidating the results
produced by each of the
map() functions/tasks

76 OutputCompact or OutputCollecto | InputCollector All the listed


------- is a generalization of options
the facility provided by the
MapReduce framework to
collect data output by the
Mapper or the Reducer

77 Correct statement for Multiple mappers All the listed


Hadoop attempts to
Hadoop framework run parallel Mappers read options
run a mapper on a
data in the form
node in which the
of key/value pair
data locally

78 Point out the correct Applications may All the listed


statement The minimum split impose a The maximum options
size is usually 1 minimum split split size defaults
byte, although size to the maximum
some formats have value that can be
a lower bound on represented
the split size byjava long type

79 Combiner increases the True False


amount of work to be done
by the reducer by reducing
the traffic

80 Point out the wrong CombineFileln none


Hadoop works CombineFilel
statement putFormat is
better with a small nputFormat does
designed to work
no. of large files not compromise
well with small
than a large no. of the speed at
files
small files which it can

Downloaded by yash swami (yashswami284@gmail.com)


process the input
in a typical MR
job

By default a maptask Whole stream of All the listed


81 Single block Multiple blocks
works on data options

82 Point out the wrong It is legal to set the The output of none
The MR
statement number of reducer map-tasks go
framework does
task to zero if no directly to the
not sort the map
reduction is desired FileSystem
outputs before
writing them out
to the

83 Generic Task library HDFS


The split size is normally
the size of a - block,
which is appropriate for
most applications

84 Point out the wrong Reducer has two It is legal to set


Increasing the
statement primary phases the number of The framework
number of reduce
Reducer task to groups Reducer
increases the
zero if no inputs by
framework
reduction is keys(since
overhead,but
desired different Mappers
increases the load
may have output
balancing and
the same key) in
lowers the cost of
sort stage
failures

85 KeyFieldPartiti KeyFieldBased KeyFieldBase d none


----- class allows the
oner Partitioner
framework to a partition
the map outputs

86 Reducer task cannot be TRUE FALSE


started until all mapper
tasks get completed
87 What happens if a Mapper Reducer task Hadoop will If another All the above
task run slow, relative to cannot started until trigger another instance of the
other Mapper task the last Mapper instance of same mapper task
gets completed Mapper task in finishes then
another node hadoop will kill
the slowly
running mapper
task

Downloaded by yash swami (yashswami284@gmail.com)


88 In MR job, maps stores TRUE FALSE
intermediated data into
local disk and it will be
deleted once the job is
done

89 Which of the following Partitioner Compactor Collector All the listed


partitions the keyspace options
90 mapper cascader scalding none
The output of the is not
sorted in the MR
framework for Hadoop

91 Reduce phase can be TRUE FALSE


started before all the
mappers complete its task

92 Correct syntax for pig {-param .{%declare | .{%declare | all the listed
parameter substitution param_name=p %default} %default} options
using cmd aram_value | - param_name param_namr
param_file param_value param_value cmd
file_name} [ -
debug | - dryrun]
scriot

93 shuffle and sort reduce and sort


which of the following all the listed
shuffle and map
phase occur simultaneously options

94 A sqoop import table In HDFS HDFS inside a hi local file system


Will be saved in a
would be saved in which of special destination directory with the of the hadoop
the following? same name of the cluster
present in the
table
sqoop
configuration

95 API’s JDBC connector Sqoop Sqoop


Which Sqoop tools would
listdatabases tool showdatabases
the Hadoop admin use to
list all the databases in a
server that sqoop can
connect to

96 Point out the wrong Each incoming None of the above


DataNode is the Data block
statement file is broken into
slave/worker node replicated across
32MB by default
and holds the user different nodes in
data in the form of the cluster

Downloaded by yash swami (yashswami284@gmail.com)


datablocks to ensure a low
degree of fault
tolerance

97 A ________ serves as DataNode NameNode Data Block Replication


the master and there is only
one namenode per cluster

98 Number of mappers is TRUE FALSE


decided by the Mapreduce
framework
99 ______ NameNode is used rack data secondary none
when the primary
NameNode goes down

100 Number of reducers can’t TRUE FALSE


be decided by user

101 Client reading the data get the data from get the block get only the block get both the data
from HDFS filesystem in the namenode location from the location from the and the block
Hadoop datanode namenode location from the
namenode

**** In HDFS system with


*

102 Point out wrong statement The MR Applications None


A mapreduce job framework typically
usually splits the operates implement the
input data-set into exclusively on mapper and
independent pairs reducer interfaces
chunks which are to provide the
processed by the map and reduce
map task in a method
completely parallel
manner

103 Each mapper will True False


communicate with each
reducer

104 LoadPushDown LoadMetadata LoadCaster All the listed


A loader implementation
option
should implement

Downloaded by yash swami (yashswami284@gmail.com)


_________ if
cast(implicit or explicit)
from databyte array fields
to other types need to be
supported

105 Which ecosystem tool HIVE PIG Sqoop Flume


schema is optional
106 Which of the following is Acyclic graphs Direct Non- Direct acyclic Non-acyclic
design and execution acyclic graphs graphs graphs
structure supported by
oozie workflow

107 If a req is to run so manu PIG sqoop HIVE MapReduce


ad-hoc query on top of
HDFS data, which
ecosystem toll will you
suggest to the client for the
use case

108 SELECT * from cat/home/myus er. hive - All of the above


A user writes a query in a
Tablel f/home/path/q u
file called gunahql, which
is correct

109 TEZ
Pig -z tez_local will enable mapreduce Local None f the option
__________ mode in
Pig

110 True False


Datanode send heartbeat
every 9 second

111 What is the default 1 2 3 4


replication factor in
HDFS

112 When using HDFS, what It permanently Files in HDFS none


It become hidden
occurs when a file is deleted and the file cannot be deleted
from the user but
deleted from the command attributes are
stays in the file
line recorded in a lof
system
file

113 Namenode is single point True False


of failure

114 True about HDFS HDFS is based on HDFS is written Sits on top of
All the list options
Google file in java native file

Downloaded by yash swami (yashswami284@gmail.com)


system system

115 How data received in GPS Structured Unstructured Semi structured


Both Structured
Satellite and the Data Data
and semi structured
Web be classified as

116 Namespace DataNode NameNode Blocks


HDFS works by breaking
large files into smaller
pieces,these smaller pieces
of files

117 Velocity Variety Volumn Variable


Data is captured which can
be in any form and can be
structure and unstructure.
Which characteristics of
BigData

118 List all the types of Semi structured Structured, semi


Semi structured
BigData Structured data and data structured and
and unstructured
unstructured data unstructure
data

119 What is the default user hivedb hbase default


database when we connect
to HIVE

120 Which of the following Zookeeper Oozie Ambari All of the options
components provides Coordinator
support for automatic
execution of the workflows
based on events and the
presence of system
resources

121 Hive parse data at time of True False


loading

123 Hive support Complex True False


Index type

124 What is the correct


./hdfs -Is ./hadoop dfs -Is Hadoop -Is dfs -Is
statement to access HDFS /user/hduser/in
/user/hduser/inp ut /user/hduser/in put /user/hduser/in put
from HIVE CLI put

125 Which of the following SerDe can be A SerDe is a All of the options
SerDe is a library
statement is true about customized to mechanism that listed
for Serialization
SerDes in HIVE allow HIVE HIVE uses to
and
understand your Parse

Downloaded by yash swami (yashswami284@gmail.com)


own custom format deserialization
various formats of
data stored in
HDFS,to be used
by HIVE

126 Hive support Left True False


Semi join
127 Hive building function size Return nill of the Convert the results Return the Only return the
(map<K,V>) is used to conversion does in the Map type number of number of
not succeed elements in the elements in the
map type Map type and
return null if the
conversion does
not

128 Upon action Upon None


On both
transformatio
When does spark transformatio n
n
evaluate ROD
129 Spark Core’s fast Spark streaming MLib GraphX RDDs
scheduling capability
to perform streaming
analytics is leveraged
by

130 Which is default input SequenceFilel BytelnputFor KeyValuelnp TextlnputFor


format defined in nputFormat mat utFormat mat
hadoop

131 YARN stands for Yahoo another Yet another Yet another
resource name resource resouce need Yahoo archived
negotiator resource name

132 Secondary namenode FALSE TRUE


is a backup for
Namenode

133 What is the default 128MB 64MB 128GB 64GB


block size of Hadoop
2

134 Datanode sends TRUE FALSE

Downloaded by yash swami (yashswami284@gmail.com)


heartbeat every 9
seconds

135 1 2 3 4
What is the default
replication data factor

136 HDFS stands for Hadoop direct Hadoop Hadoop direct


Hadoop
file system distributed file file
distributed file
system synchronizati
synchronizati
on
on

137 Which database does MYSQL Hbase derby monogodb


hive use, for storing
metadata about the
hive tables

138 Hive is Schema on Schema less All of the


Schema on read
write options

139 What are the complex Array Maps Structs All of the
data types Hive options
supports

140 Which statements are Data gets Data gets


true while loading data copied from Data gets Data gets moved from
into the table in Hive source location moved in case moved in case source location
into the tables Managed table, External table, into the tables
directory in while a pointer while a pointer directory in
both managed to the source to the source both managed
and external data location is data location is and external
tables created in case created in case tables
of external of Managed
tables tables

141 Which of the acyclic graphs Direct Non- Non-acyclic


Direct acyclic
following is
graphs

Downloaded by yash swami (yashswami284@gmail.com)


142 Why spark is an It contains It contains It contain All of the
integrated solution for sparkcore that spark sql for spark above
processing on all includes high- sqland streaming that
lambda architecture level API and structured data enable
layers an optimized processing scalable,high
engine that throughput,fa
supports ult-
general tolerance,stri
executing graph ng process of
live data

Downloaded by yash swami (yashswami284@gmail.com)

You might also like