You are on page 1of 46

BIG DATA(KCS-061) 2020-21

Big Data(KCS-061)

Big Data(KCS-061)
Course Outcome ( CO)
At the end of course , the student will be able
to understand
Demonstrate knowledge of Big Data Analytics concepts and its applications in business
CO 1

CO 2 Demonstrate functions and components of Map Reduce Framework and HDFS.

Discuss Data Management concepts in NoSQL environment.


CO 3

CO 4 Explain process of developing Map Reduce based distributed processing applications.

CO 5 Explain process of developing applications using HBASE, Hive, Pig etc.

DETAILED SYLLABUS

Unit Topic

Introduction to Big Data: Types of digital data, history of Big Data innovation, introduction to Big
Data platform, drivers for Big Data, Big Data architecture and characteristics, 5 Vs of Big Data, Big Data
I technology components, Big Data importance and applications, Big Data features – security, compliance,
auditing and protection, Big Data privacy and ethics, Big Data Analytics, Challenges of conventional
systems, intelligent data analysis, nature of data, analytic processes and tools, analysis vs reporting,
modern data analytic tools.

Hadoop: History of Hadoop, Apache Hadoop, the Hadoop Distributed File System, components of
II Hadoop, data format, analyzing data with Hadoop, scaling out, Hadoop streaming, Hadoop pipes,
Hadoop Echo System.
Map Reduce: Map Reduce framework and basics, how Map Reduce works, developing a Map Reduce
application, unit tests with MR unit, test data and local tests, anatomy of a Map Reduce job run, failures,
job scheduling, shuffle and sort, task execution, Map Reduce types, input formats, output formats, Map
Reduce features, Real-world Map Reduce
HDFS (Hadoop Distributed File System): Design of HDFS, HDFS concepts, benefits and challenges, file
sizes, block sizes and block abstraction in HDFS, data replication, how does HDFS store, read, and write
III files, Java interfaces to HDFS, command line interface, Hadoop file system interfaces, data flow, data
ingest with Flume and Scoop, Hadoop archives, Hadoop I/O: compression, serialization, Avro and file-
based data structures.
Hadoop Environment: Setting up a Hadoop cluster, cluster specification, cluster setup and installation,
Hadoop configuration, security in Hadoop, administering Hadoop, HDFS monitoring & maintenance,
Hadoop benchmarks, Hadoop in the cloud
Hadoop Eco System and YARN: Hadoop ecosystem components, schedulers, fair and capacity, Hadoop
2.0 New Features - NameNode high availability, HDFS federation, MRv2, YARN, Running MRv1 in
IV YARN.
NoSQL Databases: Introduction to NoSQL
MongoDB: Introduction, data types, creating, updating and deleing documents, querying, introduction to
indexing, capped collections
Spark: Installing spark, spark applications, jobs, stages and tasks, Resilient Distributed Databases,
anatomy of a Spark job run, Spark on YARN
SCALA: Introduction, classes and objects, basic types and operators, built-in control structures, functions
and closures, inheritance

1 University Academy
BIG DATA(KCS-061) 2020-21

Hadoop Eco System Frameworks: Applications on Big Data using Pig, Hive and HBase
V Pig - Introduction to PIG, Execution Modes of Pig, Comparison of Pig with Databases, Grunt, Pig Latin,
User Defined Functions, Data Processing operators,
Hive - Apache Hive architecture and installation, Hive shell, Hive services, Hive metastore, comparison
with traditional databases, HiveQL, tables, querying data and user defined functions, sorting and
aggregating, Map Reduce scripts, joins & subqueries.
HBase – Hbase concepts, clients, example, Hbase vs RDBMS, advanced usage, schema design, advance
indexing, Zookeeper – how it helps in monitoring a cluster, how to build applications with Zookeeper. IBM
Big Data strategy, introduction to Infosphere, BigInsights and Big Sheets, introduction to Big SQL.

2 University Academy
BIG DATA(KCS-061) 2020-21

Big Data(KCS-061)

Solved MCQ

1. Unit-I ………………………...……………………………………………………………….3

2. Unit-II …………………………………………………………….……….…………...…….11

3. Unit-III ……………………………………………………………………..……..….....…... 13

4. Unit-IV …………………………..…….…………………………………………….....…….28

5. Unit- V …………………………………………………………………….....................…….39

3 University Academy
BIG DATA(KCS-061) 2020-21

Unit-I
1. How many V's are present in the Big Answer: c
Data? 6. What are the main components present in
a. 3 the Big Data Analytics?
b. 4
a. MapReduce
c. 5
b. HDFS
d. 6
c. YARN
Answer: c
d. All of the above
Answer: d
2. Data in a Relational Database is:
6. What are the major benefits of Big Data
a. Structured
Processing?
b. Un-Structured
a. Businesses can utilize outside
c. Semi Structured
intelligence while taking decisions
d. Meta Data
b. Improved customer service
Answer: a
c. Better operational efficiency
d. All of the above
3. Data is found in the big data, in how many
Answer: d
forms?
7. The Hadoop is written in which
a. 2
programming language?
b. 3
a. C
c. 4
b. C++
d. 5
c. Java
Answer: b
d. Python

4. What kind of data is in Log files? Answer: c


a. Structured
8. Which of the following option given are
b. Un-Structured
NOT related to the big data problem(s)?
c. Semi Structured
a. Parsing 5 MB XML file every 2
d. Meta Data
minutes
Answer: c
b. Processing the twitter data

5. What is the overall percentage of the


world’s total data created within the past
c. Processing online banking
transactions
two years is:
a. 80%
d. both (a) and (c)
Answer: d
b. 85%
9. What does the characteristics “Velocity”
c. 90%
in Big Data represents?
d. 95%
a. Speed of input data generation
4 University Academy
BIG DATA(KCS-061) 2020-21

b. Speed of individual machine a. Collecting the data


processors b. Storing the data
c. Speed of ONLY storing data c. Analysing the data
d. Speed of storing and processing d. All the above
data Answer: d
15. Does Facebook uses "Big Data "
Answer: d
to determine the behavior of its
10. Which of the following are example(s) of users? Is this True or False.
Real Time Big Data Processing? a. TRUE
a. Complex Event Processing (CEP) b. FALSE
platforms Answer: a
b. Stock market data analysis
c. Bank fraud transactions detection 16. The Process of describing the data that

d. both (a) and (c) is huge and complex to store and

Answer: d process is known as

11. Hadoop is open source. a. Analytics

a. ALWAYS True b. Data mining

b. True only for Apache Hadoop c. Big Data

c. True only for Apache and d. Data Warehouse

Cloudera Hadoop Answer: c

d. ALWAYS False 17. Data generated from online

Answer: b transactions is one of the example for


volume of big data. Is this true or
12. Which of the following is not an example False.
of Social Media? a. TRUE
a. Twitter b. FALSE
b. Google Answer: a
c. Insta 18. Velocity is the speed at which the data is
d. Youtube processed
Answer: b a. TRUE
13. By 2027, the volume of data produced b. FALSE
digitally will reach to Answer: b
a. TB 19. have a structure but cannot be
b. YB stored in a database.
c. ZB a. Structured
d. EB b. Semi-Structured
Answer: c c. Unstructured
14. For Drawing insights for Business what d. None of these
are need?

5 University Academy
BIG DATA(KCS-061) 2020-21

Answer: b apploicable.
20. . refers to the ability to turn your a. TRUE
data useful for business. b. FALSE
a. Velocity Answer: b
b. Variety
c. Value 26. Which of the following options is not the
d. Volume example of NoSql ?
Answer: c a. Google
21. Value tells the trustworthiness of data in b. NetFlix
terms of quality and accuracy. c. Amazon
a. TRUE d. CERN
b. FALSE Answer: c
Answer: b 27. Scalability and better performance of No
22. Files are divided into sized SQL is Achieved by sacrificing ACID
Chunks. Compatibility Is it TRUE?
a. Static a. TRUE
b. Dynamic b. FALSE
c. Fixed Answer: a
d. Variable 28. For Scalability and better performance of
Answer: c No SQL is attained by compromising
ACID Compatibility Is it TRUE?
23. ____________is an open source a. TRUE
framework for storing data and b. FALSE
running application on clusters of Answer: a
commodity hardware.
a. HDFS 29. is a programming
b. Hadoop model for writing applications that
c. MapReduce can process Big Data in parallel on
d. Cloud multiple nodes.
Answer: b a. HDFS
24. Hadoop MapReduce allows you to b. MAP REDUCE
perform distributed parallel processing c. HADOOP
on large volumes of data quickly and d. HIVE
efficiently: statement is True or False Answer: b
a. TRUE 30. Which of the following is a widely used and
b. FALSE effective machine learning algorithm based
Answer: a on the idea of bagging?
25. In Relational database Management
a. Decision Tree
System the property of Scaling is

6 University Academy
BIG DATA(KCS-061) 2020-21

b. Regression Answer: a
c. Classification
d. Random Forest 35. Data variety refers to;
Answer: d a. Multiple schemas
b. Multiple formats and types of data
31. Data Set is the: c. Multiple Data Models
a. Tweets stored in a flat file d. None of above
b. A collection of image files in a Answer: b
directory
c. An extract of rows from a database 36. Unstructured Data Consists of:
table stored in a CSV formatted file a. Text file, Audio Files.
d. All the above b. Video files, Text data
Answer: d c. Tagged Data
d. a) and b)
32. Data analysis is the process of: Answer: d
a. Examining data to find facts
b. Relationships, 37. Multiple internal and external data in the big
c. Patterns, insights and/or trends. data comes from the multiple sources as :
d. All the above a. Sensors, Social network sites
Answer: d b. Email, Xml, Multimedia
c. a) and b)
33. What are the general categories of d. None of the above
analytics that are distinguished by the Answer: c
results they produce:
a. Descriptive analytics 38. Ingestion Layer Should have the capability to:
b. Diagnostic analytics a. validate, cleanse, transform, reduce
c. Predictive analytics b. integrate
d. All the above c. Preprocess the data
Answer: d d. a) and b)
Answer: d
34. BI enables an organization to gain insight
into the performance of an enterprise
a. By analyzing data generated by its 39. According to analysts, for what can traditional
business processes and information IT systems provide a foundation when they’re
systems. integrated with big data technologies like
b. By examining data to find facts Hadoop?
c. From relationships, a. Big data management and data mining
d. All the above b. Data warehousing and business

7 University Academy
BIG DATA(KCS-061) 2020-21

intelligence c. Oozie
c. Management of Hadoop clusters d. None of the above
d. Collecting and storing unstructured data Answer: (a)
Answer: a
45. The examination of large amounts of data to see
40. What are the main components of Big Data? what patterns or other useful information can be
a. MapReduce found is known as
b. HDFS a. Data examination
c. YARN b. Information analysis
d. All of these c. Big data analytics
Answer: (d) d. Data analysis
Answer: (c)
41. What are the different features of Big Data
Analytics? 46. Big data analysis does the following except
a. Open-Source a. Collects data
b. Scalability b. Spreads data
c. Data Recovery c. Organizes data
d. All the above d. Analyzes data
Answer: (d) Answer: (b)
42. What are the four V’s of Big Data?
a. Volume 47. What makes Big Data analysis difficult to
b. Velocity optimize?
c. Variety a. Big Data is not difficult to optimize
d. All the above b. Both data and cost effective ways to
Answer: (d) mine data to make business sense
out of it
43. All of the following accurately describe c. The technology to mine data
Hadoop, EXCEPT: d. All of the above
a. Open-source Answer: (b)
b. Real-time 48. The new source of big data that will trigger a
c. Java-based Big Data revolution in the years to come is
d. Distributed computing approach a. Business transactions
Answer: (b) b. Social media
c. Transactional data and sensor data
44. ___________ is general-purpose computing d. RDBMS
model and runtime system for distributed data Answer: (c)
analytics. 49. The unit of data that flows through a Flume
a. Mapreduce agent is
b. Drill a. Log

8 University Academy
BIG DATA(KCS-061) 2020-21

b. Row c. 1998
c. Event d. 2005
d. Record Answer: (c)
Answer:( c)
55. Concerning the Forms of Big Data, which one
50. Listed below are the three steps that are of these is odd?
followed to deploy a Big Data Solution except a. Structured
a. Data Ingestion b. Unstructured
b. Data Processing c. Processed
c. Data dissemination d. Semi-Structured
d. Data Storage Answer: ( c )
Answer: (c)
56. Big Data applications benefit the media and
51. Check below the best answer to "which entertainment industry by
industries employ the use of so-called "Big a. Predicting what the audience wants
Data" in their day to day operations? b. Ad targeting
a. Weather forecasting c. Scheduling optimization
b. Marketing d. All of the above
c. Healthcare Answer: (d)
d. All of the above
Answer: (d) 57. The feature of big data that refers to the quality
of the stored data is ______
52. There are almost as many bits of information in a. Variety
the digital universe as there are stars in the b. Volume
actual universe? c. Variability
a. True d. Veracity
b. False Answer: (d)
Answer: (a)
58. ______ is a framework for performing remote
53. The word 'Big data' was coined by procedure calls and data serialization.
a. Roger Mougalas a. Drill
b. John Philips b. BigTop
c. Simon Woods c. Avro
d. Martin Green d. Chukwa
Answer: (a) Answer: c
59. Which of the following is a characteristic of Big
54. The word 'Big Data' was coined in the year Data?
a. 2000 a. Huge volume of data
b. 1970 b. Complexity of data types and structures

9 University Academy
BIG DATA(KCS-061) 2020-21

c. Speed of data creation and growth a. Network


d. All of the mentioned b. Transport
Answer: d c. Application
60. Concurrent access to shared data may result in d. Physical
_________________ Answer: b
a. Data consistency 65. ________________refers to the biases, noise
b. Data insecurity and abnormality in data, trustworthiness of
c. Data inconsistency data.
d. None of the mentioned a. Value
Answer: c b. Veracity
61. Mutual exclusion implies that : c. Velocity
a. If a process is executing in its critical d. Volume
section, then no other process must be Answer: b
executing in their critical sections 66. _____________refers to the connectedness of
b. If a process is executing in its critical big data.
section, then other processes must be 1. Value
executing in their critical sections 2. Veracity
c. If a process is executing in its critical 3. Velocity
section, then all the resources of the 4. Valence
system must be blocked until it finishes Answer: d
execution
d. None of the mentioned
Answer: a
62. In the memory hierarchy, as the speed of
operation increases the memory size also
increases.
a. True
b. False
Answer: b
63. To use a _______network service, the service
user first establishes a connection, uses the
connection, and terminates the connection.
a. Connection-oriented
b. Connection-less
c. Service-oriented
d. Service-less
Answer: a
64. Which layer is responsible for the process-to-
process delivery ?
10 University Academy
BIG DATA(KCS-061) 2020-21

Unit- II
Answer: (b)
1. Which one of the following is false about
Hadoop? 6. Which type of data Hadoop can deal with is
a. It is a distributed framework a. Structured
b. The main algorithm used in it is b. Semi-structured
Map Reduce c. Unstructured
c. It runs with commodity hardware d. All of the above
d. All are true Answer: (d)
Answer: (d)
2. What license is Apache Hadoop distributed 7. Which statement is false about Hadoop
under? a. It runs with commodity hardware
a. Apache License 2.0 b. It is a part of the Apache project
b. Shareware sponsored by the ASF
c. Mozilla Public License c. It is best for live streaming of
d. Commercial data
Answer: (a) d. None of the above
3. Which of the following platforms does Apache Answer: (c)
Hadoop run on ? 8. As compared to RDBMS, Apache Hadoop
a. Bare metal a. Has higher data Integrity
b. Unix-like b. Does ACID transactions
c. Cross-platform c. Is suitable for read and write
d. Debian many times
Answer: (c) d. Works better on unstructured
4. Apache Hadoop achieves reliability by and semi-structured data.
replicating the data across multiple hosts and Answer: (d)
hence does not require ________ storage on
hosts. 9. Hadoop can be used to create distributed
a. Standard RAID levels clusters, based on commodity servers, that
b. RAID provide low-cost processing and storage for
c. ZFS unstructured data
d. Operating system a. True
Answer: Option (b) b. False
Answer: (a)
5. Hadoop works in 10. ______ is a framework for performing remote
a. master-worker fashion procedure calls and data serialization.
b. master – slave fashion a. Drill
c. worker/slave fashion b. BigTop
d. All of the mentioned c. Avro
11 University Academy
BIG DATA(KCS-061) 2020-21

d. Chukwa Answer: (d)


Answer: (c) 16. All of the following accurately describe
Hadoop, EXCEPT
11. IBM and ________ have announced a major a. Open source
initiative to use Hadoop to support university b. Real-time
courses in distributed computer programming. c. Java-based
a. Google Latitude d. Distributed computing approach
b. Android (operating system) Answer: (b)
c. Google Variations 17. __________ has the world’s largest Hadoop
d. Google cluster.
Answer: (d) a. Apple
b. Datamatics
12. What was Hadoop written in? c. Facebook
a. Java (software platform) d. None of the mentioned
b. Perl Answer: (c)
c. Java (programming language)
d. Lua (programming language) 18. Which among the following is the default
Answer: (c) OutputFormat?
a. SequenceFileOutputFormat
13. Apache _______ is a serialization framework b. LazyOutputFormat
that produces data in a compact binary format. c. DBOutputFormat
a. Oozie d. TextOutputFormat
b. Impala Answer: (d)
c. Kafka
d. Avro 19. Which of the following is not an input format in
Answer: (d) Hadoop?
14. Avro schemas describe the format of the a. ByteInputFormat
message and are defined using b. TextInputFormat
______________ c. SequenceFileInputFormat
a. JSON d. KeyValueInputFormat
b. XML Answer: (a)
c. JS
d. All of the mentioned 20. What is the correct sequence of data flow in
Answer: (a) MapReduce?
15. In which all languages you can code in Hadoop a. InputFormat
a. Java b. Mapper
b. Python c. Combiner
c. C++ d. Reducer
d. All of the above e. Partitioner

12 University Academy
BIG DATA(KCS-061) 2020-21

f. OutputFormat b. FileInputFormat Counters


a. abcdfe c. FileOutputFormat counters
b. abcedf d. All of the above
c. acdefb Answer: (d)
d. abcdef
Answer: (b) 24. Which of the following is not an output format
21. In which InputFormat tab character (‘/t’) is used in Hadoop?
a. KeyValueTextInputFormat a. TextoutputFormat
b. TextInputFormat b. ByteoutputFormat
c. FileInputFormat c. SequenceFileOutputFormat
d. SequenceFileInputFormat d. DBOutputFormat
Answer: (a) Answer: (b)

Which among the following is true about 25. Is it mandatory to set input and output
SequenceFileInputFormat type/format in Hadoop MapReduce?
a. Key- byte offset. Value- It is the a. Yes
contents of the line b. No
b. Key- Everything up to tab Answer: (b)
character. Value- Remaining part
of the line after tab character 26. The parameters for Mappers are:
c. Key and value- Both are user- a. text (input)
defined b. LongWritable(input)
d. None of the above c. text (intermediate output)
Answer:(c) d. All of the above
Answer: (d)
22. Which is key and value in TextInputFormat
a. Key- byte offset Value- It is the 27. For 514 MB file how many InputSplit will be
contents of the line created
b. Key- Everything up to tab a. 4
character Value- Remaining part b. 5
of the line after tab character c. 6
c. Key and value- Both are user- d. 10
defined Answer: (b)
d. None of the above 28. Which among the following is used to provide
Answer: (a) multiple inputs to Hadoop?
a. MultipleInputs class
23. Which of the following are Built-In Counters in b. MultipleInputFormat
Hadoop? c. FileInputFormat
a. FileSystem Counters d. DBInputFormat

13 University Academy
BIG DATA(KCS-061) 2020-21

Answer: (a) b. Reduce


c. Reducer
29. The Mapper implementation processes one line d. Reduced
at a time via _________ method. Answer: (b)
a. map
b. reduce 34. The number of maps is usually driven by the
c. mapper total size of
d. reducer a. task
Answer: (a) b. output
c. input
30. The Hadoop MapReduce framework spawns d. none
one map task for each __________ generated Answer: (c)
by the InputFormat for the job. 35. The right number of reduces seems to be :
a. OutputSplit a. 0.65
b. InputSplit b. 0.55
c. InputSplitStream c. 0.95
d. All of the mentioned d. 0.68
Answer: (b) Answer: (c)
31. __________ can best be described as a
programming model used to develop Hadoop-
based applications that can process massive 36. Mapper and Reducer implementations can use
amounts of data. the ________ to report progress or just indicate
a. MapReduce that they are alive.
b. Mahout a. Partitioner
c. Oozie b. OutputCollector
d. All of the mentioned c. Reporter
Answer: (a) d. All of the mentioned
32. ___________ part of the MapReduce is Answer: (c)
responsible for processing one or more chunks
of data and producing the output results. 37. The major components in the Hadoop 2.0 are:
a. Maptask a. 2
b. Mapper b. 3
c. Task execution c. 4
d. All of the mentioned d. 5
Answer: (a) Answer: (b)
33. ________ function is responsible for
consolidating the results produced by each of 38. Which of the statement is true about PIG.
the Map() functions/tasks.
a. Map

14 University Academy
BIG DATA(KCS-061) 2020-21

a. Pig is also a data ware house system used 43. The name node used, when the secondary node
for analysing the Big Data Stored in the get failed is .
HDFS a. Rack
b. .It uses the Data Flow Language for b. Data node
analysing the data c. Secondary node
c. a and b d. None of the mentioned
d. Relational Database Management System Answer: (c)
Answer: (c) 44. Which of the following scenario may not be a
good fit for HDFS?
39. Which of the following platforms does Hadoop a. HDFS is not suitable for scenarios
run on? requiring multiple/simultaneous writes
a. Bare metal to the same file
b. Debian b. HDFS is suitable for storing data related to
c. Cross-platform applications requiring low latency data
d. Unix-like access
Answer: (c) c. HDFS is suitable for storing data related to
40. The Hadoop list includes the HBase database, applications requiring low latency data
the Apache Mahout ________ system, and access
matrix operations. d. None of the mentioned
a. Machine learning Answer: (a)
b. Pattern recognition 45. The need for data replication occurs:
c. Statistical classification a. Replication Factor is changed
d. Artificial intelligence b. DataNode goes down
Answer: (a) c. Data Blocks get corrupted
41. Which of the Node serves as the master and d. All of the mentioned
there is only one NameNode per cluster. Answer: (d)
a. Data Node 46. HDFS uses only one language for
b. NameNode implementation:
c. Data block a. C++
d. Replication b. Java
Answer: (b) c. Scala
d. None of the Above
42. HDFS consists as the Answer: (d)
a. master-worker 47. In YARN which node is responsible for
b. master node and slave node managing the resources
c. worker/slave a. Data Node
d. all of the mentioned b. NameNode
Answer: (b) c. Resource Manager

15 University Academy
BIG DATA(KCS-061) 2020-21

d. Replication c. excuted
Answer: (c) d. archived
48. As Hadoop framework is implemented in Java, Answer: (d)
MapReduce applications are required to be 54. The datanode and namenode are, respectiviley,
written in Java Language which of the following?
a. True a. Slave and Master nodes
b. False b. Master and Worker nodes
Answer: (b) c. Both worker nodes
49. _________ maps input key/value pairs to a set d. both master nodes
of intermediate key/value pairs. Answer: (a)
a. Mapper
b. Reducer 55. Hadoop is a framework that works with a
c. Both Mapper and Reducer variety of related tools. Common cohorts
d. None of the mentioned include
Answer: (d) a. MapReduce, Hive and HBase
50. The number of maps is usually driven by the b. MapReduce, MySQL and Google Apps
total size of ___________ c. MapReduce, Hummer and Iguana
a. Inputs d. MapReduce, Heron and Trumpet
b. Outputs Answer: (a)
c. Tasks
d. None of the mentioned 56. Hadoop was named after?
Answer: (a) a. Creator Doug Cuttings favorite circus act
51. which of the File system is used by HBase b. The toy elephant of Cuttings son
a. Hive c. Cuttings high school rock band
b. Imphala d. A sound Cuttings laptop made during
c. Hadoop Hadoops development
d. Scala Answer: (b)
Answer: (c)
52. The information mapping data blocks with their 57. All of the following accurately describe
corresponding files is stored in Hadoop, EXCEPT:
a. Namenode a. Open source
b. Datanode b. Java-based
c. Job Tracker c. Distributed computing approach

d. Task Tracker d. Real-time

Answer: (a) Answer: (d)

53. In HDFS the files cannot be 58. Hive also support custom extensions written in
a. read :

b. deleted a. C

16 University Academy
BIG DATA(KCS-061) 2020-21

b. C# d. Periodically merge the namespace image


c. C++ with the edit log.
d. Java Answer: (b)
Answer: (d) 63. The MapReduce algorithm contains three
59. The Pig Latin scripting language is not only a important tasks, namely __________.
higher-level data flow language but also has a. Splitting, mapping, reducing
operators similar to : b. scanning, mapping, Reduction
a. JSON c. Map, Reduction, decluttering
b. XML d. Cleaning, Map, Reduce
c. SQL Answer: (a)
d. Jquer 64. In how many stages the MapReduce program
Answer: (c) executes?
a. 2
60. In comparison to Rational DBMS, Hadoop b. 3
a. A - Has higher data In c. 4
b. B - Does ACID transactions d. 5
c. C - IS suitable for read and write many
Answer: (d)
times
d. D - Works better on unstructured and
65. What is the function of Mapper in the
semi-structured data.
MapReduce?
Answer: (d)
a. Splitting the Data File
b. Job
61. The Files in HDFS are ment for
c. Scanning the subblock of files
a. Low latency data access
d. PayLoad
b. Multiple writers and modifications at
Answer: (c)
arbitrary offsets.
c. Only append at the end of file
d. Writing into a file only once. 66. Although the Hadoop framework is

Answer: (b) implemented in Java, MapReduce applications


need be written in _______
62. The main role of the secondary namenode is a. C

to b. C#

a. Copy the filesystem metadata from c. Java

primary namenode. d. None of the above

b. Copy the filesystem metadata from Answer: (d)

NFS stored by primary namenode 67. What is the meaning of commodity Hardware in
c. Monitor if the primary namenode is up Hadoop

and running. a. Very cheap hardware


b. Industry standard hardware

17 University Academy
BIG DATA(KCS-061) 2020-21

c. Discarded hardware d. All of the mentioned


d. Low specifications Industry grade Answer: (a)
hardware 73. Major Components of Hadoop 1.0 are:
Answer: (d) a. HDFS and MapReduce
68. Which of the following are true for Hadoop? b. Map Reduce, HDFS and YARN
a. It’s a tool for Big Data analysis c. YARN and HDFS
b. It supports structured and unstructured d. None of Above
data analysis Answer: (a)
c. It aims for vertical scaling out/in scenarios
d. Both (a) and (b)
Answer: (d)
69. Which of the following are the core components
of Hadoop 2.0?
a. HDFS
b. Map Reduce
c. YARN
d. all the above
Answer: (d)

70. Pogramming Language is used for real time


queries.
a. TRUE
b. FALSE
Answer: (b)

71. What is the default HDFS block size for Hadoop


2.0?
a. 32 MB
b. 128 MB
c. 128 KB
d. 64 MB
Answer: (b)

72. Which of the following phases occur


simultaneously ?
a. Shuffle and Sort
b. Reduce and Sort
c. Shuffle and Map

18 University Academy
BIG DATA(KCS-061) 2020-21

Unit- III
a. Replication Factor can be configured at a
1. A ________ serves as the master and there is cluster level (Default is set to 3) and also at
only one NameNode per cluster. a file level
a. Data Node b. Block Report from each DataNode contains
b. NameNode a list of all the blocks that are stored on that
c. Data block DataNode
d. Replication c. User data is stored on the local file system
Answer: b of DataNodes
2. Point out the correct statement. d. DataNode is aware of the files to which
a. DataNode is the slave/worker node and the blocks stored on it belong to
holds the user data in the form of Data Answer: d
Blocks
b. Each incoming file is broken into 32 MB by 6. Which of the following scenario may not be a
default good fit for HDFS?
c. Data blocks are replicated across different a. HDFS is not suitable for scenarios
nodes in the cluster to ensure a low degree requiring multiple/simultaneous writes to
of fault tolerance the same file
d. None of the mentioned b. HDFS is suitable for storing data related to
Answer: a applications requiring low latency data
access
3. HDFS works in a __________ fashion. c. HDFS is suitable for storing data related to
a. master-worker applications requiring low latency data
b. master-slave access
c. worker/slave
d. None of the mentioned
d. all of the mentioned
Answer: a
Answer: a 7. The need for data replication can arise in
various scenarios like ____________
4. ________ NameNode is used when the a. Replication Factor is changed
Primary NameNode goes down. b. DataNode goes down
a. Rack c. Data Blocks get corrupted
b. Data d. All of the mentioned
c. Secondary Answer: d
d. None of the mentioned
Answer: c 8. ________ is the slave/worker node and holds
the user data in the form of Data Blocks.
5. Point out the wrong statement. a. DataNode
b. NameNode
c. Data block

19 University Academy
BIG DATA(KCS-061) 2020-21

d. Replication b. Oozie
Answer: a c. Kafka
d. All of the mentioned
9. HDFS provides a command line interface Answer: a
called __________ used to interact with HDFS.
a. “HDFS Shell” 14. During start up, the ___________ loads the file
b. “FS Shell” system state from the fsimage and the edits log
c. “DFS Shell” file.
d. None of the mentioned a. DataNode
Answer: b b. NameNode
10. HDFS is implemented in ___________ c. ActionNode
programming language. d. None of the mentioned
a. C++ Answer: b
b. Java
c. Scala 15. What is the utility of the HBase ?
d. None of the mentioned a. It is the tool for Random and Fast
Answer: b Read/Write operations in Hadoop
11. For YARN, the ___________ Manager UI b. Acts as Faster Read only query engine in
provides host and port information. Hadoop
a. Data Node c. It is MapReduce alternative in Hadoop
b. NameNode d. It is Fast MapReduce layer in Hadoop
c. Resource
Answer: a
d. Replication
Answer: c
16. What is Hive used as?
12. Point out the correct statement.
a. Hadoop query engine
a. The Hadoop framework publishes the
b. MapReduce wrapper
job flow status to an internally
c. Hadoop SQL interface
running web server on the master
d. All of the above
nodes of the Hadoop cluster
b. Each incoming file is broken into 32 MB Answer: d
by default
c. Data blocks are replicated across 17. What is the default size of the HDFS block ?
different nodes in the cluster to ensure a a. 32 MB
low degree of fault tolerance b. 64 KB
d. None of the mentioned c. 128 KB
Answer: a d. 64 MB
13. For ________ the HBase Master UI provides Answer: d
information about the HBase Master uptime.
a. HBase

20 University Academy
BIG DATA(KCS-061) 2020-21

18. In the HDFS what is the default replication c. In either phase, but not on both sides
factor of the Data Node? simultaneously
a. 4 d. In either phase
b. 1 Answer: d
c. 3
d. 2 23. Which of the following type of joins can be
Answer: c performed in Reduce side join operation?
a. Equi Join
19. What is the protocol name that is used to create b. Left Outer Join
replica in HDFS? c. Right Outer Join
a. Forward protocol d. Full Outer Join
b. Sliding Window Protocol e. All of the above
c. HDFS protocol Answer: e
d. Store and Forward protocol 24. A Map reduce function can be written:
a. Java
Answer: c
b. Ruby
20. HDFS data blocks can be read in parallel. c. Python
a. True d. Any Language which can read from
b. False input stream
Answer: a
Answer: d

21. Which of the following is fact about combiners


25. In the map is there any input format?
in HDFS?
a. Yes, but only in Hadoop 0.22+.
a. Combiners can be used for mapper
b. Yes, there is a special format for map files.
only job
c. No, but sequence file input format can
b. Combiners can be used for any Map
read map files.
Reduce operation
d. Both 2 and 3 are correct answers
c. Mappers can be used as a combiner
Answer: c
class
d. Combiners are primarily aimed to
26. Which MapReduce phase is theoretically able
improve Map Reduce performance
to utilize features of the underlying file system
e. Combiners can’t be applied for
in order to optimize parallel execution?
associative operations
a. Split
Answer: d
b. Map
c. Combine
22. In HDFS the Distributed Cache is used in which
d. Reduce
of the following
Answer: a
a. Mapper phase only
b. Reducer phase only

21 University Academy
BIG DATA(KCS-061) 2020-21

a. Bzip2
27. Which method of the FileSystem object is used b. LZO
c. Gzip
for reading a file in HDFS
d. both Dand C
a. open() Answer: a
b. access() 33. Which of the following is provides search

c. select() technology? and Java-based indexing


d. None of the above a. Solr
Answer: a b. Lucy
c. Lucene Core
28. The world’s largest Hadoop cluster. d. None of these
a. Apple
Answer: c
b. Facebook
c. Datamatics
34. Are defined with Avro schemas _____
d. None of the mentioned
a. JAVA
Answer: b
29. The Big Data Tackles Facebook are based on b. XML

on________ Hadoop. c. All of the mentioned

a. ‘Project Data d. JSON


b. ‘Prism’ Answer: d
c. ‘Project Big’ 35. _________ of the field is used to Thrift
d. ‘Project Prism’ resolves possible conflicts.
Answer: d a. Name
b. UID
30. Which SequenceFile are present in Hadoop I/O
c. Static number
?
d. All of the mentioned
a. 2
Answer: c
b. 8
36. ______ layer of is said to be the future Hadoop.
c. 9
Avro.
d. 3
a. RMC
Answer: c
b. RPC
31. slowest compression technique is ______
c. RDC
a. Bzip2
b. LZO d. All of the mentioned
c. Gzip Answer: b
d. All of the mentioned
37. High storage density Which of the following has
Answer: c
high storage density?
32. Which of the following is a typically
a. RAM_DISK
compresses files which are best available b. ARCHIVE
techniques.10% to 15 %. c. ROM_DISK

22 University Academy
BIG DATA(KCS-061) 2020-21

d. All of the mentioned 43. Which among the following is correct?


Answer: b S1: MapReduce is a programming model for
data processing
38. HDFS provides a command line interface called
S2: Hadoop can run MapReduce programs
__________ used to interact with HDFS.
written in various languages
a. “HDFS Shell”
S3: MapReduce programs are inherently
b. “FS Shell”
parallel
c. “DFS Shell”
d. None of the mentioned
a. S1 and S2
Answer: b
b. S2 and S3
c. S1 and S3
39. Which format from the given format is more
d. S1, S2 and S3
compression-aggressive?
Answer: d
a. Partition Compressed
44. Mapper class is
b. Record Compressed
a. generic type
c. Block-Compressed
b. abstract type
d. Uncompressed
c. static type
Answer: c
d. final
40. Avro schemas describe the format of the
Answer: a
message and are defined using ____
a. JSON
b. XML 45. Which package provides the basic types of
c. JS Hadoop?
d. All of the mentioned
a. org.apache.hadoop.io
Answer: b b. org.apache.hadoop.util
c. org.apache.hadoop.type
41. Which editor is used for editing files in HDFS
d. org.apache.hadoop.lang
a. Vi Editor Answer: a
b. Python editor
c. DOS editor
d. DEV C++ Editor 46. Which among the following does the Job
Answer: a
control in Hadoop?
42. Command to view the directories and files in a. Mapper class
specific directory: b. Reducer class
a. Ls c. Task class
b. Fs –ls
d. Job class
c. Hadoop fs –ls
d. Hadoop fs Answer: d
47. Hadoop runs the jobs by dividing them into
Answer: a a. maps
b. tasks
c. individual files

23 University Academy
BIG DATA(KCS-061) 2020-21

d. None of these 53. Which acts as an interface between Hadoop and


Answer: b the program written?
48. Which are the two nodes that control the job a. Hadoop Cluster
execution process of Hadoop? b. Hadoop Streams
a. Job Tracker and Task Tracker c. Hadoop Sequencing
b. Map Tracker and Reduce Tracker d. Hadoop Streaming
c. Map Tracker and Job Tracker Answer: d
d. Map Tracker and Task Tracker
Answer: a 54. What are Hadoop Pipes?
49. Which among the following schedules tasks to a. Java interface to Hadoop MapReduce
be run? b. C++ interface to Hadoop MapReduce
a. Job Tracker c. Ruby interface to Hadoop MapReduce
b. Task Tracker d. Python interface to Hadoop MapReduce
c. Job Scheduler Answer: b
d. Task Controller
Answer: A 55. What does Hadoop Common Package contain?
50. What are fixed size pieces of MapReduce job a. war files
called? b. msi files
a. records c. jar files
b. splits d. exe files
c. tasks Answer: c
d. maps
Answer: b 56. Which among the following is the master node?
a. Name Node
51. Where is the output of map tasks written? b. Data Node
a. local disk c. Job Node
b. HDFS d. Task Node
c. File System Answer: a
d. secondary storge 57. Which among the following is the slave node?
Answer: a a. Name Node
52. Which among the following is responsible for b. Data Node
processing one or more chunks of data and c. Job Node
producing the output results. d. Task Node
a. Maptask Answer: b
b. jobtask 58. Which acts as a checkpoint node in HDFS?
c. Mapper class
d. Reducetask a. Name Node
Answer: a b. Data Node
c. Secondary Name Node

24 University Academy
BIG DATA(KCS-061) 2020-21

d. Secondary Data Node b. Datanodes must send block reports to both


Answer: c namenodes since the block mappings are
59. Which among the following holds the location stored in a namenode’s memory, and not
of data? on disk
a. Name Node c. namenodes must use highly-available
b. Data Node shared storage to share the edit log
c. Job Tracker d. All of the above
d. Task Tracker Answer: d
Answer: a
64. Which controller in HDFS manages the
60. What is the process of applying the code transition from the active namenode to the
received by the JobTracker on the file called? standby?
a. Naming a. failover controller
b. Tracker b. recovery controller
c. Mapper c. failsafe controller
d. Reducer d. fencing controller
Answer: a Answer: a
61. In which mode should Hadoop run in order to
run pipes job? 65. Which among the following is not an fencing
a. distributed mode mechanism employed by system in HDFS?
b. centralized mode a. killing the namenode’s process
c. pseudo distributed mode b. disabling namenode's network port via a
d. parallel mode remote management command
Answer: b c. revoking namenode's access to the shared
62. Which of the following are correct? S1: storage directory
Namespace volumes are independent of each d. None of the above
other S2: Namespace volumes are manages by Answer: d
namenode 66. What is the value of the property dfs.replication
a. S1 only et in case of pseudo distributed mode?
b. S2 only a. 0
c. Both S1 and S2 b. 1
d. Neither S1 nor S2 c. null
Answer: c d. yes
63. Which among the following architectural Answer: b
changes need to attain High availability in 67. What is the minimum amount of data that a disk
HDFS? can read or write in HDFS?
a. Clients must be configured to handle a. block size
namenode failover b. byte size
c. heap

25 University Academy
BIG DATA(KCS-061) 2020-21

d. None c. listStatus
Answer: a d. listPaths
Answer: C
68. Which HDFS command checks file system and 73. What is the operation that use wildcard
lists the blocks? characters to match multiple files with a single
a. hfsck expression called?
b. fcsk a. globbing
c. fblock b. pattern matching
d. fsck c. regex
d. regexfilter
Answer: d
69. What is an administered group used to manage 74. What does the globStatus() methods return?
cache permissions and resource usage?
a. Cache pools a. an array of FileStatus objects
b. block pool b. an array of ListStatus objects
c. Namenodes c. an array of PathStatus objects
d. HDFS Cluster d. an array of FilterStatus objects
Answer: a Answer: a

70. Which object encapsulates a client or server's 75. What does the glob question mark(?) matches?
configuration? a. zero or more characters
a. File Object b. one or more characters
b. Configuration object c. a single character
c. Path Object d. metacharacter
d. Stream Object Answer: c
Answer: b
71. Which interface permits seeking to a position in 76. Which method on FileSystem is used to
the file and provides a query method for the permanently remove files or directories?
current offset from the start of the file? a. remove()
DataStream b. rm()
a. Seekable c. del()
b. PositionedReadable d. delete()
c. Progressable Answer: d
77. Which streams the packets to the first datanode
Answer: b in the pipeline?
72. Which method is used to list the contents of a a. DataStreamer
directory? b. FileStreamer
a. listFiles c. InputStreamer
b. listContents d. PathStreamer

26 University Academy
BIG DATA(KCS-061) 2020-21

Answer: a
78. Which queue is responsible for asking the
namenode to allocate new blocks by picking a
list of suitable datanodes to store the replicas?

a. ack queue
b. data queue
c. path queue
d. stream queue
Answer: b
79. Which command is used to copy
files/directories?
a. distcp
b. hcp
c. copy
d. cp
Answer: a

80. Which flag is used with distcp to delete any


files or directories from the destination?
a. -remove
b. -rm
c. -del
d. -delete
Answer: d

27 University Academy
BIG DATA(KCS-061) 2020-21

Unit-IV
d. Resource constraints
1. Which among the following is Hadoop's cluster
resource management system? Answer: c
a. GLOB 6. Which among the following can be used to
b. YARN model YARN applications?
c. ARM a. one application per user job
d. SPARK b. run one application per workflow
Answer: b c. long-running application that is shared by
different users
2. Which of the following processing framework d. All of the above
interacts with YARN directly?
a. Pig Answer: d
b. Hive
c. Crunch 7. Which follows one application per user job
d. None of these model?
Answer: D a. MapReduce
b. Spark
3. Which of the following processing frameworks c. Apache Slider
run on MapReduce? d. Samza
a. Pig Answer: a
b. Hive
c. Crunch 8. Which application runs per user session?
d. All of the above a. MapReduce
Answer: d b. Spark
4. Which among the following are the core c. Apache Slider
services of YARN? d. None of the above
a. resource manager and node manager Answer: b
b. namenode and datanode
c. data manager and resource manager 9. Which among the following has a long-running
d. data manager and application manager application master for launching other
Answer: a applications on the cluster?
a. MapReduce
5. Which constraints can be used to request a b. Spark
container on a specific node or rack, or c. Apache Slider
anywhere on the cluster in YARN? d. None of the above
a. Container constraints Answer: c
b. Space constraints
c. Locality constraints

28 University Academy
BIG DATA(KCS-061) 2020-21

10. Which among the following can be used for a. jobtracker


stream processing? b. tasktrackers
a. Spark c. data node
b. Samza d. Name node
c. Storm Answer: a
d. All of the above 15. Which of the following which keeps a record of
Answer: d the overall progress of each job in MapReduce
1?
11. Which provides a simple programming model a. jobtracker
for developing distributed applications on b. tasktrackers
YARN? c. data node
a. Apache Slider d. Name node
b. Apache Twill
c. Spark Answer: a
d. Tez
Answer: b 16. Which among the following run tasks and send
progress reports in MapReduce 1?
12. Which among the following statements are true a. jobtracker
with respect to Apache Twill? S1: Twill b. tasktrackers
supports real-time logging S2: Allows the usage c. data node
of a Java Runnable interface d. Name node
a. S1 only Answer: b
b. S2 only
c. Both S1 and S2 17. Choose the tasks of jobtracker in MapReduce
d. Neither S1 nor S2 1?
Answer: c a. job scheduling
13. Which daemon control the job execution b. task progress monitoring
process in MapReduce 1? c. task bookkeeping
d. All of the above
a. jobtracker Answer: d
b. tasktrackers 18. Which is responsible for storing job history in
c. Both jobtracker and tasktrackers MapReduce 1?
d. Name node and data node a. jobtracker
b. tasktrackers
Answer: c c. data node
14. Which among the following coordinates all the d. Name node
jobs run on the system by scheduling tasks in
MapReduce 1? Answer: a

29 University Academy
BIG DATA(KCS-061) 2020-21

19. In YARN, the responsibility of jobtracker is 24. Which are/is the schedulers available in
handled by YARN?
a. Resource manager a. FIFO
b. application master b. Capacity
c. timeline server c. Fair Schedulers
d. All of the above d. All of the above
Answer: d
Answer: d 25. Which among the following schedulers attempts
20. In YARN, the responsibility of tasktracker is to allocate resources so that all running
handled by applications get the same share of resources in
YARN
a. Resource manager a. FIFO
b. application master b. Capacity
c. timeline server c. Fair Schedulers
d. Node manager d. Round Robin
Answer: d
21. Which stores the application history in YARN? Answer: c
a. Resource manager
b. application master 26. Which among the following schedulers
c. timeline server provides queue elasticity in YARN?
d. Node manager a. FIFO
b. Capacity
Answer: c c. Fair Schedulers
22. Which among the following are the features of d. Round Robin
YARN?
a. Scalability Answer: b
b. Multitenancy
c. Availabilit 27. Which among the following schedulers in
d. All of the above YARN is used by default?
Answer: d FIFO
23. Which among the following schedulers Capacity
available in YARN? Fair Schedulers
a. FIFO Round Robin
b. Shortest Job First
c. Round Robin Answer: b
d. Shortest Remaining Time 28. In which xml, is the default configuration of
Answer: a schedulers to be changed?
a. yarn-site.xml
b. config.xml

30 University Academy
BIG DATA(KCS-061) 2020-21

c. scheduler.xml
d. yarn-scheduler.xml 33. What is the default period of heartbeat request
sent by node manager?
Answer: a a. one per millisecond
29. Which among the following queue scheduling b. one per second
policies are/is supported by Fair Schedulers in c. one per minute
YARN? d. one per nanosecond
a. FIFO
b. Dominant Resource Fairness Answer: b
c. preemption 34. Which error detection code is used in HDFS?
d. All of the above a. CRC-32
b. CRC-32C
Answer: d c. SHA
d. SHA-1
30. Which holds the list of rules for queue Answer: b
placement in Fair Scheduling?
a. queuePlacementPolicy 35. CRC-32C has the storage overhead
b. rulePlacementolicy a. less than 1%
c. scheduleQueuePolicy b. less than 5%
d. schedulingPolicy c. less than 10%
d. less than 2.5%
Answer: a Answer: a

31. Which of the setting is used to set preemption 36. The heartbeat signal are sent from
globally? a. Jobtracker to Tasktracker
a. yarn.scheduler.fair.preemption = true b. Tasktracker to Job tracker
b. yarn.scheduler.preemption = true c. Jobtracker to namenode
c. yarn.scheduler.global.preemption = true d. Tasktracker to namenode
d. yarn.scheduler.enable.preemption = true Answer: b
Answer: a 37. Spark was initially started by ________ at UC
Berkeley AMPLab in 2009.
32. Which among the following supports delay a. Mahek Zaharia
scheduling? b. Matei Zaharia
a. FIFO c. Doug Cutting
b. Capacity Scheduler d. Stonebraker
c. Fair Scheduler Answer: (b)
d. Both Capacity and Fair Scheduler 38. ________ is a component on top of Spark Core.
a. Spark Streaming
Answer: d b. Spark SQL

31 University Academy
BIG DATA(KCS-061) 2020-21

c. RDDs a. SIM
d. All of the mentioned b. SIMR
Answer: (b) c. SIR
d. RIS
39. Spark SQL provides a domain-specific Answer: (b)
language to manipulate ___________ in Scala,
Java, or Python. 44. Which of the following language is not
a. Spark Streaming supported by Spark?
b. Spark SQL a. Java
c. RDDs b. Pascal
d. All of the mentioned c. Scala
Answer: (c) d. Python
40. ______________ leverages Spark Core fast Answer: (b)
scheduling capability to perform streaming
analytics. 45. Spark is packaged with higher level libraries,
a. MLlib including support for _________ queries.
b. Spark Streaming a. SQL
c. GraphX b. C
d. RDDs c. C++
Answer: (b) d. None of the mentioned
Answer: (a)
41. ________ is a distributed machine learning
framework on top of Spark. 46. Spark includes a collection over ________
a. MLlib operators for transforming data and familiar
b. Spark Streaming data frame APIs for manipulating semi-
c. GraphX structured data.
d. RDDs a. 50
Answer: (a) b. 60
c. 70
42. Users can easily run Spark on top of Amazon’s d. 80
__________ Answer: (d)
a. Infosphere
b. EC2 47. Spark is engineered from the bottom-up for
c. EMR performance, running ___________ faster than
d. None of the mentioned Hadoop by exploiting in memory computing
Answer: (b) and other optimizations.
a. 100x
43. Which of the following can be used to launch b. 150x
Spark jobs inside MapReduce? c. 200x

32 University Academy
BIG DATA(KCS-061) 2020-21

d. None of the mentioned a. Provides an execution platform for all the


Answer: (a) Spark applications
b. It is the scalable machine learning library
which delivers efficiencies
48. Spark powers a stack of high-level tools c. enables powerful interactive and data
including Spark SQL, MLlib for _________ analytics application across live streaming
a. regression models data
b. statistics d. All of the above
c. machine learning Answer: (b)
d. reproductive research
Answer: (c) 53. Which of the following is true for RDD?
a. We can operate Spark RDDs in parallel
49. For Multiclass classification problem which with a low-level API
algorithm is not the solution? b. RDDs are similar to the table in a
a. Naive Bayes relational database
b. Random Forests c. It allows processing of a large amount of
c. Logistic Regression structured data
d. Decision Trees d. It has built-in optimization engine
Answer: (d) Answer: (a)
54. RDD is fault-tolerant and immutable
50. Which of the following is a tool of Machine a. True
Learning Library? b. False
a. Persistence Answer: (a)
b. Utilities like linear algebra, statistics 55. The read operation on RDD is
c. Pipelines a. Fine-grained
d. All of the above b. Coarse-grained
Answer: (d) c. Either fine-grained or coarse-grained
51. Which of the following is true for Spark core? d. Neither fine-grained nor coarse-grained
a. It is the kernel of Spark Answer: (c)
b. It enables users to run SQL / HQL queries 56. The write operation on RDD is
on the top of Spark. a. Fine-grained
c. It is the scalable machine learning library b. Coarse-grained
which delivers efficiencies c. Either fine-grained or coarse-grained
d. Improves the performance of iterative d. Neither fine-grained nor coarse-grained
algorithm drastically. Answer: (b)
Answer: (a) 57. Is it possible to mitigate stragglers in RDD?
a. Yes
52. Which of the following is true for Spark MLlib? b. No
Answer: (a)

33 University Academy
BIG DATA(KCS-061) 2020-21

58. Fault Tolerance in RDD is achieved using


a. Immutable nature of RDD 63. Which of the following algorithm is not present
b. DAG (Directed Acyclic Graph) in MLlib?
c. Lazy-evaluation a. Streaming Linear Regression
d. None of the above b. Streaming KMeans
Answer: (b) c. Tanimoto distance
d. None of the above
59. What is action in Spark RDD? Answer: (c)
a. The ways to send result from
executors to the driver 64. Which of the following is not the feature of
b. Takes RDD as input and produces one Spark?
or more RDD as output. a. Supports in-memory computation
c. Creates one or many new RDDs b. Fault-tolerance
d. All of the above c. It is cost-efficient
Answer: (a) d. Compatible with other file storage system
Answer: (c)
60. The shortcomings of Hadoop MapReduce was
overcome by Spark RDD by 65. Which of the following is the reason for Spark
a. Lazy-evaluation being Speedy than MapReduce?
b. DAG a. DAG execution engine and in-memory
c. In-memory processing computation
d. All of the above b. Support for different language APIs like
Answer: (d) Scala, Java, Python and R
c. RDDs are immutable and fault-tolerant
61. Spark is developed in which language d. None of the above
a. Java Answer: (a)
b. Scala
c. Python 66. Which of the following is true for RDD?
d. R a. RDD is a programming paradigm
Answer: (b) b. RDD in Apache Spark is an
immutable collection of objects
62. Which of the following is not a component of c. It is a database
the Spark Ecosystem? d. None of the above
(a) Sqoop Answer: (b)
(b) GraphX
(c) MLlib 67. Which of the following is a tool of the Machine
(d) BlinkDB Learning Library?
Answer: (a) a. Persistence
b. Utilities like linear algebra, statistics

34 University Academy
BIG DATA(KCS-061) 2020-21

c. Pipelines c. MAN
d. All of the above d. All of the mentioned
Answer: (d) Answer: (b)

68. __________ is a online NoSQL developed by 73. Most NoSQL databases support automatic
Cloudera. __________ meaning that you get high
a. HCatalog availability and disaster recovery.
b. Hbase a. processing
c. Imphala b. scalability
d. Oozie c. replication
Answer: (b) d. all of the mentioned
Answer: (c)
69. Which of the following is not a NoSQL
database? 74. Which of the following are the simplest NoSQL
a. SQL Server databases?
b. MongoDB a. Key-value
c. Cassandra b. Wide-column
d. None of the mentioned c. Document
Answer: (a) d. All of the mentioned
Answer: (a)
70. Which of the following is a NoSQL Database
Type? 75. ________ stores are used to store information
a. SQL about networks, such as social connections.
b. Document databases a. Key-value
c. JSON b. Wide-column
d. All of the mentioned c. Document
Answer: (b) d. Graph
Answer: (d)
71. Which of the following is a wide-column store?
a. Cassandra 76. NoSQL databases is used mainly for handling
b. Riak large volumes of _____ data.
c. MongoDB a. unstructured
d. Redis b. structured
Answer: (a) c. semi-structured
d. all of the mentioned
72. “Sharding” a database across many server Answer: (a)
instances can be achieved with _ 77. Which of the following language is MongoDB
a. LAN written in?
b. SAN a. Javascript

35 University Academy
BIG DATA(KCS-061) 2020-21

b. C d. More easily allows for data to be held


c. C++ across multiple servers
d. All of the mentioned Answer: (b)
Answer: (d)
82. NoSQL prohibits structured query language
78. Point out the correct statement. (SQL). Is it True or False?
a. MongoDB is classified as a NoSQL a. True
database b. False
b. MongoDB favors XML format more than Answer: (b)
JSON
c. MongoDB is column-oriented database 83. When is it best to use a NoSQL database?
store a. When providing confidentiality, integrity,
d. All of the mentioned and availability is crucial
Answer: (a) b. When the data is predictable
c. When the retrieval of large quantities of
79. Which of the following format is supported by data is needed
MongoDB? d. When the retrieval speed of data is not
a. SQL critical
b. XML Answer: (c)
c. BSON
d. All of the mentioned 84. Which of the following companies developed
Answer: (c) NoSQL database Apache Cassandra?
. a. LinkedIn
80. NoSQL was designed with security in mind, so b. Twitter
developers or security teams don't need to c. MySpace
worry about implementing a security layer. Is it d. Facebook
true or false? Answer: (d)
a. True
b. False 85. NoSQL databases are most often referred to as:
Answer: (b) a. Relational
b. Distributed
81. Which of the following is not a reason NoSQL c. Object-oriented
has become a popular solution for some d. Network
organizations? Answer: (b)
a. Better scalability
b. Improved ability to keep data consistent 86. SQL databases are:
c. Faster access to data than relational a. Horizontally scalable
database management systems (RDBMS) b. Vertically scalable
c. Either horizontally or vertically scalable

36 University Academy
BIG DATA(KCS-061) 2020-21

d. They don't scale a. Less complex applications, greater


Answer: (b) consistency.
b. Convenient standard tooling.
87. Which of the following is not an example of a c. SQL influenced extensions.
NoSQL database? d. All of the mentioned
a. CouchDB Answer: (d)
b. MongoDB
c. HBase 92. Following represent column in NoSQL
d. PostgreSQL __________.
Answer: (d) a. Database
b. Field
88. SQL command types include data manipulation c. Document
language (DML) and data definition language d. Collection
(DDL). Answer:(b)
a. True 93. What is the aim of NoSQL?
b. False a. NoSQL provides an alternative to SQL
Answer: (a) databases to store textual data.
b. NoSQL databases allow storing non-
89. ________ systems are scale-out file-based structured data.
(HDD) systems moving to more uses of c. NoSQL is not suitable for storing structured
memory in the nodes. data.
a. NoSQL d. NoSQL is a new data format to store large
b. NewSQL datasets.
c. SQL Answer: (d)
d. All of the mentioned
Answer: (a) 94. Which of the following is not a feature for
NoSQL databases?
90. Point out the correct statement.
a. Hadoop is ideal for the analytical, post- a. Data can be easily held across multiple
operational, data-warehouse-ish type of servers
workload b. Relational Data
b. HDFS runs on a small cluster of commodity- c. Scalability
class nodes d. Faster data access than SQL databases
c. NEWSQL is frequently the collection point Ans : b
for big data
d. None of the mentioned 95. Which of the following statement is correct
Answer: (a) with respect to mongoDB?
a. MongoDB is a NoSQL Database
91. Which is an advantage of NewSQL ?

37 University Academy
BIG DATA(KCS-061) 2020-21

b. MongoDB used XML over JSON for data


exchange 99. Collection is a group of MongoDB __?
c. MongoDB is not scalable a.Database
d. All of the above b. Document
Ans : a c.Field
d. None of the above
Ans : b
96. Which of the following represent column in
mongoDB? 100. A developer want to develop a database for
a. document LFC system where the data stored is mostly in
b. database similar manner. Which database should use?
c. collection a. Relational
d. field b. NoSQL
Ans : d c. Both A and B can be used
d. None of the above
97. The system generated _id field is? Ans : b
101. Documents in the same collection do not need
a. A 12 byte hexadecimal value to have the same set of fields or structure, and
b. A 16 byte octal value common fields in a collection's documents may
c. A 12 byte decimal value hold different types of data is known as ?
d. A 10 bytes binary value a. dynamic schema
b. mongod
Ans : a c. mongo
d. Embedded Documents
98. Which of the following true about mongoDB? Ans : a
102. Instead of Primary Key mongoDB use?
a. MongoDB is a cross-platform a. Embedded Documents
b. MongoDB is a document oriented database b. Default key _id
c. MongoDB provides high performance c. mongod
d. All of the above d. mongo
Ans : d Ans : B

38 University Academy
BIG DATA(KCS-061) 2020-21

Unit-V
a. HDFS is not suitable for scenarios
1. A ________ serves as the master and there requiring multiple/simultaneous writes
is only one NameNode per cluster. to the same file
a. Data Node b. HDFS is suitable for storing data related to
b. NameNode applications requiring low latency data
c. Data block access
d. Replication c. HDFS is suitable for storing data related to
Answer: (b) applications requiring low latency data
access
2. Point out the correct statement. d. None of the mentioned
a. DataNode is the slave/worker node and Answer: (a)
holds the user data in the form of Data 6. ________ is the slave/worker node and
Blocks holds the user data in the form of Data Blocks.
b. Each incoming file is broken into 32 MB a. DataNode
by default b. NameNode
c. Data blocks are replicated across different c. Data block
nodes in the cluster to ensure a low degree d. Replication
of fault tolerance Answer: (a)
d. None of the mentioned 7. HDFS provides a command line interface
Answer: (a) called __________ used to interact with HDFS.
a. “HDFS Shell”
3. HDFS works in a __________ fashion. b. “FS Shell”
a. master-worker c. “DFS Shell”
b. master-slave d. None of the mentioned
c. worker/slave Answer: (b)
d. all of the mentioned 8. For YARN, the ___________ Manager UI
Answer: (a) provides host and port information.
4. ________ NameNode is used when the a. Data Node
Primary NameNode goes down. b. NameNode
a. Rack c. Resource
b. Data d. Replication
c. Secondary Answer: (c)
d. None of the mentioned 9. During start up, the ___________ loads
Answer: (c) the file system state from the fsimage and the edits
log file.
5. Which of the following scenario may not a. DataNode
be a good fit for HDFS? b. NameNode
c. ActionNode

39 University Academy
BIG DATA(KCS-061) 2020-21

d. None of the mentioned a. “STORED AS AVRO”


Answer: (b) b. “STORED AS HIVE”
10. In HDFS the files cannot be c. “STORED AS AVROHIVE”
a. read d. “STORED AS SERDE”
b. deleted Answer: (a)
c. executed 16. Types that may be null must be defined as
d. Archived a ______ of that type and Null within Avro.
Answer: (c) a. Union
b. Intersection
11. Which of the following command sets the c. Set
value of a particular configuration variable (key)? d. All of the mentioned
a. set -v Answer: (a)
b. set = 17. _______ is interpolated into the quotes to
c. set correctly handle spaces within the schema.
d. reset a. $SCHEMA
Answer: (b) b. $ROW
12. Which of the following operator executes c. $SCHEMASPACES
a shell command from the Hive shell? d. $NAMESPACES
a. | Answer: (a)
b. ! 18. ________ was designed to overcome the
c. ^ limitations of the other Hive file formats.
d. + a. ORC
Answer: (b) b. OPC
13. Hive specific commands can be run from c. ODC
Beeline, when the Hive _______ driver is used. d. None of the mentioned
a. ODBC Answer: (a)
b. JDBC 19. An ORC file contains groups of row data
c. ODBC-JDBC called __________
d. All of the Mentioned a. postscript
Answer: Option (b) b. stripes
14. Which of the following data type is c. script
supported by Hive? d. none of the mentioned
a. map Answer: (b)
b. record 20. HBase is a distributed ________ database
c. string built on top of the Hadoop file system.
d. enum a. Column-oriented
Answer: (d) b. Row-oriented
15. Avro-backed tables can simply be created c. Tuple-oriented
by using _________ in a DDL statement. d. None of the mentioned

40 University Academy
BIG DATA(KCS-061) 2020-21

Answer: (a)
21. HBase is ________ defines only column Answer: (a)
families. 26. The minimum number of row versions to
a. Row Oriented keep is configured per column family via _______
b. Schema-less a. HBaseDecriptor
c. Fixed Schema b. HTabDescriptor
d. All of the mentioned c. HColumnDescriptor
Answer: (b) d. All of the mentioned
Answer: (c)
22. The _________ Server assigns regions to 27. HBase supports a ____________ interface
the region servers and takes the help of Apache via Put and Result.
ZooKeeper for this task. a. “bytes-in/bytes-out”
a. Region b. “bytes-in”
b. Master c. “bytes-out”
c. Zookeeper d. none of the mentioned
d. All of the mentioned Answer: (a)
Answer: (b) 28. One supported data type that deserves
special mention are ____________
23. Which of the following command a. money
provides information about the user? b. counters
a. status c. smallint
b. version d. tinyint
c. whoami Answer: (b)
d. user 29. __________ does re-write data and pack
Answer: (c) rows into columns for certain time-periods.
24. _________ command fetches the contents a. OpenTS
of a row or a cell. b. OpenTSDB
a. select c. OpenTSD
b. get d. OpenDB
c. put Answer: (b)
d. none of the mentioned 30. __________ command disables drops and
Answer: (b) recreates a table.
25. HBaseAdmin and ____________ are the a. drop
two important classes in this package that provide b. truncate
DDL functionalities. c. delete
a. HTableDescriptor d. none of the mentioned
b. HDescriptor Answer: (b)
c. HTable
d. HTabDescriptor

41 University Academy
BIG DATA(KCS-061) 2020-21

34. When a _______ is triggered the client Answer: (b)


receives a packet saying that the znode has 39. ZooKeeper’s architecture supports high
changed. ____________ through redundant services.
a. event a. flexibility
b. watch b. scalability
c. row c. availability
d. value d. interactivity
Answer: (b) Answer: (c)
40. You need to have _________ installed
35. The underlying client-server protocol has before running ZooKeeper.
changed in version _______ of ZooKeeper. a. Java
a. 2.0.0 b. C
b. 3.0.0 c. C++
c. 4.0.0 d. SQLGUI
d. 6.0.0 Answer: (a)
Answer: (b) 41. To register a “watch” on a znode data, you
need to use the _______ commands to access the
36. A number of constants used in the client current content or metadata.
ZooKeeper API were renamed in order to reduce a. stat
________ collision. b. put
a. value c. receive
b. namespace d. gets
c. counter Answer: (a)
d. none of the mentioned
Answer: (b) 42. _______ has a design policy of using
37. ZooKeeper allows distributed processes to ZooKeeper only for transient data.
coordinate with each other through registers, a. Hive
known as ___________ b. Imphala
a. znodes c. Hbase
b. hnodes d. Oozie
c. vnodes Answer: (c)
d. rnodes 43. The ________ master will register its own
Answer: (a) address in this znode at startup, making this znode
38. Zookeeper essentially mirrors the _______ the source of truth for identifying which server is
functionality exposed in the Linux kernel. the Master.
a. iread a. active
b. inotify b. passive
c. iwrite c. region
d. icount d. all of the mentioned

42 University Academy
BIG DATA(KCS-061) 2020-21

Answer: (a) b. Oozie


44. Pig operates in mainly how many nodes? c. Pig
a. Two d. Hive
b. Three Answer: (c)
c. Four 50. Hive also support custom extensions
d. Five written in :
Answer: (a) a. C
45. You can run Pig in batch mode using b. C++
__________ c. C#
a. Pig shell command d. Java
b. Pig scripts Answer: (d)
c. Pig options 51. Which of the following is not true about Pig?
d. All of the mentioned a. Apache Pig is an abstraction over
Answer: (b) MapReduce
46. Which of the following function is used to b. Pig can not perform all the data
read data in PIG? manipulation operations in Hadoop.
a. WRITE c. Pig is a tool/platform which is used to
b. READ analyze larger sets of data representing
c. LOAD them as data flows.
d. None of the mentioned d. None of the above
Answer:(c) Ans : b
47. You can run Pig in interactive mode using
the ______ shell. 52. Which of the following is/are a feature of
a. Grunt Pig?
b. FS a. Rich set of operators
c. HDFS b. Ease of programming
d. None of the mentioned c. Extensibility
Answer: (a) d. All of the above
48. Which of the following is the default Ans : d
mode?
a. Mapreduce 53. In which year apache Pig was released?
b. Tez a. 2005
c. Local b. 2006
d. All of the mentioned c. 2007
Answer: (a) d. 2008
49. ________ is a platform for constructing Ans : b
data flows for extract, transform, and load (ETL)
processing and analysis of large datasets. 54. Pig operates in mainly how many nodes?
a. Pig Latin a. 2

43 University Academy
BIG DATA(KCS-061) 2020-21

b. 3 a. $pig_trunk ant pigunit-jar


c. 4 b. $pig_tr ant pigunit-jar
d. 5 c. $pig_ ant pigunit-jar
Ans : a d. $pigtr_ ant pigunit-jar
Ans : a
55. Which of the following company has 60. Point out the wrong statement.
developed PIG? a. Pig can invoke code in language like
a. Google Java Only
b. Yahoo b. Pig enables data workers to write complex
c. Microsoft data transformations without knowing
d. Apple Java
Ans : b c. Pig's simple SQL-like scripting language
is called Pig Latin, and appeals to
56. Which of the following function is used to developers already familiar with scripting
read data in PIG? languages and SQL
a. Write d. Pig is complete, so you can do all required
b. Read data manipulations in Apache Hadoop
c. Perform with Pig
d. Load Ans : a
Ans : d
61. You can run Pig in interactive mode using the
57. __________ is a framework for collecting and ______ shell
storing script-level statistics for Pig Latin. a. Grunt
a. Pig Stats b. FS
b. PStatistics c. HDFS
c. Pig Statistics d. None of the mentioned
d. All of the above Ans : a
Ans : c 62. Which of the following is the default mode?
a. Mapreduce
58. Which of the following is true statement? b. Tez
a. Pig is a high level language. c. Local
b. Performing a Join operation in Apache Pig d. All of the mentioned
is pretty simple. Ans : d
c. Apache Pig is a data flow language.
d. All of the above 63. Use the __________ command to run a Pig
Ans : d script that can interact with the Grunt shell
(interactive mode)
59. Which of the following will compile the a. fetch
Pigunit? b. declare

44 University Academy
BIG DATA(KCS-061) 2020-21

c. run b. Set -v prints a list of configuration


d. all of the mentioned variables that are overridden by the user
Ans : c or Hive
64. What are the different complex data types in c. Set sets a list of variables that are
PIG overridden by the user or Hive
a. Maps d. None of the mentioned
b. Tuples Answer: a
c. Bags
d. All of these 69. Which of the following will remove the
Answer: d resource(s) from the distributed cache?
a. delete FILE[S] <filepath>*
65. What are the various diagnostic operators b. delete JAR[S] <filepath>*
available in Apache Pig? c. delete ARCHIVE[S] <filepath>*
a. Dump Operator d. all of the mentioned
b. Describe Operator Answer: d
c. Explain Operator
d. All of these 70. _________ is a shell utility which can be
used to run Hive queries in either interactive
66. If data has less elements than the specified or batch mode.
schema elements in pig, then? a. $HIVE/bin/hive
a. Pig will not do any thing b. $HIVE_HOME/hive
b. It will pad the end of the record columns c. $HIVE_HOME/bin/hive
with nulls d. All of the mentioned
c. Pig will through error Answer: c
d. Pig will warn you before it throws error
Answer: b 71. HiveServer2 introduced in Hive 0.11 has a
new CLI called __________
67. Which of the following command sets the a. BeeLine
value of a particular configuration variable b. SqlLine
(key)? c. HiveLine
a. set -v d. CLilLine
b. set <key>=<value> Answer: a
c. set 72. Variable Substitution is disabled by using
d. reset ___________
Answer: b a. set hive.variable.substitute=false;
68. Point out the correct statement. b. set hive.variable.substitutevalues=false;
a. Hive Commands are non-SQL c. set hive.variable.substitute=true;
statement such as setting a property d. all of the mentioned
or adding a resource Answer: a

45 University Academy
BIG DATA(KCS-061) 2020-21

73. _______ supports a new command shell b. HDescriptor


Beeline that works with HiveServer2. c. HTable
a. HiveServer2 d. HTabDescriptor
b. HiveServer3 Answer: a
c. HiveServer4 79. Mention how many operational commands in
d. None of the mentioned Hbase?
Answer: a a. Get
b. Put
74. In ______ mode HiveServer2 only accepts c. Delete
valid Thrift calls. d. All of the mentioned
a. Remote Answer: d
b. HTTP
c. Embedded 80. The _________ Server assigns regions to the
d. Interactive region servers and takes the help of Apache
Answer: a ZooKeeper for this task.
75. The Hbase tables are a. Region
a. Made read only by setting the read-only b. Master
option c. Zookeeper
b. Always writeable d. All of the mentioned
c. Always read-only
d. Are made read only using the query to the
Answer: a
76. Every row in a Hbase table has
a. Same number of columns
b.Same number of column families
c. Different number of columns
d.Different number of column families
Answer: d
77. Hbase creates a new version of a record
during
a. Creation of a record
b. Modification of a record
c. Deletion of a record
d. All the above
Answer: d
78. HBaseAdmin and ____________ are the two
important classes in this package that provide
DDL functionalities.
a. HTableDescriptor

46 University Academy

You might also like