You are on page 1of 13

http://hadooptutorial.

info/category/interview-questions/hadoop-interview-questions-for-experienced-
and-freshers/

http://hadooptutorial.info/category/interview-questions/mapreduce-interview-questions/

http://hadooptutorial.info/category/interview-questions/hbase-interview-questions-for-experienced-
freshers/

http://hadooptutorial.info/category/interview-questions/hive-interview-questions/

http://hadooptutorial.info/category/interview-questions/pig-interview-questions-for-experienced-and-
freshers/

http://hadooptutorial.info/category/interview-questions/sqoop-interview-questions-and-answers

[21:44, 5/20/2019] +91 97912 25590: Hadoop scenario based interview questions

Scenario 1 – Job with 100 mappers and 1 reducer takes a long time for the reducer to start after all
the mappers are complete. One of the reasons could be that reduce is spending a lot of time copying
the map outputs. So in this case we can try couple of things.

1. If possible add a combiner to reduce the amount of output from the mapper to be sent to the reducer

2. Enable map output compression – this will further reduce the size of the outputs to be transferred to
the reducer.

Scenario 2 – A particular task is using a lot of memory which is causing the slowness or failure, I will look
for ways to reduce the memory usage.
1. Make sure the joins are made in an optimal way with memory usage in mind. For e.g. in Pig joins, the
LEFT hand side tables are sent to the reducer first and held in memory and the RIGHT most table in
streamed to the reducer. So make sure the RIGHT most table is largest of the datasets in the join.

2. We can also increase the memory requirements needed by the map and reduce tasks by setting –
mapreduce.map.memory.mb and mapreduce.reduce.memory.mb

Scenario 3 – Understanding the data helps a lot in optimizing the way we use the datasets in PIG and
HIVE scripts.

1. If you have smaller tables in join, they can be sent to distributed cache and loaded in memory on the
Map side and the entire join can be done on the Map side thereby avoiding the shuffle and reduce
phase altogether. This will tremendously improve performance. Look up USING REPLICATED in Pig and
MAPJOIN or hive.auto.convert.join in Hive

2. If the data is already sorted you can use USING MERGE which will do a Map Only join

3. If the data is bucketted in hive, you may use hive.optimize.bucketmapjoin or

hive.optimize.bucketmapjoin.sortedmerge depending on the characteristics of the data

Scenario 4 – The Shuffle process is the heart of a MapReduce program and it can be tweaked for
performance improvement.

1. If you see lots of records are being spilled to the disk (check for Spilled Records in the counters in your
MapReduce output) you can increase the memory available for Map to perform the Shuffle by
increasing the value in io.sort.mb. This will reduce the amount of Map Outputs written to the disk so the
sorting of the keys can be performed in memory.

2. On the reduce side the merge operation (merging the output from several mappers) can be done in
disk by setting the mapred.inmem.merge.threshold to 0

[21:44, 5/20/2019] +91 97912 25590: Set 1

Hadoop Interview Questions

Q: Is Hadoop a database?
A: No. Hadoop is a write-only file system. But there are other products like Hive and HBase that provide
a SQL-like interface to Hadoop for storing data in RDMB-like database structures.

Q: What commands do you use to start Hadoop?

A:start-dfs.sh and start-yarn.dfs

Q: What does Apache Pig do?

A: It is a way to write MapReduce jobs using a far simpler, SQL-like syntax than using Java, which is very
wordy.

Q: How do you copy a local file to the HDFS

A: hadoop fs -put filename /(hadoop directory)

Q: What is the Hadoop machine learning library called?

A: Apache Mahout.

Q: How is Spark different than Hadoop?

A: Spark stores data in memory, thus running MapReduce operations much faster than Hadoop, which
stores that on disk. Also it has command line interfaces in Scala, Python, and R. And it includes a
machine learning library, Spark ML, that is developed by the Spark project and not separately, like
Mahout.

Q: Map is MapReduce?

A: Map takes an input data file and reduces it to (key->value) pairs or tuples (a,b,c,d) or other iterable
structure. Reduce then takes adjacent items and iterates over them to provide one final result.
Q: What does safemode in Hadoop mean?

A: It means the datanodes are not yet ready to receive data. This usually occurs on startup.

Q: How do you take Hadoop out of safemode?

A: hdfs dfsadmin -safemode leave

Q: What is the difference between the a namenode and datanode?

A: Hadoop is a master-slave model. The namenode is the master. The slaves are the datanodes. The
namenode partitions MapReduce jobs and hands off each piece to different datanodes. Datanodes are
responsible for writing data to disk.

Q: What role does Yarn play in Hadoop?

A: It is a resource manager. What it does is keep track of available resources (memory, CPU, storage)
across the cluster (meaning the machines where Hadoop is running). Then each application (e.g.
Hadoop) asks the resource manager what resources are available and doles those out accordingly. It
runs two daemons to do this: Scheduler and ApplicationsManager.

Q: How do you add a datanode?

A: You copy the whole Hadoop $HADOOP_HOME folder to a server. Then you set up ssh keys so that the
Hadoop user can ssh to that server without having to enter a password. Then you add the name of that
server to$HADOOP_HOME/etc/hadoop/slaves. That you run hadoop-daemon.sh --config
$HADOOP_CONF_DIR --script hdfs start datanode on the new data node.

Q: How do you see what Hadoop services are running? Name them.
A: Run jps. You should see: DataNode on the datanodes and NameNode, SecondaryNameNode, and
ResourceManager on the NameNodes and (optionally) the JobHistoryServer.

Q: How do you start the Hadoop Job History Server?

A:$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh --config $HADOOP_HOME/etc/hadoop start


historyserver

Q: What is linear regression?

A: This is a technique used to find a function that most nearly matches a set of data points. For example
if you have one independent value x and one dependan…

[21:44, 5/20/2019] +91 97912 25590: Sqoop interview questions

Q1. What is Sqoop ?

Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to
import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop
file system to relational databases.

Q2. What is Sqoop metastore?

Sqoop metastore is a shared metadata repository for remote users to define and execute saved jobs
created using sqoop job defined in the metastore. The sqoop –site.xml should be configured to connect
to the metastore.

Q3. What are the two file formats supported by sqoop for import?

Delimited text and Sequence Files.

Q4. What is the difference between Sqoop and DistCP command in Hadoop?

Both distCP (Distributed Copy in Hadoop) and Sqoop transfer data in parallel but the only difference is
that distCP command can transfer any kind of data from one Hadoop cluster to another whereas Sqoop
transfers data between RDBMS and other components in the Hadoop ecosystem like HBase, Hive, HDFS,
etc.
Q5. Compare Sqoop and Flume

Sqoop vs Flume

Sqoop

Flume

Used for importing data from structured data sources like RDBMS. Used for moving bulk streaming
data into HDFS.

It has a connector based architecture. It has a agent based architecture.

Data import in sqoop is not evetn driven. Data load in flume is event driven

HDFS is the destination for importing data. Data flows into HDFS through one or more channels.

Q6. What do you mean by Free Form Import in Sqoop?

Sqoop can import data form a relational database using any SQL query rather than only using table and
column name parameters.

Q7. Does Apache Sqoop have a default database?

Yes, MySQL is the default database

Q8. How can you execute a free form SQL query in Sqoop to import the rows in a sequential manner?

This can be accomplished using the –m 1 option in the Sqoop import command. It will create only one
MapReduce task which will then import rows serially.

Q9. I have around 300 tables in a database. I want to import all the tables from the database except
the tables named Table298, Table 123, and Table299. How can I do this without having to import the
tables one by one?

This can be accomplished using the import-all-tables import command in Sqoop and by specifying the
exclude-tables option with it as follows-

sqoop import-all-tables

–connect –username –password –exclude-tables Table298, Table 123, Table 299


Q10. How can I import large objects (BLOB and CLOB objects) in Apache Sqoop?

Apache Sqoop import command does not support direct import of BLOB and CLOB large objects. To
import large objects, I Sqoop, JDBC based imports have to be used without the direct argument to the
import utility.

Q11. How will you list all the columns of a table using Apache Sqoop?

Unlike sqoop-list-tables and sqoop-list-databases, there is no direct command like sqoop-list-columns to


list all the columns. The indirect way of achieving this is to retrieve the columns of the desired tables and
redirect them to a file which can be viewed manually containing the column names of a parti…

[21:45, 5/20/2019] +91 97912 25590: Mapreduce Interview Questions

1. What is Mapreduce ?

2. What is YARN ?

3. What is data serialization ?

4. What is deserialization of data ?

5. What are the Key/Value Pairs in Mapreduce framework ?

6. What are the constraints to Key and Value classes in Mapreduce ?

7. What are the main components of Mapreduce Job ?

8. What are the Main configuration parameters that user need to specify to run Mapreduce Job ?

9. What are the main components of Job flow in YARN architecture ?

10. What is the role of Application Master in YARN architecture ?

11. What is identity Mapper ?

12. What is identity Reducer ?

13. What is chain Mapper ?

14. What is chain reducer ?

15. How can we mention multiple mappers and reducer classes in Chain Mapper or Chain Reducer
classes ?

16. What is a combiner ?

17. What are the constraints on combiner implementation ?


18. What are the advantages of combiner over reducer or why do we need combiner when we are using
same reducer class as combiner class ?

19. What are the primitive data types in Hadoop ?

20. What is NullWritable and how is it special from other Writable data types ?

21. What is Text data type in Hadoop and what are the differences from String data type in Java ?

22. What are the uses of GenericWritable class ?

23. How to create multiple value type output from Mapper with IntWritable and Text Writable ?

24. What is ObjectWritable data type in Hadoop ?

25. How do we create Writable arrays in Hadoop ?

26. What are the MapWritable data types available in Hadoop ?

27. What is speculative execution in Mapreduce ?

28. What will happen if we run a Mapreduce job with an output directory that already existing ?

29. What are the naming conventions for output files from Map phase and Reduce Phase ?

30. Where the output does from Map tasks are stored ?

Most important question starts here

31. When will the reduce() method will be called from reducers in Mapreduce job flow ?

32. If reducers do not start before all the mappers are completed then why does the progress on
MapReduce job shows something like Map(80%) Reduce(20%) ?

33. Where the output does from Reduce tasks are stored ?

34. Can we set arbitrary number of Map tasks in a mapreduce job ?

35. Can we set arbitrary number of Reduce tasks in a mapreduce job and if yes, how ?

36. What happens if we don’t override the mapper methods and keep them as it is ?

37. What is the use of Context object ?

38. Can Reducers talk with each other ?

39. What are the primary phases of the Mapper ?

40. What are the primary phases of the Reducer ?

41. What are the side effects of not running a secondary name node?
42. How many racks do you need to create an Hadoop cluster in order to make sure that the cluster
operates reliably?

43. What is the procedure for namenode recovery?

44. Hadoop WebUI shows that half of the datanodes are in decommissioning mode. What does that
mean? Is it safe to remove those nodes from the network?

45. What does the Hadoop administrator have to do after adding new datanodes to the Hadoop clust…

[21:45, 5/20/2019] +91 97912 25590: Hive Interview Questions :

Que 1. What is Apache Hive?

Ans. Basically, a tool which we call a data warehousing tool is Hive. However, Hive gives SQL queries to
perform an analysis and also an abstraction. Although, Hive it is not a database it gives you logical
abstraction over the databases and the tables.

Que 2. What kind of applications is supported by Apache Hive?

Ans. All those client applications which are written in Java, PHP, Python, C++ or Ruby by exposing its
thrift server, Hive supports them.

Que 3. Is Hive suitable to be used for OLTP systems? Why?

Ans. No, it is not suitable for OLTP system since it does not offer insert and update at the row level.

Que 4. Where does the data of a Hive table gets stored?

Ans. In an HDFS directory – /user/hive/warehouse, the Hive table is stored, by default only. Moreover,
by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in
the hive-site.xml, one can change it.

Que 5. What is a metastore in Hive?


Ans. Basically, to store the metadata information in the Hive we use Metastore. Though, it is possible by
using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus. That
converts the object representation into the relational schema and vice versa.

Que 6. Why does Hive not store metadata information in HDFS?

Ans. Using RDBMS instead of HDFS, Hive stores metadata information in the metastore. Basically, to
achieve low latency we use RDBMS. Because HDFS read/write operations are time-consuming processes.

Que 7. What is the difference between local and remote metastore?

Ans. Local Metastore:

It is the metastore service runs in the same JVM in which the Hive service is running and connects to a
database running in a separate JVM. Either on the same machine or on a remote machine.

Remote Metastore:

In this configuration, the metastore service runs on its own separate JVM and not in the Hive service
JVM.

Que 8. What is the default database provided by Apache Hive for metastore?

Ans. It offers an embedded Derby database instance backed by the local disk for the metastore, by
default. It is what we call embedded metastore configuration.

Que 9. What is the difference between the external table and managed table?

Ans. Managed table


The metadata information along with the table data is deleted from the Hive warehouse directory if one
drops a managed table.\

External table

Hive just deletes the metadata information regarding the table. Further, it leaves the table data present
in HDFS untouched.

Read more about Hive internal tables vs External tables

Que 10. Is it possible to change the default location of a managed table?

Ans. Yes, by using the clause – LOCATION ‘<hdfs_path>’ we can change the default location of a
managed table.

Hive Interview Questions for Freshers- Q. 1,2,3,4,5,7,8,9,10

Hive Interview Questions for Experience- Q. 6

Que 11. When should we use SORT BY instead of ORDER BY?

Ans. Despite ORDER BY we should use SORT BY. Especially while we have to sort huge datasets. The
reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together
using a single reducer. Hence, using ORDER BY will take a lot of time to execute a large number of
inputs.

Que 12. What is a partition in Hive?

Ans. Basically, for the purpose of grouping similar type of data together on the basis of column or
partition key, Hive organizes tables into partitions. Moreover, to identify a particular partition each table
can have one or more partition keys. On defining Hive Partition, in other words, it is a sub-directory in
the table directory.
Que 13. Why do we perform partitioning in Hive?

Ans. In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data
instead of the whole dataset it reduces the query latency.

Que 14. What is dynamic partitioning and when is it used?

Ans. Dynamic partitioning values for partition columns are known in the runtime. In other words, it is
known during loading of the data into a Hive table.

Usage:

While we Load data from an existing non-partitioned table, in order to improve the sampling. Thus it
decreases the query latency.

Also, while we do not know all the values of the partitions beforehand. Thus, finding these partition
values manually from a huge dataset is a tedious task.

Que 15. Why do we need buckets?

Ans. Basically, for performing bucketing to a partition there are two main reasons:

A map side join requires the data belonging to a unique join key to be present in the same partition.

It allows us to decrease the query time. Also, makes the sampling process more efficient.

Que 16. How Hive distributes the rows into buckets?

Ans. By using the formula: hash_function (bucketing_column) modulo (num_of_buckets) Hive


determines the bucket number for a row. Basically, hash_function depends on the column data type.
Although, hash_function for integer data type will be:

hash_function (int_type_column)= value of int_type_column

Que 17. What is indexing and why do we need it?


Ans. Hive index is a Hive query optimization techniques. Basically, we use it to speed up the access of a
column or set of columns in a Hive database. Since, the database system does not need to read all rows
in the table to find the data with the use of the index, especially that one has selected.

Que 18. What is the use of Hcatalog?

Ans. Basically, to share data structures with external systems we use Hcatalog. It offers access to hive
metastore to users of other tools on Hadoop. Hence, they can read and write data to hive’s data
warehouse.

Que 19. Where is table data stored in Apache Hive by default?

Ans. hdfs: //namenode_server/user/hive/warehouse

Que 20. Are multi-line comments supported in Hive?

Ans. No

Hive Interview Questions for Freshers- Q. 12,13,14,15,17,18,19,20

Hive Interview Questions for Experience- Q. 11,16

Que 21. What is ObjectInspector functionality?

Ans. To analyze the structure of individual columns and the internal structure of the row objects we use
ObjectInspector. Basically, it provides acc

You might also like