Professional Documents
Culture Documents
info/category/interview-questions/hadoop-interview-questions-for-experienced-
and-freshers/
http://hadooptutorial.info/category/interview-questions/mapreduce-interview-questions/
http://hadooptutorial.info/category/interview-questions/hbase-interview-questions-for-experienced-
freshers/
http://hadooptutorial.info/category/interview-questions/hive-interview-questions/
http://hadooptutorial.info/category/interview-questions/pig-interview-questions-for-experienced-and-
freshers/
http://hadooptutorial.info/category/interview-questions/sqoop-interview-questions-and-answers
[21:44, 5/20/2019] +91 97912 25590: Hadoop scenario based interview questions
Scenario 1 – Job with 100 mappers and 1 reducer takes a long time for the reducer to start after all
the mappers are complete. One of the reasons could be that reduce is spending a lot of time copying
the map outputs. So in this case we can try couple of things.
1. If possible add a combiner to reduce the amount of output from the mapper to be sent to the reducer
2. Enable map output compression – this will further reduce the size of the outputs to be transferred to
the reducer.
Scenario 2 – A particular task is using a lot of memory which is causing the slowness or failure, I will look
for ways to reduce the memory usage.
1. Make sure the joins are made in an optimal way with memory usage in mind. For e.g. in Pig joins, the
LEFT hand side tables are sent to the reducer first and held in memory and the RIGHT most table in
streamed to the reducer. So make sure the RIGHT most table is largest of the datasets in the join.
2. We can also increase the memory requirements needed by the map and reduce tasks by setting –
mapreduce.map.memory.mb and mapreduce.reduce.memory.mb
Scenario 3 – Understanding the data helps a lot in optimizing the way we use the datasets in PIG and
HIVE scripts.
1. If you have smaller tables in join, they can be sent to distributed cache and loaded in memory on the
Map side and the entire join can be done on the Map side thereby avoiding the shuffle and reduce
phase altogether. This will tremendously improve performance. Look up USING REPLICATED in Pig and
MAPJOIN or hive.auto.convert.join in Hive
2. If the data is already sorted you can use USING MERGE which will do a Map Only join
Scenario 4 – The Shuffle process is the heart of a MapReduce program and it can be tweaked for
performance improvement.
1. If you see lots of records are being spilled to the disk (check for Spilled Records in the counters in your
MapReduce output) you can increase the memory available for Map to perform the Shuffle by
increasing the value in io.sort.mb. This will reduce the amount of Map Outputs written to the disk so the
sorting of the keys can be performed in memory.
2. On the reduce side the merge operation (merging the output from several mappers) can be done in
disk by setting the mapred.inmem.merge.threshold to 0
Q: Is Hadoop a database?
A: No. Hadoop is a write-only file system. But there are other products like Hive and HBase that provide
a SQL-like interface to Hadoop for storing data in RDMB-like database structures.
A: It is a way to write MapReduce jobs using a far simpler, SQL-like syntax than using Java, which is very
wordy.
A: Apache Mahout.
A: Spark stores data in memory, thus running MapReduce operations much faster than Hadoop, which
stores that on disk. Also it has command line interfaces in Scala, Python, and R. And it includes a
machine learning library, Spark ML, that is developed by the Spark project and not separately, like
Mahout.
Q: Map is MapReduce?
A: Map takes an input data file and reduces it to (key->value) pairs or tuples (a,b,c,d) or other iterable
structure. Reduce then takes adjacent items and iterates over them to provide one final result.
Q: What does safemode in Hadoop mean?
A: It means the datanodes are not yet ready to receive data. This usually occurs on startup.
A: Hadoop is a master-slave model. The namenode is the master. The slaves are the datanodes. The
namenode partitions MapReduce jobs and hands off each piece to different datanodes. Datanodes are
responsible for writing data to disk.
A: It is a resource manager. What it does is keep track of available resources (memory, CPU, storage)
across the cluster (meaning the machines where Hadoop is running). Then each application (e.g.
Hadoop) asks the resource manager what resources are available and doles those out accordingly. It
runs two daemons to do this: Scheduler and ApplicationsManager.
A: You copy the whole Hadoop $HADOOP_HOME folder to a server. Then you set up ssh keys so that the
Hadoop user can ssh to that server without having to enter a password. Then you add the name of that
server to$HADOOP_HOME/etc/hadoop/slaves. That you run hadoop-daemon.sh --config
$HADOOP_CONF_DIR --script hdfs start datanode on the new data node.
Q: How do you see what Hadoop services are running? Name them.
A: Run jps. You should see: DataNode on the datanodes and NameNode, SecondaryNameNode, and
ResourceManager on the NameNodes and (optionally) the JobHistoryServer.
A: This is a technique used to find a function that most nearly matches a set of data points. For example
if you have one independent value x and one dependan…
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to
import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop
file system to relational databases.
Sqoop metastore is a shared metadata repository for remote users to define and execute saved jobs
created using sqoop job defined in the metastore. The sqoop –site.xml should be configured to connect
to the metastore.
Q3. What are the two file formats supported by sqoop for import?
Q4. What is the difference between Sqoop and DistCP command in Hadoop?
Both distCP (Distributed Copy in Hadoop) and Sqoop transfer data in parallel but the only difference is
that distCP command can transfer any kind of data from one Hadoop cluster to another whereas Sqoop
transfers data between RDBMS and other components in the Hadoop ecosystem like HBase, Hive, HDFS,
etc.
Q5. Compare Sqoop and Flume
Sqoop vs Flume
Sqoop
Flume
Used for importing data from structured data sources like RDBMS. Used for moving bulk streaming
data into HDFS.
Data import in sqoop is not evetn driven. Data load in flume is event driven
HDFS is the destination for importing data. Data flows into HDFS through one or more channels.
Sqoop can import data form a relational database using any SQL query rather than only using table and
column name parameters.
Q8. How can you execute a free form SQL query in Sqoop to import the rows in a sequential manner?
This can be accomplished using the –m 1 option in the Sqoop import command. It will create only one
MapReduce task which will then import rows serially.
Q9. I have around 300 tables in a database. I want to import all the tables from the database except
the tables named Table298, Table 123, and Table299. How can I do this without having to import the
tables one by one?
This can be accomplished using the import-all-tables import command in Sqoop and by specifying the
exclude-tables option with it as follows-
sqoop import-all-tables
Apache Sqoop import command does not support direct import of BLOB and CLOB large objects. To
import large objects, I Sqoop, JDBC based imports have to be used without the direct argument to the
import utility.
Q11. How will you list all the columns of a table using Apache Sqoop?
1. What is Mapreduce ?
2. What is YARN ?
8. What are the Main configuration parameters that user need to specify to run Mapreduce Job ?
15. How can we mention multiple mappers and reducer classes in Chain Mapper or Chain Reducer
classes ?
20. What is NullWritable and how is it special from other Writable data types ?
21. What is Text data type in Hadoop and what are the differences from String data type in Java ?
23. How to create multiple value type output from Mapper with IntWritable and Text Writable ?
28. What will happen if we run a Mapreduce job with an output directory that already existing ?
29. What are the naming conventions for output files from Map phase and Reduce Phase ?
30. Where the output does from Map tasks are stored ?
31. When will the reduce() method will be called from reducers in Mapreduce job flow ?
32. If reducers do not start before all the mappers are completed then why does the progress on
MapReduce job shows something like Map(80%) Reduce(20%) ?
33. Where the output does from Reduce tasks are stored ?
35. Can we set arbitrary number of Reduce tasks in a mapreduce job and if yes, how ?
36. What happens if we don’t override the mapper methods and keep them as it is ?
41. What are the side effects of not running a secondary name node?
42. How many racks do you need to create an Hadoop cluster in order to make sure that the cluster
operates reliably?
44. Hadoop WebUI shows that half of the datanodes are in decommissioning mode. What does that
mean? Is it safe to remove those nodes from the network?
45. What does the Hadoop administrator have to do after adding new datanodes to the Hadoop clust…
Ans. Basically, a tool which we call a data warehousing tool is Hive. However, Hive gives SQL queries to
perform an analysis and also an abstraction. Although, Hive it is not a database it gives you logical
abstraction over the databases and the tables.
Ans. All those client applications which are written in Java, PHP, Python, C++ or Ruby by exposing its
thrift server, Hive supports them.
Ans. No, it is not suitable for OLTP system since it does not offer insert and update at the row level.
Ans. In an HDFS directory – /user/hive/warehouse, the Hive table is stored, by default only. Moreover,
by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in
the hive-site.xml, one can change it.
Ans. Using RDBMS instead of HDFS, Hive stores metadata information in the metastore. Basically, to
achieve low latency we use RDBMS. Because HDFS read/write operations are time-consuming processes.
It is the metastore service runs in the same JVM in which the Hive service is running and connects to a
database running in a separate JVM. Either on the same machine or on a remote machine.
Remote Metastore:
In this configuration, the metastore service runs on its own separate JVM and not in the Hive service
JVM.
Que 8. What is the default database provided by Apache Hive for metastore?
Ans. It offers an embedded Derby database instance backed by the local disk for the metastore, by
default. It is what we call embedded metastore configuration.
Que 9. What is the difference between the external table and managed table?
External table
Hive just deletes the metadata information regarding the table. Further, it leaves the table data present
in HDFS untouched.
Ans. Yes, by using the clause – LOCATION ‘<hdfs_path>’ we can change the default location of a
managed table.
Ans. Despite ORDER BY we should use SORT BY. Especially while we have to sort huge datasets. The
reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together
using a single reducer. Hence, using ORDER BY will take a lot of time to execute a large number of
inputs.
Ans. Basically, for the purpose of grouping similar type of data together on the basis of column or
partition key, Hive organizes tables into partitions. Moreover, to identify a particular partition each table
can have one or more partition keys. On defining Hive Partition, in other words, it is a sub-directory in
the table directory.
Que 13. Why do we perform partitioning in Hive?
Ans. In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data
instead of the whole dataset it reduces the query latency.
Ans. Dynamic partitioning values for partition columns are known in the runtime. In other words, it is
known during loading of the data into a Hive table.
Usage:
While we Load data from an existing non-partitioned table, in order to improve the sampling. Thus it
decreases the query latency.
Also, while we do not know all the values of the partitions beforehand. Thus, finding these partition
values manually from a huge dataset is a tedious task.
Ans. Basically, for performing bucketing to a partition there are two main reasons:
A map side join requires the data belonging to a unique join key to be present in the same partition.
It allows us to decrease the query time. Also, makes the sampling process more efficient.
Ans. Basically, to share data structures with external systems we use Hcatalog. It offers access to hive
metastore to users of other tools on Hadoop. Hence, they can read and write data to hive’s data
warehouse.
Ans. No
Ans. To analyze the structure of individual columns and the internal structure of the row objects we use
ObjectInspector. Basically, it provides acc