You are on page 1of 4

1. Describe The core components of Hadoop and their purpose.

• MapReduce (Processing)

• HDFS (Where Hadoop stores data.)

• YARN (A framework for jobs scheduling and the management of cluster resources)

• Hadoop Common (support the other Hadoop modules.)

2. Identify what is a good fit for Hadoop and what is not?

• Hadoop is good for:


a. Massive amounts of data through parallelism.
b. A variety of data (structured, unstructured, and semi-structured).
c. Inexpensive commodity hardware.

• Hadoop is NOT good for:


d. Processing transactions (random access).
e. When work cannot be parallelized.
f. Low latency data access.
g. Processing many small files.
h. Intensive calculations with little data.

3. Draw the Hortonworks Data Platform Components Structure.

4. Describe Kafka, Sqoop.


Kafka: is a fast, scalable, durable, and fault-tolerant publish- subscribe messaging system. Used for building real-time
data pipelines and streaming apps
Sqoop: Tool to easily import information from structured databases, used to extract data from Hadoop and export it to
relational databases and enterprise data warehouses

5. Describe Hive, HBase.


• Hive is a data warehouse system built on Top of Hadoop.

• HBase is a column-oriented non-relational database management system that runs on Top of Hadoop Distributed
File System (HDFS).

6. Describe Zeppelin, and Ambari Views Tools.

• Zeppelin is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents.

• Ambari web interface includes a built-in set of Views that are pre- deployed for you to use with your cluster.

7. Describe the Functions of Apache Ambari.

• Provision a Hadoop cluster

• Manage a Hadoop cluster

• Monitor a Hadoop cluster

8. Describe Apache Ambari Metrics System with graph.


9. List actions of Apache Ambari for managing Hosts page in a cluster.
• Working with Hosts
• Determine Host Status
• Filtering Hosts List
• Performing Host-Level Actions
• Viewing components on a Host

10. Describe some of basic terms of Apache Ambari terminology.


• Service
• Component
• Node/Host
• Operation
• Task
• Stage
• Action
• Role

11. Describe List of the importance of the Hadoop.s


• Managing data.

• Exponential growth of the big data market.

• Robust Hadoop infrastructure.


• Research tool.

• Hadoop is omnipresent.

• A maturing technology.

12. List the Advantages and disadvantages of Hadoop.(2)


13. List HDFS goals.

• Fault tolerance by detecting faults and applying quick and automatic recovery

• Data access by using MapReduce streaming

• Simple and robust coherency model

• Processing logic close to the data rather than the data close to the processing logic

• Portability across heterogeneous commodity hardware and operating systems

14. Illustrate HDFS Architecture with graph.


• Master: Name Node (NN):
▪ Manages the file system namespace and metadata:
• FsImage

• Edits Log

• Worker: Data Node:

• Many per cluster.

• Manages storage that is attached to the nodes.

15. Describe main components of MapReduce.


• JobTracker
▪ Accepts MapReduce jobs that are submitted by clients.
▪ Pushes Map and Reduce tasks out to TaskTracker nodes.

• TaskTracker
▪ Runs Map and Reduce tasks.
▪ Reports statuses to JobTracker.

16. Define shuffling in MapReduce.


• Transfer Interim Output For Final Processing

• The Output of Each Mapper Is Locally Grouped Together by Key

• One Node Is Chosen to Process Data for Each Unique Key


17. Illustrate YARN features.
• Scalability
• Multi-tenancy
• Compatibility
• Serviceability
• Higher cluster utilization
• Reliability and availability
18. Explain the nature and purpose of Apache Spark in the Hadoop infrastructure.
• Faster results from analytics are increasingly important.
• Apache Spark is a computing platform that is fast, general-purpose, and easy to use

19. Describe the architecture and list the components of the Apache Spark unified stack.

20. Describe the role of a Resilient Distributed Dataset (RDD)


21. List and describe the Apache Spark libraries.
22. List the characteristics of representative data file formats.
23. List the characteristics of the four types of NoSQL data stores.
24. Illustrate Apache Hive Components through graph.
25. List the characteristics of programming languages that are typically used by data scientists:
R and Python.
26. Illustrate the steps for creating an RDD.
27. List RDD operations: Transformations and Actions.

You might also like