Hadoop Is Good For:: 1. Describe The Core Components of Hadoop and Their Purpose

1. Describe The core components of Hadoop and their purpose.
• MapReduce (Processing)
• HDFS (Where Hadoop stores data.)
• YARN (A framework for jobs scheduling and the management of cluster resources)
• Hadoop Common (support the other Hadoop modules.)
2. Identify what is a good fit for Hadoop and what is not?
• Hadoop is good for:

a. Massive amounts of data through parallelism.
b. A variety of data (structured, unstructured, and semi-structured).
c. Inexpensive commodity hardware.
• Hadoop is NOT good for:

d. Processing transactions (random access).
e. When work cannot be parallelized.
f. Low latency data access.
g. Processing many small files.
h. Intensive calculations with little data.
3. Draw the Hortonworks Data Platform Components Structure.
4. Describe Kafka, Sqoop.

Kafka: is a fast, scalable, durable, and fault-tolerant publish- subscribe messaging system. Used for building real-time
data pipelines and streaming apps
Sqoop: Tool to easily import information from structured databases, used to extract data from Hadoop and export it to
relational databases and enterprise data warehouses
5. Describe Hive, HBase.

• Hive is a data warehouse system built on Top of Hadoop.
• HBase is a column-oriented non-relational database management system that runs on Top of Hadoop Distributed
File System (HDFS).
6. Describe Zeppelin, and Ambari Views Tools.
• Zeppelin is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents.
• Ambari web interface includes a built-in set of Views that are pre- deployed for you to use with your cluster.
7. Describe the Functions of Apache Ambari.
• Provision a Hadoop cluster
• Manage a Hadoop cluster
• Monitor a Hadoop cluster
8. Describe Apache Ambari Metrics System with graph.

9. List actions of Apache Ambari for managing Hosts page in a cluster.
• Working with Hosts
• Determine Host Status
• Filtering Hosts List
• Performing Host-Level Actions
• Viewing components on a Host
10. Describe some of basic terms of Apache Ambari terminology.

• Service
• Component
• Node/Host
• Operation
• Task
• Stage
• Action
• Role
11. Describe List of the importance of the Hadoop.s

• Managing data.
• Exponential growth of the big data market.
• Robust Hadoop infrastructure.

• Research tool.
• Hadoop is omnipresent.
• A maturing technology.
12. List the Advantages and disadvantages of Hadoop.(2)

13. List HDFS goals.
• Fault tolerance by detecting faults and applying quick and automatic recovery
• Data access by using MapReduce streaming
• Simple and robust coherency model
• Processing logic close to the data rather than the data close to the processing logic
• Portability across heterogeneous commodity hardware and operating systems
14. Illustrate HDFS Architecture with graph.

• Master: Name Node (NN):
▪ Manages the file system namespace and metadata:
• FsImage
• Edits Log
• Worker: Data Node:
• Many per cluster.
• Manages storage that is attached to the nodes.
15. Describe main components of MapReduce.

• JobTracker
▪ Accepts MapReduce jobs that are submitted by clients.
▪ Pushes Map and Reduce tasks out to TaskTracker nodes.
• TaskTracker
▪ Runs Map and Reduce tasks.
▪ Reports statuses to JobTracker.
16. Define shuffling in MapReduce.

• Transfer Interim Output For Final Processing
• The Output of Each Mapper Is Locally Grouped Together by Key
• One Node Is Chosen to Process Data for Each Unique Key

17. Illustrate YARN features.
• Scalability
• Multi-tenancy
• Compatibility
• Serviceability
• Higher cluster utilization
• Reliability and availability
18. Explain the nature and purpose of Apache Spark in the Hadoop infrastructure.
• Faster results from analytics are increasingly important.
• Apache Spark is a computing platform that is fast, general-purpose, and easy to use
19. Describe the architecture and list the components of the Apache Spark unified stack.
20. Describe the role of a Resilient Distributed Dataset (RDD)

21. List and describe the Apache Spark libraries.
22. List the characteristics of representative data file formats.
23. List the characteristics of the four types of NoSQL data stores.
24. Illustrate Apache Hive Components through graph.
25. List the characteristics of programming languages that are typically used by data scientists:
R and Python.
26. Illustrate the steps for creating an RDD.
27. List RDD operations: Transformations and Actions.

Hadoop Is Good For:: 1. Describe The Core Components of Hadoop and Their Purpose

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Is Good For:: 1. Describe The Core Components of Hadoop and Their Purpose

Uploaded by

Copyright:

Available Formats

1. Describe The core components of Hadoop and their purpose.

• HDFS (Where Hadoop stores data.)

• Hadoop Common (support the other Hadoop modules.)

2. Identify what is a good fit for Hadoop and what is not?

• Hadoop is good for:

• Hadoop is NOT good for:

3. Draw the Hortonworks Data Platform Components Structure.

4. Describe Kafka, Sqoop.

5. Describe Hive, HBase.

6. Describe Zeppelin, and Ambari Views Tools.

7. Describe the Functions of Apache Ambari.

• Provision a Hadoop cluster

• Manage a Hadoop cluster

• Monitor a Hadoop cluster

8. Describe Apache Ambari Metrics System with graph.

10. Describe some of basic terms of Apache Ambari terminology.

11. Describe List of the importance of the Hadoop.s

• Exponential growth of the big data market.

• Robust Hadoop infrastructure.

12. List the Advantages and disadvantages of Hadoop.(2)

• Data access by using MapReduce streaming

• Simple and robust coherency model

• Portability across heterogeneous commodity hardware and operating systems

14. Illustrate HDFS Architecture with graph.

• Worker: Data Node:

• Many per cluster.

• Manages storage that is attached to the nodes.

15. Describe main components of MapReduce.

16. Define shuffling in MapReduce.

• The Output of Each Mapper Is Locally Grouped Together by Key

• One Node Is Chosen to Process Data for Each Unique Key

20. Describe the role of a Resilient Distributed Dataset (RDD)

You might also like