You are on page 1of 20

Big Data Analytics(BDA)

JIT #CMIT-5125

Unit-4
Overview of Big Data
Tools and Technology

Admas Abtew
Faculty of Computing and Informatics
Jimma Technology Institute, Jimma University
Admas.abtew@ju.edu.et
+251-912499102
 Outline
Looping
• What is Hadoop
• Hadoop framework-advantages
• The Hadoop Ecosystem
Chapter Four: Big data tools
and technologies
4.1 Introduction to Hadoop

Apache Hadoop is one of the most popularly used tools in the Big Data industry.
Hadoop is an open-source framework from Apache and runs on commodity hardware. It is
used to store process and analyze Big Data.
Hadoop is written in Java. Apache Hadoop enables parallel processing of data as it works on
multiple machines simultaneously. It uses clustered architecture. A Cluster is a group of systems
that are connected via LAN.
It consists of 3 parts-
 Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop.
 Map-Reduce – It is the data processing layer of Hadoop.
 YARN – It is the resource management layer of Hadoop.

#(BDA)  Unit:4 –Overview of Big Data Tools and


4
Hadoop….

#(BDA)  Unit:4 –Overview of Big Data Tools and


5
Hadoop…
1. Hadoop Distributed File System (HDFS)
HDFS is the main or most important part of the Hadoop ecosystem. It stores big sets of structured
or unstructured data across multiple nodes and keeps track of information in log files. It is a
distributed file system designed to store and manages a large amount of data across different
Hadoop cluster servers.
HDFS Components
HDFS consists of two core components i.e.,
 Name Node
 Data Node
NameNodes and DataNodes are the two primary components of HDFS's architecture.
NameNode is responsible for managing file system metadata, including directory structure and file-
to-block mapping. It maintains a record of which blocks reside on which DataNodes. Data nodes
are responsible for holding the data units themselves.

#(BDA)  Unit:4 –Overview of Big Data Tools and


6
Name Node
 The name node is also known as the master node which controls the workings of the data
nodes.Generally, it contains metadata.
Data Node
 The main task of data nodes is to read, write, process, and replication of data.
 The data nodes communicate with NameNode using sending signals to the name node,
known as heartbeats. These heartbeats show the status of the data node.

#(BDA)  Unit:4 –Overview of Big Data Tools and


7
HDFS Cluster Master and Slave Nodes
Master and slave nodes form the HDFS cluster.
The master is the name node, and the slaves are
the data nodes.
The name Node is the main node. It saves
metadata, which is information about data, and uses
fewer resources than the data nodes, which store
the real data.
In a distributed environment, these data nodes are
just like any hardware device which makes Hadoop
cost-effective.

#(BDA)  Unit:4 –Overview of Big Data Tools and


8
2. Yet Another Resource Negotiator (YARN)
As the name implies, Yet Another Resource Negotiator (YARN) is a resource manager
who assists in the management of resources across clusters of computers. Briefly stated, it
is responsible for the scheduling and resource allocation of the Hadoop System. It is made
up of three fundamental components:
 Resource Manager
 Node Manager
 Application Master
 Hadoop YARN functions as Hadoop's operating system. It's an additional file system
that may be added to HDFS. Its job is to coordinate the cluster's resources so that no one
node gets overworked. The jobs are planned appropriately thanks to the work scheduling
it does.

#(BDA)  Unit:4 –Overview of Big Data Tools and


9
.
Client - A client refers to an entity that sends a task or
application to the YARN cluster to run it.
The client is typically a program or script running on a
user's machine or a system that initiates the job
execution.
Resource manager - The apps in a system can only
receive the resources that the resource management has
authorized them to get.
Node managers - Node Managers at each node allocate
system resources like CPU, memory, and bandwidth
before giving credit to the machine's resource
management.
Application manager - The application manager acts as
a go-between for the resource management and the node
manager, negotiating resources and nodes as needed.

#(BDA)  Unit:4 –Overview of Big Data Tools and


10
MapReduce

MapReduce The core component of Hadoop is its MapReduce framework. The


MapReduce distributes processing among the slave nodes, who then report their tasks to
the master node. MapReduce utilizes distributed and parallel algorithms to make it feasible
to develop applications that turn massive data sets into more manageable ones.
The complete dataset is processed by data with embedded code. In most cases, the
amount of coded information is negligible compared to the raw data. To get a lot of work
done on computers, we can just transmit a few thousand bytes of code.

#(BDA)  Unit:4 –Overview of Big Data Tools and


11
MapReduce..
.

#(BDA)  Unit:4 –Overview of Big Data Tools and


12
MapReduce…
MapReduce makes use of two functions i.e. Map() and Reduce(). The description of these is
described below.
The main components of MapReduce are:
 Input Data: Firstly data is inputted to the system that needs to be worked on. It can be
inputted in different forms, like text files, in a database, or on HDFS.
 Split: The sets of intermediate keys and values are split up based on their names.
Partitioning makes sure that all key-value pairs with the same key end up in the same
partition. This step makes it easy to sort and group the data for the next phase.
 Map Function: the Map() function groups data by applying filters and performing sort
operations on it. Using a key-value pair as input, Map produces a result that may be further
processed using the Reduce() function. In a distributed computing framework, it uses
multiple nodes to handle the data at the same time. The Map function gives you a set of
intermediate key-value pairs as its result.

#(BDA)  Unit:4 –Overview of Big Data Tools and


13
MapReduce…
 Shuffle and Sort: The intermediate data that has been split up is sent from one node in the
cluster to another. The key is used to sort the data so that all of the numbers that go with the
same key are put together.
 Reduce Function: Reduce() is a basic function that takes the result of Map() as input and
concatenates the tuples into a smaller collection of tuples. It aggregates the mapped data.
Multiple nodes run the Reduce function at the same time, and its result is usually a set of final
key-value pairs.
 Output Data: The result of the Reduce function is the result of the MapReduce process. It
can be kept in different ways, like as text files, in a database, or on HDFS.

#(BDA)  Unit:4 –Overview of Big Data Tools and


14
Hadoop Ecosystem
 Hadoop ecosystem as integration of numerous components that are created directly on top of
the Hadoop platform. There are, however, a plethora of complicated interdependencies across
these systems to consider.

#(BDA)  Unit:4 –Overview of Big Data Tools and


15
Component Description
HDFS A key data storage system for Hadoop applications, the Hadoop Distributed File System (HDFS), is the
Hadoop Distributed File System. HDFS is a distributed file system that is implemented using
NameNode and DataNode architecture to offer high-performance access to data across highly scalable
Hadoop clusters.

YARN Apache Hadoop's YARN component is responsible for assigning system resources to the various
applications operating in a Hadoop cluster and scheduling jobs to be done on different cluster nodes.
YARN is one of the main components of Apache Hadoop.
MapReduce MapReduce works with two functions: Map and Reduce. The Map function accepts as input from the
disc a set of key, value> pairs, processes them, and returns as output another set of intermediate key,
value> pairs that were processed previously.
The Reduce function accepts inputs in the form of key, value> pairs and returns output in the form of
key, value> pairs
Apache Pig Apache Pig is a very high-level programming API that allows us to write simple scripts. If we don't
want to write Java or Python MapReduce codes and are more familiar with a scripting language that has
somewhat SQL-style syntax, Pig is the best one.

#(BDA)  Unit:4 –Overview of Big Data Tools and


16
Component Description
Apache Hive Hive is a means of accepting SQL queries and making the distributed data that is sitting on your file system
look like a SQL database that is not actually there. It makes use of a programming language known as Hive
SQL. In reality, it is merely a database in which you can connect to a shell client and ODBC (Open
Database Connectivity) and perform SQL queries on the data that is stored on your Hadoop cluster, despite
the fact that it is not a relational database in the traditional sense.
Apache In order to make Hadoop management easier, the Apache Ambari project is developing software that will
Ambari be used for deploying, managing, and monitoring Apache Hadoop clusters.
Mesos Apache Mesos is an open-source cluster manager that manages workloads in a distributed environment by
dynamic resource sharing and isolation. It is free and open-source software.
Apache Apache Spark is a multi-language engine that may be used to run data engineering, data science, and
Spark machine learning tasks on single-node workstations or clusters of computers.
Apache HBase is a column-oriented data store that runs on top of the Hadoop Distributed File System and enables
HBase random data lookup and updates for large amounts of data. It is designed for big data applications. HBase
creates a schema on top of the HDFS files, allowing users to read and alter these files as many times as
they like.
Apache This is a system for the real-time processing of streaming data streams. Apache Storm extends the
Storm capabilities of Enterprise Hadoop by providing dependable real-time data processing.

#(BDA)  Unit:4 –Overview of Big Data Tools and


17
Component Description
Apache Apache Oozie is a Java Web application that is used to schedule jobs for the Apache Hadoop distributed
Oozie computing system. Oozie integrates numerous jobs in a logical unit of labor by processing them in a
sequential manner. With YARN serving as its architectural heartbeat, it is fully integrated with the Hadoop
stack, and it supports Hadoop tasks for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop,
among other technologies. Job scheduling software such as Oozie may also plan jobs that are specific to a
system, such as Java programs or shell scripts.
ZooKeeper Apache ZooKeeper is a Hadoop cluster management tool that provides operational services. Among other
things, ZooKeeper provides a distributed configuration service, a synchronization service, and a naming
registry for systems that are spread over multiple computers. Zookeeper is a distributed application service
that stores and mediates updates to critical configuration information for distributed applications.
Data It involves bringing data from different sources or databases and files into Hadoop. Hadoop is free and
Ingestion open-source, and there is a multitude of methods for ingesting data into the system. It provides every
developer with the option of ingesting data into Hadoop using her/his favorite tool or programming
language. When selecting a tool or technology, developers place a strong emphasis on performance; yet,
this makes governance extremely difficult.

#(BDA)  Unit:4 –Overview of Big Data Tools and


18
4.3 Hadoop Framework - Advantages
Advantages Description
Scalability Hadoop provides scalability both in terms of storage and processing power. It allows organizations to store
and process large volumes of data by distributing the workload across a cluster of commodity hardware. As
data volumes grow, additional nodes can be added to the cluster, ensuring the system can handle increasing
data demands.
Fault Hadoop's fault tolerance is one of its core strengths. It achieves fault tolerance through data replication.
Tolerance HDFS, the storage component of Hadoop, automatically replicates data across multiple nodes in the cluster.
If a node fails, the data can be retrieved from other nodes, ensuring data availability and system reliability.
Cost- Hadoop is built on commodity hardware, which is more affordable compared to proprietary hardware
Effective solutions. By using cost-effective hardware, organizations can significantly reduce the infrastructure costs
associated with storing and processing big data. Additionally, Hadoop's ability to scale horizontally allows
organizations to start with a small cluster and expand it as needed, optimizing hardware utilization and cost
efficiency.
Flexible Data Hadoop's MapReduce processing model enables flexible and parallel processing of data. It is designed to
Processing handle batch processing tasks that do not require real-time analysis. This flexibility makes Hadoop suitable
for a wide range of applications, including log processing, data mining, ETL (Extract, Transform, Load)
operations, and large-scale analytics.

#(BDA)  Unit:4 –Overview of Big Data Tools and


19
Advantages Description
Wide Hadoop has a rich ecosystem of tools, frameworks, and libraries that extend its capabilities. These include
Ecosystem frameworks like Apache Spark, Apache Hive, Apache Pig, and Apache HBase, among others. These tools
provide additional functionalities such as real-time processing, SQL-like querying, data analysis, and NoSQL
database capabilities, making Hadoop a comprehensive platform for big data processing and analytics.
Data Locality Hadoop leverages the concept of data locality, which refers to processing data where it resides. By bringing
the computation closer to the data, Hadoop minimizes network traffic and improves overall processing
performance. This feature is particularly beneficial when dealing with large datasets distributed across
multiple nodes.
Open-Source Hadoop is an open-source project with a vibrant community of developers and contributors. This community
Community actively supports the framework by providing bug fixes, updates, and new features. The open-source nature of
Hadoop ensures continuous improvement, innovation, and the availability of a vast array of resources,
documentation, and community support.
Integration Hadoop can integrate with existing IT infrastructures and systems. It supports various data sources, including
with Existing structured, semi-structured, and unstructured data. This allows organizations to leverage their existing
Systems investments and combine data from different sources for comprehensive analysis and insights.
Security Hadoop provides robust security features to protect data and ensure compliance with regulatory requirements.
It offers authentication, authorization, and encryption mechanisms to safeguard sensitive data stored and
processed within the Hadoop ecosystem.

#(BDA)  Unit:4 –Overview of Big Data Tools and


20

You might also like