You are on page 1of 42

Unit-II

Big Data Architecture and


Distributed Computing Using
Hadoop
Google Query Architecture:

•Google invented the first Big Data architecture. Their


goal was to gather all the information on the web,
organize it, and search it for specific queries from millions
of users.
•An additional goal was to find a way to monetize this
service by serving relevant and prioritized online
advertisements on behalf of clients.
•Google developed web crawling agents which would
follow all the links in the web and make a copy of all the
content on all the webpages it visited.
•Google invented cost-effective, resilient, and fast ways to
store and process all that exponentially growing data.
• Google Web Servers (GWS) operates within a local cluster.
• The first phase of query execution involves index servers
consulting an inverted index that match each query keyword to
a matching list of documents.
• Relevance scores are also computed for matching documents so
that the search result returned to the user is ordered by score.
• In the second phase, document servers fetch each document
from disk to extract the title and the keyword-in-context portion
of the document.
• In addition to the 2 phases, the GWS also activates the spell
checker and the ad server. The spell checker verifies that the
spelling of the query keywords is correct, while the ad server
generate advertisements that relate to the query and may
therefore interest the user.
• The GFS node cluster is a single master with multiple chunk
servers that are continuously accessed by different client
systems.
• Chunk servers store data as Linux files on local disks. Stored
data is divided into large chunks (64 MB), which are
replicated in the network a minimum of three times. The
large chunk size reduces network overhead.
• GFS is designed to accommodate Google’s large cluster
requirements without burdening applications.
• Files are stored in hierarchical directories identified by path
names. Metadata - such as namespace, access control data,
and mapping information - is controlled by the master,
which interacts with and monitors the status updates of
each chunk server through timed heartbeat messages.
Standard Big Data Architecture:

Big Data architecture include the following components:


• Data sources
• Data storage
• Batch processing
• Real-time message ingestion
• Stream processing
• Analytical data store
• Analysis and reporting
• Data sources: All big data solutions start with one
or more data sources. Examples include:
– Application data stores, such as relational databases.
– Static files produced by applications, such as web server
log files.
– Real-time data sources, such as IoT devices.

• Data storage: Data for batch processing operations


is typically stored in a distributed file store that can
hold high volumes of large files in various formats.
This kind of store is often called a data lake.
• Batch processing:
– Because the data sets are so large, often a big data solution must
process data files using long-running batch jobs to filter, aggregate,
and otherwise prepare the data for analysis.
– Usually these jobs involve reading source files, processing them, and
writing the output to new files.

• Real-time message ingestion:


– If the solution includes real-time sources, the architecture must
include a way to capture and store real-time messages for stream
processing.
– This might be a simple data store, where incoming messages are
dropped into a folder for processing.
– However, many solutions need a message ingestion store to act as a
buffer for messages, and to support scale-out processing, reliable
delivery, and other message queuing semantics. Options include
• Stream processing:
– After capturing real-time messages, the solution must
process them by filtering, aggregating, and otherwise
preparing the data for analysis.
– The processed stream data is then written to an output sink.
– We can also use open source Apache streaming
technologies like Storm and Spark Streaming in an
HDInsight cluster.

• Analytical data store:


– Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be
queried using analytical tools.
– The analytical data store used to serve these queries can be
a Kimball-style relational data warehouse, as seen in most
• Analysis and reporting:
– The goal of most big data solutions is to provide insights into the data
through analysis and reporting.
– To empower users to analyze the data, the architecture may include a data
modeling layer, such as a multidimensional OLAP cube or tabular data
model in Azure Analysis Services.
– Analysis and reporting can also take the form of interactive data
exploration by data scientists or data analysts.
– For these scenarios, many Azure services support analytical notebooks,
such as Jupyter, enabling these users to leverage their existing skills with
Python or R.

• Orchestration:
– Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data
between multiple sources and sinks, load the processed data into an
analytical data store, or push the results straight to a report or dashboard.
– To automate these workflows, we can use an orchestration technology
such Azure Data Factory or Apache Oozie and Sqoop.
When to use this architecture?
Consider this architecture style when we need to:
• Store and process data in volumes too large for a
traditional database.
• Transform unstructured data for analysis and
reporting.
• Capture, process, and analyze unbounded streams
of data in real time, or with low latency.
• Use Azure Machine Learning or Microsoft Cognitive
Services.
Big Data Architecture examples:

• Every major organization and applications has a


unique optimized infrastructure to suit its specific
needs.
• Here are some architecture examples from
some very prominent users and designers of
Big data applications
IBM Watson
IBM Watson uses Spark to manage incoming data
streams. It also uses Spark’s Machine Learning
library (MLLib) to analyze data and predict data.
Ebay:
• Ebay is the second-largest e-commerce company in the
world.
• It delivers 800 million listings from 25 million sellers
to 160 million buyers.
• To manage this huge stream of activity, EBay uses a stack
of Hadoop, Spark, Kafka, and other elements.
• They think that Kafka is the best new thing for processing
data streams.
Netflix

•This is one of the largest providers of online video


entertainment.
•They handle more than 400 billion online events per day.
•As a cutting-edge user of Big Data technologies, they are
constantly innovating their mix of technologies to deliver the
best performance.
•They use Apache Kafka as the common messaging system for
all incoming requests.
•They use Spark for stream processing. They host their entire
infrastructure on Amazon Web Services (AWS) cloud service.
•The database is AWS’ S3 as well as Cassandra and Hbase to
store data.
Paypal
• This payments-facilitation company needs to
understand and acquire customers, and process a
large number of payment transactions.
• They use Apache Storm for stream processing of
data.
• They use Druid for distributed data storage.
• They also use Apache Titan Graphical (NoSQL)
database to manage links and relationship among
data elements.
Introduction to Hadoop Framework
• Hadoop is an open source software programming
framework for storing a large amount of data and
performing the computation.
• The Hadoop framework application works in an
environment that provides
distributed storage and computation across
clusters of computers.
• Hadoop is designed to scale up from single server
to thousands of machines, each offering local
computation and storage.
Hadoop Architecture
Hadoop has two major layers namely −
• Processing/Computation layer (MapReduce), and
• Storage layer (Hadoop Distributed File System).
MapReduce
• MapReduce is a parallel programming model for writing distributed applications
devised at Google for efficient processing of large amounts of data , on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner.
• The MapReduce program runs on Hadoop which is an Apache open-source
framework.
Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on
commodity hardware.
• It is highly fault-tolerant and is designed to be deployed on low-cost hardware.
• It provides high throughput access to application data and is suitable for
applications having large datasets.

Hadoop framework also includes the following two modules −


• Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
• Hadoop YARN − This is a framework for job scheduling and cluster resource
Example for MapReduce
HDFS Design Goals:
• Fast recovery from hardware failures
Because one HDFS instance may consist of thousands of servers,
failure of at least one server is inevitable. HDFS has been built to
detect faults and automatically recover quickly.
• Access to streaming data
HDFS is intended more for batch processing versus interactive use, so
the emphasis in the design is for high data throughput rates, which
accommodate streaming access to data sets.
• Accommodation of large data sets
HDFS accommodates applications that have data sets typically
gigabytes to terabytes in size. HDFS provides high aggregate data
bandwidth and can scale to hundreds of nodes in a single cluster.
• Portability
To facilitate adoption, HDFS is designed to be portable across multiple
hardware platforms and to be compatible with a variety of underlying
Master Slave Hadoop Architecture:

The goal of designing Hadoop is to develop an


inexpensive, reliable, and scalable framework that
stores and analyzes the rising big data.
• Apache Hadoop is a software framework designed by
Apache Software Foundation for storing and
processing large datasets of varying sizes and
formats.
• Hadoop follows the master-slave architecture for
effectively storing and processing vast amounts of
data. The master nodes assign tasks to the slave
nodes.
• The slave nodes are responsible for storing the actual
data and performing the actual
computation/processing.
• The master nodes are responsible for storing the
metadata and managing the resources across the
• Slave nodes store the actual business data, whereas
the master stores the metadata.
• NameNode stores the filesystem metadata, i.e files
names, information about blocks of a file, blocks
locations, permissions, etc. It manages the Datanodes.
• DataNodes are the slave nodes that store the actual
business data. It serves the client read/write requests
based on the NameNode instructions.
• DataNodes stores the blocks of the files, and
NameNode stores the metadata like block locations,
permission, etc.
• The NodeManager is responsible for launching and
managing containers on a node. Containers execute
tasks as specified by the Master.
Installing HDFS - Reading and Writing Local
Files into HDFS.
Before start using with HDFS, we should install
Hadoop.
– Hadoop installation on a single node
• It has one DataNode running and setting up all the
NameNode, DataNode, Resource Manager, and
NodeManager on a single machine.
– Hadoop installation on Multi-node cluster
• contains two or more DataNodes in a distributed Hadoop
environment.
• This is practically used in organizations to store and analyze
their Petabytes and Exabytes of data.
Hadoop installation on a single node:
I. Recommended Platform
• OS: Ubuntu 12.04 or later.
• Hadoop: Cloudera distribution for Apache Hadoop
– Cloudera's open source platform distribution, which
includes Apache Hadoop and is built specifically to
meet enterprise demands
II. Prerequisites
• Java (oracle java is recommended)
• Password-less SSH (Secure Socket Shell) setup
(Hadoop need passwordless ssh from master to all
the slaves , this is required for remote script
invocations)
III. Install Hadoop
a. Download Hadoop

b. Deploy Hadoop
- Untar Tarball
- A TAR file (. tar) is a special file that contains a collection of multiple files
combined into one single file.

c. Setup Configuration
- Edit hadoop-env.sh
- Set Hadoop-specific environment variables here. # The only required
environment variable is JAVA_HOME.
- Edit core-site.xml
- The core-site. xml file informs Hadoop daemon where NameNode runs in
the cluster
- Edit hdfs-site.xml
- The hdfs-site. xml file contains the configuration settings for HDFS
daemons; the NameNode, the Secondary NameNode, and the DataNodes.
- Edit mapred-site.xml
- This property identifies the location of the modified hadoop distribution
containing this XML file.
IV. Start The Cluster
a. Format the name node
– Format the active NameNode by specifying the Cluster ID.
b. Start Hadoop Services
c. Check whether services have been started

V. Stop Hadoop Services


Once your work has been done, You can stop the services
Hadoop installation on Multi-node cluster

I. Recommended Platform (Same as Single node)


II. Prerequisites (Same as Single node)
III. Configure, Setup and Install Hadoop 2.7 on Ubuntu
a. Download Hadoop
b. Untar Tar ball
c. Setup Configuration
 Edit .bashrc
 The .bashrc file is a script file that's executed when a user logs in for terminal
session.
 Edit hadoop-env.sh
 Edit core-site.xml
 Edit hdfs-site.xml Same as Single node
 Edit mapred-site.xml
 Edit yarn-site.xml
– YARN supports an extensible resource model. By default YARN tracks CPU and
IV. Start the Cluster
a. Format the name node
b. Start HDFS Services
c. Start YARN Services
d. Check whether services have been started
V. Run Map-Reduce Jobs
Ex: WordCount
VI. Stop the Cluster
e. Stop HDFS Services
f. Stop YARN Services
Reading and Writing Local Files/Data stream into HDFS.
Anatomy of File Read in HDFS:
• Let’s see how data flows between the client
interacting with HDFS, the name node, and the data
nodes with the following diagram:
Step 1: The client opens the file it wishes to read by calling open() on the
File System Object(which for HDFS is an instance of Distributed File System).

Step 2:
• Distributed File System(DFS) calls the name node, using remote procedure
calls (RPCs), to determine the locations of the first few blocks in the file.
• For each block, the name node returns the addresses of the data nodes
that have a copy of that block.
• The DFS returns an FSDataInputStream to the client for it to read data
from.
• FSDataInputStream in turn wraps a DFSInputStream, which manages the
data node and name node I/O.

Step 3: The client then calls read() on the stream. DFSInputStream, which
has stored the info node addresses for the primary few blocks within the file,
then connects to the primary (closest) data node for the primary block in the
file.
Step 4: Data is streamed from the data node back to the client,
which calls read() repeatedly on the stream.

Step 5:
• When the end of the block is reached, DFSInputStream will
close the connection to the data node, then finds the best
data node for the next block.
• Blocks are read as, with the DFSInputStream opening new
connections to data nodes because the client reads through
the stream.
• It will also call the name node to retrieve the data node
locations for the next batch of blocks as needed.

Step 6: When the client has finished reading the file, a function
is called, close() on the FSDataInputStream.
Anatomy of File Write in HDFS:
• HDFS follows the Write once Read many times model. In
HDFS we cannot edit the files which are already stored in
HDFS, but we can append data by reopening the files.
Step 1: The client creates the file by calling create() on
Distributed File System(DFS).

Step 2:
• DFS makes an RPC call to the name node to create a new
file in the file system’s namespace, with no blocks
associated with it.
• The name node performs various checks to make sure the
file doesn’t already exist and that the client has the right
permissions to create the file.
• If these checks pass, the name node prepares a record of
the new file; otherwise, the file can’t be created and
therefore the client is thrown an error i.e. IOException. The
DFS returns an FS Data OutputStream for the client to start
out writing data to.
Step 3:
• Because the client writes data, the DFS Output Stream
splits it into packets, which it writes to an indoor queue
called the info queue.
• The data queue is consumed by the Data Streamer, which is
liable for asking the name node to allocate new blocks by
picking an inventory of suitable data nodes to store the
replicas.
• The list of data nodes forms a pipeline, and here we’ll
assume the replication level is three, so there are three
nodes in the pipeline.
• The Data Streamer streams the packets to the primary data
node within the pipeline, which stores each packet and
forwards it to the second data node within the pipeline.
Step 4: Similarly, the second data node stores the packet and
forwards it to the third (and last) data node in the pipeline.

Step 5: The DFS Output Stream sustains an internal queue of


packets that are waiting to be acknowledged by data nodes,
called an “ack queue”.

Step 6: This action sends up all the remaining packets to the


data node pipeline and waits for acknowledgments before
connecting to the name node to signal whether the file is
complete or not.
HDFS follows Write Once Read Many models. So, we
can’t edit files that are already stored in HDFS, but we
can include them by again reopening the file. This
design allows HDFS to scale to a large number of
concurrent clients because the data traffic is spread
across all the data nodes in the cluster.

You might also like