You are on page 1of 56

KCS 061 BIG DATA

GLBITM-3-CSE-D
2020-2021

Subject Faculty: Dr. Upendra Dwivedi


UNIT II
HADOOP
• Hadoop (http://hadoop.apache.org/) is a top-level Apache project in
the Apache Software Foundation that’s written in Java.
• Hadoop is a computing environment built on top of a distributed
clustered file system that was designed specifically for very large-scale
data operations.
• Hadoop was inspired by Google’s work on its Google (distributed) File
System (GFS).
• MapReduce programming paradigm, in which work is broken down
into mapper and reducer tasks to manipulate data that is stored
across a cluster of servers for massive parallelism.
• Hadoop is designed to scan through large data sets to produce its
results through a highly scalable, distributed batch processing system.
• Hadoop is actually the name that creator Doug Cutting’s son gave to
his stuffed toy elephant. In thinking up a name for his project, Cutting
was apparently looking for something that was easy to say and stands
for nothing in particular, so the name of his son’s toy seemed to make
perfect sense.
• Hadoop is generally seen as having two parts:
• a file system (the Hadoop Distributed File System)
• and a programming paradigm (MapReduce)
• One of the key components of Hadoop is the redundancy built into
the environment.
• data redundantly stored in multiple places across the cluster
• programming model - failures are expected and resolved automatically by
running portions of the program on various servers in the cluster.
• It is well known that commodity hardware components will fail (especially
when you have very large numbers of them), but this redundancy provides
fault tolerance and a capability for the Hadoop cluster to heal itself.
• scale out workloads across large clusters of inexpensive machines to work on
Big Data problems.
• Hadoop-related projects
• Apache Avro (for data serialization),
• Cassandra and HBase (databases),
• Chukwa (a monitoring system specifically designed with large distributed
systems in mind),
• Hive (provides ad hoc SQL-like queries for data aggregation and
summarization),
• Mahout (a machine learning library),
• Pig (a high-level Hadoop programming language that provides a data-flow
language and execution framework for parallel computation),
• ZooKeeper (provides coordination services for distributed applications),
• and more.
Components of Hadoop
• Hadoop project is comprised of three pieces:
• Hadoop Distributed File System (HDFS),
• Hadoop MapReduce model,
• and Hadoop Common.
Hadoop Distributed File System
• Data in a Hadoop cluster is broken down into smaller pieces (called blocks)
and distributed throughout the cluster. Copies of these blocks are stored on
other servers in the Hadoop cluster.
• an individual file is actually stored as smaller blocks that are replicated across
multiple servers in the entire cluster.
• the map and reduce functions can be executed on smaller subsets of your
larger data sets, and this provides the scalability that is needed for Big Data
processing.
• use commonly available servers in a very large cluster, where each server has
a set of inexpensive internal disk drives.
• MapReduce tries to assign workloads to these servers where the data to be
processed is stored. (Data Locality)
Figure:- example of how data blocks are written to HDFS. Notice how (by default) each block is written three times
and at least one block is written to a different server rack for redundancy. (Understanding Big Data Analytics, Paul
C. Zikopoulos)
• Think of a file that contains the phone numbers for everyone. The people
with a last name starting with A might be stored on server 1, B on server 2,
and so on.
• In a Hadoop world, pieces of this phonebook would be stored across the
cluster, and to reconstruct the entire phonebook,
• Program would need the blocks from every server in the cluster.
• HDFS replicates these smaller pieces onto two additional servers by default.
• A data file in HDFS is divided into blocks, and the default size of these blocks
for Apache Hadoop is 64 MB.
• All of Hadoop’s data placement logic is managed by a special server called
NameNode.
• This NameNode server keeps track of all the data files in HDFS, such as where
the blocks are stored, and more.
• All of the NameNode’s information is stored in memory, which allows it to
provide quick response times to storage manipulation or read requests.
• Interaction with HDFS
• write your own Java applications to perform some of the functions.
• different HDFS commands to manage and manipulate files in the file system.
MapReduce
• MapReduce is programming paradigm that allows for massive
scalability across hundreds or thousands of servers in a Hadoop
cluster.
• The term MapReduce actually refers to two separate and distinct
tasks that Hadoop programs perform.
• map job
• reduce job
• The first is the map job, which takes a set of data and converts it into
another set of data, where individual elements are broken down into
tuples (key/value pairs).
• The reduce job takes the output from a map as input and combines
those data tuples into a smaller set of tuples.
• As the sequence of the name MapReduce implies, the reduce job is
always performed after the map job.
• Example
• Five files having each file contains two columns (a key and a value in Hadoop
terms).
• Here key represents a city and the value represents corresponding
temperature recorded in that city for the various measurement days.
• Find the maximum temperature for each city across all of the data files?
o Following snippet shows a sample of the data from one of test files
• Toronto, 20
• Whitby, 25
• New York, 22
• Rome, 32
• Toronto, 4
• Rome, 33
• New York, 18
o MapReduce framework can break this down into five map tasks, where each
mapper works on one of the five files and the mapper task goes through the
data and returns the maximum temperature for each city.

(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)


o four mapper tasks (working on the other four files not shown here) produced
the following intermediate results:

(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)


(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)
(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)
(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)
o All five of these output streams would be fed into the reduce tasks, which
combine the input results and output a single value for each city, producing a
final result set as follows:

(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)


Figure :- The flow of data in a simple MapReduce job . (Understanding Big Data Analytics, Paul C.
Zikopoulos)
• In a Hadoop cluster, a MapReduce program is referred to as a job.
• A job is executed by subsequently breaking it down into pieces called
tasks.
• An application submits a job to a specific node in a Hadoop cluster,
which is running a daemon called the JobTracker.
• The JobTracker communicates with the NameNode to find out where
all of the data required for this job exists across the cluster, and then
breaks the job down into map and reduce tasks for each node to work
on in the cluster.
• These tasks are scheduled on the nodes in the cluster where the data
exists.
• In a Hadoop cluster, a set of continually running daemons, referred to
as TaskTracker agents, monitor the status of each task.
• If a task fails to complete, the status of that failure is reported back to
the JobTracker, which will then reschedule that task on another node
in the cluster.
• All MapReduce programs that run natively under Hadoop are written
in Java, and it is the Java Archive file (jar) that’s distributed by the
JobTracker to the various Hadoop cluster nodes to execute the map
and reduce tasks.
Hadoop Word-Count Example
• Mapper Code
• Importing packages
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
Hadoop Word-Count Example
public class MapClass extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
protected void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer st = new StringTokenizer(line," ");
while(st.hasMoreTokens()){
word.set(st.nextToken());
context.write(word,one);
}
}
}
Hadoop Word-Count Example
• Reducer Code
import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ReduceClass extends Reducer{
@Override
protected void reduce(Text key, Iterable values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
Iterator valuesIt = values.iterator();
while(valuesIt.hasNext()){
sum = sum + valuesIt.next().get();
}
context.write(key, new IntWritable(sum));
}
}
Hadoop Common Components
• The Hadoop Common Components are a set of libraries that support
the various Hadoop subprojects.
• HDFS shell commands (some examples) –
• cat - Copies the file to standard output (stdout).
• Chmod - Changes the permissions for reading and writing to a given file or
set of files.
• Chown - Changes the owner of a given file or set of files.
• copyFromLocal - Copies a file from the local file system into HDFS.
• copyToLocal - Copies a file from HDFS to the local file system.
• cp - Copies HDFS files from one directory to another.
• Expunge - Empties all of the files that are in the trash.
• Ls - Displays a listing of files in a given directory.
• Mkdir - Creates a directory in HDFS.
• Mv - Moves files from one directory to another.
• Rm - Deletes a file and sends it to the trash.
Application Development in Hadoop
• Several application development languages have emerged that run on
top of Hadoop.
• Pig and PigLatin
• Hive
• Jaql
• ZooKeeper
• HBase
• Pig and PigLatin
• Pig was initially developed at Yahoo! to use Hadoop to focus more on
analyzing large data sets and spend less time having to write
mapperand reducer programs.
• Pig programming language is designed to handle any kind of data.
• Pig is made up of two components:
• the first is the language itself, which is called PigLatin.
• The second is a runtime environment where PigLatin programs are executed.
• Hive
• Facebook developed a runtime Hadoop support structure that allows
anyone who is already fluent with SQL to leverage the Hadoop
platform.
• Hive allows SQL developers to write Hive Query Language (HQL)
statements that are similar to standard SQL statements.
• HQL statements are broken down by the Hive service into MapReduce
jobs and executed across a Hadoop cluster.
• Jaql
• Jaql is primarily a query language for JavaScript Object Notation
(JSON) and allows to process both structured and nontraditional data
and was developed by IBM.
• Jaql allows to select, join, group, and filter data that is stored in HDFS.
• Jaql’s query language includes Lisp, SQL, XQuery, and Pig.
• ZooKeeper
• Apache ZooKeeper is an open-source server for highly reliable distributed
coordination of cloud applications.
• ZooKeeper is essentially a service for distributed systems offering a hierarchical
key-value store, which is used to provide a distributed configuration service,
synchronization service, and naming registry for large distributed systems.
• ZooKeeper is an open source Apache project that provides a centralized
infrastructure and services that enable synchronization across a cluster.
• ZooKeeper maintains common objects needed in large cluster environments.
• Examples of these objects include configuration information, hierarchical
namingspace, and so on.
• Applications can leverage these services to coordinate distributed processing
across large clusters.
• HBase
• HBase is a column-oriented database management system that runs on top of HDFS.
• Unlike relational database systems, HBase does not support a structured query language
like SQL.
• An HBase system comprises a set of tables. Each table contains rows and columns, much
like a traditional database.
• Each table must have an element defined as a Primary Key, and all access attempts to
HBase tables must use this Primary Key.
• An HBase column represents an attribute of an object; for example, if the table is storing
diagnostic logs from servers in your environment, where each row might be a log record,
a typical column in such a table would be the timestamp of when the log record was
written, or the servername where the record originated.

Data Formats for Hadoop
• The Hadoop ecosystem is designed to process large volumes of data
distributed through the MapReduce programming model.
• Hadoop Distributed File System (HDFS) is a distributed file system
designed for large-scale data processing where scalability, flexibility
and performance are critical.
• Hadoop works in a master / slave architecture to store data in HDFS
and is based on the principle of storing few very large files.
• In HDFS two services are executed: Namenode and Datanode.
• The Namenode manages the namespace of the file system, in addition to
maintaining the file system tree and metadata for all files and directories. This
information is permanently stored on the local disk in the form of two files:
the namespace image and the edition log. The Namenode also knows the
Datanodes where the blocks of a file are located.
• The default size of an HDFS block is 128MB.
• The HDFS blocks are larger because they aim at minimizing the cost of
searches, since if a block is large enough, the time to transfer data
from the disk can be longer than the time needed to search from the
beginning of the block.
• The blocks fit well with replication to provide fault tolerance and
availability. Each block is replicated in several small separate
machines.
• Hadoop allows to store information in any format, whether
structured, semi-structured or unstructured data. In addition, it also
provides support for optimized formats for storage and processing in
HDFS.
• Hadoop does not have a default file format and the choice of a format
depends on its use.
• The choice of an appropriate file format can produce the following
benefits: Optimum writing time, Optimum reading time, File divisibility,
Adaptive scheme and compression support.
• Each format has advantages and disadvantages, and each stage of data
processing will need a different format to be more efficient.
• The objective is to choose a format that maximizes advantages and
minimizes inconveniences.
• Choosing an appropriate HDFS file format to the type of work that will be
done with it, can ensure that resources will be used efficiently.
• Most common formats of the Hadoop ecosystem:
• Text/CSV
• SequenceFile
• Avro Data Files
• Parquet
• RCFile (Record Columnar File)
• ORC (Optimized Row Columnar)

• Text/CSV -
• A text file is the most basic and a human-readable file. It can be read or
written in any programming language and is mostly delimited by comma or
tab.
• The text file format consumes more space when a numeric value needs to be
stored as a string. It is also difficult to represent binary data such as an image.
• A plain text file or CSV is the most common format both outside and within
the Hadoop ecosystem.
• The disadvantage in the use of this format is that it does not support block
compression, so the compression of a CSV file in Hadoop can have a high cost
in reading.
• The plain text format or CSV would only be recommended in case of
extractions of data from Hadoop or a massive data load from a file.
• SequenceFile –
• The SequenceFile format stores the data in binary format.
• The sequencefile format can be used to store an image in the binary format.
• They store key-value pairs in a binary container format and are more efficient
than a text file. However, sequence files are not human- readable.
• This format accepts compression; however, it does not store metadata and
the only option in the evolution of its scheme is to add new fields at the end.
• This is usually used to store intermediate data in the input and output of
MapReduce processes.
• The SequenceFile format is recommended in case of storing intermediate
data in MapReduce jobs.
• Avro Data Files -
• Avro is a row-based storage format.
• This format includes in each file, the definition of the scheme of your data in
JSON format, improving interoperability and allowing the evolution of the
scheme.
• Avro also allows block compression in addition to its divisibility, making it a
good choice for most cases when using Hadoop.
• Avro is a good choice in case the data scheme can evolve over time.
• Avro Data Files -
• The Avro file format has efficient storage due to optimized binary encoding. It
is widely supported both inside and outside the Hadoop ecosystem.
• The Avro file format is ideal for long-term storage of important data. It can
read from and write in many languages like Java, Scala and so on.
• Schema metadata can be embedded in the file to ensure that it will always be
readable. Schema evolution can accommodate changes.
• The Avro file format is considered the best choice for general-purpose storage
in Hadoop.


• Parquet -
• Parquet is a column-based (column-based) binary storage format that can store
nested data structures.
• This format is very efficient in terms of disk input / output operations when the
necessary columns to be used are specified.
• This format is much optimized for use with Cloudera Impala.
• Parquet is a columnar format developed by Cloudera and Twitter.
• It is supported in Spark, MapReduce, Hive, Pig, Impala, Crunch, and so on.
• Parquet file format uses advanced optimizations described in Google’s Dremel paper.
These optimizations reduce the storage space and increase performance.
• This Parquet file format is considered the most efficient for adding multiple records
at a time. Some optimizations rely on identifying repeated patterns.
• RCFile (Record Columnar File) -
• RCFile is a columnar format that divides data into groups of rows, and inside
it, data is stored in columns.
• This format does not support the evaluation of the scheme and if want to add
a new column it is necessary to rewrite the file, slowing down the process.
• ORC (Optimized Row Columnar) -
• ORC is considered an evolution of the RCFile format and has all its benefits
alongside with some improvements such as better compression, allowing
faster queries.
• This format also does not support the evolution of the scheme.
• ORC are recommended when query performance is important.
Hadoop – Streaming

Figure : - Hadoop Streaming (https://data-flair.training/blogs/hadoop-streaming/


• It enables to create or run MapReduce scripts in any language either, java
or non-java, as mapper/reducer.
• By default, the Hadoop MapReduce framework is written in Java and
provides support for writing map/reduce programs in Java only.
• But Hadoop provides API for writing MapReduce programs in languages
other than Java.
• Hadoop Streaming is the utility that allows us to create and run
MapReduce jobs with any script or executable as the mapper or the
reducer.
• It uses Unix streams as the interface between the Hadoop and our
MapReduce program so that we can use any language which can read
standard input and write to standard output to write for writing our
MapReduce program.
• Hadoop Streaming supports the execution of Java, as well as non-
Java, programmed MapReduce jobs execution over the Hadoop
cluster.
• It supports the Python, Perl, R, PHP, and C++ programming languages.
How Streaming Works?

Figure :- Hadoop Streaming Working (https://data-flair.training/blogs/hadoop-streaming/)


• The mapper and the reducer are the scripts that read the input line-by-line
from stdin and emit the output to stdout.
• The utility creates a Map/Reduce job and submits the job to an appropriate
cluster and monitor the job progress until its completion.
• When a script is specified for mappers, then each mapper task launches
the script as a separate process when the mapper is initialized.
• The mapper task converts its inputs (key, value pairs) into lines and pushes
the lines to the standard input of the process. Meanwhile, the mapper
collects the line oriented outputs from the standard output and converts
each line into a (key, value pair) pair, which is collected as the result of the
mapper.
• When reducer script is specified, then each reducer task launches the
script as a separate process, and then the reducer is initialized.
• As reducer task runs, it converts its input key/values pairs into lines and
feeds the lines to the standard input of the process. Meantime, the reducer
gathers the line-oriented outputs from the stdout of the process and
converts each line collected into a key/value pair, which is then collected as
the result of the reducer.
• For both mapper and reducer, the prefix of a line until the first tab
character is the key, and the rest of the line is the value except the tab
character. In the case of no tab character in the line, the entire line is
considered as key, and the value is considered null. This is customizable by
setting -inputformat command option for mapper and -outputformat
option for reducer.
HADOOP PIPES
• Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce.
• Unlike Streaming, which uses standard input and output to communicate
with the map and reduce code, Pipes uses sockets as the channel over
which the tasktracker communicates with the process running the C++ map
or reduce function.
• The application links against the Hadoop C++ library, which is a thin
wrapper for communicating with the tasktracker child process.
• The map and reduce functions are defined by extending the Mapper and
Reducer classes defined in the HadoopPipes namespace and providing
implementations of the map() and reduce() methods in each case.
• These methods take a context object (of type MapContext or ReduceContext),
which provides the means for reading input and writing output, as well as
accessing job configuration information via the JobConf class.
• Unlike the Java interface, keys and values in the C++ interface are byte buffers,
represented as Standard Template Library (STL) strings. This makes the interface
simpler, although it does put a slightly greater burden on the application
developer, who has to convert to and from richer domain-level types. This is
evident in MapTempera tureReducer where we have to convert the input value
into an integer (using a convenience method in HadoopUtils) and then the
maximum value back into a string before it’s written out. In some cases, we can
save on doing the conversion, such as in MaxTem peratureMapper where the
airTemperature value is never converted to an integer since it is never processed
as a number in the map() method.
• The main() method is the application entry point. It calls
HadoopPipes::runTask, which connects to the Java parent process and
marshals data to and from the Mapper or Reducer.
• The runTask() method is passed a Factory so that it can create
instances of the Mapper or Reducer. Which one it creates is
controlled by the Java parent over the socket connection.
• There are overloaded template factory methods for setting a
combiner, partitioner, record reader, or record writer.

HADOOP ECOSYSTEM

You might also like