Professional Documents
Culture Documents
GLBITM-3-CSE-D
2020-2021
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
Hadoop Word-Count Example
public class MapClass extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
protected void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer st = new StringTokenizer(line," ");
while(st.hasMoreTokens()){
word.set(st.nextToken());
context.write(word,one);
}
}
}
Hadoop Word-Count Example
• Reducer Code
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ReduceClass extends Reducer{
@Override
protected void reduce(Text key, Iterable values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
Iterator valuesIt = values.iterator();
while(valuesIt.hasNext()){
sum = sum + valuesIt.next().get();
}
context.write(key, new IntWritable(sum));
}
}
Hadoop Common Components
• The Hadoop Common Components are a set of libraries that support
the various Hadoop subprojects.
• HDFS shell commands (some examples) –
• cat - Copies the file to standard output (stdout).
• Chmod - Changes the permissions for reading and writing to a given file or
set of files.
• Chown - Changes the owner of a given file or set of files.
• copyFromLocal - Copies a file from the local file system into HDFS.
• copyToLocal - Copies a file from HDFS to the local file system.
• cp - Copies HDFS files from one directory to another.
• Expunge - Empties all of the files that are in the trash.
• Ls - Displays a listing of files in a given directory.
• Mkdir - Creates a directory in HDFS.
• Mv - Moves files from one directory to another.
• Rm - Deletes a file and sends it to the trash.
Application Development in Hadoop
• Several application development languages have emerged that run on
top of Hadoop.
• Pig and PigLatin
• Hive
• Jaql
• ZooKeeper
• HBase
• Pig and PigLatin
• Pig was initially developed at Yahoo! to use Hadoop to focus more on
analyzing large data sets and spend less time having to write
mapperand reducer programs.
• Pig programming language is designed to handle any kind of data.
• Pig is made up of two components:
• the first is the language itself, which is called PigLatin.
• The second is a runtime environment where PigLatin programs are executed.
• Hive
• Facebook developed a runtime Hadoop support structure that allows
anyone who is already fluent with SQL to leverage the Hadoop
platform.
• Hive allows SQL developers to write Hive Query Language (HQL)
statements that are similar to standard SQL statements.
• HQL statements are broken down by the Hive service into MapReduce
jobs and executed across a Hadoop cluster.
• Jaql
• Jaql is primarily a query language for JavaScript Object Notation
(JSON) and allows to process both structured and nontraditional data
and was developed by IBM.
• Jaql allows to select, join, group, and filter data that is stored in HDFS.
• Jaql’s query language includes Lisp, SQL, XQuery, and Pig.
• ZooKeeper
• Apache ZooKeeper is an open-source server for highly reliable distributed
coordination of cloud applications.
• ZooKeeper is essentially a service for distributed systems offering a hierarchical
key-value store, which is used to provide a distributed configuration service,
synchronization service, and naming registry for large distributed systems.
• ZooKeeper is an open source Apache project that provides a centralized
infrastructure and services that enable synchronization across a cluster.
• ZooKeeper maintains common objects needed in large cluster environments.
• Examples of these objects include configuration information, hierarchical
namingspace, and so on.
• Applications can leverage these services to coordinate distributed processing
across large clusters.
• HBase
• HBase is a column-oriented database management system that runs on top of HDFS.
• Unlike relational database systems, HBase does not support a structured query language
like SQL.
• An HBase system comprises a set of tables. Each table contains rows and columns, much
like a traditional database.
• Each table must have an element defined as a Primary Key, and all access attempts to
HBase tables must use this Primary Key.
• An HBase column represents an attribute of an object; for example, if the table is storing
diagnostic logs from servers in your environment, where each row might be a log record,
a typical column in such a table would be the timestamp of when the log record was
written, or the servername where the record originated.
•
Data Formats for Hadoop
• The Hadoop ecosystem is designed to process large volumes of data
distributed through the MapReduce programming model.
• Hadoop Distributed File System (HDFS) is a distributed file system
designed for large-scale data processing where scalability, flexibility
and performance are critical.
• Hadoop works in a master / slave architecture to store data in HDFS
and is based on the principle of storing few very large files.
• In HDFS two services are executed: Namenode and Datanode.
• The Namenode manages the namespace of the file system, in addition to
maintaining the file system tree and metadata for all files and directories. This
information is permanently stored on the local disk in the form of two files:
the namespace image and the edition log. The Namenode also knows the
Datanodes where the blocks of a file are located.
• The default size of an HDFS block is 128MB.
• The HDFS blocks are larger because they aim at minimizing the cost of
searches, since if a block is large enough, the time to transfer data
from the disk can be longer than the time needed to search from the
beginning of the block.
• The blocks fit well with replication to provide fault tolerance and
availability. Each block is replicated in several small separate
machines.
• Hadoop allows to store information in any format, whether
structured, semi-structured or unstructured data. In addition, it also
provides support for optimized formats for storage and processing in
HDFS.
• Hadoop does not have a default file format and the choice of a format
depends on its use.
• The choice of an appropriate file format can produce the following
benefits: Optimum writing time, Optimum reading time, File divisibility,
Adaptive scheme and compression support.
• Each format has advantages and disadvantages, and each stage of data
processing will need a different format to be more efficient.
• The objective is to choose a format that maximizes advantages and
minimizes inconveniences.
• Choosing an appropriate HDFS file format to the type of work that will be
done with it, can ensure that resources will be used efficiently.
• Most common formats of the Hadoop ecosystem:
• Text/CSV
• SequenceFile
• Avro Data Files
• Parquet
• RCFile (Record Columnar File)
• ORC (Optimized Row Columnar)
•
• Text/CSV -
• A text file is the most basic and a human-readable file. It can be read or
written in any programming language and is mostly delimited by comma or
tab.
• The text file format consumes more space when a numeric value needs to be
stored as a string. It is also difficult to represent binary data such as an image.
• A plain text file or CSV is the most common format both outside and within
the Hadoop ecosystem.
• The disadvantage in the use of this format is that it does not support block
compression, so the compression of a CSV file in Hadoop can have a high cost
in reading.
• The plain text format or CSV would only be recommended in case of
extractions of data from Hadoop or a massive data load from a file.
• SequenceFile –
• The SequenceFile format stores the data in binary format.
• The sequencefile format can be used to store an image in the binary format.
• They store key-value pairs in a binary container format and are more efficient
than a text file. However, sequence files are not human- readable.
• This format accepts compression; however, it does not store metadata and
the only option in the evolution of its scheme is to add new fields at the end.
• This is usually used to store intermediate data in the input and output of
MapReduce processes.
• The SequenceFile format is recommended in case of storing intermediate
data in MapReduce jobs.
• Avro Data Files -
• Avro is a row-based storage format.
• This format includes in each file, the definition of the scheme of your data in
JSON format, improving interoperability and allowing the evolution of the
scheme.
• Avro also allows block compression in addition to its divisibility, making it a
good choice for most cases when using Hadoop.
• Avro is a good choice in case the data scheme can evolve over time.
• Avro Data Files -
• The Avro file format has efficient storage due to optimized binary encoding. It
is widely supported both inside and outside the Hadoop ecosystem.
• The Avro file format is ideal for long-term storage of important data. It can
read from and write in many languages like Java, Scala and so on.
• Schema metadata can be embedded in the file to ensure that it will always be
readable. Schema evolution can accommodate changes.
• The Avro file format is considered the best choice for general-purpose storage
in Hadoop.
•
•
• Parquet -
• Parquet is a column-based (column-based) binary storage format that can store
nested data structures.
• This format is very efficient in terms of disk input / output operations when the
necessary columns to be used are specified.
• This format is much optimized for use with Cloudera Impala.
• Parquet is a columnar format developed by Cloudera and Twitter.
• It is supported in Spark, MapReduce, Hive, Pig, Impala, Crunch, and so on.
• Parquet file format uses advanced optimizations described in Google’s Dremel paper.
These optimizations reduce the storage space and increase performance.
• This Parquet file format is considered the most efficient for adding multiple records
at a time. Some optimizations rely on identifying repeated patterns.
• RCFile (Record Columnar File) -
• RCFile is a columnar format that divides data into groups of rows, and inside
it, data is stored in columns.
• This format does not support the evaluation of the scheme and if want to add
a new column it is necessary to rewrite the file, slowing down the process.
• ORC (Optimized Row Columnar) -
• ORC is considered an evolution of the RCFile format and has all its benefits
alongside with some improvements such as better compression, allowing
faster queries.
• This format also does not support the evolution of the scheme.
• ORC are recommended when query performance is important.
Hadoop – Streaming