Professional Documents
Culture Documents
MapReduce is mainly used for parallel processing of large sets of data stored in Hadoop
cluster. Initially, it is a hypothesis specially designed by Google to provide parallelism,
data distribution and fault-tolerance. MR processes data in the form of key-value pairs.
A key-value (KV) pair is a mapping element between two linked data items - key and
its value.
The key (K) acts as an identifier to the value. An example of a key-value (KV) pair is a
pair where the key is the node Id and the value is its properties including neighbor
nodes, predecessor node, etc. MR API provides the following features like batch
processing, parallel processing of huge amounts of data and high availability.
For processing large sets of data MR comes into the picture. The programmers will
write MR applications that could be suitable for their business scenarios. Programmers
have to understand the MR working flow and according to the flow, applications will
be developed and deployed across Hadoop clusters. Hadoop built on Java APIs and it
provides some MR APIs that is going to deal with parallel computing across nodes.
The MR work flow undergoes different phases and the end result will be stored in hdfs
with replications. Job tracker is going to take care of all MR jobs that are running on
various nodes present in the Hadoop cluster. Job tracker plays vital role in scheduling
jobs and it will keep track of the entire map and reduce jobs. Actual map and reduce
tasks are performed by Task tracker.
Hadoop Mapreduce Architecture
1
Map reduce architecture consists of mainly two processing stages. First one is the map
stage and the second one is reduce stage. The actual MR process happens in task
tracker. In between map and reduce stages, Intermediate process will take place.
Intermediate process will do operations like shuffle and sorting of the mapper output
data. The Intermediate data is going to get stored in local file system.
Mapper Phase
In Mapper Phase the input data is going to split into 2 components, Key and Value. The
key is writable and comparable in the processing stage. Value is writable only during
the processing stage. Suppose, client submits input data to Hadoop system, the Job
tracker assigns tasks to task tracker. The input data that is going to get split into several
input splits.
Input splits are the logical splits in nature. Record reader converts these input splits in
Key-Value (KV) pair. This is the actual input data format for the mapped input for
further processing of data inside Task tracker. The input format type varies from one
type of application to another. So the programmer has to observe input data and to code
according.
Suppose we take Text input format, the key is going to be byte offset and value will be
the entire line. Partition and combiner logics come in to map coding logic only to
perform special data operations. Data localization occurs only in mapper nodes.
Combiner is also called as mini reducer. The reducer code is placed in the mapper as a
combiner. When mapper output is a huge amount of data, it will require high network
bandwidth. To solve this bandwidth issue, we will place the reduced code in mapper as
combiner for better performance. Default partition used in this process is Hash
partition.
A partition module in Hadoop plays a very important role to partition the data received
from either different mappers or combiners. Petitioner reduces the pressure that builds
on reducer and gives more performance. There is a customized partition which can be
performed on any relevant data on different basis or conditions.
Also, it has static and dynamic partitions which play a very important role in hadoop as
well as hive. The partitioner would split the data into numbers of folders using reducers
at the end of map reduce phase. According to the business requirement developer will
design this partition code. This partitioner runs in between Mapper and Reducer. It is
very efficient for query purpose.
Intermediate Process
The mapper output data undergoes shuffle and sorting in intermediate process. The
intermediate data is going to get stored in local file system without having replications
in Hadoop nodes. This intermediate data is the data that is generated after some
2
computations based on certain logics. Hadoop uses a Round-Robin algorithm to write
the intermediate data to local disk. There are many other sorting factors to reach the
conditions to write the data to local disks.
Reducer Phase
Shuffled and sorted data is going to pass as input to the reducer. In this phase, all
incoming data is going to combine and same actual key value pairs is going to write
into hdfs system. Record writer writes data from reducer to hdfs. The reducer is not so
mandatory for searching and mapping purpose.
Reducer logic is mainly used to start the operations on mapper data which is sorted and
finally it gives the reducer outputs like part-r-0001etc,. Options are provided to set the
number of reducers for each job that the user wanted to run. In the configuration file
mapred-site.xml, we have to set some properties which will enable to set the number of
reducers for the particular task.
Speculative Execution plays an important role during job processing. If two or more
mappers are working on the same data and if one mapper is running slow then Job
tracker assigns tasks to the next mapper to run the program fast. The execution will be
on FIFO (First In First Out).
Suppose the text file having the data like as shown in Input part in the above figure.
Assume that, it is the input data for our MR task. We have to find out the word count at
end of MR Job. The internal data flow can be shown in the above example diagram. The
line splits in splitting phase and gives a key value pair to input by record reader.
Here, three mappers are running parallel and each mapper task is going to generate
output for each input row that comes as input to it. After mapper phase, the data is
going to shuffle and sort. All the grouping will be done here and the value is passed as
input to Reducer phase. The reducers then finally combine each key-value pair and pass
those values to HDFS via record writer.
3
Hive
The Hive Query Language (HiveQL or HQL) for MapReduce to process structured
data using Hive.
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took
it up and developed it further as an open source under the name Apache Hive. It is
used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
Architecture of Hive
The following component diagram depicts the architecture of Hive:
4
This component diagram contains different units. The following table describes each
unit:
User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the
Metastore. It is one of the replacements of traditional approach
for MapReduce program. Instead of writing MapReduce
program in Java, we can write a query for MapReduce job and
process it.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
5
The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan The driver takes the help of query compiler that parses the query to check
the syntax and query plan or the requirement of query.
3 Get Metadata The compiler sends metadata request to Metastore (any database).
5 Send Plan The compiler checks the requirement and resends the plan to the driver.
Up to here, the parsing and compiling of a query is complete.
6 Execute Plan The driver sends the execute plan to the execution engine.
7 Execute Job Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node and it assigns
this job to TaskTracker, which is in Data node. Here, the query executes MapReduce
job.
7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.
8 Fetch Result The execution engine receives the results from Data nodes.
9 Send Results The execution engine sends those resultant values to the driver.
6
MapR
MapReduce v1, included in all versions of the MapR Distribution, serves two purposes
in the Hadoop cluster. First, MapReduce acts as the resource manager for the nodes in
the Hadoop cluster. It employs a JobTracker to divide a job into multiple tasks,
distributing and monitoring their progress to one or more TaskTrackers, which perform
the work in parallel. As the resource manager, it is a key component of the cluster,
serving as the platform for many higher-level Hadoop applications, including Pig(link)
and Hive(link). Second, MapReduce serves as a data processing engine, executing jobs
that are expressed with map and reduce semantics.
Starting with MapR 4.0 release, MapR includes MapReduce v2 in addition to the v1.
MapReduce v2 was redesigned to perform only as a data processing engine, spinning
off the resource manager functionality into a new component called YARN (Yet
Another Resource Negotiator). Before this split, higher-level applications that required
access to Hadoop resources had to express their jobs using map and reduce semantics,
with each job going through the map, sort, shuffle, reduce processes. This was
unsuitable for some types of jobs that didn't fit well into the MapReduce paradigm,
either because they required faster response times than a full MapReduce cycle would
allow for, or because they required more complex processing than could not be
expressed in single MapReduce jobs, such as graph processing. With YARN, Hadoop
clusters become much more versatile, allowing the same cluster to be used for both
classic batch MapReduce processing as well as interactive jobs like SQL.
7
MapR is a complete enterprise-grade distribution for Apache Hadoop. The MapR
Converged Data Platform has been engineered to improve Hadoop’s reliability,
performance, and ease of use.
The MapR distribution provides a full Hadoop stack that includes the MapR File
System (MapR-FS), the MapR-DB NoSQL database management system, MapR
Streams, the MapR Control System (MCS) user interface, and a full family of Hadoop
ecosystem projects. You can use MapR with Apache Hadoop, HDFS, and MapReduce
APIs.
MapR supports the Hadoop 2.x architecture and YARN (Yet Another Resource
Negotiator). Hadoop 2.x and YARN make up a resource management and scheduling
framework that distributes resource management and job management duties.
Hadoop 2.x was designed to solve two main problems in the Hadoop 1.x architecture:
Here is a high-level view of the MapR Converged Data Platform, showing its main
components and supported ecosystem projects:
This system overview contains architectural details about the components that run on
the MapR Data Platform, how the components assemble into a cluster, and the
relationships between the components.
The MapR distribution provides several unique features that address common concerns
with Apache Hadoop:
8
Issue Addressed by MapR Feature Apache Hadoop
No standard mirroring
MapR provides business continuity and
solution. Scripts based on
disaster recovery services out of the box
Disaster distcp quickly become hard
with mirroring that’s simple to configure
Recovery to administer and manage.
and makes efficient use of your cluster’s
No enterprise-grade
storage, CPU, and bandwidth resources.
consistency.
9
Issue Addressed by MapR Feature Apache Hadoop
NameNode HA provides
failover, but no failback,
while limiting scale and
creating complex
configuration challenges.
The MapR Converged Data Platform NameNode federation adds
Scalable provides High Availability for the Hadoop new processes and
Architecture components in the stack. MapR clusters parameters to provide
(without don’t use NameNodes and provide stateful cumbersome, error-prone
single high-availability for the MapReduce file federation. The High-
points of JobTracker and Direct Access NFS. Works Availability JobTracker in
failure) out of the box with no special stock Apache Hadoop does
configuration required. not preserve the state of
running jobs. Failover for
the JobTracker requires
restarting all in-progress
jobs and brings complex
configuration requirements.
10
Sharding
Vertical partitioning is very domain specific. You draw a logical split within your
application data, storing them in different databases. It is almost always implemented at
the application level — a piece of code routing reads and writes to a designated
database.
In contrast, sharding splits a homogeneous type of data into multiple databases. You can
see that such an algorithm is easily generalizable. That’s why sharding can be
implemented at either the application or database level. In many databases, sharding is
11
a first-class concept, and the database knows how to store and retrieve data within a
cluster. Almost all modern databases are natively sharded. Cassandra, HBase, HDFS,
and MongoDB are popular distributed databases. Notable examples of non-sharded
modern databases are Sqlite, Redis (spec in progress), Memcached, and Zookeeper.
There exist various strategies to distribute data into multiple databases. Each strategy
has pros and cons depending on various assumptions a strategy makes. It is crucial to
understand these assumptions and limitations. Operations may need to search through
many databases to find the requested data. These are called cross-partition
operations and they tend to be inefficient. Hotspots are another common problem —
having uneven distribution of data and operations. Hotspots largely counteract the
benefits of sharding.
Get a more expensive machine. Storage capacity is growing at the speed of Moore’s law.
From Amazon, you can get a server with 6.4 TB of SDD, 244 GB of RAM and 32 cores.
Even in 2013, Stack Overflow runs on a single MS SQL server. (Some may argue that
splitting Stack Overflow and Stack Exchange is a form of sharding)
Driving Principles
To compare the pros and cons of each sharding strategy, I’ll use the following principles.
How the data is read — Databases are used to store and retrieve data. If we don’t need
to read data at all, we can simply write it to /dev/null. If we only need to batch process the
data once in a while, we can append to a single file and periodically scan through them.
Data retrieval requirements (or lack thereof) heavily influence the sharding strategy.
12
How the data is distributed — Once you have a cluster of machines acting together, it is
important to ensure that data and work is evenly distributed. Uneven load causes
storage and performance hotspots. Some databases redistribute data dynamically, while
others expect clients to evenly distribute and access data.
Once sharding is employed, redistributing data is an important problem. Once your
database is sharded, it is likely that the data is growing rapidly. Adding an additional
node becomes a regular routine. It may require changes in configuration and moving
large amounts of data between nodes. It adds both performance and operational burden.
Common Definitions
Many databases have their own terminologies. The following terminologies are used
throughout to describe different algorithms.
Shard or Partition Key is a portion of primary key which determines how data should
be distributed. A partition key allows you to retrieve and modify data efficiently by
routing operations to the correct database. Entries with the same partition key are stored
in the same node. A logical shard is a collection of data sharing the same partition key.
A database node, sometimes referred as a physical shard, contains multiple logical
shards.
13
Reads are performed within a single database as long as a partition key is given. Queries
without a partition key require searching every database node. Non-partitioned queries
do not scale with respect to the size of cluster, thus they are discouraged.
Algorithmic sharding distributes data by its sharding function only. It doesn’t consider
the payload size or space utilization. To uniformly distribute data, each partition should
be similarly sized. Fine grained partitions reduce hotspots — a single database will
contain many partitions, and the sum of data between databases is statistically likely to
be similar. For this reason, algorithmic sharding is suitable for key-value databases with
homogeneous values.
Resharding data can be challenging. It requires updating the sharding function and
moving data around the cluster. Doing both at the same time while maintaining
consistency and availability is hard. Clever choice of sharding function can reduce the
amount of transferred data. Consistent Hashing is such an algorithm.
Examples of such system include Memcached. Memcached is not sharded on its own,
but expects client libraries to distribute data within a cluster. Such logic is fairly easy to
implement at the application level.
In dynamic sharding, an external locator service determines the location of entries. It can
be implemented in multiple ways. If the cardinality of partition keys is relatively low,
the locator can be assigned per individual key. Otherwise, a single locator can address a
range of partition keys.
To read and write data, clients need to consult the locator service first. Operation by
primary key becomes fairly trivial. Other queries also become efficient depending on the
structure of locators. In the example of range-based partition keys, range queries are
efficient because the locator service reduces the number of candidate databases. Queries
without a partition key will need to search all databases.
14
Dynamic sharding is more resilient to nonuniform distribution of data. Locators can be
created, split, and reassigned to redistribute data. However, relocation of data and
update of locators need to be done in unison. This process has many corner cases with a
lot of interesting theoretical, operational, and implementational challenges.
The locator service becomes a single point of contention and failure. Every database
operation needs to access it, thus performance and availability are a must. However,
locators cannot be cached or replicated simply. Out of date locators will route operations
to incorrect databases. Misrouted writes are especially bad — they become
undiscoverable after the routing issue is resolved.
Since the effect of misrouted traffic is so devastating, many systems opt for a high
consistency solution. Consensus algorithms and synchronous replications are used to
store this data. Fortunately, locator data tends to be small, so computational costs
associated with such a heavyweight solution tends to be low.
Due to its robustness, dynamic sharding is used in many popular databases. HDFS uses
a Name Node to store filesystem metadata. Unfortunately, the name node is a single
point of failure in HDFS. Apache HBase splits row keys into ranges. The range server is
responsible for storing multiple regions. Region information is stored in Zookeeper to
ensure consistency and redundancy. In MongoDB, the ConfigServer stores the sharding
information, and mongos performs the query routing. ConfigServer uses synchronous
replication to ensure consistency. When a config server loses redundancy, it goes into
read-only mode for safety. Normal database operations are unaffected, but shards
cannot be created or moved.
Previous examples are geared towards key-value operations. However, many databases
have more expressive querying and manipulation capabilities. Traditional RDBMS
features such as joins, indexes and transactions reduce complexity for an application.
15
The concept of entity groups is very simple. Store related entities in the same partition to
provide additional capabilities within a single partition.
Specifically:
Queries within a single physical shard are efficient.
Stronger consistency semantics can be achieved within a shard.
This is a popular approach to shard a relational database. In a typical web application
data is naturally isolated per user. Partitioning by user gives scalability of sharding
while retaining most of its flexibility. It normally starts off as a simple company-specific
solution, where resharding operations are done manually by developers. Mature
solutions like Youtube’s Vitess and Tumblr’s Jetpants can automate most operational
tasks.
Queries spanning multiple partitions typically have looser consistency guarantees than a
single partition query. They also tend to be inefficient, so such queries should be done
sparingly.
16
Column-oriented databases are an extension of key-value stores. They add
expressiveness of entity groups with a hierarchical primary key. A primary key is
composed of a pair (row key, column key). Entries with the same partition key are stored
together. Range queries on columns limited to a single partition are efficient. That’s why
a column key is referred as a range key in DynamoDB.
This model has been popular since mid 2000s. The restriction given by hierarchical keys
allows databases to implement data-agnostic sharding mechanisms and efficient storage
engines. Meanwhile, hierarchical keys are expressive enough to represent sophisticated
relationships. Column-oriented databases can model a problem such as time
series efficiently.
The term column database is losing popularity. Both HBase and Cassandra once
marketed themselves as column databases, but not anymore. If I need to categorize these
systems today, I would call them hierarchical key-value stores, since this is the most
distinctive characteristic between them.
17
NoSQL Databases
What is NoSQL?
Developers are working with applications that create massive volumes of new,
rapidly changing data types — structured, semi-structured, unstructured and
polymorphic data.
Applications that once served a finite audience are now delivered as services that
must be always-on, accessible from many different devices and scaled globally to
millions of users.
Relational databases were not designed to cope with the scale and agility challenges
that face modern applications, nor were they built to take advantage of the commodity
storage and processing power available today.
18
NoSQL Database Types
Document databases pair each key with a complex data structure known as a
document. Documents can contain many different key-value pairs, or key-array
pairs, or even nested documents.
Graph stores are used to store information about networks of data, such as social
connections. Graph stores include Neo4J and Giraph.
Key-value stores are the simplest NoSQL databases. Every single item in the
database is stored as an attribute name (or 'key'), together with its value.
Examples of key-value stores are Riak and Berkeley DB. Some key-value stores,
such as Redis, allow each value to have a type, such as 'integer', which adds
functionality.
Wide-column stores such as Cassandra and HBase are optimized for queries over
large datasets, and store columns of data together, instead of rows.
When compared to relational databases, NoSQL databases are more scalable and provide
superior performance, and their data model addresses several issues that the relational
model is not designed to address:
Selecting the appropriate data model: document, key-value & wide column, or
graph model
The pros and cons of consistent and eventually consistent systems
Why idiomatic drivers minimize onboarding time for new developers and
simplify application development
19
Dynamic Schemas
Relational databases require that schemas be defined before you can add data. For
example, you might want to store data about your customers such as phone numbers,
first and last name, address, city and state – a SQL database needs to know what you
are storing in advance.
This fits poorly with agile development approaches, because each time you complete new
features, the schema of your database often needs to change. So if you decide, a few
iterations into development, that you'd like to store customers' favorite items in
addition to their addresses and phone numbers, you'll need to add that column to the
database, and then migrate the entire database to the new schema.
If the database is large, this is a very slow process that involves significant downtime. If
you are frequently changing the data your application stores – because you are iterating
rapidly – this downtime may also be frequent. There's also no way, using a relational
database, to effectively address data that's completely unstructured or unknown in
advance.
NoSQL databases are built to allow the insertion of data without a predefined schema.
That makes it easy to make significant application changes in real-time, without
worrying about service interruptions – which means development is faster, code
integration is more reliable, and less database administrator time is needed. Developers
have typically had to add application-side code to enforce data quality controls, such as
mandating the presence of specific fields, data types or permissible values. More
sophisticated NoSQL databases allow validation rules to be applied within the
database, allowing users to enforce governance across data, while maintaining the
agility benefits of a dynamic schema.
Auto-sharding
Because of the way they are structured, relational databases usually scale vertically – a
single server has to host the entire database to ensure acceptable performance for cross-
table joins and transactions. This gets expensive quickly, places limits on scale, and
creates a relatively small number of failure points for database infrastructure. The
solution to support rapidly growing applications is to scale horizontally, by adding
servers instead of concentrating more capacity in a single server.
20
'Sharding' a database across many server instances can be achieved with SQL databases,
but usually is accomplished through SANs and other complex arrangements for making
hardware act as a single server. Because the database does not provide this ability
natively, development teams take on the work of deploying multiple relational
databases across a number of machines. Data is stored in each database instance
autonomously. Application code is developed to distribute the data, distribute queries,
and aggregate the results of data across all of the database instances. Additional code
must be developed to handle resource failures, to perform joins across the different
databases, for data rebalancing, replication, and other requirements. Furthermore,
many benefits of the relational database, such as transactional integrity, are
compromised or eliminated when employing manual sharding.
NoSQL databases, on the other hand, usually support auto-sharding, meaning that they
natively and automatically spread data across an arbitrary number of servers, without
requiring the application to even be aware of the composition of the server pool. Data
and query load are automatically balanced across servers, and when a server goes
down, it can be quickly and transparently replaced with no application disruption.
Cloud computing makes this significantly easier, with providers such as Amazon Web
Services providing virtually unlimited capacity on demand, and taking care of all the
necessary infrastructure administration tasks. Developers no longer need to construct
complex, expensive platforms to support their applications, and can concentrate on
writing application code. Commodity servers can provide the same processing and
storage capabilities as a single high-end server for a fraction of the price.
Replication
Integrated Caching
A number of products provide a caching tier for SQL database systems. These systems
can improve read performance substantially, but they do not improve write
performance, and they add operational complexity to system deployments. If your
21
application is dominated by reads then a distributed cache could be considered, but if
your application has just a modest write volume, then a distributed cache may not
improve the overall experience of your end users, and will add complexity in managing
cache invalidation.
MySQL, Postgres,
MongoDB, Cassandra, HBase,
Examples Microsoft SQL Server,
Neo4j
Oracle Database
22
'date hired,' etc.), much sometimes stored as BLOBs
like a spreadsheet. Related within the 'value' columns.
data is stored in separate Document databases do away
tables, and then joined with the table-and-row model
together when more altogether, storing all relevant
complex queries are data together in single
executed. For example, 'document' in JSON, XML, or
'offices' might be stored in another format, which can nest
one table, and 'employees' values hierarchically.
in another. When a user
wants to find the work
address of an employee,
the database engine joins
the 'employee' and 'office'
tables together to get all
the information necessary.
Vertically, meaning a
single server must be Horizontally, meaning that to
made increasingly add capacity, a database
powerful in order to deal administrator can simply add
Scaling with increased demand. It more commodity servers or
is possible to spread SQL cloud instances. The database
databases over many automatically spreads data
servers, but significant across servers as necessary.
additional engineering is
generally required, and
23
core relational features
such as JOINs, referential
integrity and transactions
are typically lost.
Supports
Mostly no. MongoDB 4.0 and
multi-record
Yes beyond support multi-document
ACID
ACID transactions. Learn more
transactions
Often, organizations will begin with a small-scale trial of a NoSQL database in their
organization, which makes it possible to develop an understanding of the technology in
a low-stakes way. Most NoSQL databases are also based on open technologies and are
free to use, meaning that they can be downloaded, implemented and scaled at little cost.
Because development cycles are faster, organizations can also innovate more quickly
and deliver superior customer experience at a lower cost.
24
As you consider alternatives to legacy infrastructures, you may have several
motivations: to scale or perform beyond the capabilities of your existing system,
identify viable alternatives to expensive proprietary software, or increase the speed and
agility of development. When selecting the right database for your business and
application, there are five important dimensions to consider.
S3
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers
industry-leading scalability, data availability, security, and performance. This means
customers of all sizes and industries can use it to store and protect any amount of data
for a range of use cases, such as websites, mobile applications, backup and restore,
archive, enterprise applications, IoT devices, and big data analytics. Amazon S3
provides easy-to-use management features so you can organize your data and configure
finely-tuned access controls to meet your specific business, organizational, and
compliance requirements. Amazon S3 is designed for 99.999999999% (11 9's) of
durability, and stores data for millions of applications for companies all around the
world.
25
Hadoop Distributed File Systems
The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications. It employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance access to data across
highly scalable Hadoop clusters.
HDFS supports the rapid transfer of data between compute nodes. At its outset, it was
closely coupled with MapReduce, a programmatic framework for data processing.
When HDFS takes in data, it breaks the information down into separate blocks and
distributes them to different nodes in a cluster, thus enabling highly efficient parallel
processing.
Moreover, the Hadoop Distributed File System is specially designed to be highly fault-
tolerant. The file system replicates, or copies, each piece of data multiple times and
distributes the copies to individual nodes, placing at least one copy on a different server
rack than the others. As a result, the data on nodes that crash can be found elsewhere
within a cluster. This ensures that processing can continue while data is recovered.
This master node "data chunking" architecture takes as its design guides elements from
Google File System (GFS), a proprietary file system outlined in in Google technical
papers, as well as IBM's General Parallel File System (GPFS), a format that boosts I/O
by striping blocks of data over multiple disks, writing blocks in parallel. While HDFS is
not Portable Operating System Interface model-compliant, it echoes POSIX design style
in some aspects.
26
APACHE SOFTWARE FOUNDATION
The Hadoop Distributed File System arose at Yahoo as a part of that company's ad
serving and search engine requirements. Like other web-oriented companies, Yahoo
found itself juggling a variety of applications that were accessed by a growing numbers
of users, who were creating more and more data. Facebook, eBay, LinkedIn and Twitter
are among the web companies that used HDFS to underpin big data analytics to
address these same requirements.
But the file system found use beyond that. HDFS was used by The New York Times as
part of large-scale image conversions, Media6Degrees for log processing and machine
learning, LiveBet for log storage and odds analysis, Joost for session analysis and Fox
Audience Network for log analysis and data mining. HDFS is also at the core of many
open source data warehouse alternatives, sometimes called data lakes.
27
Because HDFS is typically deployed as part of very large-scale implementations,
support for low-cost commodity hardware is a particularly useful feature. Such
systems, running web search and related applications, for example, can range into the
hundreds of petabytes and thousands of nodes. They must be especially resilient, as
server failures are common at such scale.
In 2006, Hadoop's originators ceded their work on HDFS and MapReduce to the
Apache Software Foundation project. The software was widely adopted in big data
analytics projects in a range of industries. In 2012, HDFS and Hadoop became available
in Version 1.0.
The basic HDFS standard has been continuously updated since its inception.
With Version 2.0 of Hadoop in 2013, a general-purpose YARN resource manager was
added, and MapReduce and HDFS were effectively decoupled. Thereafter, diverse data
processing frameworks and file systems were supported by Hadoop. While MapReduce
was often replaced by Apache Spark, HDFS continued to be a prevalent file format for
Hadoop.
After four alpha releases and one beta, Apache Hadoop 3.0.0 became generally available
in December 2017, with HDFS enhancements supporting additional NameNodes,
erasure coding facilities and greater data compression. At the same time, advances in
HDFS tooling, such as LinkedIn's open source Dr. Elephant and Dynamometer
performance testing tools, have expanded to enable development of ever larger HDFS
implementations.
28
Visualizatoins
29
30
31
32
33
34
35
36
37
38
Systems and applications:
39
40
41
42
43
44
45