You are on page 1of 45

Unit 5

Frameworks and Visualizatoins

Hadoop Map Reduce Architecture and Example

MapReduce is mainly used for parallel processing of large sets of data stored in Hadoop
cluster. Initially, it is a hypothesis specially designed by Google to provide parallelism,
data distribution and fault-tolerance. MR processes data in the form of key-value pairs.
A key-value (KV) pair is a mapping element between two linked data items - key and
its value.

The key (K) acts as an identifier to the value. An example of a key-value (KV) pair is a
pair where the key is the node Id and the value is its properties including neighbor
nodes, predecessor node, etc. MR API provides the following features like batch
processing, parallel processing of huge amounts of data and high availability.

For processing large sets of data MR comes into the picture. The programmers will
write MR applications that could be suitable for their business scenarios. Programmers
have to understand the MR working flow and according to the flow, applications will
be developed and deployed across Hadoop clusters. Hadoop built on Java APIs and it
provides some MR APIs that is going to deal with parallel computing across nodes.

The MR work flow undergoes different phases and the end result will be stored in hdfs
with replications. Job tracker is going to take care of all MR jobs that are running on
various nodes present in the Hadoop cluster. Job tracker plays vital role in scheduling
jobs and it will keep track of the entire map and reduce jobs. Actual map and reduce
tasks are performed by Task tracker.
Hadoop Mapreduce Architecture

1
Map reduce architecture consists of mainly two processing stages. First one is the map
stage and the second one is reduce stage. The actual MR process happens in task
tracker. In between map and reduce stages, Intermediate process will take place.
Intermediate process will do operations like shuffle and sorting of the mapper output
data. The Intermediate data is going to get stored in local file system.

Mapper Phase

In Mapper Phase the input data is going to split into 2 components, Key and Value. The
key is writable and comparable in the processing stage. Value is writable only during
the processing stage. Suppose, client submits input data to Hadoop system, the Job
tracker assigns tasks to task tracker. The input data that is going to get split into several
input splits.

Input splits are the logical splits in nature. Record reader converts these input splits in
Key-Value (KV) pair. This is the actual input data format for the mapped input for
further processing of data inside Task tracker. The input format type varies from one
type of application to another. So the programmer has to observe input data and to code
according.

Suppose we take Text input format, the key is going to be byte offset and value will be
the entire line. Partition and combiner logics come in to map coding logic only to
perform special data operations. Data localization occurs only in mapper nodes.

Combiner is also called as mini reducer. The reducer code is placed in the mapper as a
combiner. When mapper output is a huge amount of data, it will require high network
bandwidth. To solve this bandwidth issue, we will place the reduced code in mapper as
combiner for better performance. Default partition used in this process is Hash
partition.

A partition module in Hadoop plays a very important role to partition the data received
from either different mappers or combiners. Petitioner reduces the pressure that builds
on reducer and gives more performance. There is a customized partition which can be
performed on any relevant data on different basis or conditions.

Also, it has static and dynamic partitions which play a very important role in hadoop as
well as hive. The partitioner would split the data into numbers of folders using reducers
at the end of map reduce phase. According to the business requirement developer will
design this partition code. This partitioner runs in between Mapper and Reducer. It is
very efficient for query purpose.

Intermediate Process

The mapper output data undergoes shuffle and sorting in intermediate process. The
intermediate data is going to get stored in local file system without having replications
in Hadoop nodes. This intermediate data is the data that is generated after some

2
computations based on certain logics. Hadoop uses a Round-Robin algorithm to write
the intermediate data to local disk. There are many other sorting factors to reach the
conditions to write the data to local disks.

Reducer Phase

Shuffled and sorted data is going to pass as input to the reducer. In this phase, all
incoming data is going to combine and same actual key value pairs is going to write
into hdfs system. Record writer writes data from reducer to hdfs. The reducer is not so
mandatory for searching and mapping purpose.

Reducer logic is mainly used to start the operations on mapper data which is sorted and
finally it gives the reducer outputs like part-r-0001etc,. Options are provided to set the
number of reducers for each job that the user wanted to run. In the configuration file
mapred-site.xml, we have to set some properties which will enable to set the number of
reducers for the particular task.

Speculative Execution plays an important role during job processing. If two or more
mappers are working on the same data and if one mapper is running slow then Job
tracker assigns tasks to the next mapper to run the program fast. The execution will be
on FIFO (First In First Out).

MapReduce word count Example

Map Reduce Flow

Suppose the text file having the data like as shown in Input part in the above figure.
Assume that, it is the input data for our MR task. We have to find out the word count at
end of MR Job. The internal data flow can be shown in the above example diagram. The
line splits in splitting phase and gives a key value pair to input by record reader.

Here, three mappers are running parallel and each mapper task is going to generate
output for each input row that comes as input to it. After mapper phase, the data is
going to shuffle and sort. All the grouping will be done here and the value is passed as
input to Reducer phase. The reducers then finally combine each key-value pair and pass
those values to HDFS via record writer.

3
Hive

Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

 The Hive Query Language (HiveQL or HQL) for MapReduce to process structured
data using Hive.
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took
it up and developed it further as an open source under the name Apache Hive. It is
used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Hive is not

 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive
The following component diagram depicts the architecture of Hive:

4
This component diagram contains different units. The following table describes each
unit:

Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight (In Windows server).

Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.

HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the
Metastore. It is one of the replacements of traditional approach
for MapReduce program. Instead of writing MapReduce
program in Java, we can write a query for MapReduce job and
process it.

Execution Engine The conjunction part of HiveQL process Engine and


MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.

Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.

5
The following table defines how Hive interacts with Hadoop framework:

Step Operation
No.

1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.

2 Get Plan The driver takes the help of query compiler that parses the query to check
the syntax and query plan or the requirement of query.

3 Get Metadata The compiler sends metadata request to Metastore (any database).

4 Send Metadata Metastore sends metadata as a response to the compiler.

5 Send Plan The compiler checks the requirement and resends the plan to the driver.
Up to here, the parsing and compiling of a query is complete.

6 Execute Plan The driver sends the execute plan to the execution engine.

7 Execute Job Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node and it assigns
this job to TaskTracker, which is in Data node. Here, the query executes MapReduce
job.

7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.

8 Fetch Result The execution engine receives the results from Data nodes.

9 Send Results The execution engine sends those resultant values to the driver.

10 Send Results The driver sends the results to Hive Interfaces.

6
MapR

Apache MapReduce is a powerful framework for processing large, distributed sets of


structured or unstructured data on a Hadoop cluster. The key feature of MapReduce is
its ability to perform processing across an entire cluster of nodes, with each node
processing its local data. This feature makes MapReduce orders of magnitude faster
than legacy methods of processing big data, which often consisted of a single node
accessing and processing data located in remote SAN or NAS devices.

MapReduce abstracts away the complexity of distributed programming, allowing


programmers to describe the processing they'd like to perform in terms of a map
function and a reduce function. At time of execution, during the map phase, multiple
nodes in the cluster, called mappers, read in local raw data into key-value pairs. This is
followed by a sort and shuffle phase, where each mapper sorts their results by key and
forwards ranges of keys to other nodes in the cluster, called reducers. Finally, in the
reduce phase, reducers analyze data for the keys it was passed from the mappers.

MapReduce v1, included in all versions of the MapR Distribution, serves two purposes
in the Hadoop cluster. First, MapReduce acts as the resource manager for the nodes in
the Hadoop cluster. It employs a JobTracker to divide a job into multiple tasks,
distributing and monitoring their progress to one or more TaskTrackers, which perform
the work in parallel. As the resource manager, it is a key component of the cluster,
serving as the platform for many higher-level Hadoop applications, including Pig(link)
and Hive(link). Second, MapReduce serves as a data processing engine, executing jobs
that are expressed with map and reduce semantics.

Starting with MapR 4.0 release, MapR includes MapReduce v2 in addition to the v1.
MapReduce v2 was redesigned to perform only as a data processing engine, spinning
off the resource manager functionality into a new component called YARN (Yet
Another Resource Negotiator). Before this split, higher-level applications that required
access to Hadoop resources had to express their jobs using map and reduce semantics,
with each job going through the map, sort, shuffle, reduce processes. This was
unsuitable for some types of jobs that didn't fit well into the MapReduce paradigm,
either because they required faster response times than a full MapReduce cycle would
allow for, or because they required more complex processing than could not be
expressed in single MapReduce jobs, such as graph processing. With YARN, Hadoop
clusters become much more versatile, allowing the same cluster to be used for both
classic batch MapReduce processing as well as interactive jobs like SQL.

7
MapR is a complete enterprise-grade distribution for Apache Hadoop. The MapR
Converged Data Platform has been engineered to improve Hadoop’s reliability,
performance, and ease of use.
The MapR distribution provides a full Hadoop stack that includes the MapR File
System (MapR-FS), the MapR-DB NoSQL database management system, MapR
Streams, the MapR Control System (MCS) user interface, and a full family of Hadoop
ecosystem projects. You can use MapR with Apache Hadoop, HDFS, and MapReduce
APIs.
MapR supports the Hadoop 2.x architecture and YARN (Yet Another Resource
Negotiator). Hadoop 2.x and YARN make up a resource management and scheduling
framework that distributes resource management and job management duties.
Hadoop 2.x was designed to solve two main problems in the Hadoop 1.x architecture:

 Centralization of job scheduling, resulting in scheduler bottlenecks


 Separating resource management from application programming concerns

Here is a high-level view of the MapR Converged Data Platform, showing its main
components and supported ecosystem projects:

This system overview contains architectural details about the components that run on
the MapR Data Platform, how the components assemble into a cluster, and the
relationships between the components.
The MapR distribution provides several unique features that address common concerns
with Apache Hadoop:

8
Issue Addressed by MapR Feature Apache Hadoop

MapR Snapshots provide complete


recovery capabilities. MapR Snapshots are
rapid point-in-time consistent snapshots Snapshot-like capabilities
for both files and tables. MapR Snapshots are not consistent, require
Data make efficient use of storage and CPU application changes to
Protection resources, storing only changes from the make consistent, and may
point the snapshot is taken. You can lead to data loss in certain
configure schedules for MapR Snapshots situations.
with easy to use but powerful scheduling
tools.

With wire-level security, data


transmissions to, from, and within the
cluster are encrypted, and strong
authorization mechanisms enable you to Permissions for users are
Security
tailor the actions a given user is able to checked on file open only.
perform. Authentication is robust without
burdening end-users. Permissions for
users are checked on each file access.

No standard mirroring
MapR provides business continuity and
solution. Scripts based on
disaster recovery services out of the box
Disaster distcp quickly become hard
with mirroring that’s simple to configure
Recovery to administer and manage.
and makes efficient use of your cluster’s
No enterprise-grade
storage, CPU, and bandwidth resources.
consistency.

With high-availability Direct Access NFS,


data ingestion to your cluster can be made
as simple as mounting an NFS share to the
Enterprise
data source. Support for Hadoop
Integration
ecosystem projects like Flume or Sqoop
means minimal disruptions to your
existing workflow.

MapR uses customized units of I/O,


Performance Stock Apache Hadoop’s
chunking, resync, and administration.
NFS cannot read or write to
These architectural elements allow MapR

9
Issue Addressed by MapR Feature Apache Hadoop

clusters to run at speeds close to the an open file.


maximum allowed by the underlying
hardware. In addition, the DirectShuffle
technology leverages the performance
advantages of MapR-FS to deliver strong
cluster performance, and Direct Access
NFS simplifies data ingestion and access.
MapR-DB tables, available with the M7
license, are natively stored in the file
system and support the Apache HBase
API. MapR-DB tables provide the fastest
and easiest to administer NoSQL solution
on Hadoop.

NameNode HA provides
failover, but no failback,
while limiting scale and
creating complex
configuration challenges.
The MapR Converged Data Platform NameNode federation adds
Scalable provides High Availability for the Hadoop new processes and
Architecture components in the stack. MapR clusters parameters to provide
(without don’t use NameNodes and provide stateful cumbersome, error-prone
single high-availability for the MapReduce file federation. The High-
points of JobTracker and Direct Access NFS. Works Availability JobTracker in
failure) out of the box with no special stock Apache Hadoop does
configuration required. not preserve the state of
running jobs. Failover for
the JobTracker requires
restarting all in-progress
jobs and brings complex
configuration requirements.

10
Sharding

Sharding is a method of splitting and storing a single logical dataset in multiple


databases. By distributing the data among multiple machines, a cluster of database
systems can store larger dataset and handle additional requests. Sharding is necessary if
a dataset is too large to be stored in a single database. Moreover, many sharding
strategies allow additional machines to be added. Sharding allows a database cluster to
scale along with its data and traffic growth.

Sharding is also referred as horizontal partitioning. The distinction


of horizontal vs vertical comes from the traditional tabular view of a database. A
database can be split vertically — storing different tables & columns in a separate
database, or horizontally — storing rows of a same table in multiple database nodes.

An illustrated example of vertical and horizontal partitioning

# Example of vertical partitioning


fetch_user_data(user_id) -> db[―USER‖].fetch(user_id)
fetch_photo(photo_id) -> db[―PHOTO‖].fetch(photo_id)
# Example of horizontal partitioning
fetch_user_data(user_id) -> user_db[user_id % 2].fetch(user_id)

Vertical partitioning is very domain specific. You draw a logical split within your
application data, storing them in different databases. It is almost always implemented at
the application level — a piece of code routing reads and writes to a designated
database.

In contrast, sharding splits a homogeneous type of data into multiple databases. You can
see that such an algorithm is easily generalizable. That’s why sharding can be
implemented at either the application or database level. In many databases, sharding is

11
a first-class concept, and the database knows how to store and retrieve data within a
cluster. Almost all modern databases are natively sharded. Cassandra, HBase, HDFS,
and MongoDB are popular distributed databases. Notable examples of non-sharded
modern databases are Sqlite, Redis (spec in progress), Memcached, and Zookeeper.

There exist various strategies to distribute data into multiple databases. Each strategy
has pros and cons depending on various assumptions a strategy makes. It is crucial to
understand these assumptions and limitations. Operations may need to search through
many databases to find the requested data. These are called cross-partition
operations and they tend to be inefficient. Hotspots are another common problem —
having uneven distribution of data and operations. Hotspots largely counteract the
benefits of sharding.

Sharding adds additional programming and operational complexity to your application.


You lose the convenience of accessing the application’s data in a single location.
Managing multiple servers adds operational challenges. Before you begin, see whether
sharding can be avoided or deferred.

Get a more expensive machine. Storage capacity is growing at the speed of Moore’s law.
From Amazon, you can get a server with 6.4 TB of SDD, 244 GB of RAM and 32 cores.
Even in 2013, Stack Overflow runs on a single MS SQL server. (Some may argue that
splitting Stack Overflow and Stack Exchange is a form of sharding)

If your application is bound by read performance, you can add caches or


database replicas. They provide additional read capacity without heavily modifying
your application.

Vertically partition by functionality. Binary blobs tend to occupy large amounts of


space and are isolated within your application. Storing files in S3 can reduce storage
burden. Other functionalities such as full text search, tagging, and analytics are best
done by separate databases.
Not everything may need to be sharded. Often times, only few tables occupy a majority
of the disk space. Very little is gained by sharding small tables with hundreds of rows.
Focus on the large tables.

Driving Principles
To compare the pros and cons of each sharding strategy, I’ll use the following principles.

How the data is read — Databases are used to store and retrieve data. If we don’t need
to read data at all, we can simply write it to /dev/null. If we only need to batch process the
data once in a while, we can append to a single file and periodically scan through them.
Data retrieval requirements (or lack thereof) heavily influence the sharding strategy.

12
How the data is distributed — Once you have a cluster of machines acting together, it is
important to ensure that data and work is evenly distributed. Uneven load causes
storage and performance hotspots. Some databases redistribute data dynamically, while
others expect clients to evenly distribute and access data.
Once sharding is employed, redistributing data is an important problem. Once your
database is sharded, it is likely that the data is growing rapidly. Adding an additional
node becomes a regular routine. It may require changes in configuration and moving
large amounts of data between nodes. It adds both performance and operational burden.

Common Definitions
Many databases have their own terminologies. The following terminologies are used
throughout to describe different algorithms.

Shard or Partition Key is a portion of primary key which determines how data should
be distributed. A partition key allows you to retrieve and modify data efficiently by
routing operations to the correct database. Entries with the same partition key are stored
in the same node. A logical shard is a collection of data sharing the same partition key.
A database node, sometimes referred as a physical shard, contains multiple logical
shards.

Case 1 — Algorithmic Sharding


One way to categorize sharding is algorithmic versus dynamic. In algorithmic sharding,
the client can determine a given partition’s database without any help. In dynamic
sharding, a separate locator service tracks the partitions amongst the nodes.

An algorithmically sharded database, with a simple sharding function


Algorithmically sharded databases use
a sharding function (partition_key) -> database_id to locate data. A simple sharding
function may be ―hash(key) % NUM_DB‖.

13
Reads are performed within a single database as long as a partition key is given. Queries
without a partition key require searching every database node. Non-partitioned queries
do not scale with respect to the size of cluster, thus they are discouraged.

Algorithmic sharding distributes data by its sharding function only. It doesn’t consider
the payload size or space utilization. To uniformly distribute data, each partition should
be similarly sized. Fine grained partitions reduce hotspots — a single database will
contain many partitions, and the sum of data between databases is statistically likely to
be similar. For this reason, algorithmic sharding is suitable for key-value databases with
homogeneous values.
Resharding data can be challenging. It requires updating the sharding function and
moving data around the cluster. Doing both at the same time while maintaining
consistency and availability is hard. Clever choice of sharding function can reduce the
amount of transferred data. Consistent Hashing is such an algorithm.

Examples of such system include Memcached. Memcached is not sharded on its own,
but expects client libraries to distribute data within a cluster. Such logic is fairly easy to
implement at the application level.

Case 2— Dynamic Sharding

A dynamic sharding scheme using range based partitioning.

In dynamic sharding, an external locator service determines the location of entries. It can
be implemented in multiple ways. If the cardinality of partition keys is relatively low,
the locator can be assigned per individual key. Otherwise, a single locator can address a
range of partition keys.

To read and write data, clients need to consult the locator service first. Operation by
primary key becomes fairly trivial. Other queries also become efficient depending on the
structure of locators. In the example of range-based partition keys, range queries are
efficient because the locator service reduces the number of candidate databases. Queries
without a partition key will need to search all databases.

14
Dynamic sharding is more resilient to nonuniform distribution of data. Locators can be
created, split, and reassigned to redistribute data. However, relocation of data and
update of locators need to be done in unison. This process has many corner cases with a
lot of interesting theoretical, operational, and implementational challenges.

The locator service becomes a single point of contention and failure. Every database
operation needs to access it, thus performance and availability are a must. However,
locators cannot be cached or replicated simply. Out of date locators will route operations
to incorrect databases. Misrouted writes are especially bad — they become
undiscoverable after the routing issue is resolved.

Since the effect of misrouted traffic is so devastating, many systems opt for a high
consistency solution. Consensus algorithms and synchronous replications are used to
store this data. Fortunately, locator data tends to be small, so computational costs
associated with such a heavyweight solution tends to be low.

Due to its robustness, dynamic sharding is used in many popular databases. HDFS uses
a Name Node to store filesystem metadata. Unfortunately, the name node is a single
point of failure in HDFS. Apache HBase splits row keys into ranges. The range server is
responsible for storing multiple regions. Region information is stored in Zookeeper to
ensure consistency and redundancy. In MongoDB, the ConfigServer stores the sharding
information, and mongos performs the query routing. ConfigServer uses synchronous
replication to ensure consistency. When a config server loses redundancy, it goes into
read-only mode for safety. Normal database operations are unaffected, but shards
cannot be created or moved.

Case 3 — Entity Groups

Entity Groups partitions all related tables together

Previous examples are geared towards key-value operations. However, many databases
have more expressive querying and manipulation capabilities. Traditional RDBMS
features such as joins, indexes and transactions reduce complexity for an application.

15
The concept of entity groups is very simple. Store related entities in the same partition to
provide additional capabilities within a single partition.

Specifically:
Queries within a single physical shard are efficient.
Stronger consistency semantics can be achieved within a shard.
This is a popular approach to shard a relational database. In a typical web application
data is naturally isolated per user. Partitioning by user gives scalability of sharding
while retaining most of its flexibility. It normally starts off as a simple company-specific
solution, where resharding operations are done manually by developers. Mature
solutions like Youtube’s Vitess and Tumblr’s Jetpants can automate most operational
tasks.

Queries spanning multiple partitions typically have looser consistency guarantees than a
single partition query. They also tend to be inefficient, so such queries should be done
sparingly.

However, a particular cross-partition query may be required frequently and efficiently.


In this case, data needs to be stored in multiple partitions to support efficient reads. For
example, chat messages between two users may be stored twice — partitioned by both
senders and recipients. All messages sent or received by a given user are stored in a
single partition. In general, many-to-many relationships between partitions may need to
be duplicated.

Entity groups can be implemented either algorithmically or dynamically. They are


usually implemented dynamically since the total size per group can vary greatly. The
same caveats for updating locators and moving data around applies here. Instead of
individual tables, an entire entity group needs to be moved together.

Other than sharded RDBMS solutions, Google Megastore is an example of such a


system. Megastore is publicly exposed via Google App Engine’s Datastore API.

Case 4 — Hierarchical keys & Column-Oriented Databases

Column-oriented databases partition its data by row keys.

16
Column-oriented databases are an extension of key-value stores. They add
expressiveness of entity groups with a hierarchical primary key. A primary key is
composed of a pair (row key, column key). Entries with the same partition key are stored
together. Range queries on columns limited to a single partition are efficient. That’s why
a column key is referred as a range key in DynamoDB.

This model has been popular since mid 2000s. The restriction given by hierarchical keys
allows databases to implement data-agnostic sharding mechanisms and efficient storage
engines. Meanwhile, hierarchical keys are expressive enough to represent sophisticated
relationships. Column-oriented databases can model a problem such as time
series efficiently.

Column-oriented databases can be sharded either algorithmically or dynamically. With


small and numerous small partitions, they haveconstraints similarto key-value stores.
Otherwise, dynamic sharding is more suitable.

The term column database is losing popularity. Both HBase and Cassandra once
marketed themselves as column databases, but not anymore. If I need to categorize these
systems today, I would call them hierarchical key-value stores, since this is the most
distinctive characteristic between them.

Originally published in 2005, Google BigTable popularized column-oriented databases


amongst the public. Apache HBase is a BigTable-like database implemented on top of
Hadoop ecosystem. Apache Cassandra previously described itself as a column database
— entries were stored in column families with row and column keys. CQL3, the latest
API for Cassandra, presents a flattened data model — (partition key, column key) is
simply a composite primary key. Amazon’s Dynamo popularized highly available
databases. Amazon DynamoDB is a platform-as-a-service offering of Dynamo.
DynamoDB uses (hash key, range key) as its primary key.

17
NoSQL Databases

What is NoSQL?

NoSQL encompasses a wide variety of different database technologies that were


developed in response to the demands presented in building modern applications:

 Developers are working with applications that create massive volumes of new,
rapidly changing data types — structured, semi-structured, unstructured and
polymorphic data.

 Long gone is the twelve-to-eighteen month waterfall development cycle. Now


small teams work in agile sprints, iterating quickly and pushing code every week
or two, some even multiple times every day.

 Applications that once served a finite audience are now delivered as services that
must be always-on, accessible from many different devices and scaled globally to
millions of users.

 Organizations are now turning to scale-out architectures using open software


technologies, commodity servers and cloud computing instead of large
monolithic servers and storage infrastructure.

Relational databases were not designed to cope with the scale and agility challenges
that face modern applications, nor were they built to take advantage of the commodity
storage and processing power available today.

Launching an application on any database typically requires careful planning to ensure


performance, high availability, security, and disaster recovery – and these obligations
continue as long as you run the application. With MongoDB Atlas, you receive all of the
features of MongoDB without any of the operational heavy lifting, allowing you to
focus instead on learning and building your apps. Features include:

 On-demand, pay as you go model


 Seamless upgrades and auto-healing
 Fully elastic. Scale up and down with ease
 Deep monitoring & customizable alerts
 Highly secure by default
 Continuous backups with point-in-time recovery

18
NoSQL Database Types

 Document databases pair each key with a complex data structure known as a
document. Documents can contain many different key-value pairs, or key-array
pairs, or even nested documents.
 Graph stores are used to store information about networks of data, such as social
connections. Graph stores include Neo4J and Giraph.
 Key-value stores are the simplest NoSQL databases. Every single item in the
database is stored as an attribute name (or 'key'), together with its value.
Examples of key-value stores are Riak and Berkeley DB. Some key-value stores,
such as Redis, allow each value to have a type, such as 'integer', which adds
functionality.
 Wide-column stores such as Cassandra and HBase are optimized for queries over
large datasets, and store columns of data together, instead of rows.

The Benefits of NoSQL

When compared to relational databases, NoSQL databases are more scalable and provide
superior performance, and their data model addresses several issues that the relational
model is not designed to address:

 Large volumes of rapidly changing structured, semi-structured, and


unstructured data
 Agile sprints, quick schema iteration, and frequent code pushes
 Object-oriented programming that is easy to use and flexible
 Geographically distributed scale-out architecture instead of expensive,
monolithic architecture

Top 5 Considerations When Evaluating NoSQL Databases and learn about:

 Selecting the appropriate data model: document, key-value & wide column, or
graph model
 The pros and cons of consistent and eventually consistent systems
 Why idiomatic drivers minimize onboarding time for new developers and
simplify application development

19
Dynamic Schemas

Relational databases require that schemas be defined before you can add data. For
example, you might want to store data about your customers such as phone numbers,
first and last name, address, city and state – a SQL database needs to know what you
are storing in advance.

This fits poorly with agile development approaches, because each time you complete new
features, the schema of your database often needs to change. So if you decide, a few
iterations into development, that you'd like to store customers' favorite items in
addition to their addresses and phone numbers, you'll need to add that column to the
database, and then migrate the entire database to the new schema.

If the database is large, this is a very slow process that involves significant downtime. If
you are frequently changing the data your application stores – because you are iterating
rapidly – this downtime may also be frequent. There's also no way, using a relational
database, to effectively address data that's completely unstructured or unknown in
advance.

NoSQL databases are built to allow the insertion of data without a predefined schema.
That makes it easy to make significant application changes in real-time, without
worrying about service interruptions – which means development is faster, code
integration is more reliable, and less database administrator time is needed. Developers
have typically had to add application-side code to enforce data quality controls, such as
mandating the presence of specific fields, data types or permissible values. More
sophisticated NoSQL databases allow validation rules to be applied within the
database, allowing users to enforce governance across data, while maintaining the
agility benefits of a dynamic schema.

Auto-sharding

Because of the way they are structured, relational databases usually scale vertically – a
single server has to host the entire database to ensure acceptable performance for cross-
table joins and transactions. This gets expensive quickly, places limits on scale, and
creates a relatively small number of failure points for database infrastructure. The
solution to support rapidly growing applications is to scale horizontally, by adding
servers instead of concentrating more capacity in a single server.

20
'Sharding' a database across many server instances can be achieved with SQL databases,
but usually is accomplished through SANs and other complex arrangements for making
hardware act as a single server. Because the database does not provide this ability
natively, development teams take on the work of deploying multiple relational
databases across a number of machines. Data is stored in each database instance
autonomously. Application code is developed to distribute the data, distribute queries,
and aggregate the results of data across all of the database instances. Additional code
must be developed to handle resource failures, to perform joins across the different
databases, for data rebalancing, replication, and other requirements. Furthermore,
many benefits of the relational database, such as transactional integrity, are
compromised or eliminated when employing manual sharding.

NoSQL databases, on the other hand, usually support auto-sharding, meaning that they
natively and automatically spread data across an arbitrary number of servers, without
requiring the application to even be aware of the composition of the server pool. Data
and query load are automatically balanced across servers, and when a server goes
down, it can be quickly and transparently replaced with no application disruption.

Cloud computing makes this significantly easier, with providers such as Amazon Web
Services providing virtually unlimited capacity on demand, and taking care of all the
necessary infrastructure administration tasks. Developers no longer need to construct
complex, expensive platforms to support their applications, and can concentrate on
writing application code. Commodity servers can provide the same processing and
storage capabilities as a single high-end server for a fraction of the price.

Replication

Most NoSQL databases also support automatic database replication to maintain


availability in the event of outages or planned maintenance events. More sophisticated
NoSQL databases are fully self-healing, offering automated failover and recovery, as
well as the ability to distribute the database across multiple geographic regions to
withstand regional failures and enable data localization. Unlike relational databases,
NoSQL databases generally have no requirement for separate applications or expensive
add-ons to implement replication.

Integrated Caching
A number of products provide a caching tier for SQL database systems. These systems
can improve read performance substantially, but they do not improve write
performance, and they add operational complexity to system deployments. If your

21
application is dominated by reads then a distributed cache could be considered, but if
your application has just a modest write volume, then a distributed cache may not
improve the overall experience of your end users, and will add complexity in managing
cache invalidation.

Many NoSQL database technologies have excellent integrated caching capabilities,


keeping frequently-used data in system memory as much as possible and removing the
need for a separate caching layer. Some NoSQL databases also offer fully managed,
integrated in-memory database management layer for workloads demanding the
highest throughput and lowest latency.

NoSQL vs. SQL Summary

SQL Databases NoSQL Databases

Many different types including


One type (SQL database) key-value stores, document
Types
with minor variations databases, wide-column stores,
and graph databases

Developed in late 2000s to deal


with limitations of SQL
Developed in 1970s to deal
Development databases, especially scalability,
with first wave of data
History multi-structured data, geo-
storage applications
distribution and agile
development sprints

MySQL, Postgres,
MongoDB, Cassandra, HBase,
Examples Microsoft SQL Server,
Neo4j
Oracle Database

Individual records (e.g., Varies based on database type.


'employees') are stored as For example, key-value stores
Data Storage rows in tables, with each function similarly to SQL
Model column storing a specific databases, but have only two
piece of data about that columns ('key' and 'value'), with
record (e.g., 'manager,' more complex information

22
'date hired,' etc.), much sometimes stored as BLOBs
like a spreadsheet. Related within the 'value' columns.
data is stored in separate Document databases do away
tables, and then joined with the table-and-row model
together when more altogether, storing all relevant
complex queries are data together in single
executed. For example, 'document' in JSON, XML, or
'offices' might be stored in another format, which can nest
one table, and 'employees' values hierarchically.
in another. When a user
wants to find the work
address of an employee,
the database engine joins
the 'employee' and 'office'
tables together to get all
the information necessary.

Typically dynamic, with some


Structure and data types enforcing data validation rules.
are fixed in advance. To Applications can add new fields
store information about a on the fly, and unlike SQL table
new data item, the entire rows, dissimilar data can be
Schemas
database must be altered, stored together as necessary. For
during which time the some databases (e.g., wide-
database must be taken column stores), it is somewhat
offline. more challenging to add new
fields dynamically.

Vertically, meaning a
single server must be Horizontally, meaning that to
made increasingly add capacity, a database
powerful in order to deal administrator can simply add
Scaling with increased demand. It more commodity servers or
is possible to spread SQL cloud instances. The database
databases over many automatically spreads data
servers, but significant across servers as necessary.
additional engineering is
generally required, and

23
core relational features
such as JOINs, referential
integrity and transactions
are typically lost.

Mix of open technologies


Development (e.g., Postgres, MySQL)
Open technologies
Model and closed source (e.g.,
Oracle Database)

Supports
Mostly no. MongoDB 4.0 and
multi-record
Yes beyond support multi-document
ACID
ACID transactions. Learn more
transactions

Specific language using


Select, Insert, and Update
Data
statements, e.g. SELECT Through object-oriented APIs
Manipulation
fields FROM table
WHERE…

Depends on product. Some


provide strong consistency (e.g.,
Can be configured for MongoDB, with tunable
Consistency
strong consistency consistency for reads) whereas
others offer eventual consistency
(e.g., Cassandra).

Implementing a NoSQL Database

Often, organizations will begin with a small-scale trial of a NoSQL database in their
organization, which makes it possible to develop an understanding of the technology in
a low-stakes way. Most NoSQL databases are also based on open technologies and are
free to use, meaning that they can be downloaded, implemented and scaled at little cost.
Because development cycles are faster, organizations can also innovate more quickly
and deliver superior customer experience at a lower cost.

24
As you consider alternatives to legacy infrastructures, you may have several
motivations: to scale or perform beyond the capabilities of your existing system,
identify viable alternatives to expensive proprietary software, or increase the speed and
agility of development. When selecting the right database for your business and
application, there are five important dimensions to consider.

S3

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers
industry-leading scalability, data availability, security, and performance. This means
customers of all sizes and industries can use it to store and protect any amount of data
for a range of use cases, such as websites, mobile applications, backup and restore,
archive, enterprise applications, IoT devices, and big data analytics. Amazon S3
provides easy-to-use management features so you can organize your data and configure
finely-tuned access controls to meet your specific business, organizational, and
compliance requirements. Amazon S3 is designed for 99.999999999% (11 9's) of
durability, and stores data for millions of applications for companies all around the
world.

25
Hadoop Distributed File Systems

The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications. It employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance access to data across
highly scalable Hadoop clusters.

HDFS is a key part of the many Hadoop ecosystem technologies, as it provides a


reliable means for managing pools of big data and supporting related big data
analytics applications.

How HDFS works

HDFS supports the rapid transfer of data between compute nodes. At its outset, it was
closely coupled with MapReduce, a programmatic framework for data processing.

When HDFS takes in data, it breaks the information down into separate blocks and
distributes them to different nodes in a cluster, thus enabling highly efficient parallel
processing.

Moreover, the Hadoop Distributed File System is specially designed to be highly fault-
tolerant. The file system replicates, or copies, each piece of data multiple times and
distributes the copies to individual nodes, placing at least one copy on a different server
rack than the others. As a result, the data on nodes that crash can be found elsewhere
within a cluster. This ensures that processing can continue while data is recovered.

HDFS uses master/slave architecture. In its initial incarnation, each Hadoop


cluster consisted of a single NameNode that managed file system operations and
supporting DataNodes that managed data storage on individual compute nodes. The
HDFS elements combine to support applications with large data sets.

This master node "data chunking" architecture takes as its design guides elements from
Google File System (GFS), a proprietary file system outlined in in Google technical
papers, as well as IBM's General Parallel File System (GPFS), a format that boosts I/O
by striping blocks of data over multiple disks, writing blocks in parallel. While HDFS is
not Portable Operating System Interface model-compliant, it echoes POSIX design style
in some aspects.

26
APACHE SOFTWARE FOUNDATION

HDFS architecture centers on commanding NameNodes that hold metadata and


DataNodes that store information in blocks. Working at the heart of Hadoop, HDFS can
replicate data at great scale.

Why use HDFS?

The Hadoop Distributed File System arose at Yahoo as a part of that company's ad
serving and search engine requirements. Like other web-oriented companies, Yahoo
found itself juggling a variety of applications that were accessed by a growing numbers
of users, who were creating more and more data. Facebook, eBay, LinkedIn and Twitter
are among the web companies that used HDFS to underpin big data analytics to
address these same requirements.

But the file system found use beyond that. HDFS was used by The New York Times as
part of large-scale image conversions, Media6Degrees for log processing and machine
learning, LiveBet for log storage and odds analysis, Joost for session analysis and Fox
Audience Network for log analysis and data mining. HDFS is also at the core of many
open source data warehouse alternatives, sometimes called data lakes.

27
Because HDFS is typically deployed as part of very large-scale implementations,
support for low-cost commodity hardware is a particularly useful feature. Such
systems, running web search and related applications, for example, can range into the
hundreds of petabytes and thousands of nodes. They must be especially resilient, as
server failures are common at such scale.

HDFS and Hadoop history

In 2006, Hadoop's originators ceded their work on HDFS and MapReduce to the
Apache Software Foundation project. The software was widely adopted in big data
analytics projects in a range of industries. In 2012, HDFS and Hadoop became available
in Version 1.0.

The basic HDFS standard has been continuously updated since its inception.

With Version 2.0 of Hadoop in 2013, a general-purpose YARN resource manager was
added, and MapReduce and HDFS were effectively decoupled. Thereafter, diverse data
processing frameworks and file systems were supported by Hadoop. While MapReduce
was often replaced by Apache Spark, HDFS continued to be a prevalent file format for
Hadoop.

After four alpha releases and one beta, Apache Hadoop 3.0.0 became generally available
in December 2017, with HDFS enhancements supporting additional NameNodes,
erasure coding facilities and greater data compression. At the same time, advances in
HDFS tooling, such as LinkedIn's open source Dr. Elephant and Dynamometer
performance testing tools, have expanded to enable development of ever larger HDFS
implementations.

28
Visualizatoins

Visual data analysis techniques

29
30
31
32
33
34
35
36
37
38
Systems and applications:

39
40
41
42
43
44
45

You might also like