You are on page 1of 66

UNIT – 4

NOSQL DATABASES AND BIG DATA STORAGE SYSTEMS

NoSQL – Categories of NoSQL Systems – CAP Theorem – Document-Based NoSQL Systems and MongoDB – MongoDB Data
Model – MongoDB Distributed Systems Characteristics – NoSQL Key-Value Stores – DynamoDB Overview – Voldemort Key-
Value Distributed Data Store – Wide Column NoSQL Systems – Hbase Data Model – Hbase Crud Operations – Hbase Storage and
Distributed System Concepts – NoSQL Graph Databases and Neo4j – Cypher Query Language of Neo4j – Big Data – MapReduce
– Hadoop – YARN.

What is NoSQL?
NoSQL Database is a non-relational Data Management System, that does not require a fixed schema. It
avoids joins, and is easy to scale. The major purpose of using a NoSQL database is for distributed data
stores with humongous data storage needs. NoSQL is used for Big data and real-time web apps. For
example, companies like Twitter, Facebook and Google collect terabytes of user data every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term would be “NoREL”,
NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.

Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a NoSQL
database system encompasses a wide range of database technologies that can store structured, semi-
structured, unstructured and polymorphic data. Let’s understand about NoSQL with a diagram in this
NoSQL database tutorial:
OLAP:
Online analytical processing (OLAP) is a system for performing multi-dimensional analysis at high speeds on large volumes of data.
Typically, this data is from a data warehouse, data mart or some other centralized data store.

Categories of No Sql:
Here are the four main types of NoSQL databases:

 Document databases
 Key-value stores
 Column-oriented databases
 Graph databases

Document Databases
A document database stores data in JSON, BSON , or XML documents (not Word documents or Google docs, of course).
In a document database, documents can be nested. Particular elements can be indexed for faster querying.

Documents can be stored and retrieved in a form that is much closer to the data objects used in applications, which means
less translation is required to use the data in an application. SQL data must often be assembled and disassembled when
moving back and forth between applications and storage.

Document databases are popular with developers because they have the flexibility to rework their document structures
as needed to suit their application, shaping their data structures as their application requirements change over time. This
flexibility speeds development because in effect data becomes like code and is under the control of developers. In SQL
databases, intervention by database administrators may be required to change the structure of a database.

The most widely adopted document databases are usually implemented with a scale-out architecture, providing a clear
path to scalability of both data volumes and traffic.

Use cases include ecommerce platforms, trading platforms, and mobile app development across industries.

Comparing MongoDB vs PostgreSQL offers a detailed analysis of MongoDB, the leading NoSQL database, and
PostgreSQL, one of the most popular SQL databases.

Key-Value Stores
The simplest type of NoSQL database is a key-value store . Every data element in the database is stored as a key value
pair consisting of an attribute name (or "key") and a value. In a sense, a key-value store is like a relational database with
only two columns: the key or attribute name (such as state) and the value (such as Alaska).

Use cases include shopping carts, user preferences, and user profiles.

Column-Oriented Databases
While a relational database stores data in rows and reads data row by row, a column store is organized as a set of columns.
This means that when you want to run analytics on a small number of columns, you can read those columns directly
without consuming memory with the unwanted data. Columns are often of the same type and benefit from more efficient
compression, making reads even faster. Columnar databases can quickly aggregate the value of a given column (adding
up the total sales for the year, for example). Use cases include analytics.

Unfortunately there is no free lunch, which means that while columnar databases are great for analytics, the way in which
they write data makes it very difficult for them to be strongly consistent as writes of all the columns require multiple
write events on disk. Relational databases don't suffer from this problem as row data is written contiguously to disk.

Graph Databases
A graph database focuses on the relationship between data elements. Each element is stored as a node (such as a person
in a social media graph). The connections between elements are called links or relationships. In a graph database,
connections are first-class elements of the database, stored directly. In relational databases, links are implied, using data
to express the relationships.

A graph database is optimized to capture and search the connections between data elements, overcoming the overhead
associated with JOINing multiple tables in SQL.

Very few real-world business systems can survive solely on graph queries. As a result graph databases are usually run
alongside other more traditional databases.

Use cases include fraud detection, social networks, and knowledge graphs.

The CAP Theorem in DBMS


The CAP theorem, originally introduced as the CAP principle, can be used to explain some of the competing
requirements in a distributed system with replication. It is a tool used to make system designers aware of the
trade-offs while designing networked shared-data systems.
The three letters in CAP refer to three desirable properties of distributed systems with replicated
data: consistency (among replicated copies), availability of the system for read and write operations)
and partition tolerance in the face of the nodes in the system being partitioned by a network fault).
The CAP theorem states that it is not possible to guarantee all three of the desirable properties – consistency,
availability, and partition tolerance at the same time in a distributed system with data replication.
The theorem states that networked shared-data systems can only strongly support two of the following three
properties:

 Consistency–
Consistency means that the nodes will have the same copies of a replicated data item visible for
various transactions. A guarantee that every node in a distributed cluster returns the same, most
recent, successful write. Consistency refers to every client having the same view of the data. There
are various types of consistency models. Consistency in CAP refers to sequential consistency, a
very strong form of consistency.

 Availability–
Availability means that each read or write request for a data item will either be processed
successfully or will receive a message that the operation cannot be completed. Every non-failing
node returns a response for all read and write requests in a reasonable amount of time. The key
word here is every. To be available, every node on (either side of a network partition) must be able
to respond in a reasonable amount of time.

 Partition Tolerant–
Partition tolerance means that the system can continue operating if the network connecting the
nodes has a fault that results in two or more partitions, where the nodes in each partition can only
communicate among each other. That means, the system continues to function and upholds its
consistency guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover from partitions once
the partition heals.

The use of the word consistency in CAP and its use in ACID do not refer to the same identical concept.

In CAP, the term consistency refers to the consistency of the values in different copies of the same data item
in a replicated distributed system. In ACID, it refers to the fact that a transaction will not violate the integrity
constraints specified on the database schema.

The CAP theorem categorizes systems into three categories:

CP (Consistent and Partition Tolerant) database: A CP database delivers consistency and


partition tolerance at the expense of availability. When a partition occurs between any two nodes,
the system has to shut down the non-consistent node (i.e., make it unavailable) until the partition
is resolved.

Partition refers to a communication break between nodes within a distributed system. Meaning,
if a node cannot receive any messages from another node in the system, there is a partition
between the two nodes. Partition could have been because of network failure, server crash, or
any other reason.

AP (Available and Partition Tolerant) database: An AP database delivers availability and


partition tolerance at the expense of consistency. When a partition occurs, all nodes remain
available but those at the wrong end of a partition might return an older version of data than
others. When the partition is resolved, the AP databases typically resync the nodes to repair all
inconsistencies in the system.

CA (Consistent and Available) database: A CA delivers consistency and availability in the


absence of any network partition. Often a single node’s DB servers are categorized as CA
systems. Single node DB servers do not need to deal with partition tolerance and are thus
considered CA systems.

In any networked shared-data systems or distributed systems partition tolerance is a must.


Network partitions and dropped messages are a fact of life and must be handled appropriately.
Consequently, system designers must choose between consistency and availability.

The following diagram shows the classification of different databases based on the CAP theorem.
System designers must take into consideration the CAP theorem while designing or choosing
distributed storages as one needs to be sacrificed from C and A for others.

Document based NoSQL

Document databases are considered to be non-relational (or NoSQL) databases. Instead of storing data
in fixed rows and columns, document databases use flexible documents. Document databases are the
most popular alternative to tabular, relational databases.

Understanding Document-Based MongoDB NoSQL

rdbms related to enterprise scalabilities, performance and


flexibilities
rdbms evolved out of strong roots in math like relational and set theories. some of the aspects
are schema validation, normalized data to avoid duplication, atomicity, locking, concurrency,
high availability and one version of the truth .
while these aspects have lot of benefits in terms of data storage and retrieval, they can impact
enterprise scalabilities, performance and flexibilities. let us consider a typical purchase order
example. in rdbms world we will have 2 tables with one-to-many relationship as below,

consider that we need to store huge amount of purchase orders and we started partitioning, one of
the ways to partition is to have orderheader table in one db instance and lineitem information in
another. and if you want to insert or update an order information, you need to update both the tables
atomically and you need to have a transaction manager to ensure atomicity. if you want to scale this
further in terms of processing and data storage, you can only increase hard disk space and ram.

the easy way to achieve scaling in rdbms is vertical scaling

While horizontal scaling refers to adding additional nodes, vertical scaling describes adding more
power to your current machines. For instance, if your server requires more processing power, vertical
scaling would mean upgrading the CPUs.

let us consider another situation, because of the change in our business we added a new column to
the lineitem table called linedesc. and imagine that this application was running in production. once
we deploy this change, we need to bring down the server and for some time to take effect this change.

achieving enterprise scalabilities, performance and


flexibility
fundamental requirements of modern enterprise systems are,

1. flexibilities in terms of scaling database so that multiple instance of the database can process
the information parallel
2. flexibilities in terms of changes to the database can be absorbed without long server
downtimes
3. application /middle tier does not handle object-relational impedance mismatch – can we get
away with it using techniques like json(javascript Object notation)

let us go back to our purchaseorder example and relax some of the aspects of rdbms like
normalization (avoid joins of lot of rows), atomicity and see if we can achieve some of the above
objectives.

below is an example of how we can store the purchaseorder (there are other better way of storing
the information).

orderheader:{

orderdescription: “krishna’s orders”

date:"sat jul 24 2010 19:47:11 gmt-0700(pdt)",

lineitems:[

{linename:"pendrive", quantity:"5"}, {linename:"harddisk", quantity:"10"}

if you notice carefully, the purchase order is stored in a json document like structure. you also
notice, we don’t need multiple tables, relationship and normalization and hence there is no need to
join. and since the schema qualifiers are within the document, there is no table definition.

you can store them as collection of objects/documents. hypothetically if we need to store several
millions of purchaseorders, we can chunk them in groups and store them in several instances.

if you want to retrieve purchaseorders based on specific criteria, for example all the purchase orders
in which one of the line item is a “pendrive”, we can ask all the individual instances to retrieve in
“parallel” based on the same criteria and one of them can consolidate the list and return the
information to the client. this is the concept of horizontal scaling
because the there is no separate table schema and and the schema definition is included in the json
object, we can change document structure and store and retrieve with just change in application
layer. this does not need database restart.

finally the object structure is json, we can directly present it to the web tier or mobile device and
they will render it.

nosql is a classification of database which is designed to keep the above aspects in mind.

mongodb: document based nosql


mongodb nosql database is document based which is some of the above techniques to store and
retrieve the data. there are few nosql databases that are ordered key value based
like redis , cassandra whichalso take these approaches but are much simpler.

if you have to give rdbms analogy, collection in mongodb are similar to tables, document are similar
to rows. internally mongodb stores the information as binary serializable json objects called bson .
mongodb support javascript style query syntax to retrieve bson (binary JSON)objects.
typical documents looks as below,

post={

author:“hergé”,

date:new date(),

text:“destination moon”,

tags:[“comic”,“adventure”] }

> db.post.save(post)

------------

>db.posts.find() {

_id:objectid(" 4c4ba5c0672c685e5e8aabf3"),

author:"hergé",

date:"sat jul 24 2010 19:47:11 gmt-0700(pdt)",

text:"destination moon",

tags:["comic","adventure"]

in mongodb, atomicity is guaranteed within a document. if you have to achieve atomicity outside of
the document, it has to be managed at the application level. below is an example,

many to many:

products:{

_id:objectid("10"),

name:"destinationmoon",

category_ids:[objectid("20"),objectid("30”)]}

categories:{

_id:objectid("20"),

name:"adventure"}

//all products for a given category


>db.products.find({category_ids:objectid("20")})

//all categories for a given product

product=db.products.find(_id:some_id)

>db.categories.find({

_id:{$in:product.category_ids}

})

[feedly mini]

in a typical stack that uses mongodb, it makes lot of sense to use a javascript based framework. a
good web framework, we use express/node.js/mongodb stack. a good example of how to use these
stack is here .

mongodb nosql also supports sharding which supports parallel processing/horizontal scaling. for
more details on how a typical bigdata handles parallel processing/horizontal scaling, refer rickly
ho’s link

a typical use cases for mongodb include, event logging, realtime analytics, content management,
ecommerce. use cases where it is not a good fit are transaction base banking system, non realtime
data warehousing

Data Model Design


Effective data models support your application needs. The key consideration for the structure of your documents is the
decision to embed or to use references.

Embedded Data Models


With MongoDB, you may embed related data in a single structure or document. These schema are generally known as
"denormalized" models, and take advantage of MongoDB's rich documents. Consider the following diagram:

Embedded data models allow applications to store related pieces of information in the same database record. As a result,
applications may need to issue fewer queries and updates to complete common operations.

In general, use embedded data models when:

 you have "contains" relationships between entities. See Model One-to-One Relationships with Embedded Documents.
 you have one-to-many relationships between entities. In these relationships the "many" or child documents always appear
with or are viewed in the context of the "one" or parent documents. See Model One-to-Many Relationships with Embedded
Documents.

In general, embedding provides better performance for read operations, as well as the ability to request and retrieve
related data in a single database operation. Embedded data models make it possible to update related data in a single
atomic write operation.

To access data within embedded documents, use dot notation to "reach into" the embedded documents. See query for
data in arrays and query data in embedded documents for more examples on accessing data in arrays and embedded
documents.

Embedded Data Model and Document Size Limit

Documents in MongoDB must be smaller than the maximum BSON document size.

For bulk binary data, consider GridFS.

Normalized Data Models

Normalized data models describe relationships using references between documents.

In general, use normalized data models:

 when embedding would result in duplication of data but would not provide sufficient read performance
advantages to outweigh the implications of the duplication.
 to represent more complex many-to-many relationships.
 to model large hierarchical data sets.

To join collections, MongoDB provides the aggregation stages:


 $lookup
 $graphLookup

MongoDB also provides referencing to join data across collections.

For an example of normalized data models,

For examples of various tree models,

MongoDB’s top five technical features:

1. Ad-hoc queries for optimized, real-time analytics


In SQL, an ad hoc query is a loosely typed command/query whose value depends upon some variable. Each time the
command is executed, the result is different, depending on the value of the variable. It cannot be predetermined
and usually comes under dynamic programming SQL query.

When designing the schema of a database, it is impossible to know in advance all the queries that will be performed by
end users. An ad hoc query is a short-lived command whose value depends on a variable. Each time an ad hoc query is
executed, the result may be different, depending on the variables in question.

Optimizing the way in which ad-hoc queries are handled can make a significant difference at scale, when thousands to
millions of variables may need to be considered. This is why MongoDB, a document-oriented, flexible schema database,
stands apart as the cloud database platform of choice for enterprise applications that require real-time analytics. With ad-
hoc query support that allows developers to update ad-hoc queries in real time, the improvement in performance can be
game-changing.

MongoDB supports field queries, range queries, and regular expression searches. Queries can return specific fields and
also account for user-defined functions. This is made possible because MongoDB indexes BSON documents and uses
the MongoDB Query Language (MQL).

2. Indexing appropriately for better query executions


In our experience, the number one issue that many technical support teams fail to address with their users is indexing.
Done right, indexes are intended to improve search speed and performance. A failure to properly define appropriate
indices can and usually will lead to a myriad of accessibility issues, such as problems with query execution and load
balancing.

Without the right indices, a database is forced to scan documents one by one to identify the ones that match the query
statement. But if an appropriate index exists for each query, user requests can be optimally executed by the server.
MongoDB offers a broad range of indices and features with language-specific sort orders that support complex access
patterns to datasets.
Notably, MongoDB indices can be created on demand to accommodate real-time, ever-changing query patterns and
application requirements. They can also be declared on any field within any of your documents, including those nested
within arrays.

3. Replication for better data availability and stability


When your data only resides in a single database, it is exposed to multiple potential points of failure, such as a server
crash, service interruptions, or even good old hardware failure. Any of these events would make accessing your data
nearly impossible.

Replication allows you to sidestep these vulnerabilities by deploying multiple servers for disaster recovery and backup.
Horizontal scaling across multiple servers that house the same data (or shards of that same data) means greatly increased
data availability and stability. Naturally, replication also helps with load balancing. When multiple users access the same
data, the load can be distributed evenly across servers.

In MongoDB, replica sets are employed for this purpose. A primary server or node accepts all write operations and
applies those same operations across secondary servers, replicating the data. If the primary server should ever experience
a critical failure, any one of the secondary servers can be elected to become the new primary node. And if the former
primary node comes back online, it does so as a secondary server for the new primary node.

4. Sharding
When dealing with particularly large datasets, sharding—the process of splitting larger datasets across multiple
distributed collections, or “shards”—helps the database distribute and better execute what might otherwise be
problematic and cumbersome queries. Without sharding, scaling a growing web application with millions of daily users
is nearly impossible.

Like replication via replication sets, sharding in MongoDB allows for much greater horizontal scalability. Horizontal
scaling means that each shard in every cluster houses a portion of the dataset in question, essentially functioning as a
separate database. The collection of distributed server shards forms a single, comprehensive database much better suited
to handling the needs of a popular, growing application with zero downtime.

Zero downtime deployment is a deployment method where your website or application is never down or in an
unstable state during the deployment process. To achieve this the web server doesn't start serving the changed code
until the entire deployment process is complete.

All operations in a sharding environment are handled through a lightweight process called mongos. Mongos can direct
queries to the correct shard based on the shard key. Naturally, proper sharding also contributes significantly to better
load balancing.

5. Load balancing
At the end of the day, optimal load balancing remains one of the holy grails of large-scale database management for
growing enterprise applications. Properly distributing millions of client requests to hundreds or thousands of servers can
lead to a noticeable (and much appreciated) difference in performance.

Fortunately, via horizontal scaling features like replication and sharding, MongoDB supports large-scale load balancing.
The platform can handle multiple concurrent read and write requests for the same data with best-in-class concurrency
control and locking protocols that ensure data consistency. There’s no need to add an external load balancer—MongoDB
ensures that each and every user has a consistent view and quality experience with the data they need to access

NoSQL database types explained: Key-value store

What is a key-value store?

This specific type of NoSQL database uses the key-value method and represents a collection of numerous key-value
pairs. The keys are unique identifiers for the values. The values can be any type of object -- a number or a string, or even
another key-value pair in which case the structure of the database grows more complex.

Unlike relational databases, key-value databases do not have a specified structure. Relational databases store data in
tables where each column has an assigned data type. Key-value databases are a collection of key-value pairs that are
stored as individual records and do not have a predefined data structure. The key can be anything, but seeing that it is
the only way of retrieving the value associated with it, naming the keys should be done strategically.

Key names can range from as simple as numbering to specific descriptions of the value that is about to follow. A key-
value database can be thought of as a dictionary or a directory. Dictionaries have words as keys and their meanings as
values.
Phonebooks have names of people as keys and their phone numbers as values. Just like key-value stores, unless you
know the name of the person whose number you need, you will not be able to find the right number.

The features of key-value database


The key-value store is one of the least complex types of NoSQL databases. This is precisely what makes this model so
attractive. It uses very simple functions to store, get and remove data.

Apart from those main functions, key-value store databases do not have querying language. The data is of no type and is
determined by the requirements of the application used to process the data.

A very useful feature is built-in redundancy improving the reliability of this database type.

Use cases of key-value databases


The choice of which database an organization should apply depends purely on its users and their needs. However, some
of the most common use cases of key-value databases are to record sessions in applications that require logins.
In this case, the data about each session -- period from login to logoff -- is recorded in a key-value store. Sessions are
marked with identifiers and all data recorded about each session -- themes, profiles, targeted offers, etc. -- is sorted under
the appropriate identifier.

With an increasing variety of data types and cheap storing options, we started stepping away from them and looking into
nonrelational (NoSQL) databases.

Another more specific use case yet similar to the previous one is a shopping cart where e-commerce websites can record
data pertaining to individual shopping sessions. Relational databases are better to use with payment transaction records;
however, session records prior to payment are probably better off in a key-value store. We know that more people fill
their shopping carts and subsequently change their mind about buying the selected items than those who proceed to
payment. Why fill a relational database with all this data when there is a more efficient and more reliable solution?

A key-value store will be quick to record and get data simultaneously. Also, with its built-in redundancy, it ensures that
no item from a cart gets lost. The scalability of key-value stores comes in handy in peak seasons around holidays or
during sales and special promotions because there is usually a sharp increase in sales and an even greater increase in
traffic on the website. The scalability of the key-value store will make sure that the increased load on the database does
not result in performance issues.

Advantages of key-value databases


It is worth pointing out that different database types exist to serve different purposes. This sometimes makes the choice
of the right type of database to use obvious. While key-value databases may be limited in what they can do, they are
often the right choice for the following reasons:

 Simplicity. As mentioned above, key value databases are quite simple to use. The straightforward commands
and the absence of data types make work easier for programmers. With this feature data can assume any type,
or even multiple types, when needed.

 Speed. This simplicity makes key value databases quick to respond, provided that the rest of the environment
around it is well-built and optimized.

 Scalability. This is a beloved advantage of NoSQL databases over relational databases in general, and key-
value stores in particular. Unlike relational databases, which are only scalable vertically, key-value stores are
also infinitely scalable horizontally.

 Easy to move. The absence of a query language means that the database can be easily moved between
different systems without having to change the architecture.

 Reliability. Built-in redundancy comes in handy to cover for a lost storage node where duplicated data comes
in place of what's been lost.
Disadvantages of key-value databases
Not all key-value databases are the same, but some of the general drawbacks include the following:

 Simplicity. The list of advantages and disadvantages demonstrates that everything is relative, and that what
generally comes as an advantage can also be a disadvantage. This further proves that you have to consider
your needs and options carefully before choosing a database to use. The fact that key-value stores are not
complex also means that they are not refined. There is no language nor straightforward means that would
allow you to query the database with anything else other than the key.

 No query language. Without a unified query language to use, queries from one database may not be
transportable into a different key-value database.

 Values can't be filtered. The database sees values as blobs so it cannot make much sense of what they
contain. When there is a request placed, whole values are returned -- rather than a specific piece of
information -- and when they get updated, the whole value needs to be updated.

Popular key-value databases


If you want to rely on recommendations and follow in the footsteps of your peers, you likely won't make a mistake by
choosing one of the following key-value databases:

 Amazon DynamoDB. DynamoDB is a database trusted by many large-scale users and users in general. It is
fully managed and reliable, with built-in backup and security options. It is able to endure high loads and
handle trillions of requests daily. These are just some of the many features supporting the reputation of
DynamoDB, apart from its famous name.

 Aerospike. This is a real-time platform facilitating billions of transactions. It reduces the server footprint by
80% and enables high performance of real-time applications.

 Redis. Redis is an open source key-value database. With keys containing lists, hashes, strings and sets, Redis
is known as a data structure server.

The list goes on, and includes many strong competitors Key-value databases serve a specific purpose, and they have
features that can add value to some but impose limitations on others. For this reason, you should always carefully assess
your requirements and the purpose of your data before you settle for a database. Once that is done, you can start looking
into your options and ensure that your database allows you to collect and make the most of your data without
compromising performance.
DynamoDB – Overview
DynamoDB allows users to create databases capable of storing and retrieving any amount of data, and serving any
amount of traffic. It automatically distributes data and traffic over servers to dynamically manage each customer's
requests, and also maintains fast performance.

What is DynamoDB?

DynamoDB is a hosted NoSQL database offered by Amazon Web Services (AWS). It offers:
 reliable performance even as it scales;
 a managed experience, so you won't be SSH-ing (SSH or Secure Shell is a network communication protocol that
enables two computers to communicate) into servers to upgrade the crypto libraries;
 a small, simple API allowing for simple key-value access as well as more advanced query patterns.

DynamoDB is a particularly good fit for the following use cases:

Applications with large amounts of data and strict latency requirements. As your amount of data scales, JOINs and
advanced SQL operations can slow down your queries. With DynamoDB, your queries have predictable latency up to
any size, including over 100 TBs!
Serverless applications using AWS Lambda. AWS Lambda provides auto-scaling, stateless, ephemeral compute in
response to event triggers. DynamoDB is accessible via an HTTP API and performs authentication & authorization via
IAM roles, making it a perfect fit for building Serverless applications.
Data sets with simple, known access patterns. If you're generating recommendations and serving them to users,
DynamoDB's simple key-value access patterns make it a fast, reliable choice.

DynamoDB vs. RDBMS

DynamoDB uses a NoSQL model, which means it uses a non-relational system. The following table highlights the
differences between DynamoDB and RDBMS −

Common Tasks RDBMS DynamoDB

Connect to the Source It uses a persistent connection and SQL It uses HTTP requests and API operations
commands.
Create a Table Its fundamental structures are tables, and must It only uses primary keys, and no schema on
be defined. creation. It uses various data sources.

Get Table Info All table info remains accessible Only primary keys are revealed.

Load Table Data It uses rows made of columns. In tables, it uses items made of attributes

Read Table Data It uses SELECT statements and filtering It uses GetItem, Query, and Scan.
statements.

Manage Indexes It uses standard indexes created through SQL It uses a secondary index to achieve the
statements. Modifications to it occur same function. It requires specifications
automatically on table changes. (partition key and sort key).

Modify Table Data It uses an UPDATE statement. It uses an UpdateItem operation.

Delete Table Data It uses a DELETE statement. It uses a DeleteItem operation.

Delete a Table It uses a DROP TABLE statement. It uses a DeleteTable operation.

Advantages
The two main advantages of DynamoDB are scalability and flexibility. It does not force the use of a particular data
source and structure, allowing users to work with virtually anything, but in a uniform way.

Its design also supports a wide range of use from lighter tasks and operations to demanding enterprise functionality. It
also allows simple use of multiple languages: Ruby, Java, Python, C#, Erlang, PHP, and Perl.

Limitations
DynamoDB does suffer from certain limitations, however, these limitations do not necessarily create huge problems or
hinder solid development.

You can review them from the following points −

 Capacity Unit Sizes − A read capacity unit is a single consistent read per second for items no larger than 4KB.
A write capacity unit is a single write per second for items no bigger than 1KB.

 Provisioned Throughput Min/Max − All tables and global secondary indices have a minimum of one read and
one write capacity unit. Maximums depend on region. In the US, 40K read and write remains the cap per table
(80K per account), and other regions have a cap of 10K per table with a 20K account cap.
A data cap (bandwidth cap) is a service provider-imposed limit on the amount of data transferred by
a user account at a specified level of throughput over a given time period, for a specified fee. The
term applies to both home Internet service and mobile data plans.

Data caps are usually imposed as a maximum allowed amount of data in a month for an agreed-upon
charge. As a rule, when the user exceeds that limit, they are charged at a higher rate for further data
use.

 Provisioned Throughput Increase and Decrease − You can increase this as often as needed, but decreases
remain limited to no more than four times daily per table.

 Table Size and Quantity Per Account − Table sizes have no limits, but accounts have a 256 table limit unless
you request a higher cap.

 Secondary Indexes Per Table − Five local and five global are permitted.

 Projected Secondary Index Attributes Per Table − DynamoDB allows 20 attributes.

 Partition Key Length and Values − Their minimum length sits at 1 byte, and maximum at 2048 bytes, however,
DynamoDB places no limit on values.

 Sort Key Length and Values − Its minimum length stands at 1 byte, and maximum at 1024 bytes, with no limit
for values unless its table uses a local secondary index.

 Table and Secondary Index Names − Names must conform to a minimum of 3 characters in length, and a
maximum of 255. They use the following characters: AZ, a-z, 0-9, “_”, “-”, and “.”.

 Attribute Names − One character remains the minimum, and 64KB the maximum, with exceptions for keys and
certain attributes.

 Reserved Words − DynamoDB does not prevent the use of reserved words as names.

 Expression Length − Expression strings have a 4KB limit. Attribute expressions have a 255-byte limit.
Substitution variables of an expression have a 2MB limit.

Voldemort Key-Value Distributed Data Store


Voldemort is a distributed data store that was designed as a key-value store used by LinkedIn for highly-scalable storage. It is
named after the fictional Harry Potter villain Lord Voldemort.

Voldemort is a distributed key-value storage system

 Data is automatically replicated over multiple servers.


 Data is automatically partitioned so each server contains only a subset of the total data
 Provides tunable consistency (strict quorum or eventual consistency)
 Server failure is handled transparently
 Pluggable Storage Engines -- BDB-JE, MySQL, Read-Only
 Pluggable serialization -- Protocol Buffers, Thrift, Avro and Java Serialization
 Data items are versioned to maximize data integrity in failure scenarios without compromising availability of the system
 Each node is independent of other nodes with no central point of failure or coordination
 Good single node performance: you can expect 10-20k operations per second depending on the machines, the network, the
disk system, and the data replication factor
 Support for pluggable data placement strategies to support things like distribution across data centers that are geographically
far apart.

It is used at LinkedIn by numerous critical services powering a large portion of the site.

Comparison to relational databases

Voldemort is not a relational database, it does not attempt to satisfy arbitrary relations while satisfying ACID properties. Nor is it
an object database that attempts to transparently map object reference graphs. Nor does it introduce a new abstraction such as
document-orientation. It is basically just a big, distributed, persistent, fault-tolerant hash table. For applications that can use an O/R
mapper like active-record or hibernate this will provide horizontal scalability and much higher availability but at great loss of
convenience. For large applications under internet-type scalability pressure, a system may likely consist of a number of functionally
partitioned services or APIs, which may manage storage resources across multiple data centers using storage systems which may
themselves be horizontally partitioned. For applications in this space, arbitrary in-database joins are already impossible since all the
data is not available in any single database. A typical pattern is to introduce a caching layer which will require hashtable semantics
anyway. For these applications Voldemort offers a number of advantages:

 Voldemort combines in memory caching with the storage system so that a separate caching tier is not required (instead the
storage system itself is just fast)
 Unlike MySQL replication, both reads and writes scale horizontally
 Data portioning is transparent, and allows for cluster expansion without rebalancing all data
 Data replication and placement is decided by a simple API to be able to accommodate a wide range of application specific
strategies
 The storage layer is completely mockable so development and unit testing can be done against a throw-away in-memory
storage system without needing a real cluster (or even a real storage system) for simple testing

Wide Column NoSQL Systems

A wide-column database is a NoSQL database that organizes data storage into flexible columns that can be spread
across multiple servers or database nodes, using multi-dimensional mapping to reference data by column, row, and
timestamp.

What are wide-column stores?

Wide-column stores use the typical tables, columns, and rows, but unlike relational databases (RDBs),
columnal formatting and names can vary from row to row inside the same table. And each column is stored
separately on disk.
Columnar databases store each column in a separate file. One file stores only the key column, the other
only the first name, the other the ZIP, and so on. Each column in a row is governed by auto-indexing —
each functions almost as an index — which means that a scanned/queried columns offset corresponds to
the other columnal offsets in that row in their respective files.

Traditional row-oriented storage gives you the best performance when querying multiple columns of a
single row. Of course, relational databases are structured around columns that hold very specific
information, upholding that specificity for each entry. For instance, let’s take a Customer table. Column
values contain Customer names, addresses, and contact info. Every Customer has the same format.

Columnar families are different. They give you automatic vertical partitioning; storage is both column-
based and organized by less restrictive attributes. RDB tables are also restricted to row-based storage and
deal with tuple storage in rows, accounting for all attributes before moving forward; e.g., tuple 1 attribute
1, tuple 1 attribute 2, and so on — then tuple 2 attribute 1, tuple 2 attribute 2, and so on — in that order.
The opposite is columnar storage, which is why we use the term column families.

Note: some columnar systems also have the option for horizontal partitions at default of, say, 6 million
rows. When it’s time to run a scan, this eliminates the need to partition during the actual query. Set up your
system to sort its horizontal partitions at default based on the most commonly used columns. This minimizes
the number of extents containing the values you are looking for.

One useful option if offered — InfiniDB is one example that does — is to automatically create horizontal
partitions based on the most recent queries. This eliminates the impact of much older queries that are no
longer crucial.

A wide-column store (or extensible record store) is a type of NoSQL database. It uses tables, rows, and columns, but unlike a relational
database, the names and format of the columns can vary from row to row in the same table. A wide-column store can be interpreted as a
two-dimensional key–value store.

Wide-column stores versus columnar databases


Wide-column stores such as Bigtable and Apache Cassandra are not column stores in the original sense of the term, since
their two-level structures do not use a columnar data layout. In genuine column stores, a columnar data layout is adopted
such that each column is stored separately on disk. Wide-column stores do often support the notion of column
families that are stored separately. However, each such column family typically contains multiple columns that are used
together, similar to traditional relational database tables. Within a given column family, all data is stored in a row-by-
row fashion, such that the columns for a given row are stored together, rather than each column being stored separately.

Wide-column stores that support column families are also known as column family databases.

Notable wide-column stores


Notable wide-column stores include:

 Apache Accumulo
 Apache Cassandra
 Apache HBase
 Bigtable
 DataStax Enterprise
 DataStax Astra DB
 Hypertable
 Azure Tables
 Scylla (database)

What Is a Wide Column Database?

Wide column databases, or column family databases, refers to a category of NoSQL databases that works well for storing
enormous amounts of data that can be collected. Its architecture uses persistent, sparse matrix, multi-dimensional
mapping (row-value, column-value, and timestamp) in a tabular format meant for massive scalability (over and above
the petabyte scale). Column family stores do not follow the relational model, and they aren’t optimized for joins.

Good wide column database use cases include:

 Sensor Logs [Internet of Things (IoT)]


 User preferences
 Geographic information
 Reporting systems
 Time series data
 Logging and other write heavy applications

Wide column databases are not the preferred choice for applications with ad-hoc query patterns, high level aggregations
and changing database requirements. This type of data store does not keep good data lineage.

Other Definitions of Wide Column Databases Include:

 “A multidimensional nested sorted map of maps, where data is stored in cells of columns and grouped into column
families.” (Akshay Pore)
 “Scalability and high availability without compromising performance.” (Apache)
 Database management systems that organize related facts into columns. (Forbes)
 “Databases [that] are similar to key-value but allow a very large number of columns. They are well suited for
analyzing huge data sets, and Cassandra is the best known.” (IBM)
 A store that groups data into columns and allowing for an infinite number of them. (Temple University)
 A store with data as rows and columns, like a RDBMS, but able to handle more ambiguous and complex data
types, including unformatted text and imagery. (Michelle Knight)

Businesses Use Wide Column Databases to Handle:

 High volume of data


 Extreme write speeds with relatively less velocity reads
 Data extraction by columns using row keys

Hbase Data Model

What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and
is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts
of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).

It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.

One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS
randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.

HBase and HDFS

HDFS HBase

HDFS is a distributed file system suitable HBase is a database built on top of the HDFS.
for storing large files.

HDFS does not support fast individual HBase provides fast lookups for larger tables.
record lookups.

It provides high latency batch processing; It provides low latency access to single rows from billions of
no concept of batch processing. records (Random access).

It provides only sequential access of data. HBase internally uses Hash tables and provides random access,
and it stores the data in indexed HDFS files for faster lookups.

Storage Mechanism in HBase


HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column
families, which are the key value pairs. A table have multiple column families and each column family can have any
number of columns. Subsequent column values are stored contiguously on the disk. Each cell value of the table has a
timestamp. In short, in an HBase:

 Table is a collection of rows.

 Row is a collection of column families.

 Column family is a collection of columns.

 Column is a collection of key value pairs.

Given below is an example schema of table in HBase.

Rowid Column Family Column Family Column Family Column Family

col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data, rather than as rows of data.
Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database

It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical Processing
(OLAP).

Such databases are designed for small number of rows and Column-oriented databases are designed for
columns. huge tables.

The following image shows column families in a column-oriented database:


HBase and RDBMS

HBase RDBMS

HBase is schema-less, it doesn't have the concept of fixed An RDBMS is governed by its schema, which
columns schema; defines only column families. describes the whole structure of tables.

It is built for wide tables. HBase is horizontally scalable. It is thin and built for small tables. Hard to scale.

No transactions are there in HBase. RDBMS is transactional.

It has de-normalized data. It will have normalized data.

It is good for semi-structured as well as structured data. It is good for structured data.

Features of HBase

 HBase is linearly scalable.

 It has automatic failure support.

 It provides consistent read and writes.

 It integrates with Hadoop, both as a source and a destination.

 It has easy java API for client.

 It provides data replication across clusters.


Where to Use HBase

 Apache HBase is used to have random, real-time read/write access to Big Data.

 It hosts very large tables on top of clusters of commodity hardware.

 Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on Google File
System, likewise Apache HBase works on top of Hadoop and HDFS.

Applications of HBase

 It is used whenever there is a need to write heavy applications.

 HBase is used whenever we need to provide fast random access to available data.

 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

HBase – Overview of Architecture and Data Model


In this article, we will briefly look at the capabilities of HBase, compare it against technologies that we are already
familiar with and look at the underlying architecture. In the upcoming parts, we will explore the core data model and
features that enable it to store and manage semi-structured data.

Introduction
HBase is a column-oriented database that’s an open-source implementation of Google’s Big Table storage architecture.
It can manage structured and semi-structured data and has some built-in features such as scalability, versioning,
compression and garbage collection.
Since its uses write-ahead logging and distributed configuration, it can provide fault-tolerance and quick recovery from
individual server failures. HBase built on top of Hadoop / HDFS and the data stored in HBase can be manipulated using
Hadoop’s MapReduce capabilities.

Let’s now take a look at how HBase (a column-oriented database) is different from some other data structures and
concepts that we are familiar with Row-Oriented vs. Column-Oriented data stores. As shown below, in a row-oriented
data store, a row is a unit of data that is read or written together. In a column-oriented data store, the data in a column is
stored together and hence quickly retrieved.
Row-oriented data stores –

 Data is stored and retrieved one row at a time and hence could read unnecessary data if only some of the
data in a row is required.
 Easy to read and write records
 Well suited for OLTP systems
 Not efficient in performing operations applicable to the entire dataset and hence aggregation is an
expensive operation
 Typical compression mechanisms provide less effective results than those on column-oriented data stores

Column-oriented data stores –

 Data is stored and retrieved in columns and hence can read only relevant data if only some data is required
 Read and Write are typically slower operations
 Well suited for OLAP systems
 Can efficiently perform operations applicable to the entire dataset and hence enables aggregation over
many rows and columns
 Permits high compression rates due to few distinct values in columns

Introduction Relational Databases vs. HBase


When talking of data stores, we first think of Relational Databases with structured data storage and a sophisticated query
engine. However, a Relational Database incurs a big penalty to improve performance as the data size increases. HBase,
on the other hand, is designed from the ground up to provide scalability and partitioning to enable efficient data structure
serialization, storage and retrieval. Broadly, the differences between a Relational Database and HBase are:

Relational Database –

 Is Based on a Fixed Schema


 Is a Row-oriented datastore
 Is designed to store Normalized Data
 Contains thin tables
 Has no built-in support for partitioning.

HBase –

 Is Schema-less
 Is a Column-oriented datastore
 Is designed to store Denormalized Data
 Contains wide and sparsely populated tables
 Supports Automatic Partitioning

HDFS vs. HBase


HDFS is a distributed file system that is well suited for storing large files. It’s designed to support batch processing of
data but doesn’t provide fast individual record lookups. HBase is built on top of HDFS and is designed to provide access
to single rows of data in large tables. Overall, the differences between HDFS and HBase are

HDFS –

 Is suited for High Latency operations batch processing


 Data is primarily accessed through MapReduce
 Is designed for batch processing and hence doesn’t have a concept of random reads/writes

HBase –

 Is built for Low Latency operations


 Provides access to single rows from billions of records
 Data is accessed through shell commands, Client APIs in Java, REST, Avro or Thrift

HBase Architecture
The HBase Physical Architecture consists of servers in a Master-Slave relationship as shown below. Typically, the HBase
cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Each Region Server
contains multiple Regions – HRegions.

Just like in a Relational Database, data in HBase is stored in Tables and these Tables are stored in Regions. When a Table
becomes too big, the Table is partitioned into multiple Regions. These Regions are assigned to Region Servers across
the cluster. Each Region Server hosts roughly the same number of Regions.
The HMaster in the HBase is responsible for

 Performing Administration
 Managing and Monitoring the Cluster
 Assigning Regions to the Region Servers
 Controlling the Load Balancing and Failover

On the other hand, the HRegionServer perform the following work

 Hosting and managing Regions


 Splitting the Regions automatically
 Handling the read/write requests
 Communicating with the Clients directly
Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions. Each Region in turn is made up
of a MemStore and multiple StoreFiles (HFile). The data lives in these StoreFiles in the form of Column Families
(explained below). The MemStore holds in-memory modifications to the Store (data).

The mapping of Regions to Region Server is kept in a system table called .META. When trying to read or write data
from HBase, the clients read the required Region information from the .META table and directly communicate with the
appropriate Region Server. Each Region is identified by the start key (inclusive) and the end key (exclusive)

HBase Data Model


The Data Model in HBase is designed to accommodate semi-structured data that could vary in field size, data type and
columns. Additionally, the layout of the data model makes it easier to partition the data and distribute it across the cluster.
The Data Model in HBase is made of different logical components such as Tables, Rows, Column Families, Columns,
Cells and Versions.

Tables – The HBase Tables are more like logical collection of rows stored in separate partitions called Regions. As
shown above, every Region is then served by exactly one Region Server. The figure above shows a representation of a
Table.
Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys are unique in a Table and are
always treated as a byte[].
Column Families – Data in a row are grouped together as Column Families. Each Column Family has one more Columns
and these Columns in a family are stored together in a low level storage file known as HFile. Column Families form the
basic unit of physical storage to which certain HBase features like compression are applied. Hence it’s important that
proper care be taken when designing Column Families in table.
The table above shows Customer and Sales Column Families. The Customer Column Family is made up 2 columns –
Name and City, whereas the Sales Column Families is made up to 2 columns – Product and Amount.

Columns – A Column Family is made of one or more columns. A Column is identified by a Column Qualifier that
consists of the Column Family name concatenated with the Column name using a colon – example:
columnfamily:columnname. There can be multiple Columns within a Column Family and Rows within a table can have
varied number of Columns.
Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family and the Column (Column
Qualifier). The data stored in a Cell is called its value and the data type is always treated as byte[].
Version – The data stored in a cell is versioned and versions of data are identified by the timestamp. The number of
versions of data retained in a column family is configurable and this value by default is 3.

Hbase Crud Operations

HBase - Client API


This chapter describes the java client API for HBase that is used to
perform CRUD (Create,Retrive,Update,Delete)operations on HBase tables. HBase is written in Java and has a Java
Native API. Therefore it provides programmatic access to Data Manipulation Language (DML).

Class HBase Configuration

Adds HBase configuration files to a Configuration. This class belongs to the org.apache.hadoop.hbase package.

Methods and description


S.No. Methods and Description

1
static org.apache.hadoop.conf.Configuration create()

This method creates a Configuration with HBase resources.

Class HTable

HTable is an HBase internal class that represents an HBase table. It is an implementation of table that is used to
communicate with a single HBase table. This class belongs to the org.apache.hadoop.hbase.client class.

Constructors
S.No. Constructors and Description
1
HTable()

2
HTable(TableName tableName, ClusterConnection connection, ExecutorService pool)

Using this constructor, you can create an object to access an HBase table.

Methods and description


S.No. Methods and Description

1
void close()

Releases all the resources of the HTable.

2
void delete(Delete delete)

Deletes the specified cells/row.

3
boolean exists(Get get)

Using this method, you can test the existence of columns in the table, as specified by Get.

4
Result get(Get get)

Retrieves certain cells from a given row.

5
org.apache.hadoop.conf.Configuration getConfiguration()

Returns the Configuration object used by this instance.

6
TableName getName()

Returns the table name instance of this table.

7
HTableDescriptor getTableDescriptor()

Returns the table descriptor for this table.

8
byte[] getTableName()

Returns the name of this table.


9
void put(Put put)

Using this method, you can insert data into the table.

Class Put

This class is used to perform Put operations for a single row. It belongs to
the org.apache.hadoop.hbase.client package.

Constructors
S.No. Constructors and Description

1
Put(byte[] row)

Using this constructor, you can create a Put operation for the specified row.

2
Put(byte[] rowArray, int rowOffset, int rowLength)

Using this constructor, you can make a copy of the passed-in row key to keep local.

3
Put(byte[] rowArray, int rowOffset, int rowLength, long ts)

Using this constructor, you can make a copy of the passed-in row key to keep local.

4
Put(byte[] row, long ts)

Using this constructor, we can create a Put operation for the specified row, using a given timestamp.

Methods
S.No. Methods and Description

1
Put add(byte[] family, byte[] qualifier, byte[] value)

Adds the specified column and value to this Put operation.

2
Put add(byte[] family, byte[] qualifier, long ts, byte[] value)

Adds the specified column and value, with the specified timestamp as its version to this Put operation.
3
Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)

Adds the specified column and value, with the specified timestamp as its version to this Put operation.

4
Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)

Adds the specified column and value, with the specified timestamp as its version to this Put operation.

Class Get

This class is used to perform Get operations on a single row. This class belongs to
the org.apache.hadoop.hbase.client package.

Constructor
S.No. Constructor and Description

1
Get(byte[] row)

Using this constructor, you can create a Get operation for the specified row.

2 Get(Get get)

Methods
S.No. Methods and Description

1
Get addColumn(byte[] family, byte[] qualifier)

Retrieves the column from the specific family with the specified qualifier.

2
Get addFamily(byte[] family)

Retrieves all columns from the specified family.

Class Delete

This class is used to perform Delete operations on a single row. To delete an entire row, instantiate a Delete object with
the row to delete. This class belongs to the org.apache.hadoop.hbase.client package.
Constructor
S.No. Constructor and Description

1
Delete(byte[] row)

Creates a Delete operation for the specified row.

2
Delete(byte[] rowArray, int rowOffset, int rowLength)

Creates a Delete operation for the specified row and timestamp.

3
Delete(byte[] rowArray, int rowOffset, int rowLength, long ts)

Creates a Delete operation for the specified row and timestamp.

4
Delete(byte[] row, long timestamp)

Creates a Delete operation for the specified row and timestamp.

Methods
S.No. Methods and Description

1
Delete addColumn(byte[] family, byte[] qualifier)

Deletes the latest version of the specified column.

2
Delete addColumns(byte[] family, byte[] qualifier, long timestamp)

Deletes all versions of the specified column with a timestamp less than or equal to the specified
timestamp.

3
Delete addFamily(byte[] family)

Deletes all versions of all columns of the specified family.

4
Delete addFamily(byte[] family, long timestamp)

Deletes all columns of the specified family with a timestamp less than or equal to the specified
timestamp.
Class Result

This class is used to get a single row result of a Get or a Scan query.

Constructors
S.No. Constructors

1
Result()

Using this constructor, you can create an empty Result with no KeyValue payload; returns null if you
call raw Cells().

Methods
S.No. Methods and Description

1
byte[] getValue(byte[] family, byte[] qualifier)

This method is used to get the latest version of the specified column.

2
byte[] getRow()

This method is used to retrieve the row key that corresponds to the row from which this Result was
created.

Hbase Storage and Distributed System Concepts


Auto-Sharding

The basic unit of scalability and load balancing in HBase is called a region. Regions are essentially contiguous ranges of
rows stored together. They are dynamically split by the system when they become too large. Alternatively, they may also
be merged to reduce their number and required storage files. Each region is served by exactly one region server, and
each of these servers can serve many regions at any time.
Splitting and serving regions can be thought of as autosharding, as offered by other systems. The regions allow for fast
recovery when a server fails, and fine-grained load balancing since they can be moved between servers when the load of
the server currently serving the region is under pressure, or if that server becomes unavailable because of a failure or
because it is being decommissioned.

Splitting is also very fast—close to instantaneous—because the split regions simply read from the original storage files
until a compaction rewrites them into separate ones asynchronously.

1.3.3Storage API
The API offers operations to create and delete tables and column families. In addition, it has functions to change the
table and column family metadata, such as compression or block sizes. Furthermore, there are the usual operations for
clients to create or delete values as well as retrieving them with a given row key.
A scan API allows you to efficiently iterate over ranges of rows and be able to limit which columns are returned or the
number of versions of each cell. You can match columns using filters and select versions using time ranges, specifying
start and end times.

1.3.4Implementation(Architecture)
The data is stored in store files, called HFiles, which are persistent and ordered immutable maps from keys to values.
Internally, the files are sequences of blocks with a block index stored at the end. The index is loaded when the HFile is
opened and kept in memory. The default block size is 64 KB but can be configured differently if required.The store files
provide an API to access specific values as well as to scan ranges of values given a start and end key.
The store files are typically saved in the Hadoop Distributed File System (HDFS), which provides a scalable, persistent,
replicated storage layer for HBase. It guarantees that data is never lost by writing the changes across a configurable
number of physical servers. When data is updated it is first written to a commit log, called a write-ahead log (WAL) in
HBase, and then stored in the in-memory memstore. Once the data in memory has exceeded a given maximum value, it
is flushed as anHFile to disk.
There are three major components to HBase: the client library, one master server, and many region servers. The region
servers can be added or removed while the system is up and running to accommodate changing workloads. The master
is responsible for assigning regions to region servers and uses Apache ZooKeeper, a reliable, highly available, persistent
and distributed coordination service,to facilitate that task.

The master server is also responsible for handling load balancing of regions across region servers, to unload busy
servers and move regions to less occupied ones. The master is not part of the actual data storage or retrieval path. It
negotiates load balancing and maintains the state of the cluster, but never provides any data services to either the region
servers or the clients, and is therefore lightly loaded in practice. In addition, it takes care of schema changes and other
metadata operations, such as creation of tables and column families.
Region servers are responsible for all read and write requests for all regions they serve, and also split regions that have
exceeded the configured region size thresholds. Clients communicate directly with them to handle all data-related
operations.

NoSQL Graph Databases and Neo4j

What is a NoSQL Graph Database?

The NoSQL graph database is a technology for data management designed to handle very large sets of structured,
semi-structured or unstructured data. The semantic graph database (also known as RDF triplestore) is a type of
NoSQL graph database that is capable of integrating heterogeneous data from many sources and making links
between datasets. It focuses on the relationships between entities and is able to infer new knowledge out of existing
information.

The NoSQL (‘not only SQL’) graph database is a technology for data management designed to handle very large sets of
structured, semi-structured or unstructured data. It helps organizations access, integrate and analyze data from various
sources, thus helping them with their big data and social media analytics.

NoSQL Graph Database Vs. Relational Database

The traditional approach to data management, the relational database, was developed in the 1970s to help enterprises
store structured information. The relational database needs its schema (the definition how data is organized and how the
relations are associated) to be defined before any new information is added.
Today, however, mobile, social and Internet of Things (IoT) data is everywhere, with unstructured real-time data piling
up by the minute. Apart from handling a massive amount of data of all kind, the NoSQL graph database does not need
its schema re-defined before adding new data.

This makes the graph database much more flexible, dynamic and lower-cost in integrating new data sources than
relational databases.

Compared to the moderate data velocity from one or few locations of the relational databases, NoSQL graph databases
are able to store, retrieve, integrate and analyze high-velocity data coming from many locations. Eg : Facebook Link

Semantically Rich NoSQL Graph Database

The semantic graph database is a type of NoSQL graph database that is capable of integrating heterogeneous data from
many sources and making links between datasets.

The semantic graph database, also referred to as an RDF triplestore, focuses on the relationships between entities and is
able to infer new knowledge out of existing information. It is a powerful tool to use in relationship-centered analytics
and knowledge discovery.

In addition, the capability to handle massive datasets and the schema-less approach support the NoSQL semantic graph
database usage in real-time big data analytics.

 In relational databases, the need to have the schemas defined before adding new information restricts data integration from
new sources because the whole schema needs to be changed anew.
 With the schema-less NoSQL semantic graph database with no need to change schemas every time a new data source is
about to be added, enterprises integrate data with less effort and cost.

The semantic graph database stands out from the other types of graph databases with its ability to additionally support
rich semantic data schema, the so-called ontologies.
The semantic NoSQL graph database gets the best of both worlds: on the one hand, data is flexible because it does not
depend on the schema. On the other hand, ontologies give the semantic graph database the freedom and ability to build
logical models any way organizations find it useful for their applications, without having to change the data.

The Benefits of the Semantic Graph Database

Apart from rich semantic models, semantic graph databases use the globally developed W3C standards for representing
data on the Web. The use of standard practices makes data integration, exchange and mapping to other datasets easier
and lowers the risk of vendor lock-in while working with a graph database.

One of those standards is the Uniform Resource Identifier (URI), a kind of unique ID for all things linked so that we can
distinguish between them or know that one thing from one dataset is the same as another in a different dataset. The use
of URIs not only reduces costs in integrating data from disparate sources, it also makes data publishing and sharing easier
with mapping to Linked (Open) Data.

Ontotext’s GraphDB is able to use inference, that is, to infer new links out of existing explicit statements in the RDF
triplestore. Inference enriches the graph database by creating new knowledge and gives organizations the ability to see
all their data highly interlinked. Thus, enterprises have more insights at hand to use in their decision-making processes.

NoSQL Graph Database Use Cases

Apart from representing proprietary enterprise data in a linked and meaningful way, the NoSQL graph database makes
content management and personalization easier, due to its cost-effective way of integrating and combining huge sets of
data.

As we all know the graph is a pictorial representation of data in the form of nodes and relationships which are
represented by edges. A graph database is a type of database used to represent the data in the form of a graph. It has
three components: nodes, relationships, and properties. These components are used to model the data. The concept of
a Graph Database is based on the theory of graphs. It was introduced in the year 2000. They are commonly referred
to NoSql databases as data is stored using nodes, relationships and properties instead of traditional databases. A graph
database is very useful for heavily interconnected data. Here relationships between data are given priority and therefore
the relationships can be easily visualized. They are flexible as new data can be added without hampering the old ones.
They are useful in the fields of social networking, fraud detection, AI Knowledge graphs etc.
The description of components are as follows:

 Nodes: represent the objects or instances. They are equivalent to a row in database. The node basically acts
as a vertex in a graph. The nodes are grouped by applying a label to each member.
 Relationships: They are basically the edges in the graph. They have a specific direction, type and form
patterns of the data. They basically establish relationship between nodes.
 Properties: They are the information associated with the nodes.
Some examples of Graph Databases software are Neo4j, Oracle NoSQL DB, Graph base etc. Out of which Neo4j is
the most popular one.

In traditional databases, the relationships between data is not established. But in the case of Graph Database, the
relationships between data are prioritized. Nowadays mostly interconnected data is used where one data is connected
directly or indirectly. Since the concept of this database is based on graph theory, it is flexible and works very fast for
associative data. Often data are interconnected to one another which also helps to establish further relationships. It
works fast in the querying part as well because with the help of relationships we can quickly find the desired nodes.
join operations are not required in this database which reduces the cost. The relationships and properties are stored as
first-class entities in Graph Database.

Graph databases allow organizations to connect the data with external sources as well. Since organizations require a
huge amount of data, often it becomes cumbersome to store data in the form of tables. For instance, if the organization
wants to find a particular data that is connected with another data in another table, so first join operation is performed
between the tables, and then search for the data is done row by row. But Graph database solves this big problem. They
store the relationships and properties along with the data. So if the organization needs to search for a particular data,
then with the help of relationships and properties the nodes can be found without joining or without traversing row by
row. Thus the searching of nodes is not dependent on the amount of data.

Types of Graph Databases:


 Property Graphs: These graphs are used for querying and analyzing data by modelling the relationships
among the data. It comprises of vertices that has information about the particular subject and edges that
denote the relationship. The vertices and edges have additional attributes called properties.
 RDF Graphs: It stands for Resource Description Framework. It focuses more on data integration. They
are used to represent complex data with well defined semantics. It is represented by three elements: two
vertices, an edge that reflect the subject, predicate and object of a sentence. Every vertex and edge is
represented by URI(Uniform Resource Identifier).
When to Use Graph Database?
 Graph databases should be used for heavily interconnected data.
 It should be used when amount of data is larger and relationships are present.
 It can be used to represent the cohesive picture of the data.
How Graph and Graph Databases Work?
Graph databases provide graph models They allow users to perform traversal queries since data is connected. Graph
algorithms are also applied to find patterns, paths and other relationships this enabling more analysis of the data. The
algorithms help to explore the neighboring nodes, clustering of vertices analyze relationships and patterns. Countless
joins are not required in this kind of database.

Example of Graph Database:


 Recommendation engines in E commerce use graph databases to provide customers with accurate
recommendations, updates about new products thus increasing sales and satisfying the customer’s desires.
 Social media companies use graph databases to find the “friends of friends” or products that the user’s
friends like and send suggestions accordingly to user.
 To detect fraud Graph databases play a major role. Users can create graph from t he transactions between
entities and store other important information. Once created, running a simple query will help to identify
the fraud.
Advantages of Graph Database:
 Potential advantage of Graph Database is establishing the relationships with external sources as well
 No joins are required since relationships is already specified.
 Query is dependent on concrete relationships and not on the amount of data.
 It is flexible and agile.
 it is easy to manage the data in terms of graph.
Disadvantages of Graph Database:
 Often for complex relationships speed becomes slower in searching.
 The query language is platform dependent.
 They are inappropriate for transactional data
 It has smaller user base.
Future of Graph Database:
Graph Database is an excellent tool for storing data but it cannot be used to completely replace the traditional database.
This database deals with a typical set of interconnected data. Although Graph Database is in the developmental phase
it is becoming an important part as business and organizations are using big data and Graph databases help in complex
analysis. Thus these databases have become a must for today’s needs and tomorrow success.

Neo4j is the world's leading open source Graph Database which is developed using Java technology. It is highly scalable
and schema free (NoSQL).

Used by : Walmart,epay,NASA,Microsoft,IBM

What is a Graph Database?


A graph is a pictorial representation of a set of objects where some pairs of objects are connected by links. It is composed
of two elements - nodes (vertices) and relationships (edges).

Graph database is a database used to model the data in the form of graph. In here, the nodes of a graph depict the entities
while the relationships depict the association of these nodes.

Popular Graph Databases


Neo4j is a popular Graph Database. Other Graph Databases are Oracle NoSQL Database, OrientDB, HypherGraphDB,
GraphBase, InfiniteGraph, and AllegroGraph.

Why Graph Databases?


Nowadays, most of the data exists in the form of the relationship between different objects and more often, the
relationship between the data is more valuable than the data itself.

Relational databases store highly structured data which have several records storing the same type of data so they can
be used to store structured data and, they do not store the relationships between the data.

Unlike other databases, graph databases store relationships and connections as first-class entities.

The data model for graph databases is simpler compared to other databases and, they can be used with OLTP systems.
They provide features like transactional integrity and operational availability.

RDBMS Vs Graph Database


Following is the table which compares Relational databases and Graph databases.

Sr.No RDBMS Graph Database

1 Tables Graphs

2 Rows Nodes

3 Columns and Data Properties and its values

4 Constraints Relationships

5 Joins Traversal

Advantages of Neo4j

Following are the advantages of Neo4j.


 Flexible data model − Neo4j provides a flexible simple and yet powerful data model, which can be easily
changed according to the applications and industries.

 Real-time insights − Neo4j provides results based on real-time data.

 High availability − Neo4j is highly available for large enterprise real-time applications with transactional
guarantees.

 Connected and semi structures data − Using Neo4j, you can easily represent connected and semi-structured
data.

 Easy retrieval − Using Neo4j, you can not only represent but also easily retrieve (traverse/navigate) connected
data faster when compared to other databases.

 Cypher query language − Neo4j provides a declarative query language to represent the graph visually, using
an ascii-art syntax. The commands of this language are in human readable format and very easy to learn.

 No joins − Using Neo4j, it does NOT require complex joins to retrieve connected/related data as it is very easy
to retrieve its adjacent node or relationship details without joins or indexes.

Features of Neo4j

Following are the notable features of Neo4j −

 Data model (flexible schema) − Neo4j follows a data model named native property graph model. Here, the
graph contains nodes (entities) and these nodes are connected with each other (depicted by relationships). Nodes
and relationships store data in key-value pairs known as properties.

In Neo4j, there is no need to follow a fixed schema. You can add or remove properties as per requirement. It
also provides schema constraints.

 ACID properties − Neo4j supports full ACID (Atomicity, Consistency, Isolation, and Durability) rules.

 Scalability and reliability − You can scale the database by increasing the number of reads/writes, and the
volume without effecting the query processing speed and data integrity. Neo4j also provides support
for replication for data safety and reliability.

 Cypher Query Language − Neo4j provides a powerful declarative query language known as Cypher. It uses
ASCII-art for depicting graphs. Cypher is easy to learn and can be used to create and retrieve relations between
data without using the complex queries like Joins.

 Built-in web application − Neo4j provides a built-in Neo4j Browser web application. Using this, you can create
and query your graph data.

 Drivers − Neo4j can work with −

o REST (representational state transfer )API to work with programming languages such as Java,

Spring, Scala etc.


o Java Script to work with UI MVC(Model–view–controller) frameworks such as Node JS.

o It supports two kinds of Java API(application program interface): Cypher API and Native Java API to

develop Java applications. In addition to these, you can also work with other databases such as
MongoDB, Cassandra, etc.

 Indexing − Neo4j supports Indexes by using Apache Lucence.

Neo4j Property Graph Data Model

Neo4j Graph Database follows the Property Graph Model to store and manage its data.

Following are the key features of Property Graph Model −

 The model represents data in Nodes, Relationships and Properties

 Properties are key-value pairs

 Nodes are represented using circle and Relationships are represented using arrow keys

 Relationships have directions: Unidirectional and Bidirectional

 Each Relationship contains "Start Node" or "From Node" and "To Node" or "End Node"

 Both Nodes and Relationships contain properties

 Relationships connects nodes

In Property Graph Data Model, relationships should be directional. If we try to create relationships without direction,
then it will throw an error message.

In Neo4j too, relationships should be directional. If we try to create relationships without direction, then Neo4j will
throw an error message saying that "Relationships should be directional".

Neo4j Graph Database stores all of its data in Nodes and Relationships. We neither need any additional RRBMS
Database nor any SQL database to store Neo4j database data. It stores its data in terms of Graphs in its native format.

Neo4j uses Native GPE (Graph Processing Engine) to work with its Native graph storage format.

The main building blocks of Graph DB Data Model are −

 Nodes

 Relationships

 Properties

Following is a simple example of a Property Graph.


Here, we have represented Nodes using Circles. Relationships are represented using Arrows. Relationships are
directional. We can represent Node's data in terms of Properties (key-value pairs). In this example, we have represented
each Node's Id property within the Node's Circle.

Neo4j Query Cypher Language


The Neo4j has its own query language that called Cypher Language. It is similar to SQL, remember one one
thing Neo4j does not work with tables, row or columns it deals with nodes. It is more satisfied to see the data
in a graph format rather then in a table format.

Example: The Neo4j Cypher statement compare to SQL

MATCH (G:Company { name:"GeeksforGeeks" })

RETURN G

This Cypher statement will return the “Company” node where the “name” property is GeeksforGeeks. Here
the “G” works like a variable to holds the data that your Cypher query demands after that it will return. It will
will be more clear if you know the SQL. Below same query is written in SQL.

SELECT * FROM Company WHERE name = "GeeksforGeeks";

The neo4j is made for NoSQL database but it is very effective on relational database too, it does not use the
SQL language.

ASCII-Art Syntax: The Neo4j used ASCII-Art to create pattern.

(X)-[:GeeksfroGeeks]->(Y)

In the Neo4j the nodes are represented by “( )”.

 The relationship is represented by ” -> “.


 What kind of relationship is between the nodes are represented by ” [ ] ” like [:GeeksforGeeks]
So above description is helpful to decode the ASCII-Art Syntax given, (X)-[:GeeksforGeeks]->(Y). Here the X and
Y are the nodes X to Y relation kind is “GeekforGeeks”.
Defining Data: Below points will help you to grasp the concept of Cypher language.

 Neo4j deals with nodes and the nodes contains labels that could be “Person”, “Employee”, “Employer”
anything that can define the type of value field.
 Neo4j also have properties like “name”, “employee_id”, “phone_number”, basically that will gives us
information about the nodes.
 Neo4j’s relationship also can contain the properties but this is not mandatory.
 In Neo4j the relationship is like the kind of situation like X works for GeeksforGeeks here the X and the
“GeeksforGeeks” is the node and the relation is works for, in Cypher language it will be like.
(X)-[:WORK]->(GeekfoGeeks).

Note: Here the Company is the node’s label, name is the property of the node

MATCH (G:Company { name:"GeeksforGeeks" })

RETURN G

What is Big Data?


Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large
size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also
a data but with huge size.

What is an Example of Big Data?


Following are some of the Big Data examples-

The New York Stock Exchange is an example of Big Data that generates about one terabyte of new trade data per day.

Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc.

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand
flights per day, generation of data reaches up to many Petabytes.

Data Growth over the years

Data Growth over the years,Yottabyte is 2 80

Characteristics Of Big Data


Big data can be described by the following characteristics:

 Volume
 Variety
 Velocity
 Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in
determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent
upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big
Data solutions.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days,
spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the
form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed
to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs,
networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process
of being able to handle and manage the data effectively.
(v)Veracity - Veracity refers to the quality of data. Because data comes from so many different sources, it’s difficult to
link, match, cleanse and transform data across systems. Businesses need to connect and correlate relationships,
hierarchies and multiple data linkages. Otherwise, their data can quickly spiral out of control.

Advantages Of Big Data Processing


Ability to process Big Data in DBMS brings in multiple benefits, such as-

 Businesses can utilize outside intelligence while taking decisions

Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their
business strategies.

 Improved customer service


Traditional customer feedback systems are getting replaced by new systems designed with Big Data technologies. In
these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer
responses.

 Early identification of risk to the product/services, if any


 Better operational efficiency

Big Data technologies can be used for creating a staging area or landing zone for new data before identifying what data
should be moved to the data warehouse. In addition, such integration of Big Data technologies and data warehouse helps
an organization to offload infrequently accessed data.

Summary

 Big Data definition : Big Data meaning a data that is huge in size. Bigdata is a term used to describe a collection
of data that is huge in size and yet growing exponentially with time.
 Big Data analytics examples includes stock exchanges, social media sites, jet engines, etc.
 Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
 Volume, Variety, Velocity, and Variability are few Big Data characteristics
 Improved customer service, better operational efficiency, Better Decision Making are few advantages of Bigdata

MapReduce
MapReduce is a programming model for writing applications that can process Big Data in parallel on multiple nodes.
MapReduce provides analytical capabilities for analyzing huge volumes of complex data.

Why MapReduce?

Traditional Enterprise Systems normally have a centralized server to store and process data. The following illustration
depicts a schematic view of a traditional enterprise system. Traditional model is certainly not suitable to process huge
volumes of scalable data and cannot be accommodated by standard database servers. Moreover, the centralized system
creates too much of a bottleneck while processing multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task into small parts
and assigns them to many computers. Later, the results are collected at one place and integrated to form the result
dataset.
How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The Map task takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key-value pairs).

 The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into
a smaller set of tuples.

The reduce task is always performed after the map job.

Let us now take a close look at each of the phases and try to understand their significance.

 Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed
data to the mapper in the form of key-value pairs.

 Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one of them
to generate zero or more key-value pairs.
 Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys.

 Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable
sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the
values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional.

 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value
pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key
into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily
in the Reducer task.

 Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each
one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a
wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.

 Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from
the Reducer function and writes them onto a file using a record writer.

Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −

MapReduce-Example

Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around 500 million tweets
per day, which is nearly 3000 tweets per second. The following illustration shows how Tweeter manages its tweets with
the help of MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions −

 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.

 Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs.

 Count − Generates a token counter per word.

 Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.

What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in
volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline
processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled
up just by adding nodes in the cluster.

Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was
developed. It states that the files will be broken into blocks and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using key
value pair. The Map task takes input data and converts it into a data set which can be computed in Key value pair.
The output of Map task is consumed by reduce task and then the out of reducer gives the desired result.

4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.

Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop Distributed File
System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job Tracker, Task
Tracker, NameNode, and DataNode whereas the slave node includes DataNode and TaskTracker.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a master/slave
architecture. This architecture consist of a single NameNode performs the role of master, and multiple DataNodes
performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is used to develop
HDFS. So any machine that supports Java language can easily run the NameNode and DataNode software.

NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a Mapper.

MapReduce Layer

The MapReduce comes into existence when the client application submits the MapReduce job to Job Tracker. In
response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes, the TaskTracker fails or time
out. In such a case, that part of the job is rescheduled.

Advantages of Hadoop

o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even the tools
to process the data are often on the same servers, thus reducing the processing time. It is able to process terabytes
of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost effective as
compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one node is
down or some other network failure happens, then Hadoop takes the other copy of data and use it. Normally, data
are replicated thrice but the replication factor is configurable.

History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper,
published by Google.
Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an open source web
crawler software project.

o While working on Apache Nutch, they were dealing with big data. To store that data they have to spend a lot of
costs which becomes the consequence of that project. This problem becomes one of the important reason for the
emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary distributed file
system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the data processing on large
clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch Distributed File
System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough Cutting
introduces a new project Hadoop with a file system known as HDFS (Hadoop Distributed File System). Hadoop
first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

Hadoop YARN Architecture


YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to remove the
bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described as a “Redesigned Resource
Manager” at the time of its launching, but it has now evolved to be known as large-scale distributed operating
system used for Big Data processing.
YARN architecture basically separates resource management layer from the processing layer. In Hadoop 1.0
version, the responsibility of Job tracker is split between the resource manager and application manager.
YARN also allows different data processing engines like graph processing, interactive processing, stream
processing as well as batch processing to run and process data stored in HDFS (Hadoop Distributed File
System) thus making the system much more efficient. Through its various components, it can dynamically
allocate various resources and schedule the application processing. For large volume data processing, it is
quite necessary to manage the available resources properly so that every application can leverage them.

YARN Features: YARN gained popularity because of the following features-


 Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to extend
and manage thousands of nodes and clusters.
 Compatability: YARN supports the existing map-reduce applications without disruptions thus
making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which
enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi-
tenancy.
Hadoop YARN Architecture

The main components of YARN architecture include:

 Client: It submits map-reduce jobs.


 Resource Manager: It is the master daemon of YARN and is responsible for resource assignment
and management among all the applications. Whenever it receives a processing request, it forwards
it to the corresponding node manager and allocates resources for the completion of the request
accordingly. It has two major components:
 Scheduler: It performs scheduling based on the allocated application and available
resources. It is a pure scheduler, means it does not perform other tasks such as monitoring
or tracking and does not guarantee a restart if a task fails. The YARN scheduler supports
plugins such as Capacity Scheduler and Fair Scheduler to partition the cluster resources.
 Application manager: It is responsible for accepting the application and negotiating the
first container from the resource manager. It also restarts the Application Master
container if a task fails.
 Node Manager: It take care of individual node on Hadoop cluster and manages application and
workflow and that particular node. Its primary job is to keep-up with the Resource Manager. It
registers with the Resource Manager and sends heartbeats with the health status of the node. It
monitors resource usage, performs log management and also kills a container based on directions
from the resource manager. It is also responsible for creating the container process and start it on
the request of Application master.
 Application Master: An application is a single job submitted to a framework. The application
master is responsible for negotiating resources with the resource manager, tracking the status and
monitoring progress of a single application. The application master requests the container from the
node manager by sending a Container Launch Context(CLC) which includes everything an
application needs to run. Once the application is started, it sends the health report to the resource
manager from time-to-time.
 Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single
node. The containers are invoked by Container Launch Context(CLC) which is a record that
contains information such as environment variables, security tokens, dependencies etc.
Application workflow in Hadoop YARN:

1. Client submits an application


2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Once the processing is complete, the Application Manager un-registers with the Resource Manager

HADOOP Extra Content :

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across
clusters of computers using simple programming models. The Hadoop framework application works in an environment
that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from
single server to thousands of machines, each offering local computation and storage.

Hadoop Architecture

At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and

 Storage layer (Hadoop Distributed File System).

MapReduce

MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient
processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-
source framework.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file
system that is designed to run on commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant. It is highly fault-tolerant and is designed to
be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications
having large datasets.

Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules −

 Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.

 Hadoop YARN − This is a framework for job scheduling and cluster resource management.

How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an
alternative, you can tie together many commodity computers with single-CPU, as a single functional distributed system
and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput. Moreover,
it is cheaper than one high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.

Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs

 Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M
(preferably 128M).

 These files are then distributed across various cluster nodes for further processing.

 HDFS, being on top of the local file system, supervises the processing.

 Blocks are replicated for handling hardware failure.

 Checking that the code was executed successfully.

 Performing the sort that takes place between the map and reduce stages.

 Sending the sorted data to a certain computer.

 Writing the debugging logs for each job.

Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic
distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU
cores.

 Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library
itself has been designed to detect and handle failures at the application layer.

 Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without
interruption.

 Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since
it is Java based.

Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally developed to support distribution
for the Nutch search engine project. Doug, who was working at Yahoo! at the time and is now Chief Architect of
Cloudera, named the project after his son's toy elephant.

You might also like