Professional Documents
Culture Documents
Assignment ID Submission
(as per the Deadline
policy Assignment Submission Assessment Group/ Date of (Date and
guidelines) Title Mode Method Individual Weightage Release time)
MS Teams 19/02/2 23/02/2021
CS346_A2 Hadoop Upload Marks Individual 10 Marks 021 5:00 PM
Problems:
Assignment
5. What are different kinds of NoSQL databases? What are advantages of each?
Solution : -
2) MapReduce in simple terms can be explained as a programming model that allows the
scalability of multiple servers in a Hadoop cluster. It can be used to write applications to
process huge amounts of data in parallel on clusters of commodity hardware. It is
inspired by the map and reduce functions commonly used in functional programming.
A MapReduce program usually executes in three stages — map stage, shuffle stage and
reduce stage.
Map stage – In this stage, the input data is split into fixed sized pieces known as
input splits. Each input split is passed through a mapping function to produce
output values.
Shuffle stage – The output values from the map stage is consolidated in the next
stage, which is the shuffle stage.
Reduce stage – The output from the shuffle stage is are combined to return a
single value and is stored in the HDFS.
a) HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is
responsible for storing large data sets of structured or unstructured data across
various nodes and thereby maintaining the metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data.
These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
b) PIG:
Pig was basically developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data
sets.
Pig does the work of executing commands and in the background, all the activities
of MapReduce are taken care of. After the processing, pig stores the result in
HDFS.
Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
Assignment
c) HIVE:
With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the processing
of queries.
d)Mahout:
Mahout, allows Machine Learnability to a system or application. Machine
Learning, as the name suggests helps the system to develop itself based on some
patterns, user/environmental interaction or om the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine learning.
It allows invoking algorithms as per our need with the help of its own libraries.
e) Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small
in a huge database, the request must be processed within a short quick span of
time. At such times, HBase comes handy as it gives us a tolerant way of storing
limited data.
4) a NoSQL Database is a non-relational Data Management System, that does not require
a fixed schema. It avoids joins, and is easy to scale. The major purpose of using a NoSQL
database is for distributed data stores with humongous data storage needs. NoSQL is used
for Big data and real-time web apps. For example, companies like Twitter, Facebook and
Assignment
Google collect terabytes of user data every single day. NoSQL database stands for "Not
Only SQL" or "Not SQL."
Non-relational
NoSQL databases never follow the relational model
Never provide tables with flat fixed-column records
Work with self-contained aggregates or BLOBs
Schema-free
NoSQL databases are either schema-free or have relaxed schemas
Do not require any sort of definition of the schema of the data
Offers heterogeneous structures of data in the same domain
Distributed
Multiple NoSQL databases can be executed in a distributed fashion
Offers auto-scaling and fail-over capabilities
Often ACID concept can be sacrificed for scalability and throughput
Doesn’t require a pre-defined schema: NoSQL does not require any adherence to pre-
defined schema. It is pretty flexible. For example, if we look at MongoDB, the
documents (equivalent of records in RDBMS) in a collection (equivalent of table in
RDBMS) can have different sets of key-value pairs.
Cheap, easy to implement: Deploying NoSQL properly allows for all of the benefits of
scale, high availability, fault tolerance, etc. while also lowering operational costs.
Assignment
Relaxes the data consistency requirement : NoSQL databases have adherence to CAP
theorem (Consistency, Availability, and Partition Tolerance). Most of the NoSQL
databases compromise on consistency in favor of availability and partition tolerance.
However, they do go for eventual consistency.
Data can be replicated to multiple nodes and can be partitioned: There are two terms
that we will discuss here:
Sharding: Sharding is when different pieces of data are distributed across
multipleservers. NoSQL databases support auto-sharding meaning they can natively and
automatically spread data across an arbitrary number of servers, without requiring the
application to even be aware of the composition of the server pool. Servers can be added
or removed from the data layer with out application downtime. This would mean that
data and query load are automatically balanced across servers.
Replication:Replication is when multiple copies of data are stored across the cluster and
even across datacenters. This promises high availability and fault tolerance.
5) NoSQL Databases are mainly categorized into four types: Key-value pair, Column-
oriented, Graph-based and Document-oriented. Every category has its unique attributes
and limitations.
Column-based
Column-oriented databases work on columns and are based on BigTable paper by
Google. Every column is treated separately. Values of single column databases are stored
contiguously.
Assignment
They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN
etc. as the data is readily available in a column.
Column-based NoSQL databases are widely used to manage data warehouses, business
intelligence, CRM, Library card catalogs,
HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based
database.
Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value
part is stored as a document. The document is stored in JSON or XML formats. The value
is understood by the DB and can be queried.
The document type is mostly used for CMS systems, blogging platforms, real-time
analytics & e-commerce applications. It should not use for complex transactions which
require multiple operations or queries against varying aggregate structures.
Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.
Graph-Based
A graph type database stores entities as well the relations amongst those entities. The
entity is stored as a node with the relationship as edges. An edge gives a relationship
between nodes. Every node and edge has a unique identifier.
Compared to a relational database where tables are loosely connected, a Graph database
is a multi-relational in nature. Traversing relationship is fast as they are already captured
into the DB, and there is no need to calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.