You are on page 1of 8

Assignment

Name of Student: Aditya Mishra SAP ID: 1000011196

Assignment ID Submission
(as per the Deadline
policy Assignment Submission Assessment Group/ Date of (Date and
guidelines) Title Mode Method Individual Weightage Release time)
MS Teams 19/02/2 23/02/2021
CS346_A2 Hadoop Upload Marks Individual 10 Marks 021 5:00 PM

Instructions (Sample provided below, please change as necessary):

 Assignment must be submitted by the Due Dateand Time as mentioned above.


 Assignment submitted after Due Dateand Time and before the next 48 hours will
be marked late and will attract a penalty of Xmarks (out of the overall Ymarks,
and it will be evaluated out of Y-Xmarks only). Assignment will not be
considered for evaluation subsequently (after 48 hours past due date and time),
and a score of zero will be awarded.
 Plagiarism is not allowed by the University for any Academic Document to be
submitted by the students for any assessment. In order to avoid plagiarism,
ensure you always follow good academic practice. This include self- plagiarism
i.e. submitting a piece of your own work which has provisionally been presented
for examination.
 Submittedassignment must have your Full Name and SAP ID in the space
provided above this page in the Header.

Submitting this Assignment

 You will submit (upload) this assignment in MS Teams.


 Email/paper submissions will not be accepted (except for UG students who are
not yet registered in MS Teams).
 Questions must be answered in the given order.
 Submit a pdf version of this document.
 Name this document as A1_CSD207_Even2020_John_Doe.pdf in case your name
is John Doe, and you are submitting Assignment 1 of the course whose code is
CSD207, and it is offered in the Even Semester of the Year 2020.

Problems:
Assignment

Name of Student: Aditya Mishra SAP ID: 1000011196

1. Define Hadoop. Draw the architecture of Hadoop Ecosystem.

2. What is Map Reduce? What are its benefits?

3. Explain working of any fivecomponents of Hadoop Ecosystem.

4. What is NoSQL database? Write its characteristics & advantages.

5. What are different kinds of NoSQL databases? What are advantages of each?

Solution : -

1) Hadoop is an open source distributed processing framework that manages data


processing and storage for big data applications in scalable clusters of computer servers.
It's at the center of an ecosystem of big data technologies that are primarily used to
support advanced analytics initiatives, including predictive analytics, data mining
and machine learning. Hadoop systems can handle various forms of structured and
unstructured data, giving users more flexibility for collecting, processing, analyzing
and managing data than relational databases and data warehouses provide.

High level Hadoop Architecture


Assignment

Name of Student: Aditya Mishra SAP ID: 1000011196

Various Components of hadoop ecosystem.

2) MapReduce in simple terms can be explained as a programming model that allows the
scalability of multiple servers in a Hadoop cluster. It can be used to write applications to
process huge amounts of data in parallel on clusters of commodity hardware. It is
inspired by the map and reduce functions commonly used in functional programming.
A MapReduce program usually executes in three stages — map stage, shuffle stage and
reduce stage.
 Map stage – In this stage, the input data is split into fixed sized pieces known as
input splits. Each input split is passed through a mapping function to produce
output values.
 Shuffle stage – The output values from the map stage is consolidated in the next
stage, which is the shuffle stage.
 Reduce stage – The output from the shuffle stage is are combined to return a
single value and is stored in the HDFS.

Some of the various advantages of Hadoop MapReduce are:


 Scalability – The biggest advantage of MapReduce is its level of scalability,
which is very high and can scale across thousands of nodes.
Assignment

Name of Student: Aditya Mishra SAP ID: 1000011196

 Parallel nature – One of the other major strengths of MapReduce is that it is


parallel in nature. It is best to work with both structured and unstructured data at
the same time.
 Memory requirements – MapReduce does not require large memory as compared
to other Hadoop ecosystems. It can work with minimal amount of memory and
still produce results quickly.
 Cost reduction – As MapReduce is highly scalable, it reduces the cost of storage
and processing in order to meet the growing data requirements.

3) Following are 5 important components of hadoop.

a) HDFS:
 HDFS is the primary or major component of Hadoop ecosystem and is
responsible for storing large data sets of structured or unstructured data across
various nodes and thereby maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data.
These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.

b) PIG:
 Pig was basically developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data
sets.
 Pig does the work of executing commands and in the background, all the activities
of MapReduce are taken care of. After the processing, pig stores the result in
HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
Assignment

Name of Student: Aditya Mishra SAP ID: 1000011196

c) HIVE:
 With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
 It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the processing
of queries.

d)Mahout:
 Mahout, allows Machine Learnability to a system or application. Machine
Learning, as the name suggests helps the system to develop itself based on some
patterns, user/environmental interaction or om the basis of algorithms.
 It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine learning.
It allows invoking algorithms as per our need with the help of its own libraries.

e) Apache HBase:
 It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something small
in a huge database, the request must be processed within a short quick span of
time. At such times, HBase comes handy as it gives us a tolerant way of storing
limited data.

4) a NoSQL Database is a non-relational Data Management System, that does not require
a fixed schema. It avoids joins, and is easy to scale. The major purpose of using a NoSQL
database is for distributed data stores with humongous data storage needs. NoSQL is used
for Big data and real-time web apps. For example, companies like Twitter, Facebook and
Assignment

Name of Student: Aditya Mishra SAP ID: 1000011196

Google collect terabytes of user data every single day. NoSQL database stands for "Not
Only SQL" or "Not SQL."

Few characteristics of nosql are –

Non-relational
 NoSQL databases never follow the relational model
 Never provide tables with flat fixed-column records
 Work with self-contained aggregates or BLOBs

Schema-free
 NoSQL databases are either schema-free or have relaxed schemas
 Do not require any sort of definition of the schema of the data
 Offers heterogeneous structures of data in the same domain

Distributed
 Multiple NoSQL databases can be executed in a distributed fashion
 Offers auto-scaling and fail-over capabilities
 Often ACID concept can be sacrificed for scalability and throughput

Some advantages of nosql are –

Can easily scale up and down :


No SQL data base supports scaling rapidly and elastically and even allows to scale to the
cloud.
Clusterscale : It allows distribution of database across 100+ nodes often in multiple
datacenters
Performance scale : It sustains over 100,000+ database reads and writes per second.
Datascale : It supports housing of 1 billion+ documents in the database.

Doesn’t require a pre-defined schema: NoSQL does not require any adherence to pre-
defined schema. It is pretty flexible. For example, if we look at MongoDB, the
documents (equivalent of records in RDBMS) in a collection (equivalent of table in
RDBMS) can have different sets of key-value pairs.

Cheap, easy to implement: Deploying NoSQL properly allows for all of the benefits of
scale, high availability, fault tolerance, etc. while also lowering operational costs.
Assignment

Name of Student: Aditya Mishra SAP ID: 1000011196

Relaxes the data consistency requirement : NoSQL databases have adherence to CAP
theorem (Consistency, Availability, and Partition Tolerance). Most of the NoSQL
databases compromise on consistency in favor of availability and partition tolerance.
However, they do go for eventual consistency.

Data can be replicated to multiple nodes and can be partitioned: There are two terms
that we will discuss here:
Sharding: Sharding is when different pieces of data are distributed across
multipleservers. NoSQL databases support auto-sharding meaning they can natively and
automatically spread data across an arbitrary number of servers, without requiring the
application to even be aware of the composition of the server pool. Servers can be added
or removed from the data layer with out application downtime. This would mean that
data and query load are automatically balanced across servers.

Replication:Replication is when multiple copies of data are stored across the cluster and
even across datacenters. This promises high availability and fault tolerance.

5) NoSQL Databases are mainly categorized into four types: Key-value pair, Column-
oriented, Graph-based and Document-oriented. Every category has its unique attributes
and limitations.

Advantages of these types are as follows :-

Key Value Pair Based


Data is stored in key/value pairs. It is designed in such a way to handle lots of data and
heavy load.
Key-value pair storage databases store data as a hash table where each key is unique, and
the value can be a JSON, BLOB(Binary Large Objects), string, etc.
It is one of the most basic NoSQL database example. This kind of NoSQL database is
used as a collection, dictionaries, associative arrays, etc. Key value stores help the
developer to store schema-less data. They work best for shopping cart contents.
Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are
all based on Amazon's Dynamo paper.

Column-based
Column-oriented databases work on columns and are based on BigTable paper by
Google. Every column is treated separately. Values of single column databases are stored
contiguously.
Assignment

Name of Student: Aditya Mishra SAP ID: 1000011196

They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN
etc. as the data is readily available in a column.
Column-based NoSQL databases are widely used to manage data warehouses, business
intelligence, CRM, Library card catalogs,
HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based
database.

Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value
part is stored as a document. The document is stored in JSON or XML formats. The value
is understood by the DB and can be queried.
The document type is mostly used for CMS systems, blogging platforms, real-time
analytics & e-commerce applications. It should not use for complex transactions which
require multiple operations or queries against varying aggregate structures.
Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.

Graph-Based
A graph type database stores entities as well the relations amongst those entities. The
entity is stored as a node with the relationship as edges. An edge gives a relationship
between nodes. Every node and edge has a unique identifier.
Compared to a relational database where tables are loosely connected, a Graph database
is a multi-relational in nature. Traversing relationship is fast as they are already captured
into the DB, and there is no need to calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.

You might also like