You are on page 1of 6

Name: Aryan Shah

Roll No. – 201310132112

BDA Assignment - I

Q.1 Explain different types of NoSQL data architecture patterns.

Answer: Traditional RDBMS uses SQL syntax to store and retrieve data from SQL databases.
They all use a data model that has a different structure than the traditional row-and-column

table model used with relational database management systems (RDBMSs). Instead, a NoSQL database
system encompasses a wide range of database technologies that can store structured, semi-structured,
unstructured and polymorphic data.

1. Key-Value Pair Oriented


2. Document Oriented
3. Column Oriented
4. Graph Oriented

(i) Key-Value Pair Oriented


• Key-value Stores are the simplest type of NoSQL database. Data is stored in key/value pairs.
• It uses keys and values to store the data. The attribute name is stored in ‘key’, whereas the
• values corresponding to that key will be held in ‘value’.
• In Key-value store databases, the key can only be string, whereas the value can store string,
• JSON, XML, Blob, etc. Due to its behavior, it is capable of handling massive data and loads.
• The use case of key-value stores mainly stores user preferences, user profiles, shopping carts,
• etc.
• DynamoDB, Riak, Redis are a few famous examples of Key-value store NoSQL databases.

(ii) Document Oriented


• Document Databases use key-value pairs to store and retrieve data from the documents.
• A document is stored in the form of XML and JSON.
• Data is stored as a value. Its associated key is the unique identifier for that value.
• The difference is that, in a document database, the value contains structured or semi-structured
• data.
• This structured/semi-structured value is referred to as a document and can be in XML, JSON or
• BSON format.

Aryan Shah 201310132112


• Examples of Document databases are – MongoDB, OrientDB, Apache CouchDB, IBM Cloudant,
• CrateDB, BaseX, and many more.

(iii) Column Oriented


• Column-oriented databases work on columns and are based on BigTable paper by Google.
• Every column is treated separately. Values of single column databases are stored contiguously.
• They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as the
data is readily available in a column.
• Column-based NoSQL databases are widely used to manage data warehouses, business
intelligence, CRM, Library card catalogs.

(iv) Graph Oriented


• Graph databases form and store the relationship of the data.
• Each element/data is stored in a node, and that node is linked to another data/element.
• A typical example for Graph database use cases is Facebook.
• It holds the relationship between each user and their further connections.
• Graph databases help search the connections between data elements and link one part to
various parts directly or indirectly.
• The Graph database can be used in social media, fraud detection, and knowledge graphs.
Examples of Graph Databases are – Neo4J, Infinite Graph, OrientDB, FlockDB, etc.

Q.2 Explain working of various phases of MapReduce with appropriate example


and diagram.

Answer:

Aryan Shah 201310132112


Input Phase: We have a Record Reader that translates each record in an input file and sends the parsed
data to the mapper in the form of key-value pairs.

Map: It is a user-defined function, which takes a series of key-value pairs and processes each one of
them to generate zero or more key-value pairs.

Intermediate Keys: They key-value pairs generated by the mapper are known as intermediate keys.

Combiner: A combiner is a type of local Reducer that groups similar data from the map phase into
identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined
code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce
algorithm; it is optional.

Shuffle and Sort: The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-
value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are
sorted by key into a larger data list. The data list groups the equivalent keys together so that their values
can be iterated easily in the Reducer task.

Reducer: The Reducer takes the grouped key-value paired data as input and runs a Reducer function on
each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it
requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to
the final step.

Output Phase: In the output phase, we have an output formatter that translates the final key-value pairs
from the Reducer function and writes them onto a file using a record writer.

Q.3 Draw and explain HDFS architecture.

Answer: Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using
low-cost hardware. HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue
the system from possible data losses in case of failure. HDFS also makes applications available to parallel
processing.

Aryan Shah 201310132112


1. Name Node

• It is the master daemon that maintains and manages the DataNodes (slave nodes)
• It records the metadata of all the blocks stored in the cluster, e.g. location of blocks stored,
• size of the files, permissions, hierarchy, etc.
• It records each and every change that takes place to the file system metadata
• If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog

• It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to
ensure that the DataNodes are alive.
• It keeps a record of all the blocks in the HDFS and DataNode in which they are stored
• It has high availability and federation features which I will discuss in HDFS architecture in detail.
• The namenode is the commodity hardware that contains the GNU/Linux operating system and
the namenode software.

• It is a software that can be run on commodity hardware.


o The system having the namenode acts as the master server and it does the following
tasks −
o It also executes file system operations such as renaming, closing, and opening files and
directories.
o It records each and every change that takes place to the file system metadata.
o It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster
to ensure that the
o DataNodes are alive
o It keeps a record of all the blocks in the HDFS and DataNode in which they are stored.

2. Data Node

• It is the slave daemon/process which runs on each slave machine.

Aryan Shah 201310132112


• The actual data is stored on DataNodes.
• It is responsible for serving read and write requests from the clients.
• It is also responsible for creating blocks, deleting blocks and replicating the same based on the
decisions taken by the NameNode.
• It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by
default, this frequency is set to 3 seconds.

3. Secondary Node

• The Secondary NameNode works concurrently with the primary NameNode as a helper
daemon/process.
• It is one which constantly reads all the file systems and metadata from the RAM of the
NameNode and writes it into the hard disk or the file system.
• It is responsible for combining the EditLogs with FsImage from the NameNode.
• It downloads the EditLogs from the NameNode at regular intervals and applies to FsImage.
• The new FsImage is copied back to the NameNode, which is used whenever the NameNode is
started the next time.
• Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also called
CheckpointNode.

4. Blocks

• A Block is the minimum amount of data that it can read or write.


• HDFS blocks are 128 MB by default and this is configurable.
• Files n HDFS are broken into block-sized chunks,which are stored as independent units.

Q.4 Explain role of Zookeeper.

Answer:

• There was a huge issue of management of coordination and synchronization among the
resources or the components of Hadoop which resulted in inconsistency.
• Before Zookeeper, it was very difficult and time consuming to coordinate between different
services in Hadoop Ecosystem.
• The services earlier had many problems with interactions like common configuration while
synchronizing data.
• Even if the services are configured, changes in the configurations of the services make it complex
and difficult to handle.
• The grouping and naming were also a time-consuming factor.
• Zookeeper overcame all the problems by performing synchronization, inter-component-based
communication, grouping, and maintenance.

Aryan Shah 201310132112


Q.5 Difference between Master Slave and Peer to Peer Architecture.

Answer:
Master-Slave Architecture Peer to Peer Architecture
Centralized In a master-slave architecture, there is a In a peer-to-peer architecture, there is
Control clear distinction between the master (or no central authority or master node. All
primary) and the slaves (or secondary nodes (peers) in the network are
nodes). The master node has centralized considered equal, and each can
control and authority over the slave nodes. communicate directly with any other
peer.

Role Hierarchy The master node is responsible for making All peers in a P2P network have the
critical decisions and managing the overall same level of authority and decision-
system, while the slave nodes follow the making power. There is no central point
instructions given by the master. This of control, which can make the system
hierarchy simplifies system management. more robust.
Scalability It can be less scalable than peer-to-peer P2P architectures are highly scalable
architectures because the master node can because adding new peers to the
become a bottleneck as the system grows. network does not affect the overall
Scalability depends on the master's capacity system's performance or capacity.
to handle requests.

Fault Tolerance Master-slave architectures can provide P2P networks can be more resilient in
better fault tolerance, as slave nodes can the face of failures because there is no
take over if the master node fails. However, single point of failure. If one peer goes
there may be some downtime during the offline, the network can still function
transition. using the remaining peers.

Use Cases Master-slave architectures are common in P2P architectures are commonly used in
database replication, content delivery file sharing, content distribution (e.g.,
networks, and backup systems where data BitTorrent), and decentralized systems
consistency and centralized control are like blockchain networks. They are also
essential. well-suited for scenarios where there is
no need for centralized control.

Aryan Shah 201310132112

You might also like