Professional Documents
Culture Documents
Aryan BDA Assignment
Aryan BDA Assignment
BDA Assignment - I
Answer: Traditional RDBMS uses SQL syntax to store and retrieve data from SQL databases.
They all use a data model that has a different structure than the traditional row-and-column
table model used with relational database management systems (RDBMSs). Instead, a NoSQL database
system encompasses a wide range of database technologies that can store structured, semi-structured,
unstructured and polymorphic data.
Answer:
Map: It is a user-defined function, which takes a series of key-value pairs and processes each one of
them to generate zero or more key-value pairs.
Intermediate Keys: They key-value pairs generated by the mapper are known as intermediate keys.
Combiner: A combiner is a type of local Reducer that groups similar data from the map phase into
identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined
code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce
algorithm; it is optional.
Shuffle and Sort: The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-
value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are
sorted by key into a larger data list. The data list groups the equivalent keys together so that their values
can be iterated easily in the Reducer task.
Reducer: The Reducer takes the grouped key-value paired data as input and runs a Reducer function on
each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it
requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to
the final step.
Output Phase: In the output phase, we have an output formatter that translates the final key-value pairs
from the Reducer function and writes them onto a file using a record writer.
Answer: Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using
low-cost hardware. HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue
the system from possible data losses in case of failure. HDFS also makes applications available to parallel
processing.
• It is the master daemon that maintains and manages the DataNodes (slave nodes)
• It records the metadata of all the blocks stored in the cluster, e.g. location of blocks stored,
• size of the files, permissions, hierarchy, etc.
• It records each and every change that takes place to the file system metadata
• If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
• It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to
ensure that the DataNodes are alive.
• It keeps a record of all the blocks in the HDFS and DataNode in which they are stored
• It has high availability and federation features which I will discuss in HDFS architecture in detail.
• The namenode is the commodity hardware that contains the GNU/Linux operating system and
the namenode software.
2. Data Node
3. Secondary Node
• The Secondary NameNode works concurrently with the primary NameNode as a helper
daemon/process.
• It is one which constantly reads all the file systems and metadata from the RAM of the
NameNode and writes it into the hard disk or the file system.
• It is responsible for combining the EditLogs with FsImage from the NameNode.
• It downloads the EditLogs from the NameNode at regular intervals and applies to FsImage.
• The new FsImage is copied back to the NameNode, which is used whenever the NameNode is
started the next time.
• Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also called
CheckpointNode.
4. Blocks
Answer:
• There was a huge issue of management of coordination and synchronization among the
resources or the components of Hadoop which resulted in inconsistency.
• Before Zookeeper, it was very difficult and time consuming to coordinate between different
services in Hadoop Ecosystem.
• The services earlier had many problems with interactions like common configuration while
synchronizing data.
• Even if the services are configured, changes in the configurations of the services make it complex
and difficult to handle.
• The grouping and naming were also a time-consuming factor.
• Zookeeper overcame all the problems by performing synchronization, inter-component-based
communication, grouping, and maintenance.
Answer:
Master-Slave Architecture Peer to Peer Architecture
Centralized In a master-slave architecture, there is a In a peer-to-peer architecture, there is
Control clear distinction between the master (or no central authority or master node. All
primary) and the slaves (or secondary nodes (peers) in the network are
nodes). The master node has centralized considered equal, and each can
control and authority over the slave nodes. communicate directly with any other
peer.
Role Hierarchy The master node is responsible for making All peers in a P2P network have the
critical decisions and managing the overall same level of authority and decision-
system, while the slave nodes follow the making power. There is no central point
instructions given by the master. This of control, which can make the system
hierarchy simplifies system management. more robust.
Scalability It can be less scalable than peer-to-peer P2P architectures are highly scalable
architectures because the master node can because adding new peers to the
become a bottleneck as the system grows. network does not affect the overall
Scalability depends on the master's capacity system's performance or capacity.
to handle requests.
Fault Tolerance Master-slave architectures can provide P2P networks can be more resilient in
better fault tolerance, as slave nodes can the face of failures because there is no
take over if the master node fails. However, single point of failure. If one peer goes
there may be some downtime during the offline, the network can still function
transition. using the remaining peers.
Use Cases Master-slave architectures are common in P2P architectures are commonly used in
database replication, content delivery file sharing, content distribution (e.g.,
networks, and backup systems where data BitTorrent), and decentralized systems
consistency and centralized control are like blockchain networks. They are also
essential. well-suited for scenarios where there is
no need for centralized control.