PDC A4

Shaheed Zulfikar Ali Bhutto Institute of Science & Technology
COMPUTER SCIENCE DEPARTMENT
Total Marks: 4
Obtained Marks:
Parallel and Distributed

Computing
Assignment # 3b
Submitted To: Dr. Danish Mahmood

___________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
Student Name: Sajjad Ahmed

___________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
Reg Number:1912163
___________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
PDC BS(CS)-7 SZABIST-ISB

Instructions: Copied or shown assignments will be marked zero. Late submissions are not
entertained.
Q1. Why do we use HDFS for applications having large data sets rather
than for applications having plenty of small files?
Ans:Hadoop Distributed File System (HDFS) is designed to store and process
large data sets efficiently. HDFS is well-suited for applications that have a large
volume of data, as it is able to store and manage data across multiple servers and
disks, allowing for scalable and reliable storage.
One of the main reasons HDFS is not well-suited for small files is because it has a high
overhead in terms of storing and processing the metadata for each file. This means that it is
more efficient to store and process large files rather than small files, as the overhead of storing
and processing the metadata becomes a smaller proportion of the overall processing time.
Additionally, HDFS is optimized for batch processing of data, which means that it is not well-
suited for applications that require real-time processing or low latency.
In summary, HDFS is ideal for applications that have large data sets, as it is able to efficiently
store and process the data at scale. For applications that have a large number of small files,
other file systems such as a distributed file system or a cloud-based storage solution may be
more appropriate.
Q2:How do you define rack awareness in Hadoop?
Ans:In Hadoop, "rack awareness" refers to the idea that nodes within the same rack should be
preferred for data processing over nodes in different racks, in order to minimize data transfer
over the network.
This is because data transfer between nodes within the same rack typically has lower latency
compared to transferring data between nodes in different racks. By taking this into account,
Hadoop can optimize data processing and improve the performance of jobs.
In practice, "rack awareness" is implemented by specifying the topology of the network,
including the mapping of nodes to racks, when configuring the Hadoop cluster. This allows
Hadoop to make informed decisions about where to place data blocks and execute tasks.
Q3:Which processing engine will you prefer to use for the following scenarios?
Real time processing

Stream processing
Batch processing
Moreover, differentiate between the above mentioned three types of processing.

Ans:For real-time processing, where the goal is to process data as it is produced and get
immediate results, I would recommend using a stream processing engine. Examples of stream
processing engines include Apache Flink, Apache Spark Streaming, and Apache Storm.
For stream processing, where the goal is to process data as it is produced, but with a little bit
of delay (on the order of seconds or minutes), I would also recommend using a stream
processing engine.
For batch processing, where the goal is to process large amounts of data in a non-real-time,
offline manner (for example, to generate reports or perform data analysis), I would
recommend using a batch processing engine. Examples of batch processing engines include
Apache Hadoop and Apache Spark.
Here are some key differences between these types of processing:
Real-time processing:
 Data is processed as it is produced, with minimal delay
 Results are available immediately
Stream processing:
 Data is processed as it is produced, with a little bit of delay (on the order of seconds or
minutes)
 Results are available in near real-time
Batch processing:
 Data is processed in large batches, in a non-real-time, offline manner
 Results may not be available until the processing is completed, which can take a
significant amount of time depending on the size of the data and the complexity of the
processing.

PDC A4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PDC A4

Uploaded by

Copyright:

Available Formats

Shaheed Zulfikar Ali Bhutto Institute of Science & Technology

COMPUTER SCIENCE DEPARTMENT

Parallel and Distributed

Submitted To: Dr. Danish Mahmood

Student Name: Sajjad Ahmed

PDC BS(CS)-7 SZABIST-ISB

COMPUTER SCIENCE DEPARTMENT

Real time processing

Moreover, differentiate between the above mentioned three types of processing.

COMPUTER SCIENCE DEPARTMENT

PDC BS(CS)-7 SZABIST-ISB

You might also like