You are on page 1of 12

Birla Institute of Technology & Science, Pilani

Work Integrated Learning Programmes Division


First Semester 2023-2024

Mid-Semester Test
(EC-2 Regular)

Course No. : CCZG522


Course Title : BIG DATA SYSTEMS
Nature of Exam : Open Book
Weightage : 30% No. of Pages = 12
Duration : No. of Questions = 4
Date of Exam :
Note to Students:
1. Please follow all the Instructions to Candidates given on the cover page of the answer book.
2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.
3. Assumptions made if any, should be stated clearly at the beginning of your answer.

Set-(A)
1. a) Label the following as structured and unstructured data. [1]
i) Images uploaded by user on social media platform.
ii) Passenger details recorded by railway reservation website

b) Consider a 2-level memory hierarchy system built using cache and main memory. It is observed
that 15 out of 50 times the data/instructions required by program are not found in cache.
i) Find out cache hit ratio. [1]
ii) If the cache access time is 2ms and main memory access time is 10ms. Find out the average
memory access time. [2]

Note: Show detailed calculations

c) You have a 700 MB file stored on HDFS as part of a Hadoop 1.x distribution. A data analytics
program uses this file and runs in parallel across the cluster nodes. The default block size and
replication factor is used in the configuration. How many total blocks including replicas will be
stored in the cluster? What are the unique HDFS block sizes you will find for this specific file? [2]

Note: Show all the necessary calculations with explanation


Set-(B)
1. a) Label the following as structured and unstructured data. [1]
i) Resume (word document) uploaded by candidate in various job portals.
ii) Details given by customer to create an account on online retail store

b) Consider a 2-level memory hierarchy system built using cache and main memory. It is observed
that 20 out of 80 times the data/instructions required by program are not found in cache.
i) Find out cache hit ratio. [1]
ii) If the cache access time is 3ms and main memory access time is 12ms. Find out the average
memory access time. [2]

Note: Show detailed calculations

c) You have a 942 MB file stored on HDFS as part of a Hadoop 2.x distribution. A data analytics
program uses this file and runs in parallel across the cluster nodes. The default block size and
replication factor is used in the configuration. How many total blocks including replicas will be
stored in the cluster? What are the unique HDFS block sizes you will find for this specific file? [2]

Note: Show all the necessary calculations with explanation


Set-(C)
1. a) Label the following as structured/unstructured data. [1]
i) A blog written by user
ii) A video uploaded by user on social media platform

b) Consider a 2-level memory hierarchy system built using cache and main memory. It is observed
that 10 out of 50 times the data/instructions required by program are not found in cache.
i) Find out cache hit ratio. [1]
ii) If the cache access time is 3ms and main memory access time is 15ms. Find out the average
memory access time. [2]

Note: Show detailed calculations

c) You have an 812 MB file stored on HDFS as part of a Hadoop 1.x distribution. A data analytics
program uses this file and runs in parallel across the cluster nodes. The default block size and
replication factor of 4 is used in the configuration. How many total blocks including replicas will
be stored in the cluster? What are the unique HDFS block sizes you will find for this specific file?

Note: Show all the necessary calculations with explanation. [2]


Set-(A)
2. a) In a Movie reservation system what are the trade-offs between consistency and availability you can
think of for the following cases (applying CAP theorem)? Justify your answer.
i) When most of seats are available [2]
ii) When the cinema hall is close to be filled [2]

b) Ravi wants to run a simulation for his research and his supervisor advised him to run it for a fixed
problem size. Ravi is successful in achieving 88% parallelism of the code, with 12% of it being
sequential.
i) If time taken to run the problem on single processor is 10 seconds. What will be the time
taken to execute the same problem on 11 processors? [3]

ii) What would be the maximum speed up Ravi can achieve executing the same problem if there
is no constraint on the number of processors available? [2]

Note: Show detailed calculations. Ignore other overheads such as communication etc.
Set-(B)
2. a) Consider the following use cases and suggest the design choice as per the design principles of CAP
theorem, i.e. is it of type CA, CP or AP? Justify your design choice in each case.

i) A large scale event reservation system that has less than 40% seats booked. The system should
facilitate the bookings in case of network disruptions. [2]
.

ii) ABC.com is an online e-retailer. The organization gathers product reviews from the customers
and displays it on its website. The customers can provide ratings with feedback comments. The
organization wants to ensure that review written by the customer is never lost while it may not
be immediately available to others for view. [2]

b) Ravi wants to run a simulation for his research and his supervisor advised him to run it for a fixed
problem size. Ravi is successful in achieving 66% parallelism of the code, with 34% of it being
sequential.
i) If time taken to run the problem on single processor is 10 seconds. What will be the time
taken to execute the same problem on 11 processors? [3]

ii) What would be the maximum speed up Ravi can achieve executing the same problem if there
is no constraint on the number of processors available? [2]

Note: Show detailed calculations. Ignore other overheads such as communication etc.
Set-(C)
2. a) Consider the following use cases and suggest the design choice as per the design principles of CAP
theorem, i.e. is it of type CA, CP or AP? Justify your design choice in each case.

i) A large scale event reservation system with more than 95% of seats booked. The system
should continue to work in case of network disruptions. [2]

ii) A large scale banking application facilitating credit and debit in customer accounts. This
application is expected to handle millions of transactions daily and maintain the financial
integrity of customer accounts. The application must ensure the security and accuracy of
financial data while providing a responsive and reliable user experience. [2]

b) Ravi wants to run a simulation for his research and his supervisor advised him to run it for a fixed
problem size. Ravi is successful in achieving 99% parallelism of the code, with 1% of it being
sequential.
i) If time taken to run the problem on single processor is 10 seconds. What will be the time
taken to execute the same problem on 11 processors? [3]

ii) What would be the maximum speed up Ravi can achieve executing the same problem if there
is no constraint on the number of processors available? [2]

Note: Show detailed calculations. Ignore other overheads such as communication etc.
Set-(A)
3. Consider the below Student file as follows

Student_id Name DOB Gender Course Marks


101 Ram 1-Jan-90 M Big Data Systems 85
102 Sham 7-Feb-90 M Big Data Systems 90
103 Kiran 15-Mar-90 F Big Data Systems 80
104 Seema 7-Mar-90 F Big Data Systems 95
105 Mohit 1-Dec-89 M Big Data Systems 75
101 Ram 1-Jan-90 M Cloud Computing 80
102 Sham 7-Feb-90 M Cloud Computing 85
103 Kiran 15-Mar-90 F Cloud Computing 90
104 Seema 7-Mar-90 F Cloud Computing 80
105 Mohit 1-Dec-89 M Cloud Computing 80

a) How you can use map-reduce programming model to find average marks of all the students in each
course? Explain. [1]
The reference output is given below

Course Avg Marks


Big Data Systems 85
Cloud Computing 83

b) Write pseudo code for map and reduce functions to find out average marks for each student. [4]
c) Write the output after Map Phase and the input data that will be given to the reducer. [2]
Set-(B)
3. Consider the below database of Patients Visits

Patient_id Name Age Gender Date_of Visit Symptoms Fee_paid


101 Ram 35 M 1-Jan-22 Headache, Cold, Fever 800
102 Sham 45 M 7-Feb-23 Stomach ache 1000
103 Kiran 40 F 15-Mar-23 Muscle Pain 900
104 Seema 59 F 7-Mar-23 Headache, Cold, Fever 1000
105 Mohit 62 M 1-Dec-22 Chest Pain 1300
101 Ram 35 M 15-Jan-22 Infection in Eyes 1100
102 Sham 45 M 7-Apr-23 Chest Infection 1500
103 Kiran 40 F 15-Apr-23 Fever 1000
104 Seema 59 F 7-May-23 Food Poisoning 1400
105 Mohit 62 M 1-Jun-23 Fatty Liver 1700

a) How you can use map-reduce programming model to find average fees paid by a patient for all
his/her visits. Explain [1]
The reference output is given below

Patient_id Avg_Fee_paid
101 950
102 1300
103 950
104 1200
105 1500

b) Write pseudo code for map and reduce functions to find out average fees paid for each patient. [4]
c) Write the output after Map Phase and the input data that will be given to the reducer. [2]
Set-(C)
3. Consider the below database of transactions

T_id Product_id Product_Description Category Qty_Sold Unit_Price Total_Price


101 P100 T-shirt Apparel 3 400 1200
102 P200 Shirt Apparel 2 800 1600
103 P300 Denim Apparel 2 1200 2400
104 P400 Laptop Electronics 1 40000 40000
105 P500 Charger Electronics 1 1800 1800
106 P600 Cross Trainer Fitness 1 15000 15000
107 P700 Hard Disk Electronics 1 2500 2500
108 P800 Treadmill Fitness 1 32000 32000
109 P900 Mobile Electronics 2 10000 20000
110 P300 Denim Apparel 3 1200 3600

Note: For simplicity, assume single product is sold in each transaction

a) How you can use map-reduce programming model to find average sales per transaction across
category? Explain. [1]

The reference output is given below

Category Avg Sales


Apparel 2200
Electronics 16075
Fitness 23500

b) Write pseudo code for map and reduce functions to find out average sales per transaction across
category [4]
c) Write the output after map phase and the input data that will be given to the reducer. [2]
Set-(A)
4. Consider the following assembly of system composed multiple components.

Given below the mean time to failure of each of the component


Component A1: 2 days
Component A2: 2 days
Component B: 4 days
Component C1: 8 days
Component C2: 8 days

a) Find out
i) Mean time to failure of component A [1]
ii) Mean time to failure of component C [1]
iii) Mean Time to failure of system [2]

b) Assume the mean time to repair every component is 23 hours and mean time to diagnose for any
failed component is 1 hour. Find out
i) The availability of component A [1]
ii) The availability of component C [1]
iii) The availability of the system [2]
Set-(B)
4. Consider the following assembly of system composed multiple components.

Given below the mean time to failure of each of the component


Component A1: 4 days
Component A2: 4 days
Component B: 8 days
Component C1: 16 days
Component C2: 16 days

c) Find out
i) Mean time to failure of component A [1]
ii) Mean time to failure of component C [1]
iii) Mean Time to failure of system [2]

d) Assume the mean time to repair every component is 22 hours and mean time to diagnose for any
failed component is 2 hours. Find out
i) The availability of component A [1]
ii) The availability of component C [1]
iii) The availability of the system [2]
Set-(C)
4. Consider the following assembly of system composed multiple components.

Given below the mean time to failure of each of the component


Component A1: 8 days
Component A2: 8 days
Component B: 16 days
Component C1: 32 days
Component C2: 32 days

a) Find out
i) Mean time to failure of component A [1]
ii) Mean time to failure of component C [1]
iii) Mean Time to failure of system [2]

b) Assume the mean time to repair every component is 20 hours and mean time to diagnose for any
failed component is 4 hours. Find out
i) The availability of component A [1]
ii) The availability of component C [1]
iii) The availability of the system [2]

You might also like