You are on page 1of 2

Thapar Institute of Engineering and Technology, Patiala

Department of Electronics and Communication Engineering


BE- ENC (VII Semester) MST (September 2022) 11E735: Big Data Analytics
Time: 02 Hours; MM: 35 Name of Faculty: Dr. Debayani Ghosh, Dr. Arnab
Pattanayak

Note: Answer all questions

Ql. Suppose we have two tables, where the first contains an employee's personal information, primary
keyed on SSN and the second table includes the employee's income again keyed on SSN. The data is
as follows:

Input Data:

Table 1: (SSN, (Personal Information))


111222: (Stephen; Sacramento, CA)
333444: (Edward; San Diego, CA)
555666: (John; San Diego, CA)

Table 2: (SSN, {year, income))

111222: (2016,$70000), (2015,$65000), (2014,$6000),...


333444: (2016,$72000) ,(2015,$70000), (2014,$6000),...
555666: (2016,$80000), (2015,$85000), (2014,$7500),....

(a) Write a first stage Map-Reduce programming model to get the following output from the input
data: (3.5 marks)

{SSN, (City,Income in 2016))

(b) Write a second-stage Map-reduce programming model to get the average income in each city
in 2016. The second stage takes as input the output generated by the first stage. (3.5 marks)

Q2. (a) Consider the following list [2,4,5,6,7,8,9]. Use necessary RDD transformations to print the
numbers that are either divisible by 2 or divisible by 3. (3.5 marks)
(b) Consider two RDDs created from the lists ['orange', 'mango', 'apple', 'grapes', 'orange] and
['green', 'red', 'yellow']. Perform necessary transformations and actions to generate the following:
(3.5 marks)

['orange', 'mango', 'green', 'red', 'yellow', 'apple', 'grapes]

Q3. Consider the following data:


1461balanceid.watio,
.
21 2143 261
441 29 15
331 2060
581 1596 51
331 569 195
351 231 1?8
(a) Write an algorithm to find the outliers in the above dataframe. (3.5 marks)
(b) Find the mean of all columns after removing the rows corresponding to the outliers (3.5 marks)

Q4. Consider a Spark Structured Streaming Program is receiving a steaming data for word count. We want a
Window-Based Aggregation, where Window length = 05 Seconds and we want the output result in every 03
seconds in the complete mode. [time in hh:mm:ss format]

If the input data stream is as follows —

Input Stream

Event-time Data received at Program Data


07:00:01 07:00:01 Deer Owl Dog
07:00:02 07:00:02 Dog
07:00:02 07:00:08 Dog Owl
07:00:04 07:00:05 Deer Owl Owl
07:00:05 07:00:05 Dog Owl
07:00:05 07:00:08 Owl Cheetah
07:00:07 07:00:08 Deer Cheetah

(a) what will be all the output tables of word count at 07:00:03, 07:00:06 and 07:00:09. (7 marks)

Consider the following points for the output —

(1) streaming starts at 07:00:00.


(2) Index the count by both Grouping Key (word) and the Window.
(3) Consider the data only within Window interval — that means — data in interval 07:00:03 — 07:00:08
implies data coming after 07:00:03 and before 07:00:08.

Q5 (a). Consider following three documents below—

Doc ID Document
Docl three birds, two birds
Doc2 red birds, blue birds
Doc3 two red birds

Using Map-Reduce, we want to build an inverted index to show — how many times does a word occur in each
document? For example — desired output will be as below for each word.

two (Doc1,1), (Doc3,1)

What are the inputs of the map function and the output (key, value) pairs for each of those inputs? (3 marks)
After shuffle, what does the reducer function do to generate the desired output? (1 mark)
(b) Write the output of the following program (here sc is spark.sparkContext) — (3 marks)
rdd = sc.parallelize ( • ) flatMap (lambda x: [x, x*x] )
rdd. cartesian (rdd) . reduceByKey (lambda x, y:x+y) . collect ()

You might also like