Professional Documents
Culture Documents
Ql. Suppose we have two tables, where the first contains an employee's personal information, primary
keyed on SSN and the second table includes the employee's income again keyed on SSN. The data is
as follows:
Input Data:
(a) Write a first stage Map-Reduce programming model to get the following output from the input
data: (3.5 marks)
(b) Write a second-stage Map-reduce programming model to get the average income in each city
in 2016. The second stage takes as input the output generated by the first stage. (3.5 marks)
Q2. (a) Consider the following list [2,4,5,6,7,8,9]. Use necessary RDD transformations to print the
numbers that are either divisible by 2 or divisible by 3. (3.5 marks)
(b) Consider two RDDs created from the lists ['orange', 'mango', 'apple', 'grapes', 'orange] and
['green', 'red', 'yellow']. Perform necessary transformations and actions to generate the following:
(3.5 marks)
Q4. Consider a Spark Structured Streaming Program is receiving a steaming data for word count. We want a
Window-Based Aggregation, where Window length = 05 Seconds and we want the output result in every 03
seconds in the complete mode. [time in hh:mm:ss format]
Input Stream
(a) what will be all the output tables of word count at 07:00:03, 07:00:06 and 07:00:09. (7 marks)
Doc ID Document
Docl three birds, two birds
Doc2 red birds, blue birds
Doc3 two red birds
Using Map-Reduce, we want to build an inverted index to show — how many times does a word occur in each
document? For example — desired output will be as below for each word.
What are the inputs of the map function and the output (key, value) pairs for each of those inputs? (3 marks)
After shuffle, what does the reducer function do to generate the desired output? (1 mark)
(b) Write the output of the following program (here sc is spark.sparkContext) — (3 marks)
rdd = sc.parallelize ( • ) flatMap (lambda x: [x, x*x] )
rdd. cartesian (rdd) . reduceByKey (lambda x, y:x+y) . collect ()