You are on page 1of 1

BIG DATA ANALYTICS

ASSIGNMENT 2

SPRING 2021

Due Date: 30th April 2021 (Submit Code file online on google classroom)
Instructions:
• The name of the file should be your rollnumber-Question number
• Do not copy the work of your peers. In case cheating is detected, then your case will be referred to DC.

Question 1: (10 marks)

We have received the huge user comment file and we wish to perform some basic statistics on it. Write a PySPARK code to perform
the following tasks
a) Determine the number of long comments given by each users where the length of the long comment should be greater than
20 alphabets.
b) Count the number of UserNames starting with each English alphabet.
c) Write custom partitioner to partition the data on the basis of first letter of user name.
d) Sort the data on the basis of the length of the comment given by each user.
e) Find the user who have given maximum number of comments
Input:
UserName, Comment
Aliya153, Your website is superb
Sara2, You need to work on your website design
Ali45, Good !!!
Ali45, I will definitely visit again

Question 2: (10 marks)

We want to remove stop words from the comments of the users in the above dataset.

Stop Words are those words that do not contain important information for example to, was, do etc. Usually these words
are filtered out from search queries.
Write a PySpark program to input a text file containing stop words (you can get one such file from internet). Use this file to
remove stop words from the comment of the users.

Hint: broadcast the stop word file to efficiently removing the stop words.

Question 3: (5 marks)

After removing the stop words from comments generate co-occurring words that co-occurred more than 5 times in the
comments.

You might also like