You are on page 1of 2

Thapar Institute of Engineering and Technology

Auxiliary Exam: Big Data Analytics (UEC735)


ECED-4ENC
Date: 03.03.2023 MM: 100 Duration: 3 Hours
Instructors: Dr. Arnab Pattanayak

Attempt all questions.

Ql. Consider you are building a Naive-Bays based classifier to say whether a text message is about
elections or not. We have 1500 examples of such text messages, among which 1000 are election-
messages and rest are not about elections. Among the 1000 election-messages 800 contain the word
"CONSTITUENCY", 750 contain the word "CONTEST" and 600 contain the word
"CANDIDATE". On the other-hand, among 500 non-election messages, 50 contain the word
"CONSTITUENCY", 150 contain the word "CONTEST" and 100 contain the word
"CANDIDATE". What is the probability of a text that it will be an Election message if that text
contains all the three words. (10)

Q2. Consider the following matrix of 12 users rated 6 movies.

UI U2 U3 U4 U5 U6 U7 U8 U9 U10 U11 U12


M1 1 3 4 4 5
M2 4 5 4 1 2 3
M3 1 5 2 1 3 4 3 4
M4 1 5 5 5 1
M5 4 3 5 1 1 4
M6 1 3 3 2 4

Using item-item collaborative filtering, predict the rating of MI by U5. For similarity measure use
centred cosine similarity and use 2 nearest neighbours of MI. (20)
Q3. Consider the following two-dimensional dataset —
(180, 80), (172, 73), (178, 69), (189, 82), (164, 70), (186, 71), (180, 69), (170, 76), (166,71),
(180,72)
Apply two iterations of K-means clustering algorithm to the above data points to group them into two
clusters. Show the two cluster centroids and data points belonged to these two clusters. Choose initial
cluster centroids as (185,70), and (170,80) and calculate the distance between two points using the
Euclidean distance formula (20)
Q4. (a) Consider the following two 5-dimensional data points -
Coordinatel Coordinate2 Coordinate3 Coordinate4 Coordinate5
25 -8 55 -8 -18
45 -25 32 4.5 -26
Calculate Euclidean, Manhattan and Chebyshev Distance between two data points (3+3+3)
(b) Consider the 3 sets of numbers — A = {-7,1,2,6,4}, B = (1,3,8), C = {-7, —5, —4,6}. Calculate
the Jaccard Similarity between each pair of sets. (3+3+3)
(c) Calculate Jaccard distance between B and C. (2)
Q5. Consider following three documents below—

Doc ID Document
Docl three pens, four papers
Doc2 blue papers, red pens
Doc3 three blue pens

Using Map-Reduce, we want to build an inverted index to show — how many times does a word occur
in each document? For example — desired output will be as below for each word.

three (Docl,l), (lloc3,1)

What are the inputs of the map function and the output (key, value) pairs for each of those inputs? (8)

After shuffle, what does the reducer function do to generate the desired output? (2)

Q6. (a) Consider following matrix below which shows nutrient values of some foods —

w 6s 156 19
Ioo 34 0 110
To3r
33
Cob/ 1 0
7
ctr.b1 ° 2 0
av'cJ'
4-)\°°"‘I'

Find the nutrition value of a breakfast consisting of 2 bananas, 2 potatoes, 0.5 glass milk and 0.25
pancake, using Map-Reduce Algorithm. Specify the input and output <key, value> pairs for Map, and
Reduce stage, and the Map and Reduce operations. [15]

(b) Write the output of the following program (here sc is spark.sparkContext) — (5)

rdd = sc.parallelize([3, 5]).flatMap(lambda x: [x,2*x])


rdd.cartesian(rdd).reduceByKey(lambda x,y:x*y).collect()

You might also like