You are on page 1of 24

MIDTERM PROJECT

Course: Mining Massive Datasets


Members

Student ID Full name Email Assigned tasks Complete percentage

519H0310 Trần Lê Thành Lộc 519H0310@student.tdtu.edu.vn Task 1, 2, 5 100%

519H0306 Trần Trung Kiên 519h0306@student.tdtu.edu.vn Task 3, 4 100%


Task 1: Item Counting
• Solve:
• Use only RDD functions to find results (do not use DataFrame): textFile()

• The result does not contain the header line: first() to get header and filter()
to remove header

• Map() to split text file into respective elements of Member_number, Date,


itemDescription, year, month, day, day_of_week.
Task 1: Item Counting
• Solve:
• Map() to map Member_number, Day; Member_number, Month;
Member_number, Year to count 1 item bought by each line data

• Count the number of items bought by each customer for each day, month,
and year: reduceByKey() to add value each line in Quantity column.

• the values on each line are separated by a “,” and do not contain a “,” at the
end of the line: join() “,” between elements of 3 columns
Task 1: Item Counting
• Solve:
• Find out the maximum number of items in a basket sold by day: groupByKey() to merge
items into 1 date.

• Count the number of items bought by each customer for each day, month, and year:
reduceByKey() to add value each line in Quantity column.

• the values on each line are separated by a “,” and do not contain a “,” at the end of the line:
join() “,” between element.

• Use saveAsTextFile() to save results of Number of items for each day, Number of items for
each month, Number of items for each year and Maximum number of items in a basket for
each day
Task 1: Item Counting
• Output:

Each folder has 3 file: _SUCCESS, part-00000, part-00001.


_SUCCESS: the sign announces save text file successfully.
part-00000, part-00001: all of data have stored in it.
Task 2: Baskets
• Solve:
• Use only RDD functions to find results (do not use DataFrame): textFile()

• Map() to split text file into respective elements of Member_number, Date,


itemDescription, year, month, day, day_of_week.

• Map() use for map (Member_number, Date) , itemDescriptiont item bought


by each line data
Task 2: Baskets
• Solve:
• Find out the list of items bought by each customer for each day: groupByKey() to
merge items with each customer for each day.

• Each item in a basket appears no more than 1 time: set() get unique items

• Columns are separated by “;” and items are separated by “,”: use join() “;” between
columns and join() “,” between itemDescription.

• Use saveAsTextFile() to save results of Number of items for each day, Number of
items for each month, Number of items for each year and Maximum number of items
in a basket for each day
Task 2: Baskets
• Outputs:
Task 2: Baskets
• Solve:
Program 2A:
• Get RDD in previous part, we use filter() to find the member_number, date
and it return corresponding basket.

Program 2B:
• Check input n is a positive integer
• Get RDD in previous part, we use groupByKey() to merge itemDescription the
same member_number and date
• we use filter() to find the member_number and sortBy() to sort Date
descending and it return the list of baskets basket with length n
Task 2: Baskets
• Outpust:
Program 2A: Program 2B:
Task 3 : Frequent Itemsets
• Solve: display df and frequency
Task 3 : Frequent Itemsets
• Solve: 3a
• Solve: 3b
Task 3 : Frequent Itemsets
• ouput:
Task 3 : Frequent Itemsets
• ouput: 3a 3b
Task 4: Baskets-to-Vectors
• Solve:
Task 4: Baskets-to-Vectors
• Solve:
Task 4: Baskets-to-Vectors
• Result:
Task 4: Baskets-to-Vectors
• Result:
Task 5: MinHashLSH
• Solve:
• From basket2vector() in task 4, we calculate Item into vector.
• Create model MinHashLSH with inputCol = Item, outputCol = Hashes,
numHashTables = 8
• Train the model with dfMembers by fit()
• Apply transform() function in dfMembers:
Task 5: MinHashLSH
• Solve:
• Use approxSimilarityJoin() to find out pairs of customers with similar
shopping habit, where JaccardDistance is not over 0.5( distance < 0.5)
• We use filter() to select row with positive JaccardDistance(> 0)
Task 5: MinHashLSH
• Solve:
• Use approxNearestNeighbors() to find out 6 customers(because of a
duplication) whose shopping habit is the most similar to the one whose
member_number is input by users
• We use filter() to remove row with member_number ‘s duplication ()
Assignment Sheet
Trung Kiên Thành Lộc
Task 1 HT
Task 2 HT
Task 3 HT
Task 4 HT
Task 5 HT

You might also like