Midterm Report

MIDTERM PROJECT
Course: Mining Massive Datasets

Members
Student ID Full name Email Assigned tasks Complete percentage
519H0310 Trần Lê Thành Lộc 519H0310@student.tdtu.edu.vn Task 1, 2, 5 100%
519H0306 Trần Trung Kiên 519h0306@student.tdtu.edu.vn Task 3, 4 100%

Task 1: Item Counting
• Solve:
• Use only RDD functions to find results (do not use DataFrame): textFile()
• The result does not contain the header line: first() to get header and filter()
to remove header
• Map() to split text file into respective elements of Member_number, Date,

itemDescription, year, month, day, day_of_week.
• Solve:
• Map() to map Member_number, Day; Member_number, Month;
Member_number, Year to count 1 item bought by each line data
• Count the number of items bought by each customer for each day, month,
and year: reduceByKey() to add value each line in Quantity column.
• the values on each line are separated by a “,” and do not contain a “,” at the
end of the line: join() “,” between elements of 3 columns
• Solve:
• Find out the maximum number of items in a basket sold by day: groupByKey() to merge
items into 1 date.
• Count the number of items bought by each customer for each day, month, and year:
reduceByKey() to add value each line in Quantity column.
• the values on each line are separated by a “,” and do not contain a “,” at the end of the line:
join() “,” between element.
• Use saveAsTextFile() to save results of Number of items for each day, Number of items for
each month, Number of items for each year and Maximum number of items in a basket for
each day
• Output:
Each folder has 3 file: _SUCCESS, part-00000, part-00001.

_SUCCESS: the sign announces save text file successfully.
part-00000, part-00001: all of data have stored in it.
Task 2: Baskets
• Solve:
• Use only RDD functions to find results (do not use DataFrame): textFile()
• Map() to split text file into respective elements of Member_number, Date,

itemDescription, year, month, day, day_of_week.
• Map() use for map (Member_number, Date) , itemDescriptiont item bought

by each line data
Task 2: Baskets
• Solve:
• Find out the list of items bought by each customer for each day: groupByKey() to
merge items with each customer for each day.
• Each item in a basket appears no more than 1 time: set() get unique items
• Columns are separated by “;” and items are separated by “,”: use join() “;” between
columns and join() “,” between itemDescription.
• Use saveAsTextFile() to save results of Number of items for each day, Number of
items for each month, Number of items for each year and Maximum number of items
in a basket for each day
Task 2: Baskets
• Outputs:
Task 2: Baskets
• Solve:
Program 2A:
• Get RDD in previous part, we use filter() to find the member_number, date
and it return corresponding basket.
Program 2B:
• Check input n is a positive integer
• Get RDD in previous part, we use groupByKey() to merge itemDescription the
same member_number and date
• we use filter() to find the member_number and sortBy() to sort Date
descending and it return the list of baskets basket with length n
Task 2: Baskets
• Outpust:
Program 2A: Program 2B:
Task 3 : Frequent Itemsets
• Solve: display df and frequency
• Solve: 3a
• Solve: 3b
• ouput:
• ouput: 3a 3b
Task 4: Baskets-to-Vectors
• Solve:
• Solve:
• Result:
• Result:
Task 5: MinHashLSH
• Solve:
• From basket2vector() in task 4, we calculate Item into vector.
• Create model MinHashLSH with inputCol = Item, outputCol = Hashes,
numHashTables = 8
• Train the model with dfMembers by fit()
• Apply transform() function in dfMembers:
Task 5: MinHashLSH
• Solve:
• Use approxSimilarityJoin() to find out pairs of customers with similar
shopping habit, where JaccardDistance is not over 0.5( distance < 0.5)
• We use filter() to select row with positive JaccardDistance(> 0)
Task 5: MinHashLSH
• Solve:
• Use approxNearestNeighbors() to find out 6 customers(because of a
duplication) whose shopping habit is the most similar to the one whose
member_number is input by users
• We use filter() to remove row with member_number ‘s duplication ()
Assignment Sheet
Trung Kiên Thành Lộc
Task 1 HT
Task 2 HT
Task 3 HT
Task 4 HT
Task 5 HT

Midterm Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Midterm Report

Uploaded by

Copyright:

Available Formats

MIDTERM PROJECT

Course: Mining Massive Datasets

Student ID Full name Email Assigned tasks Complete percentage

519H0310 Trần Lê Thành Lộc 519H0310@student.tdtu.edu.vn Task 1, 2, 5 100%

519H0306 Trần Trung Kiên 519h0306@student.tdtu.edu.vn Task 3, 4 100%

• Map() to split text file into respective elements of Member_number, Date,

Each folder has 3 file: _SUCCESS, part-00000, part-00001.

• Map() to split text file into respective elements of Member_number, Date,

• Map() use for map (Member_number, Date) , itemDescriptiont item bought

You might also like