Professional Documents
Culture Documents
1. Problem statement:
Write and execute a MapReduce program to figure out the top 100 trending songs from
Saavn’s stream data, on a daily basis, for the week December 25-31, year 2017. A stream is a
record of a user playing a song.. Each stream is represented as a tuple with the following
attributes:
For example if song was streamed on the date preceding the target date, the difference in date
is 1 and therefore the weight is the full 1.0. If streamed two days preceding the target date the weight
will be 0.8. Thus in this scheme the further back the song is streamed from the target date, the less
importance it is given to song while accounting it in the total number of times the song is played.
The weights are then summed to find the total weight for each song. The songs are sorted
according to the sum and the top 100 songs in terms of the highest sum is chosen as the top 100
trending songs.
3. MapReduce design:
The entire task is achieved with a single mapreduce phase and consists of a mapper, combiner,
partitioner and a reducer.
The value is the ValuesPair custom object containing the weight and the timestamp.
Mapper value: ValuesPair
Note that the key needs the timestamp as this timestamp is the basis by which the spike filtering
algorithm works. Since a song with a timestamp can be spread across different mappers, this filtering
cannot be done at the combiners and we would have to wait till the reduce phase to process these
timestamps to perform the filtering.
S3Connector and AWS CLI is installed as mentioned in the project resources module.
where, window is the desired window size in the sliding window algorithm, spikefactor is the spike
factor as described in the algorithm section.
The mapreduce generated output files are then downloaded to the PC using winscp using same
credentials as mentioned before.
The output files each corresponds to a date starting (part-r-00000) at 25-12-2017 and
ending(part-r-00006), in sequence, at 31-12-2017. The files are renamed accordingly to indicate the
date and placed in an output folder with the following type of name
output-window1-decaypoint8-spikefactor1000-dates
Similarly the gold standard set of files for the month of Dec-2017 provided in s3 is transferred to
EC2 using copyToLocal and then downloaded to the PC using winscp.
In the project the gold standard file is placed in the resources/goldstandard folder while the various
sets of map reduce output files are placed (one folder per configuration tested with) in the
resources/output folder.
Finally, for each date the intersection set is collected using retainAll().
AnalyseOutputs class is updated to contain a list of all the directories in the output folder.
The AnalyseOutputs class is then run from eclipse using Run As -> Java application.
The output of the program is, for each directory, the number of songIds which overlapped with the gold
standard for each date of interest (25 to 31 Dec 2017) and the list of overlapping songs. This is output in
the file results.txt in the folder resources/analysisresult/ in the DataAnalysis project folder.
6. Conclusions:
It is seen that window size 2 performs better than window size 5 and the overlap size is greater than
60 for all the dates. Changing spike factor from 1000 to 10 does not seem to have a significant effect.
The best result was for configuration
output-window2-decaypoint8-spikefactor1000/