You are on page 1of 4

CMIS-550 Assignment 1 Due February 21, 2023

Instructions
This assignment covers all material prior to the midterm. In particular, it will test your knowledge
of MapReduce. You will be tasked with performing real-world analysis on three data sets:
• Sales data (purchases.txt)
• Web server logs (access_log)
• Discussion forum data (forum_data.zip)
These datasets can be found on myCourses.

Grading
With regard to grading, even though not explicitly mentioned in the questions, it is to your advantage
to show all your work and justify your responses.

Question Points Score


Question 1: Sales Data Analysis 6
Question 2: Log Analysis 3
Question 3: Discussion Forum Analysis 6
Total: 15

Submissions
The assignment is due February 21, 2023 and is to my submitted through MyCourses. E-mail
and late submissions will not be accepted.

Assignments may be submitted in your project group. Please be aware that McGill’s policy of
academic integrity applies to group work.

Note is that the assignment closely follows material of Udacity’s Introduction to Hadoop and
MapReduce. You may be interested in viewing their course materials for reference.

When submitting your work, please submit for each question:


1. Your mapper and reducer python scripts.
2. The answer to the question
3. An optional notes.txt file for any information you would like to relay to the grader.
CMIS-550 Assignment 1, Page 2 of 4 Due February 21, 2023
The assignment should be submitted as one zip file with the following contents:
Assignment01/
-- Q01/
---- q01A-answer.txt # Response to question
---- q01A-mapper.py # Mapper in Python
---- q01A-reducer.py # Reducer in Python
---- q01A-notes.txt # Any notes you’d like to tell me
---- q01B-answer.txt
---- q01B-mapper.py
---- q01B-reducer.py
---- q01B-notes.txt
---- [...]
-- Q02/
---- [...]
-- Q03/
---- [...]
To execute your MapReduce programs, you should run them through the test stream to mimic
MapReduce functionality. Once Python is installed, this can be accomplished using the following on
Windows, Mac or Linux:
cat input.txt | python3 mapper1.py | sort | python3 reducer1.py > answer1.txt

Some questions may require chaining multiple map/reduce chains


1. The file purchases.txt contains sales information for one company with many stores. It has
contents:
[Date]\t[Time]\t[StoreLocation]\t[ProductType]\t[Amount]\t[PaymentMethod]

Please use MapReduce to answer the following questions


(a) (1 point) How many transactions occur at each store location?
(b) (1 point) What is the total amount sold of each ProductType in dollars?
(c) (1 point) How many transactions were placed for each PaymentMethod
(d) (1 point) What is the total amount sold broken down by storage location and payment
method?
(e) (1 point) What is the maximum amount paid at each separate store location?
(f) (1 point) What is the total amount sold over all the stores?
2. The data set we are using is the log file from a Web server. The log file is called access_log.
If you take a look at the file, you’ll see that each line represents a hit to the Web server. It
includes the IP address which accessed the site, the date and time of the access, and the name
of the page which was visited.
CMIS-550 Assignment 1, Page 3 of 4 Due February 21, 2023
The logfile is in Common Log Format:
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /a.htm HTTP/1.1" 200 10469

%h %l %u %t "%r" %>s %b

Where:
• % h is the IP address of the client
• % l is identity of the client, or "-" if it’s unavailable
• % u is username of the client, or "-" if it’s unavailable
• % t is the time that the server finished processing the request. The format is [day/month/year:hour:minute:sec
zone]
• % r is the request line from the client is given (in double quotes). It contains the method,
path, query-string, and protocol or the request.
• % >s is the status code that the server sends back to the client. You will see see mostly
status codes 200 (OK - The request has succeeded), 304 (Not Modified) and 404 (Not
Found). See more information on status codes in W3C.org
• % b is the size of the object returned to the client, in bytes. It will be "-" in case of status
code 304.
Please use MapReduce to do the following:
(a) (1 point) Display the number of hits for each different file on the web site
(b) (1 point) Determine the number of hits to the site made by each different IP address
(c) (1 point) Find the most popular file on the web site. That is, the file whose path occurs
most often in access_log.
3. In this question you will work with discussion forum data that uses a similar platform to
StackOverflow. The basic structure is made up of nodes. All nodes have a body and author_id.
Top level nodes are called questions, and will also have a title and tags. Questions can have
answers. Both questions and answers can have comments. The dataset is called forum_data.zip
and contains two files.
The first is forum_nodes.tsv, and it contains all forum questions and answers in one table.
It was exported from the RDBMS by using tab as a separator, and enclosing all fields in
doublequotes. You can find the field names in the first line of the file. The ones that are the
most relevant to the task are:
• id: id of the node
• title: title of the node. in case "node_type" is "answer" or "comment", this field will be
empty
• tagnames: space separated list of tags
CMIS-550 Assignment 1, Page 4 of 4 Due February 21, 2023
• author_id: id of the author
• body: content of the post
• node_type: type of the node, either "question", "answer" or "comment"
• parent_id: node under which the post is located, will be empty for "questions"
• abs_parent_id: top node where the post is located
• added_at: date added
The second table is forum_users.tsv. It contains fields for
• user_ptr_id : the id of the user.
• reputation : the reputation earned when other users upvote their posts
• the number of "gold", "silver" and "bronze" badges earned. The actual database has more
fields in this table, like user name nickname, bio (if set) etc, but we have removed this
information here.
(a) (1 point) For each student what is the hour during which the student has posted the most
posts. Output from reducers should be
author_id \t hours

In order to find the hour posted, please use the date_added field and NOT the last_activity_at
field.
(b) (2 points) We are interested to see if there is a correlation between the length of a post
and the length of answers.
Write a mapreduce program that would process the forum_node data and output the length
of the post and the average answer (just answer, not comment) length for each post.
(c) (2 points) We are interested seeing what are the top tags used in posts. Write a mapreduce
program that would output Top 10 tags, ordered by the number of questions they appear
in. Please note that you should only look at tags appearing in questions themselves (i.e.
nodes with node_type "question"), not on answers or comments.
(d) (1 point) Count the number of posts by reputation.

You might also like