You are on page 1of 6

You have two datasets: Trips.txt which records tri...

chegg.com/homework-help/questions-and-answers/two-datasets-tripstxt-records-trip-information-taxistxt-taxi-
information-tripstxt-taxistxt-q118382967

Question

(0)

You have two datasets: Trips.txt which records trip information, and Taxis.txt which is about
taxi information.
Both Trips.txt and Taxis.txt are stored on HDFS. Complete the following MapReduce
programming tasks
with Python. Note that using any other language like Java will directly lead to a 0 mark on the
assignment.
Also, you are not allowed to use any Python MapReduce library such as mrjob.
A sample of Taxis.txt A sample of Trips.txt
Taxi#, company, model, year
470,0,80,2018
332,11,88,2013
254,10,62,2018
460,4,90,2022
113,6,23,2015
275,16,13,2015
318,14,46,2014
Trip#, Taxi#, fare, distance, pickup_x, pickup_y, dropoff_x, dropoff_y
0,354,232.64,127.23,46.069,85.566,10.355,4.83
1,173,283.7,150.74,5.02,31.765,88.386,27.265
2,8,83.84,43.17,63.269,33.156,92.953,60.647
3,340,259.2,136.3,14.729,13.356,14.304,90.273
4,32,270.07,152.65,27.965,13.37,77.925,62.82
5,64,378.31,202.95,1.145,94.519,98.296,35.469
6,480,235.98,121.23,66.982,66.912,5.02,31.765
7,410,293.16,162.29,2.841,95.636,91.029,16.232

For each taxi, count the number of trips and the average distance per trip by developing
MapReduce programs with Python. The program should implement in-mapper combining
with state preserved across lines. The code must work for 3 reducers. You need to submit a
shell script named task1-run.sh. Running the shell script, the task is performed where the
shell script and code files are in the same folder (no subfolders).
Expert Answer

This solution was written by a subject matter expert. It's designed to help students like you
learn core concepts.

Step-by-step

1st step
All steps
Answer only
Step 1/2
Certainly! To solve this problem using MapReduce in Python, you can follow the steps below.
I'll provide you with the Python code for the Mapper and Reducer, as well as the shell script
to run the task.

Mapper Code (mapper.py)

The mapper reads lines from Trips.txt and emits the Taxi# as the key and the distance as
the value. It uses in-mapper combining to aggregate the data.
#!/usr/bin/env python

import sys

# Initialize a dictionary to hold Taxi# and its corresponding distance and count
taxi_data = {}

for line in sys.stdin:


line = line.strip()
if not line:
continue

# Split the line into fields


fields = line.split(',')

# Skip the header line


if fields[0] == "Trip#":
continue

taxi_id = fields[1]
distance = float(fields[3])

# In-mapper combining
if taxi_id in taxi_data:
taxi_data[taxi_id][0] += distance # Update distance
taxi_data[taxi_id][1] += 1 # Update count
else:
taxi_data[taxi_id] = [distance, 1] # Initialize distance and count

# Emit the Taxi# and its corresponding distance and count


for taxi_id, (total_distance, count) in taxi_data.items():
print(f"{taxi_id}\t{total_distance},{count}")

Reducer Code (reducer.py)

The reducer reads the output from the mapper, aggregates the data, and calculates the
average distance per trip for each taxi.
#!/usr/bin/env python

import sys

# Initialize variables
current_taxi = None
total_distance = 0.0
total_count = 0

for line in sys.stdin:


line = line.strip()
if not line:
continue

taxi_id, value = line.split('\t')


distance, count = map(float, value.split(','))

if current_taxi == taxi_id:
total_distance += distance
total_count += count
else:
if current_taxi:
avg_distance = total_distance / total_count
print(f"{current_taxi}\t{total_count}\t{avg_distance}")

current_taxi = taxi_id
total_distance = distance
total_count = count

# Output the last taxi data


if current_taxi:
avg_distance = total_distance / total_count
print(f"{current_taxi}\t{total_count}\t{avg_distance}")

Step 2/2

Shell Script (task1-run.sh)

This shell script assumes that Trips.txt and Taxis.txt are stored in HDFS and that the
mapper and reducer Python files are in the same directory.
#!/bin/bash

# Remove output folder if it exists


hadoop fs -rm -r output

# Run MapReduce job


hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-* \
-files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py \
-input /path/to/Trips.txt -output output -numReduceTasks 3

# Fetch the output from HDFS


hadoop fs -get output/* .

Make sure to give execute permissions to your Python and shell script files:

chmod +x mapper.py reducer.py task1-run.sh

Explanation:

Now, you can run the shell script task1-run.sh to execute the MapReduce job. Make sure to
replace /path/to/Trips.txt with the actual HDFS path to your Trips.txt file.

🤎
Final answer
Dear Student√ We are expecting just one like for our effort nothing more than that it's save
our accounts from block.

ALEAST, PLEASE,Leave without downvote.

Thank you 😊
Have an awesome day! 📚
Was this answer helpful?
Post a question

Your toughest questions, solved step-by-step.

0 questions left - more coming in 23 days!

Practice with similar questions

Q:

What are the stats of the penalties that have been incurred bygender?  Display the
gender as “Gender”, count as “PenaltyCount”, sum as “Penalty Sum”, average as
“Penalty Average”, lowestamount as ‘Minimum Penalty’, and highest amount as
‘MaximumPenalty’ of all the penalties incurred by gender.  Insert yourscreenshot
here.2. Which players have lost more matches than the average numberof losses?   No
duplicates should be listed.  Order by player nu...

A:

See answer

100% (1 rating)

Q:

Consider solving the system Ax =b whereA=[4 −1 −1 0 0 0;−1 4 0 −1 0 0;−1 0 4 −1 −1


0;0 −1 −1 4 0 −1;0 0−1 0 4 −1;0 0 −1 0 4 −1]b=[13 3 5 6 1 6]'Write a script file which
generated the following 2 plots:

A:

See answer

You might also like