You Have Two Datasets - Trips - TXT Which Records Tri...

You have two datasets: Trips.txt which records tri...
chegg.com/homework-help/questions-and-answers/two-datasets-tripstxt-records-trip-information-taxistxt-taxi-
information-tripstxt-taxistxt-q118382967
Question
(0)
You have two datasets: Trips.txt which records trip information, and Taxis.txt which is about
taxi information.
Both Trips.txt and Taxis.txt are stored on HDFS. Complete the following MapReduce
programming tasks
with Python. Note that using any other language like Java will directly lead to a 0 mark on the
assignment.
Also, you are not allowed to use any Python MapReduce library such as mrjob.
A sample of Taxis.txt A sample of Trips.txt
Taxi#, company, model, year
470,0,80,2018
332,11,88,2013
254,10,62,2018
460,4,90,2022
113,6,23,2015
275,16,13,2015
318,14,46,2014
Trip#, Taxi#, fare, distance, pickup_x, pickup_y, dropoff_x, dropoff_y
0,354,232.64,127.23,46.069,85.566,10.355,4.83
1,173,283.7,150.74,5.02,31.765,88.386,27.265
2,8,83.84,43.17,63.269,33.156,92.953,60.647
3,340,259.2,136.3,14.729,13.356,14.304,90.273
4,32,270.07,152.65,27.965,13.37,77.925,62.82
5,64,378.31,202.95,1.145,94.519,98.296,35.469
6,480,235.98,121.23,66.982,66.912,5.02,31.765
7,410,293.16,162.29,2.841,95.636,91.029,16.232
For each taxi, count the number of trips and the average distance per trip by developing
MapReduce programs with Python. The program should implement in-mapper combining
with state preserved across lines. The code must work for 3 reducers. You need to submit a
shell script named task1-run.sh. Running the shell script, the task is performed where the
shell script and code files are in the same folder (no subfolders).
Expert Answer
This solution was written by a subject matter expert. It's designed to help students like you
learn core concepts.
Step-by-step
1st step
All steps
Answer only
Step 1/2
Certainly! To solve this problem using MapReduce in Python, you can follow the steps below.
I'll provide you with the Python code for the Mapper and Reducer, as well as the shell script
to run the task.
Mapper Code (mapper.py)
The mapper reads lines from Trips.txt and emits the Taxi# as the key and the distance as
the value. It uses in-mapper combining to aggregate the data.
#!/usr/bin/env python
import sys
# Initialize a dictionary to hold Taxi# and its corresponding distance and count
taxi_data = {}
for line in sys.stdin:

line = line.strip()
if not line:
continue
# Split the line into fields

fields = line.split(',')
# Skip the header line

if fields[0] == "Trip#":
continue
taxi_id = fields[1]
distance = float(fields[3])
# In-mapper combining
if taxi_id in taxi_data:
taxi_data[taxi_id][0] += distance # Update distance
taxi_data[taxi_id][1] += 1 # Update count
else:
taxi_data[taxi_id] = [distance, 1] # Initialize distance and count
# Emit the Taxi# and its corresponding distance and count

for taxi_id, (total_distance, count) in taxi_data.items():
print(f"{taxi_id}\t{total_distance},{count}")
Reducer Code (reducer.py)
The reducer reads the output from the mapper, aggregates the data, and calculates the
average distance per trip for each taxi.
#!/usr/bin/env python
import sys
# Initialize variables
current_taxi = None
total_distance = 0.0
total_count = 0
for line in sys.stdin:

line = line.strip()
if not line:
continue
taxi_id, value = line.split('\t')

distance, count = map(float, value.split(','))
if current_taxi == taxi_id:
total_distance += distance
total_count += count
else:
if current_taxi:
avg_distance = total_distance / total_count
print(f"{current_taxi}\t{total_count}\t{avg_distance}")
current_taxi = taxi_id
total_distance = distance
total_count = count
# Output the last taxi data

if current_taxi:
avg_distance = total_distance / total_count
print(f"{current_taxi}\t{total_count}\t{avg_distance}")
Step 2/2
Shell Script (task1-run.sh)
This shell script assumes that Trips.txt and Taxis.txt are stored in HDFS and that the
mapper and reducer Python files are in the same directory.
#!/bin/bash
# Remove output folder if it exists

hadoop fs -rm -r output
# Run MapReduce job

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-* \
-files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py \
-input /path/to/Trips.txt -output output -numReduceTasks 3
# Fetch the output from HDFS

hadoop fs -get output/* .
Make sure to give execute permissions to your Python and shell script files:
chmod +x mapper.py reducer.py task1-run.sh
Explanation:
Now, you can run the shell script task1-run.sh to execute the MapReduce job. Make sure to
replace /path/to/Trips.txt with the actual HDFS path to your Trips.txt file.
🤎
Final answer
Dear Student√ We are expecting just one like for our effort nothing more than that it's save
our accounts from block.
ALEAST, PLEASE,Leave without downvote.
Thank you 😊
Have an awesome day! 📚
Was this answer helpful?
Post a question
Your toughest questions, solved step-by-step.
0 questions left - more coming in 23 days!
Practice with similar questions
Q:
What are the stats of the penalties that have been incurred bygender? Display the
gender as “Gender”, count as “PenaltyCount”, sum as “Penalty Sum”, average as
“Penalty Average”, lowestamount as ‘Minimum Penalty’, and highest amount as
‘MaximumPenalty’ of all the penalties incurred by gender. Insert yourscreenshot
here.2. Which players have lost more matches than the average numberof losses? No
duplicates should be listed. Order by player nu...
A:
See answer
100% (1 rating)
Q:
Consider solving the system Ax =b whereA=[4 −1 −1 0 0 0;−1 4 0 −1 0 0;−1 0 4 −1 −1

0;0 −1 −1 4 0 −1;0 0−1 0 4 −1;0 0 −1 0 4 −1]b=[13 3 5 6 1 6]'Write a script file which
generated the following 2 plots:
A:
See answer

You Have Two Datasets - Trips - TXT Which Records Tri...

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

You Have Two Datasets - Trips - TXT Which Records Tri...

Uploaded by

Copyright:

Available Formats

You have two datasets: Trips.txt which records tri...

Mapper Code (mapper.py)

for line in sys.stdin:

# Split the line into fields

# Skip the header line

# Emit the Taxi# and its corresponding distance and count

Reducer Code (reducer.py)

for line in sys.stdin:

taxi_id, value = line.split('\t')

# Output the last taxi data

Shell Script (task1-run.sh)

# Remove output folder if it exists

# Run MapReduce job

# Fetch the output from HDFS

chmod +x mapper.py reducer.py task1-run.sh

ALEAST, PLEASE,Leave without downvote.

Your toughest questions, solved step-by-step.

0 questions left - more coming in 23 days!

Practice with similar questions

Consider solving the system Ax =b whereA=[4 −1 −1 0 0 0;−1 4 0 −1 0 0;−1 0 4 −1 −1

You might also like