You are on page 1of 3

Dan Perlman

CS 283
L3

Introduction
============================================================================
This assignment has two parts. In part 1, multiple sub programs parse
DelayedFlights.csv to calculate certain things.
Entering NODES=# make run you can run the entire program (will take about 2-3
minutes) where NODES is the number of nodes desired: recommended 2, 4, 8, or 16.

In subprogram 1, Part A, I calculate the top 10 worst airports with the longest
delays due to security and weather
First it is run sequentially, then using a map reduce algorithm (partA.py and then
partA-MR.py).

In subprogram 2, Part B, I calculate the top 10 worst airports with the longest
carrier delays.
First it is run sequentially, then using a map reduce algorithm (partB.py and then
partB-MR.py).

In subprogram 3, Part C, I calculate the average total late aircraft delay for
each airport.
First it is run sequentially, then using a map reduce algorithm. (partC.py and then
partC-MR.py)

In subprogram 4, Part D, I calculate the average total late aircraft delay for
each carrier. I assumed that if no data was provided for aircraft delay, to not
calculate the average (there would be many 0s).
First it is run sequentially, then using a map reduce algorithm. (partD.py and then
partD-MR.py)
============================================================================
The Makefile

I split this problem into many sub programs, each with a makefile target to their
appropriate python files. 'NODES=# (2,4,8, or 16) make run' will run everything
I am using Python3 to run the programs to get the following results:
=============================================================================
Part 1 Map Reduce Time Results

A)
Sequentially = 31.68s

N=2 Nodes = 44.50s


N=4 Nodes = 24.12s
184% speedup from N=2 to 4

N=4 Nodes = 24.12s


N=8 Nodes = 15.17s
163% speedup from N=4 to 8

N=8 Nodes = 15.17s


N=16 Nodes = 9.07s
159% speedup from N=8 to 16

and
349% speedup from sequential to N=16!

B)
Sequentially = 29.64s

N=2 Nodes = 26.76s


N=4 Nodes = 15.98s
167% speedup from N=2 to 4

N=4 Nodes =15.98s


N=8 Nodes 10.77s
148% speedup from N=4 to 8

N=8 Nodes = 10.77s


N=16 Nodes = 8.01s
134% speedup from N=8 to 16

and
370% speedup from sequential to N=16!

C)
Sequentially = 29.27s

N=2 Nodes = 25.88s


N=4 Nodes = 15.05s
172% speedup from N=2 to 4

N=4 Nodes = 15.05s


N=8 Nodes = 10.13s
149% speedup from N=4 to 8

N=8 Nodes = 10.13s


N=16 Nodes = 7.55s
134% speedup from N=8 to 16

and
387% speedup from sequential to N=16!

D)
Sequentially = 29.24s

N=2 Nodes = 22.61s


N=4 Nodes = 14.58s
155% speedup from N=2 to 4

N=4 Nodes = 14.58s


N=8 Nodes = 9.62s
151% speedup from N=4 to 8

N=8 Nodes = 9.62s


N=16 Nodes = 7.60s
126% speedup from N=8 to 16

and
385% speedup from sequential to N=16!
=====================================================================
Part 2

This program calculates Pi using PySpark with a variable number of nodes. I ran
this in Windows so I needed to use the special command prompt with Py Spark
to run it and had to use findpsark to indicate my findspark path at the top of the
program (code reproduced below).
Code:

import findspark
findspark.init('C:\\Users\\Daniel.Perlman\\Downloads\\spark-2.2.0-bin-
hadoop2.7\\spark-2.2.0-bin-hadoop2.7')
import pyspark
import random
import time
start_time = time.time()
sc = pyspark.SparkContext(appName="Pi")
num_samples = 10000000
num_nodes = 4 # Nodes - 2, 4, 8, or 16
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples), num_nodes).filter(inside).count()
pi = 4.0 * count / num_samples
print(pi)
sc.stop()
print ("--- Completed in", (time.time() - start_time), "seconds ---")

Results
(4 trials with an average)

2 5.53100013733 5.41100001335 5.33500003815 5.33599996567 === 5.403250038625s


4 5.6289999485 5.96700000763 6.20799994469 5.58500003815 ===== 5.8472499847425s
8 6.20799994469 6.43600010872 6.09100008011 6.28600001335 ==== 6.2552500367175s
16 7.20700001717 7.39199995995 7.16400003433 7.12999987602 ==== 7.2232499718675s

See Part2Graphs for results in graph form

Sadly my PySpark results did not yield an expected linear speedup. Im wondering if
this had to do with the fact that I ran these on a powerful computer and on
Windows,
not tux. The performance increases probably had to do with the time it took to
context switch from process to process, and I could visually see this in the
command prompt
as it ran. It would take a while to complete each process.

Update - Professor Mongan stated to use Python3. I am using Python3 for PySpark yet
I am still getting these lame results. My guess is that PySpark is sorta clunky on
Windows.

You might also like