Readme

Dan Perlman
CS 283
L3
Introduction
============================================================================
This assignment has two parts. In part 1, multiple sub programs parse
DelayedFlights.csv to calculate certain things.
Entering NODES=# make run you can run the entire program (will take about 2-3
minutes) where NODES is the number of nodes desired: recommended 2, 4, 8, or 16.
In subprogram 1, Part A, I calculate the top 10 worst airports with the longest
delays due to security and weather
First it is run sequentially, then using a map reduce algorithm (partA.py and then
partA-MR.py).
In subprogram 2, Part B, I calculate the top 10 worst airports with the longest
carrier delays.
First it is run sequentially, then using a map reduce algorithm (partB.py and then
partB-MR.py).
In subprogram 3, Part C, I calculate the average total late aircraft delay for
each airport.
First it is run sequentially, then using a map reduce algorithm. (partC.py and then
partC-MR.py)
In subprogram 4, Part D, I calculate the average total late aircraft delay for
each carrier. I assumed that if no data was provided for aircraft delay, to not
calculate the average (there would be many 0s).
First it is run sequentially, then using a map reduce algorithm. (partD.py and then
partD-MR.py)
============================================================================
The Makefile
I split this problem into many sub programs, each with a makefile target to their
appropriate python files. 'NODES=# (2,4,8, or 16) make run' will run everything
I am using Python3 to run the programs to get the following results:
=============================================================================
Part 1 Map Reduce Time Results
A)
Sequentially = 31.68s
N=2 Nodes = 44.50s

N=4 Nodes = 24.12s
184% speedup from N=2 to 4
N=4 Nodes = 24.12s

N=8 Nodes = 15.17s
N=8 Nodes = 15.17s

N=16 Nodes = 9.07s
and
349% speedup from sequential to N=16!
B)
N=2 Nodes = 26.76s

N=4 Nodes = 15.98s
N=4 Nodes =15.98s

N=8 Nodes 10.77s
N=8 Nodes = 10.77s

N=16 Nodes = 8.01s
and
C)
N=2 Nodes = 25.88s

N=4 Nodes = 15.05s
N=4 Nodes = 15.05s

N=8 Nodes = 10.13s
N=8 Nodes = 10.13s

N=16 Nodes = 7.55s
and
D)
N=2 Nodes = 22.61s

N=4 Nodes = 14.58s
N=4 Nodes = 14.58s

N=8 Nodes = 9.62s
N=8 Nodes = 9.62s

N=16 Nodes = 7.60s
and
=====================================================================
Part 2
This program calculates Pi using PySpark with a variable number of nodes. I ran
this in Windows so I needed to use the special command prompt with Py Spark
to run it and had to use findpsark to indicate my findspark path at the top of the
program (code reproduced below).
Code:
import findspark
findspark.init('C:\\Users\\Daniel.Perlman\\Downloads\\spark-2.2.0-bin-
hadoop2.7\\spark-2.2.0-bin-hadoop2.7')
import pyspark
import random
import time
start_time = time.time()
sc = pyspark.SparkContext(appName="Pi")
num_samples = 10000000
num_nodes = 4 # Nodes - 2, 4, 8, or 16
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples), num_nodes).filter(inside).count()
pi = 4.0 * count / num_samples
print(pi)
sc.stop()
print ("--- Completed in", (time.time() - start_time), "seconds ---")
Results
(4 trials with an average)
2 5.53100013733 5.41100001335 5.33500003815 5.33599996567 === 5.403250038625s

4 5.6289999485 5.96700000763 6.20799994469 5.58500003815 ===== 5.8472499847425s
8 6.20799994469 6.43600010872 6.09100008011 6.28600001335 ==== 6.2552500367175s
16 7.20700001717 7.39199995995 7.16400003433 7.12999987602 ==== 7.2232499718675s
See Part2Graphs for results in graph form
Sadly my PySpark results did not yield an expected linear speedup. Im wondering if
this had to do with the fact that I ran these on a powerful computer and on
Windows,
not tux. The performance increases probably had to do with the time it took to
context switch from process to process, and I could visually see this in the
command prompt
as it ran. It would take a while to complete each process.
Update - Professor Mongan stated to use Python3. I am using Python3 for PySpark yet
I am still getting these lame results. My guess is that PySpark is sorta clunky on
Windows.

Readme

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Readme

Uploaded by

Copyright:

Available Formats

Dan Perlman

N=2 Nodes = 44.50s

N=4 Nodes = 24.12s

N=8 Nodes = 15.17s

N=2 Nodes = 26.76s

N=4 Nodes =15.98s

N=8 Nodes = 10.77s

N=2 Nodes = 25.88s

N=4 Nodes = 15.05s

N=8 Nodes = 10.13s

N=2 Nodes = 22.61s

N=4 Nodes = 14.58s

N=8 Nodes = 9.62s

2 5.53100013733 5.41100001335 5.33500003815 5.33599996567 === 5.403250038625s

See Part2Graphs for results in graph form

You might also like