Professional Documents
Culture Documents
Software Engineering
Hadoop and Spark
David A. Wheeler
SWE 622
George Mason University
outline
Apache Hadoop
Hadoop Distributed File System (HDFS)
MapReduce
Apache Spark
Source: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0004.html
Source: https://spark.apache.org/docs/latest/quick-start.html
Source: https://spark.apache.org/docs/latest/programming-guide.html
points = spark.textFile(...).map(parsePoint).cache()
w = numpy.random.ranf(size = D) # current separating plane
for i in range(ITERATIONS):
gradient = points.map(
lambda p: (1 / (1 + exp(-p.y*(w.dot(p.x)))) - 1) * p.y * p.x
).reduce(lambda a, b: a + b)
w -= gradient
print "Final separating plane: %s" % w
Source: https://spark.apache.org/docs/latest/quick-start.html
Source: https://spark.apache.org/docs/latest/quick-start.html
[Nova2017] https://blog.bradfieldcs.com/you-are-not-google-84912cf44afb
https://tech.channable.com/posts/2019-10-04-why-we-decided-to-go-for-the-big-rewrite.html
SWE 622 – Distributed Software Systems © Wheeler Hadoop – Spark – 32
Closing caveat: choose right tool
for the job (3 of 3)
Spark really is awesome for certain kinds of
problems
However, beware of the siren call
Do not choose a technology because it’s the
“newest/latest thing” (fad-based engineering) or “this
will scale far beyond my needs” (overengineering)
Know what tools exist, & choose the best one
for the circumstance
In some cases that could be Spark