You are on page 1of 2

Large Scale Distributed Data Science using Apache Spark

James G. Shanahan Liang Dai


NativeX and NativeX
University of California, Berkeley
875 Howard Street
875 Howard Street
San Francisco, CA, USA
San Francisco, CA, USA
Liang.Dai@NativeX.com
James.Shanahan@NativeX.com

ABSTRACT Until recently “big data” was very much the purvey of database
Apache Spark is an open-source cluster computing framework for management and summary statistics systems such as Hadoop
big data processing. It has emerged as the next generation big data (HDFS and MapReduce) and was largely underleveraged by
processing engine, overtaking Hadoop MapReduce which helped machine learning. These systems though useful suffered from
ignite the big data revolution. Spark maintains MapReduce’s their limited utility (challenging to code up; lack of purpose built
linear scalability and fault tolerance, but extends it in a few tools and libraries). This course builds on and goes beyond this
important ways: it is much faster (100 times faster for certain collect-and-analyze phase of big data by focusing on how
applications), much easier to program in due to its rich APIs in machine learning algorithms can be rewritten and in some cases
Python, Java, Scala (and shortly R), and its core data abstraction, extended to scale to work on petabytes of data, both structured
the distributed data frame, and it goes far beyond batch and unstructured, to generate sophisticated models that can be
applications to support a variety of compute-intensive tasks, used for real-time predictions. Predictive modeling at this scale
including interactive queries, streaming, machine learning, and can lead to huge boosts in performance (typically in the order of
graph processing. 10-20%) over small scale models running on standalone
This tutorial will provide an accessible introduction to Spark and computers that require one to significantly down-sample, and
its potential to revolutionize academic and commercial data necessarily simplify big data.
science practices.

Categories and Subject Descriptors


D.3.3 [Concurrent Programming]: • Distributed programming.
G.3 [PROBABILITY AND STATISTICS] • Statistical
computing.

General Terms
Algorithms, Measurement, Performance, Design, Reliability,
Experimentation, Security, Languages.

Keywords
Distributed Systems, Hadoop, HDFS, Map Reduce, Spark, Large
Scale Machine Learning, Data Science.

1. INTRODUCTION
Big data is a broad term for data sets so large or complex that
traditional data processing applications are inadequate. Figure 1. Core concepts of data science at scale
Concretely, imagine an average size laptop (4-8 gig of memory,
dual-core, with one terabyte (TB) of disk space) that gets Concretely, this tutorial focuses on how the Map-Reduce design
overwhelmed with a machine learning task involving a 3-4 gig pattern from parallel computing can be extended and more
dataset; imagine the same laptop taking 3 hours to read one TB of faithfully leveraged to tackle the somewhat “embarrassingly
data. Clearly, this type of single node sequential computing is parallel” task of machine learning (a lot of machine learning
inadequate for processing terabytes/petabytes of data, which are algorithms fit this mold).
commonplace in modern day society where both machines In this tutorial, this is accomplished via the Apache Spark project
(internet of Things (IoT)) and humans generate petabytes of data and it many related sub projects. Spark has emerged as the next
every day (Figure 1). generation big data processing engine, overtaking Hadoop
MapReduce which helped ignite the big data revolution. Spark
Permission to make digital or hard copies of part or all of this work for personal
maintains MapReduce’s linear scalability and fault tolerance, but
or classroom use is granted without fee provided that copies are not made or extends it in a few important ways: it is much faster (100 times
distributed for profit or commercial advantage and that copies bear this notice faster for certain applications), much easier to program in due to
and the full citation on the first page. Copyrights for third-party components of its rich APIs in Python, Java, Scala (and R), and its core data
this work must be honored. For all other uses, contact the Owner/Author(s). abstraction, the distributed data frame, and it goes far beyond
Copyright is held by the owner/author(s).
batch applications to support a variety of compute-intensive tasks,
KDD '15, August 10-13, 2015, Sydney, NSW, Australia.
ACM 978-1-4503-3664-2/15/08.. including interactive queries, streaming, machine learning, and
DOI: http://dx.doi.org/10.1145/2783258.2789993 graph processing

2323
2. Tutorial Outline [3] Zaharia, M.. 2011. Spark: In-Memory Cluster Computing for
This tutorial provides an accessible introduction to Spark and its Iterative and Interactive Applications. Invited Talk. NIPS Big
potential to revolutionize academic and commercial data science Learning Workshop: Algorithms, Systems, and Tools for
practices. It is divided into two parts: Learning at Scale (Granada, Spain. December 12-17, 2011)
 the first part will cover fundamental Spark concepts, [4] Apache Foundation. 2014. Cluster Mode Overview - Spark
including Spark Core, data frames, the Spark Shell, 1.2.0 Documentation - Cluster Manager Types. Apache Org.
Spark Streaming, Spark SQL and vertical libraries such [5] Databricks. 2015. Databricks Spark Reference Applications.
as MLlib and GraphX; https://www.gitbook.com/book/databricks/databricks-spark-
 the second part will focus on hands-on algorithmic reference-applications/details
design and development with Spark (developing [6] Shanahan, J. G., and Kurra, G., 2010. Web Advertising:
algorithms from scratch such as decision tree learning, Business Models, Technologies and Issues in Information
graph processing algorithms such as pagerank/shortest Retrieval, Edited by Melucci, M., and Baeza-Yates, R..
path, gradient descent algorithms such as support
[7] Panda, B., Herbach, J. S., Basu, S., and Bayardo, R. J.. 2009.
vectors machines and matrix factorization.
PLANET: massively parallel learning of tree ensembles with
Industrial applications and deployments of Spark will also be MapReduce. Proc. VLDB Endow. 2, 2 (August 2009), 1426-
presented. Example code will be made available in python 1437. DOI=10.14778/1687553.1687569
(pySpark) notebooks. We also review some of the limitations of http://dx.doi.org/10.14778/1687553.1687569
Spark in its current form. See Table 2 for a list of the subject
[8] Takács, G. et al. 2008. Matrix factorization and neighbor
matter that will be covered in this tutorial.
based algorithms for the Netflix prize problem. In:
Proceedings of the 2008 ACM Conference on Recommender
3. BIBLIOGRAPHY Systems, Lausanne, Switzerland, October 23 - 25, 267-274.
[1] Karau, H., Konwinski. A., Wendell, and P., Zaharia, M..
2015. Learning Spark: Lightning-Fast Big Data Analysis. [9] Ott, P. 2008. Incremental Matrix Factorization for
O'Reilly Media. Collaborative Filtering. Science, Technology and Design,
Anhalt University of Applied Sciences.
[2] Owen, S., Ryza, S., Laserson, U., and Wills, J.. 2015.
Advanced Analytics with Apache Spark. O’Reilly Media.
Table 2: Tutorial subject matter outline

2324

You might also like