Introduction To Apache Spark On Databricks

11/23/2017 Introduction to Apache Spark on Databricks - Databricks
Introduction to Apache Spark on Databricks
A Gentle Introduction to Apache

Spark on Databricks
Welcome to Databricks!
This notebook is intended to be the first step in your process to learn more about
how to best use Apache Spark on Databricks together. We'll be walking through
the core concepts, the fundamental abstractions, and the tools at your disposal.
This notebook will teach the fundamental concepts and best practices directly
from those that have written Apache Spark and know it best.
First, it's worth defining Databricks. Databricks is a managed platform for running
Apache Spark - that means that you do not have to learn complex cluster
management concepts nor perform tedious maintenance tasks to take advantage
of Spark. Databricks also provides a host of features to help its users be more
productive with Spark. It's a point and click platform for those that prefer a user
interface like data scientists or data analysts. However, this UI is accompanied by
a sophisticated API for those that want to automate aspects of their data
workloads with automated jobs. To meet the needs of enterprises, Databricks also
includes features such as role-based access control and other intelligent
optimizations that not only improve usability for users but also reduce costs and
complexity for administrators.
The Gentle Introduction Series
This notebook is a part of a series of notebooks aimed to get you up to speed with
the basics of Apache Spark quickly. This notebook is best suited for those that
have very little or no experience with Spark. The series also serves as a strong
review for those that have some experience with Spark but aren't as familiar with
some of the more sophisticated tools like UDF creation and machine learning
pipelines. The other notebooks in this series are:
A Gentle Introduction to Apache Spark on Databricks
(https://docs.databricks.com/_static/notebooks/gentle-introduction-to-
apache-spark.html)
Apache Spark on Databricks for Data Scientists
(https://docs.databricks.com/_static/notebooks/databricks-for-data-
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 1/19
scientists.html)
Apache Spark on Databricks for Data Engineers
(https://docs.databricks.com/_static/notebooks/databricks-for-data-
engineers.html)
Databricks Terminology
Databricks has key concepts that are worth understanding. You'll notice that many
of these line up with the links and icons that you'll see on the left side. These
together define the fundamental tools that Databricks provides to you as an end
user. They are available both in the web application UI as well as the REST API.
Workspaces
Workspaces allow you to organize all the work that you are doing on
Databricks. Like a folder structure in your computer, it allows you to save
notebooks and libraries and share them with other users. Workspaces are
not connected to data and should not be used to store data. They're simply
for you to store the notebooks and libraries that you use to operate on
and manipulate your data with.
Notebooks
Notebooks are a set of any number of cells that allow you to execute
commands. Cells hold code in any of the following languages: Scala ,
Python , R , SQL , or Markdown . Notebooks have a default language, but
each cell can have a language override to another language. This is done
by including %[language name] at the top of the cell. For instance
%python . We'll see this feature shortly.
Notebooks need to be connected to a cluster in order to be able to
execute commands however they are not permanently tied to a cluster. This
allows notebooks to be shared via the web or downloaded onto your local
machine.
Here is a demonstration video of Notebooks
(http://www.youtube.com/embed/MXI0F8zfKGI).
Dashboards
Dashboards can be created from notebooks as a way of displaying the
output of cells without the code that generates them.
Notebooks can also be scheduled as jobs in one click either to run a data
pipeline, update a machine learning model, or update a dashboard.
Libraries
Libraries are packages or modules that provide additional functionality that

you need to solve your business problems. These may be custom written
Scala or Java jars; python eggs or custom written packages. You can write
and upload these manually or you may install them directly via package
management utilities like pypi or maven.
Tables
Tables are structured data that you and your team will use for analysis.
Tables can exist in several places. Tables can be stored in cloud storage,
they can be stored on the cluster that you're currently using, or they can be
cached in memory. For more about tables see the documentation
(https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html
Clusters
Clusters are groups of computers that you treat as a single computer. In
Databricks, this means that you can effectively treat 20 computers as you
might treat one computer. Clusters allow you to execute code from
notebooks or libraries on set of data. That data may be raw data located
on cloud storage or structured data that you uploaded as a table to the
cluster you are working on.
It is important to note that clusters have access controls to control who has
access to each cluster.
Here is a demonstration video of Clusters
(http://www.youtube.com/embed/2-imke2vDs8).
Jobs
Jobs are the tool by which you can schedule execution to occur either on
an already existing cluster or a cluster of its own. These can be notebooks
as well as jars or python scripts. They can be created either manually or via
the REST API.
Here is a demonstration video of Jobs
(http://www.youtube.com/embed/srI9yNOAbU0).
Apps
Apps are third party integrations with the Databricks platform. These
include applications like Tableau.
Databricks and Apache Spark Help

Resources
Databricks comes with a variety of tools to help you learn how to use Databricks
and Apache Spark effectively. Databricks holds the greatest collection of Apache
Spark documentation available anywhere on the web. There are two fundamental
sets of resources that we make available: resources to help you learn how to use
Apache Spark and Databricks and resources that you can refer to if you already
know the basics.
To access these resources at any time, click the question mark button at the top
right-hand corner. This search menu will search all of the below sections of the
documentation.
The Databricks Guide

The Databricks Guide is the definitive reference for you and your team once
you've become accustomed to using and leveraging Apache Spark. It
allows for quick reference of common Databricks and Spark APIs with
snippets of sample code.
The Guide also includes a series of tutorials (including this one!) that
provide a more guided introduction to a given topic.
The Spark APIs
Databricks makes it easy to search the Apache Spark APIs directly. Simply
use the search that is available at the top right and it will automatically
display API results as well.
The Apache Spark Documentation
The Apache Spark open source documentation is also made available for
quick and simple search if you need to dive deeper into some of the
internals of Apache Spark.
Databricks Forums
The Databricks Forums (https://forums.databricks.com/) are a community
resource for those that have specific use case questions or questions that
they cannot see answered in the guide or the documentation.
Databricks also provides professional and enterprise level technical support

(http://go.databricks.com/contact-databricks) for companies and enterprises
looking to take their Apache Spark deployments to the next level.
Databricks and Apache Spark Abstractions

Now that we've defined the terminology and more learning resources - let's go
through a basic introduction of Apache Spark and Databricks. While you're likely
familiar with the concept of Spark, let's take a moment to ensure that we all share
the same definitions and give you the opportunity to learn a bit about Spark's
history.
The Apache Spark project's History

Spark was originally written by the founders of Databricks during their time at UC
Berkeley. The Spark project started in 2009, was open sourced in 2010, and in
2013 its code was donated to Apache, becoming Apache Spark. The employees
of Databricks have written over 75% of the code in Apache Spark and have
contributed more than 10 times more code than any other organization. Apache
Spark is a sophisticated distributed computation framework for executing code in
parallel across many different machines. While the abstractions and interfaces are
simple, managing clusters of computers and ensuring production-level stability is
not. Databricks makes big data simple by providing Apache Spark as a hosted
solution.
The Contexts/Environments
Let's now tour the core abstractions in Apache Spark to ensure that you'll be
comfortable with all the pieces that you're going to need to understand in order to
understand how to use Databricks and Spark effectively.
Historically, Apache Spark has had two core contexts that are available to the user.
The sparkContext made available as sc and the SQLContext made available
as sqlContext , these contexts make a variety of functions and information
available to the user. The sqlContext makes a lot of DataFrame functionality
available while the sparkContext focuses more on the Apache Spark engine
itself.
However in Apache Spark 2.X, there is just one context - the SparkSession .
The Data Interfaces

There are several key interfaces that you should understand when you go to use
Spark.
The Dataset
The Dataset is Apache Spark's newest distributed collection and can be
considered a combination of DataFrames and RDDs. It provides the typed
interface that is available in RDDs while providing a lot of conveniences of
DataFrames. It will be the core abstraction going forward.
The DataFrame
The DataFrame is collection of distributed Row types. These provide a
flexible interface and are similar in concept to the DataFrames you may be
familiar with in python (pandas) as well as in the R language.
The RDD (Resilient Distributed Dataset)
Apache Spark's first abstraction was the RDD or Resilient Distributed
Dataset. Essentially it is an interface to a sequence of data objects that
consist of one or more types that are located across a variety of machines
in a cluster. RDD's can be created in a variety of ways and are the "lowest
level" API available to the user. While this is the original data structure
made available, new users should focus on Datasets as those will be
supersets of the current RDD functionality.
Getting Started with Some Code!

Whew, that's a lot to cover thus far! But we've made it to the demonstration so we
can see the power of Apache Spark and Databricks together. To do this you can
do one of several things. First, and probably simplest, is that you can copy this
notebook into your own environment via the Import Notebook button that is
available at the top right or top left of this page. If you'd rather type all of the
commands yourself, you can create a new notebook and type the commands as
we proceed.
Creating a Cluster
Click the Clusters button that you'll notice on the left side of the page. On the
Clusters page, click on in the upper left corner.
Then, on the Create Cluster dialog, enter the configuration for the new cluster.
Finally,
Select a unique name for the cluster.
Select the Runtime Version.
Enter the number of workers to bring up - at least 1 is required to run Spark
commands.
To read more about some of the other options that are available to users please
see the Databricks Guide on Clusters
(https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#02%2
first let's explore the previously mentioned SparkSession . We can access it via
the spark variable. As explained, the Spark Session is the core location for where
Apache Spark related information is stored. For Spark 1.X the variables are
sqlContext and sc .
Cells can be executed by hitting shift+enter while the cell is selected.
spark
Out[1]: <pyspark.sql.session.SparkSession at 0x7f8816057b90>
We can use the Spark Context to access information but we can also use it to
parallelize a collection as well. Here we'll parallelize a small python range that will
provide a return type of DataFrame .
firstDataFrame = sqlContext.range(1000000)
# The code for 2.X is

# spark.range(1000000)
print firstDataFrame
DataFrame[id: bigint]
Now one might think that this would actually print out the values of the
DataFrame that we just parallelized, however that's not quite how Apache Spark
works. Spark allows two distinct kinds of operations by the user. There are
transformations and there are actions.
Transformations
Transformations are operations that will not be completed at the time you write
and execute the code in a cell - they will only get executed once you have called a
action. An example of a transformation might be to convert an integer into a float
or to filter a set of values.
Actions
Actions are commands that are computed by Spark right at the time of their
execution. They consist of running all of the previous transformations in order to
get back an actual result. An action is composed of one or more jobs which
consists of tasks that will be executed by the workers in parallel where possible
Here are some simple examples of transformations and actions. Remember, these
are not all the transformations and actions - this is just a short sample of them.
We'll get to why Apache Spark is designed this way shortly!
# An example of a transformation
# select the ID column values and multiply them by 2
secondDataFrame = firstDataFrame.selectExpr("(id * 2) as value")
# an example of an action
# take the first 5 values that we have in our firstDataFrame
print firstDataFrame.take(5)
# take the first 5 values that we have in our secondDataFrame
print secondDataFrame.take(5)
[Row(id=0), Row(id=1), Row(id=2), Row(id=3), Row(id=4)]

[Row(value=0), Row(value=2), Row(value=4), Row(value=6), Row(value=8)]
Now we've seen that Spark consists of actions and transformations. Let's talk
about why that's the case. The reason for this is that it gives a simple way to
optimize the entire pipeline of computations as opposed to the individual pieces.
This makes it exceptionally fast for certain types of computation because it can
perform all relevant computations at once. Technically speaking, Spark
pipelines this computation which we can see in the image below. This means
that certain computations can all be performed at once (like a map and a filter)
rather than having to do one operation for all pieces of data then the following
operation.
Apache Spark can also keep results in memory as opposed to other frameworks
that immediately write to disk after each task.
Apache Spark Architecture

Before proceeding with our example, let's see an overview of the Apache Spark
architecture. As mentioned before, Apache Spark allows you to treat many
machines as one machine and this is done via a master-worker type architecture
where there is a driver or master node in the cluster, accompanied by worker
nodes. The master sends work to the workers and either instructs them to pull to
data from memory or from disk (or from another data source).
The diagram below shows an example Apache Spark cluster, basically there exists
a Driver node that communicates with executor nodes. Each of these executor
nodes have slots which are logically like execution cores.
The Driver sends Tasks to the empty slots on the Executors when work has to be
done:
Note: In the case of the Community Edition there is no Worker, and the Master, not
shown in the figure, executes the entire code.
You can view the details of your Apache Spark application in the Apache Spark
web UI. The web UI is accessible in Databricks by going to "Clusters" and then
clicking on the "View Spark UI" link for your cluster, it is also available by clicking
at the top left of this notebook where you would select the cluster to attach this
notebook to. In this option will be a link to the Apache Spark Web UI.
At a high level, every Apache Spark application consists of a driver program that
launches various parallel operations on executor Java Virtual Machines (JVMs)
running either in a cluster or locally on the same machine. In Databricks, the
notebook interface is the driver program. This driver program contains the main
loop for the program and creates distributed datasets on the cluster, then applies
operations (transformations & actions) to those datasets. Driver programs access
Apache Spark through a SparkSession object regardless of deployment location.
A Worked Example of Transformations and

Actions
To illustrate all of these architectural and most relevantly transformations and
actions - let's go through a more thorough example, this time using DataFrames
and a csv file.
The DataFrame and SparkSQL work almost exactly as we have described above,
we're going to build up a plan for how we're going to access the data and then
finally execute that plan with an action. We'll see this process in the diagram
below. We go through a process of analyzing the query, building up a plan,
comparing them and then finally executing it.
While we won't go too deep into the details for how this process works, you can
read a lot more about this process on the Databricks blog
(https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-
optimizer.html). For those that want a more information about how Apache Spark
goes through this process, I would definitely recommend that post!
Going forward, we're going to access a set of public datasets that Databricks
makes available. Databricks datasets are a small curated group that we've pulled
together from across the web. We make these available using the Databricks
filesystem. Let's load the popular diamonds dataset in as a spark DataFrame .
Now let's go through the dataset that we'll be working with.
%fs ls /databricks-datasets/Rdatasets/data-001/datasets.csv
path
dbfs:/databricks-datasets/Rdatasets/data-001/datasets.csv
dataPath = "/databricks-datasets/Rdatasets/data-
001/csv/ggplot2/diamonds.csv"
diamonds = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.option("inferSchema", "true")\
.load(dataPath)
# inferSchema means we will automatically figure out column types

# at a cost of reading the data more than once
Now that we've loaded in the data, we're going to perform computations on it.
This provide us a convenient tour of some of the basic functionality and some of
the nice features that makes running Spark on Databricks the simplest! In order to
be able to perform our computations, we need to understand more about the data.
We can do this with the display function.
display(diamonds)
_c0 carat cut color clarity

1 0.23 Ideal E SI2
2 0.21 Premium E SI1
3 0.23 Good E VS1
4 0.29 Premium I VS2
5 0.31 Good J SI2
6 0.24 Very Good J VVS2
7 0.24 Very Good I VVS1
8 0.26 Very Good H SI1
9 0 22 Fair E VS2
Showing the first 1000 rows.
what makes display exceptional is the fact that we can very easily create some
more sophisticated graphs by clicking the graphing icon that you can see below.
Here's a plot that allows us to compare price, color, and cut.
display(diamonds)
6,500
6,000
5,500
5,000
4,500
4,000
3,500
price
3,000
2,500
2,000
1,500
1,000
500
0
Fair Good Ideal Premium
Now that we've explored the data, let's return to understanding transformations
and actions. I'm going to create several transformations and then an action. After
that we will inspect exactly what's happening under the hood.
These transformations are simple, first we group by two variables, cut and color
and then compute the average price. Then we're going to inner join that to the
original dataset on the column color . Then we'll select the average price as well
as the carat from that new dataset.
df1 = diamonds.groupBy("cut", "color").avg("price") # a simple grouping
df2 = df1\
.join(diamonds, on='color', how='inner')\
.select("àvg(price)`", "carat")
# a simple join and selecting some columns
These transformations are now complete in a sense but nothing has happened. As
you'll see above we don't get any results back!
The reason for that is these computations are lazy in order to build up the entire
flow of data from start to finish required by the user. This is a intelligent
optimization for two key reasons. Any calculation can be recomputed from the
very source data allowing Apache Spark to handle any failures that occur along
the way, successfully handle stragglers. Secondly, Apache Spark can optimize
computation so that data and computation can be pipelined as we mentioned
above. Therefore, with each transformation Apache Spark creates a plan for how it
will perform this work.
To get a sense for what this plan consists of, we can use the explain method.
Remember that none of our computations have been executed yet, so all this
explain method does is tells us the lineage for how to compute this exact dataset.
df2.explain()
== Physical Plan ==
*Project [avg(price)#276,carat#282]
+- *BroadcastHashJoin [color#109], [color#284], Inner, BuildRight, None
:- *TungstenAggregate(key=[cut#108,color#109], functions=[(avg(cast(price
#113 as bigint)),mode=Final,isDistinct=false)], output=[color#109,avg(price)
#276])
: +- Exchange hashpartitioning(cut#108, color#109, 200), None
: +- *TungstenAggregate(key=[cut#108,color#109], functions=[(avg(cast
(price#113 as bigint)),mode=Partial,isDistinct=false)], output=[cut#108,colo
r#109,sum#314,count#315L])
: +- *Project [cut#108,color#109,price#113]
: +- *Filter isnotnull(color#109)
: +- *Scan csv [cut#108,color#109,price#113] Format: CSV, In
putPaths: dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.
csv, PushedFilters: [IsNotNull(color)], ReadSchema: struct<cut:string,color:
string,price:int>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, tr
ue]))
+- *Project [carat#282,color#284]
+- *Filter isnotnull(color#284)
+- *Scan csv [carat#282,color#284] Format: CSV, InputPaths: dbf
s:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv, PushedFi
lters: [IsNotNull(color)], ReadSchema: struct<carat:double,color:string>
Now explaining the above results is outside of this introductory tutorial, but please
feel free to read through it. What you should deduce from this is that Spark has
generated a plan for how it hopes to execute the given query. Let's now run an
action in order to execute the above plan.
df2.count()
Out[11]: 269700
This will execute the plan that Apache Spark built up previously. Click the little
arrow next to where it says (2) Spark Jobs after that cell finishes executing and
then click the View link. This brings up the Apache Spark Web UI right inside of
your notebook. This can also be accessed from the cluster attach button at the
top of this notebook. In the Spark UI, you should see something that includes a
diagram something like this.
or
These are significant visualizations. The top one is using Apache Spark 1.6 while
the lower one is using Apache Spark 2.0, we'll be focusing on the 2.0 version.
These are Directed Acyclic Graphs (DAG)s of all the computations that have to be
performed in order to get to that result. It's easy to see that the second DAG
visualization is much cleaner than the one before but both visualizations show us
all the steps that Spark has to get our data into the final form.
Again, this DAG is generated because transformations are lazy - while generating
this series of steps Spark will optimize lots of things along the way and will even
generate code to do so. This is one of the core reasons that users should be
focusing on using DataFrames and Datasets instead of the legacy RDD API. With
DataFrames and Datasets, Apache Spark will work under the hood to optimize the
entire query plan and pipeline entire steps together. You'll see instances of
WholeStageCodeGen as well as tungsten in the plans and these are apart of the
improvements in SparkSQL which you can read more about on the Databricks
blog. (https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-
closer-to-bare-metal.html)
In this diagram you can see that we start with a CSV all the way on the left side,
perform some changes, merge it with another CSV file (that we created from the
original DataFrame), then join those together and finally perform some
aggregations until we get our final result!
Caching
One of the significant parts of Apache Spark is its ability to store things in memory
during computation. This is a neat trick that you can use as a way to speed up
access to commonly queried tables or pieces of data. This is also great for
iterative algorithms that work over and over again on the same data. While many
see this as a panacea for all speed issues, think of it much more like a tool that
you can use. Other important concepts like data partitioning, clustering and
bucketing can end up having a much greater effect on the execution of your job
than caching however remember - these are all tools in your tool kit!
To cache a DataFrame or RDD, simply use the cache method.
df2.cache()
Out[12]: DataFrame[avg(price): double, carat: double]
Caching, like a transformation, is performed lazily. That means that it won't store
the data in memory until you call an action on that dataset.
Here's a simple example. We've created our df2 DataFrame which is essentially a
logical plan that tells us how to compute that exact DataFrame. We've told Apache
Spark to cache that data after we compute it for the first time. So let's call a full
scan of the data with a count twice. The first time, this will create the DataFrame,
cache it in memory, then return the result. The second time, rather than
recomputing that whole DataFrame, it will just hit the version that it has in memory.
Let's take a look at how we can discover this.
df2.count()
Out[13]: 269700
However after we've now counted the data. We'll see that the explain ends up
being quite different.
df2.count()
Out[14]: 269700
In the above example, we can see that this cuts down on the time needed to
generate this data immensely - often by at least an order of magnitude. With much
larger and more complex data analysis, the gains that we get from caching can be
even greater!
Conclusion
In this notebook we've covered a ton of material! But you're now well on your way
to understanding Spark and Databricks! Now that you've completed this
notebook, you should hopefully be more familiar with the core concepts of Spark
on Databricks. Be sure to subscribe to our blog to get the latest updates about
Apache Spark 2.0 and the next notebooks in this series!

Introduction To Apache Spark On Databricks - Databricks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Apache Spark On Databricks - Databricks

Uploaded by

Copyright:

Available Formats

11/23/2017 Introduction to Apache Spark on Databricks - Databricks

A Gentle Introduction to Apache

The Gentle Introduction Series

Libraries are packages or modules that provide additional functionality that

Databricks and Apache Spark Help

The Databricks Guide

Databricks also provides professional and enterprise level technical support

Databricks and Apache Spark Abstractions

The Apache Spark project's History

The Data Interfaces

Getting Started with Some Code!

Cells can be executed by hitting shift+enter while the cell is selected.

Out[1]: <pyspark.sql.session.SparkSession at 0x7f8816057b90>

# The code for 2.X is

[Row(id=0), Row(id=1), Row(id=2), Row(id=3), Row(id=4)]

Apache Spark Architecture

A Worked Example of Transformations and

# inferSchema means we will automatically figure out column types

_c0 carat cut color clarity

df1 = diamonds.groupBy("cut", "color").avg("price") # a simple grouping

To cache a DataFrame or RDD, simply use the cache method.

Out[12]: DataFrame[avg(price): double, carat: double]

Let's take a look at how we can discover this.

You might also like