Professional Documents
Culture Documents
This notebook is intended to be the first step in your process to learn more about
how to best use Apache Spark on Databricks together. We'll be walking through
the core concepts, the fundamental abstractions, and the tools at your disposal.
This notebook will teach the fundamental concepts and best practices directly
from those that have written Apache Spark and know it best.
First, it's worth defining Databricks. Databricks is a managed platform for running
Apache Spark - that means that you do not have to learn complex cluster
management concepts nor perform tedious maintenance tasks to take advantage
of Spark. Databricks also provides a host of features to help its users be more
productive with Spark. It's a point and click platform for those that prefer a user
interface like data scientists or data analysts. However, this UI is accompanied by
a sophisticated API for those that want to automate aspects of their data
workloads with automated jobs. To meet the needs of enterprises, Databricks also
includes features such as role-based access control and other intelligent
optimizations that not only improve usability for users but also reduce costs and
complexity for administrators.
This notebook is a part of a series of notebooks aimed to get you up to speed with
the basics of Apache Spark quickly. This notebook is best suited for those that
have very little or no experience with Spark. The series also serves as a strong
review for those that have some experience with Spark but aren't as familiar with
some of the more sophisticated tools like UDF creation and machine learning
pipelines. The other notebooks in this series are:
A Gentle Introduction to Apache Spark on Databricks
(https://docs.databricks.com/_static/notebooks/gentle-introduction-to-
apache-spark.html)
Apache Spark on Databricks for Data Scientists
(https://docs.databricks.com/_static/notebooks/databricks-for-data-
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 1/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
scientists.html)
Apache Spark on Databricks for Data Engineers
(https://docs.databricks.com/_static/notebooks/databricks-for-data-
engineers.html)
Databricks Terminology
Databricks has key concepts that are worth understanding. You'll notice that many
of these line up with the links and icons that you'll see on the left side. These
together define the fundamental tools that Databricks provides to you as an end
user. They are available both in the web application UI as well as the REST API.
Workspaces
Workspaces allow you to organize all the work that you are doing on
Databricks. Like a folder structure in your computer, it allows you to save
notebooks and libraries and share them with other users. Workspaces are
not connected to data and should not be used to store data. They're simply
for you to store the notebooks and libraries that you use to operate on
and manipulate your data with.
Notebooks
Notebooks are a set of any number of cells that allow you to execute
commands. Cells hold code in any of the following languages: Scala ,
Python , R , SQL , or Markdown . Notebooks have a default language, but
each cell can have a language override to another language. This is done
by including %[language name] at the top of the cell. For instance
%python . We'll see this feature shortly.
Notebooks need to be connected to a cluster in order to be able to
execute commands however they are not permanently tied to a cluster. This
allows notebooks to be shared via the web or downloaded onto your local
machine.
Here is a demonstration video of Notebooks
(http://www.youtube.com/embed/MXI0F8zfKGI).
Dashboards
Dashboards can be created from notebooks as a way of displaying the
output of cells without the code that generates them.
Notebooks can also be scheduled as jobs in one click either to run a data
pipeline, update a machine learning model, or update a dashboard.
Libraries
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 2/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 3/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
Databricks comes with a variety of tools to help you learn how to use Databricks
and Apache Spark effectively. Databricks holds the greatest collection of Apache
Spark documentation available anywhere on the web. There are two fundamental
sets of resources that we make available: resources to help you learn how to use
Apache Spark and Databricks and resources that you can refer to if you already
know the basics.
To access these resources at any time, click the question mark button at the top
right-hand corner. This search menu will search all of the below sections of the
documentation.
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 4/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
The Contexts/Environments
Let's now tour the core abstractions in Apache Spark to ensure that you'll be
comfortable with all the pieces that you're going to need to understand in order to
understand how to use Databricks and Spark effectively.
Historically, Apache Spark has had two core contexts that are available to the user.
The sparkContext made available as sc and the SQLContext made available
as sqlContext , these contexts make a variety of functions and information
available to the user. The sqlContext makes a lot of DataFrame functionality
available while the sparkContext focuses more on the Apache Spark engine
itself.
However in Apache Spark 2.X, there is just one context - the SparkSession .
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 5/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
Creating a Cluster
Click the Clusters button that you'll notice on the left side of the page. On the
Clusters page, click on in the upper left corner.
Then, on the Create Cluster dialog, enter the configuration for the new cluster.
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 6/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
Finally,
Select a unique name for the cluster.
Select the Runtime Version.
Enter the number of workers to bring up - at least 1 is required to run Spark
commands.
To read more about some of the other options that are available to users please
see the Databricks Guide on Clusters
(https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#02%2
first let's explore the previously mentioned SparkSession . We can access it via
the spark variable. As explained, the Spark Session is the core location for where
Apache Spark related information is stored. For Spark 1.X the variables are
sqlContext and sc .
spark
We can use the Spark Context to access information but we can also use it to
parallelize a collection as well. Here we'll parallelize a small python range that will
provide a return type of DataFrame .
firstDataFrame = sqlContext.range(1000000)
DataFrame[id: bigint]
Now one might think that this would actually print out the values of the
DataFrame that we just parallelized, however that's not quite how Apache Spark
works. Spark allows two distinct kinds of operations by the user. There are
transformations and there are actions.
Transformations
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 7/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
Transformations are operations that will not be completed at the time you write
and execute the code in a cell - they will only get executed once you have called a
action. An example of a transformation might be to convert an integer into a float
or to filter a set of values.
Actions
Actions are commands that are computed by Spark right at the time of their
execution. They consist of running all of the previous transformations in order to
get back an actual result. An action is composed of one or more jobs which
consists of tasks that will be executed by the workers in parallel where possible
Here are some simple examples of transformations and actions. Remember, these
are not all the transformations and actions - this is just a short sample of them.
We'll get to why Apache Spark is designed this way shortly!
# An example of a transformation
# select the ID column values and multiply them by 2
secondDataFrame = firstDataFrame.selectExpr("(id * 2) as value")
# an example of an action
# take the first 5 values that we have in our firstDataFrame
print firstDataFrame.take(5)
# take the first 5 values that we have in our secondDataFrame
print secondDataFrame.take(5)
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 8/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
Now we've seen that Spark consists of actions and transformations. Let's talk
about why that's the case. The reason for this is that it gives a simple way to
optimize the entire pipeline of computations as opposed to the individual pieces.
This makes it exceptionally fast for certain types of computation because it can
perform all relevant computations at once. Technically speaking, Spark
pipelines this computation which we can see in the image below. This means
that certain computations can all be performed at once (like a map and a filter)
rather than having to do one operation for all pieces of data then the following
operation.
Apache Spark can also keep results in memory as opposed to other frameworks
that immediately write to disk after each task.
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 9/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
The diagram below shows an example Apache Spark cluster, basically there exists
a Driver node that communicates with executor nodes. Each of these executor
nodes have slots which are logically like execution cores.
The Driver sends Tasks to the empty slots on the Executors when work has to be
done:
Note: In the case of the Community Edition there is no Worker, and the Master, not
shown in the figure, executes the entire code.
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 10/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
You can view the details of your Apache Spark application in the Apache Spark
web UI. The web UI is accessible in Databricks by going to "Clusters" and then
clicking on the "View Spark UI" link for your cluster, it is also available by clicking
at the top left of this notebook where you would select the cluster to attach this
notebook to. In this option will be a link to the Apache Spark Web UI.
At a high level, every Apache Spark application consists of a driver program that
launches various parallel operations on executor Java Virtual Machines (JVMs)
running either in a cluster or locally on the same machine. In Databricks, the
notebook interface is the driver program. This driver program contains the main
loop for the program and creates distributed datasets on the cluster, then applies
operations (transformations & actions) to those datasets. Driver programs access
Apache Spark through a SparkSession object regardless of deployment location.
The DataFrame and SparkSQL work almost exactly as we have described above,
we're going to build up a plan for how we're going to access the data and then
finally execute that plan with an action. We'll see this process in the diagram
below. We go through a process of analyzing the query, building up a plan,
comparing them and then finally executing it.
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 11/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
While we won't go too deep into the details for how this process works, you can
read a lot more about this process on the Databricks blog
(https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-
optimizer.html). For those that want a more information about how Apache Spark
goes through this process, I would definitely recommend that post!
Going forward, we're going to access a set of public datasets that Databricks
makes available. Databricks datasets are a small curated group that we've pulled
together from across the web. We make these available using the Databricks
filesystem. Let's load the popular diamonds dataset in as a spark DataFrame .
Now let's go through the dataset that we'll be working with.
%fs ls /databricks-datasets/Rdatasets/data-001/datasets.csv
path
dbfs:/databricks-datasets/Rdatasets/data-001/datasets.csv
dataPath = "/databricks-datasets/Rdatasets/data-
001/csv/ggplot2/diamonds.csv"
diamonds = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.option("inferSchema", "true")\
.load(dataPath)
Now that we've loaded in the data, we're going to perform computations on it.
This provide us a convenient tour of some of the basic functionality and some of
the nice features that makes running Spark on Databricks the simplest! In order to
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 12/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
be able to perform our computations, we need to understand more about the data.
We can do this with the display function.
display(diamonds)
what makes display exceptional is the fact that we can very easily create some
more sophisticated graphs by clicking the graphing icon that you can see below.
Here's a plot that allows us to compare price, color, and cut.
display(diamonds)
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 13/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
6,500
6,000
5,500
5,000
4,500
4,000
3,500
price
3,000
2,500
2,000
1,500
1,000
500
0
Fair Good Ideal Premium
Now that we've explored the data, let's return to understanding transformations
and actions. I'm going to create several transformations and then an action. After
that we will inspect exactly what's happening under the hood.
These transformations are simple, first we group by two variables, cut and color
and then compute the average price. Then we're going to inner join that to the
original dataset on the column color . Then we'll select the average price as well
as the carat from that new dataset.
df2 = df1\
.join(diamonds, on='color', how='inner')\
.select("`avg(price)`", "carat")
# a simple join and selecting some columns
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 14/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
These transformations are now complete in a sense but nothing has happened. As
you'll see above we don't get any results back!
The reason for that is these computations are lazy in order to build up the entire
flow of data from start to finish required by the user. This is a intelligent
optimization for two key reasons. Any calculation can be recomputed from the
very source data allowing Apache Spark to handle any failures that occur along
the way, successfully handle stragglers. Secondly, Apache Spark can optimize
computation so that data and computation can be pipelined as we mentioned
above. Therefore, with each transformation Apache Spark creates a plan for how it
will perform this work.
To get a sense for what this plan consists of, we can use the explain method.
Remember that none of our computations have been executed yet, so all this
explain method does is tells us the lineage for how to compute this exact dataset.
df2.explain()
== Physical Plan ==
*Project [avg(price)#276,carat#282]
+- *BroadcastHashJoin [color#109], [color#284], Inner, BuildRight, None
:- *TungstenAggregate(key=[cut#108,color#109], functions=[(avg(cast(price
#113 as bigint)),mode=Final,isDistinct=false)], output=[color#109,avg(price)
#276])
: +- Exchange hashpartitioning(cut#108, color#109, 200), None
: +- *TungstenAggregate(key=[cut#108,color#109], functions=[(avg(cast
(price#113 as bigint)),mode=Partial,isDistinct=false)], output=[cut#108,colo
r#109,sum#314,count#315L])
: +- *Project [cut#108,color#109,price#113]
: +- *Filter isnotnull(color#109)
: +- *Scan csv [cut#108,color#109,price#113] Format: CSV, In
putPaths: dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.
csv, PushedFilters: [IsNotNull(color)], ReadSchema: struct<cut:string,color:
string,price:int>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, tr
ue]))
+- *Project [carat#282,color#284]
+- *Filter isnotnull(color#284)
+- *Scan csv [carat#282,color#284] Format: CSV, InputPaths: dbf
s:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv, PushedFi
lters: [IsNotNull(color)], ReadSchema: struct<carat:double,color:string>
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 15/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
Now explaining the above results is outside of this introductory tutorial, but please
feel free to read through it. What you should deduce from this is that Spark has
generated a plan for how it hopes to execute the given query. Let's now run an
action in order to execute the above plan.
df2.count()
Out[11]: 269700
This will execute the plan that Apache Spark built up previously. Click the little
arrow next to where it says (2) Spark Jobs after that cell finishes executing and
then click the View link. This brings up the Apache Spark Web UI right inside of
your notebook. This can also be accessed from the cluster attach button at the
top of this notebook. In the Spark UI, you should see something that includes a
diagram something like this.
or
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 16/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
These are significant visualizations. The top one is using Apache Spark 1.6 while
the lower one is using Apache Spark 2.0, we'll be focusing on the 2.0 version.
These are Directed Acyclic Graphs (DAG)s of all the computations that have to be
performed in order to get to that result. It's easy to see that the second DAG
visualization is much cleaner than the one before but both visualizations show us
all the steps that Spark has to get our data into the final form.
Again, this DAG is generated because transformations are lazy - while generating
this series of steps Spark will optimize lots of things along the way and will even
generate code to do so. This is one of the core reasons that users should be
focusing on using DataFrames and Datasets instead of the legacy RDD API. With
DataFrames and Datasets, Apache Spark will work under the hood to optimize the
entire query plan and pipeline entire steps together. You'll see instances of
WholeStageCodeGen as well as tungsten in the plans and these are apart of the
improvements in SparkSQL which you can read more about on the Databricks
blog. (https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-
closer-to-bare-metal.html)
In this diagram you can see that we start with a CSV all the way on the left side,
perform some changes, merge it with another CSV file (that we created from the
original DataFrame), then join those together and finally perform some
aggregations until we get our final result!
Caching
One of the significant parts of Apache Spark is its ability to store things in memory
during computation. This is a neat trick that you can use as a way to speed up
access to commonly queried tables or pieces of data. This is also great for
iterative algorithms that work over and over again on the same data. While many
see this as a panacea for all speed issues, think of it much more like a tool that
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 17/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
you can use. Other important concepts like data partitioning, clustering and
bucketing can end up having a much greater effect on the execution of your job
than caching however remember - these are all tools in your tool kit!
df2.cache()
Caching, like a transformation, is performed lazily. That means that it won't store
the data in memory until you call an action on that dataset.
Here's a simple example. We've created our df2 DataFrame which is essentially a
logical plan that tells us how to compute that exact DataFrame. We've told Apache
Spark to cache that data after we compute it for the first time. So let's call a full
scan of the data with a count twice. The first time, this will create the DataFrame,
cache it in memory, then return the result. The second time, rather than
recomputing that whole DataFrame, it will just hit the version that it has in memory.
df2.count()
Out[13]: 269700
However after we've now counted the data. We'll see that the explain ends up
being quite different.
df2.count()
Out[14]: 269700
In the above example, we can see that this cuts down on the time needed to
generate this data immensely - often by at least an order of magnitude. With much
larger and more complex data analysis, the gains that we get from caching can be
even greater!
Conclusion
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 18/19
11/23/2017 Introduction to Apache Spark on Databricks - Databricks
In this notebook we've covered a ton of material! But you're now well on your way
to understanding Spark and Databricks! Now that you've completed this
notebook, you should hopefully be more familiar with the core concepts of Spark
on Databricks. Be sure to subscribe to our blog to get the latest updates about
Apache Spark 2.0 and the next notebooks in this series!
file:///Users/gaurav/Downloads/Introduction%20to%20Apache%20Spark%20on%20Databricks.html 19/19