You are on page 1of 29

MODERN DATA ARCHITECTURES

FOR BIG DATA II


APACHE SPARK
PRODUCTION
SCENARIOS
Agenda

● Developing a Spark Application

● Databricks

● References
▹ Databricks Usages
▹ Customer Stories
▹ Optional Videos

2
1.
SPARK
APPLICATIONS
1.1
DEVELOPING
A SPARK
APPLICATION
Developing a Spark Application

So far we’ve played around with Spark by


running Jupyter notebooks, which perfectly
valid for interactive analytics.

Spark applications are more than that, and


oftentimes are needed for critical and
production jobs (ETL, streaming, advanced
analytics, …)

5
Developing a Spark Application

We can create a PySpark Application just


porting all the code we’ve written in the cells
of a notebook in a regular python application:

if __name__ == '__main__':
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
sc = SparkSession.builder \
.master("local") \
.appName("Bikes") \
.getOrCreate()

spark = SparkSession(sc)

6
Developing a Spark application

Example of a pyspark Application in a regular


python file (.py)

7
Developing a Spark Application

The key thing to remember is that Spark


applications are meant to run in distributed
way, this is, in a cluster of computers.

Spark provides a command line interface to


launch and execute Spark applications in
clusters.

This command is spark-submit

8
Developing a Spark Application

Once the Spark code is written, it’s time to submit


it for execution by using spark-submit:

osbdet@osbdet:~$ export PYSPARK_PYTHON=/usr/bin/python3


osbdet@osbdet:~$ $SPARK_HOME/bin/spark-submit --master local \
--packages "graphframes:graphframes:0.8.0-spark3.0-s_2.12" \
Bikes.py

9
Developing a Spark Application

10
2.
DATABRICKS
Databricks

Databricks is an American enterprise software


company founded by the creators of Apache
Spark.

Databricks develops a web-based platform for


working with Spark, that provides automated
cluster management and multi language
notebooks.

It’s a commercial product available in all the


major cloud vendors (AWS, Azure and GCP)

12
Databricks

It’s adoption has been growing in the last years


and right now is the preferred way of using
Apache Spark in the clouds.

13
Databricks

Since it’s a cloud technology we will need to


register ourselves in a cloud vendor providing
our credit card.

In order avoid unwanted charges in case you


miss to stop a cloud service and that the goal of
this class is NOT learning how to setup and
configure a Databricks environment in the cloud
but to learn how to use it, we will provide you
some videos.

14
Databricks

https://www.youtube.com/watch?v=5MC-RVfqnuY 15
Databricks

https://www.youtube.com/watch?v=js3MFxkDcL8 16
Databricks

https://www.youtube.com/watch?v=67DeQOWIA7c 17
Databricks

https://www.youtube.com/watch?v=xtHcZVroK8Y 18
2.
REFERENCES
2.1
DATABRICKS
USAGE
Databricks - company behind Spark

21
Streaming use cases

22
Machine Learning and Big Data

23
2.2
CUSTOMER
STORIES
Recommendation Engine for Rue Gilt Groupe

25
Shell

26
Clearsense

27
2.3
OPTIONAL
VIDEOS
Databricks

If you still being very interested in provisioning


your Databricks environment you can follow this
steps

29

You might also like