You are on page 1of 298

Dec 01 2020 - What is Azure DataBricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Azure Databricks is a data analytics platform (PaaS), specially optimised for Microsoft
Azure cloud platform. Databricks is an enterprise-grade platform service that is
unified for data lake architecture for large analytical operations.

Azure Databricks: End-to-end web-based analytics platform


Azure Databricks combines:
 large scale data processing for batch loads and streaming data
 simplifies and accelerates collaborative work among data scientists, data
engineers and machine learning engineers
 offers complete analytics and machine learning algorithms and languages
 features complete ML DevOps model life-cycle; from experimentation to
production
 is build on Apache Spark and embraces Delta Lake and ML Flow
Azure Databricks is optimized for the Microsoft Azure and offeres interactive
workspace for collaboration between data engineers, data scientists, and machine
learning engineers. With the multi language capabilities to create notebooks in
Python, R, Scala, Spark, SQL and others.
It gives you the capabilities also to run SQL queries on data lake, create multiple
visualisation types to explore query results from different perspectives, and build and
share dashboards.
Azure Databricks is designed to build and handle big data pipeline, for data ingestion
(raw or structured) into Azure through several different Azure services as:
 Azure Data Factory in batches,
 or streamed near real-time using Apache Kafka,
 Event Hub, or
 IoT Hub.
If supports also connectivity so several persisted storages for creating data lake, like:
 Azure Blob Storage
 Azure Data Lage Storage
 SQL-type databases
 Queue / File-tables
Your analytics workflow will be using Spark technology to read data from multiple
different sources, and create state of the art analytics in Azure Databricks.
Welcome page to Azure Databricks gives you easy, fast and collaborative interface.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 02 2020 - How to get started with


Azure Databricks
Azure Databricks repository is a set of blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
In previous blogpost we have looked into what Azure Databricks and its' main
features, who is the targeted user, and what are the capabilities of the platform.
Where and how to get started?
1.Get Azure subscription
If you don't have the subscription yet, get one today at the Azure web site. It is
totally free and you can get either 12 months of free subscription (for popular free
services - complete list is available on the link) or get credit for $200 USD to fully
explore Azure for 30-days. For using Azure Databricks, the latter will be the one.
2.Create the Azure Databricks service
Once logged in to your Azure portal, you will be directly directed to Azure
dashboard, which is your UI into Azure Cloud service. In the search box type "Azure
Databrick" or select it, if is recommended to you.
After selection, you will get the dialog window to select and insert the:
 your Azure subscription
 Create the name of Azure Databricks Workspace
 Resource group
 Pricing Tier and
 Region

Once you have done the most important thing, you are left with selection of detailed
network, Tags and you are finished.
After this, you can check your Databrick services:

And select the newly created Azure Databrick Service to get the overview page:
And just Launch the Workspace!
As always, the code and notebooks is available at Github repository.
Stay Healthy! See you tomorrow.

Dec 03 2020 - Getting to know the


workspace and Azure Databricks
platform

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
We have learned what Azure Databricks is and looked how to get started with the
platform. Now that we have this covered, let's get familiar with the workspace and
the platform.
On Azure portal go to Azure Databricks services and launch workspace.
You will be re-directed to signed-in to Azure Databricks platform. And you will also
see the IAM integration with Azure Active Directory Single Sign-on is done smooth.
This is especially welcoming for enterprises and businesses that whole IAM policy can
be federated using AD.

On main console page of Azure Databricks you will find the following sections:
1. Main vertical navigation bar, that will be available all the time and gives users simple
transitions from one task (or page) to another.
2. Common tasks to get started immediately with one desired task
3. Importing & Exploring data is task for Drag&Dropping your external data to DBFS
system
4. Starting new notebook or getting some additional information on Databricks
documentation and Release notes
5. Settings for any user settings, administration of the console and management.
When you will be using Azure Databricks, the vertical navigation bar (1) and Settings
(5) will always be available for you to access.
Navigation bar
Thanks to the intuitive and self-explanatory icons and names, there is no need to
explain what each icon represents.

 Home - this will always get you at the console page, no matter where you are.
 Workspaces - this page is where all the collaboration will happen, where user will
have data, notebooks and all the work at their disposal. Workspaces is by far - from
data engineer, data scientist, machine learning engineer point of view - the most
important section
 Recents - where you will find all recently used documents, data, services in Azure
Databricks
 Data - is access point to all the data - databases and tables that reside on DBFS and
as files; in order to see the data, a cluster must be up and running, due to the nature
of Spark data distribution
 Clusters - is a VM in the background that runs the Azure Databricks. Without the
cluster up and running, the whole Azure Databricks will not work. Here you can setup
new cluster, shut down a cluster, manage the cluster, attach cluster to notebook or to
a job, create job cluster and setup the pools. This is the "horses" behind the code and
it is the compute power, decoupled from the notebooks in order to give it scalability.
 Jobs - is a overview of scheduled (crontab) jobs that are executing and are available
to user. This is the control center for job overview, job history, troubleshooting and
administration of the jobs.
 Models - page that gives you overview and tracking of your machine learning
models, operations over the model, artefacts, metadata and parameters for particular
model or a run of a model.
 Search - is a fast, easy and user-friendly way to search your workspace.
Settings
Here you will have overview of your service, user management and account:
 User setting - where you can setup personal access tokens for Databricks API,
manage GIT integration and notebooks settings
 Admin console - where administrator will set IAM policies, security and group access
and enabling/disabling additional services as Databricks genomics, Container
services, workspaces behaviour, etc.
 Manage account - will redirect you to start page on Azure dashboard for managing
of the Azure account that you are using to access Azure Databricks.
 Log Out - will log out you from Azure Databricks.
This will get you around the platform. Tomorrow we will start exploring the clusters!
Complete set of code and Notebooks will be available at the Github repository.
Stay Healthy! See you tomorrow.

Dec 04 2020 - Creating your first Azure


Databricks cluster
Azure Databricks repository is a set of blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks  platform
On day 4, we came so far, that we are ready to explore how to create a Azure
Databricks Cluster. We have already learned, that cluster is an Azure VM, created in
the background to give compute power, storage and scalability to Azure Databricks
plaform.
On vertical navigation bar select Clusters in order to get Clusters subpage.

This page will give you the list of existing clusters:


 name of the cluster
 Status (Running, Terminated, deleted, etc.)
 Nodes
 Runtime (Spark version installed on VM,
 Driver type (Type of computer used for running this cluster)
 Worker (type of VM eg.: 4 Cores, 0.90 DUB, etc..)
 Creator
 Actions (by hovering over, you will receive additional information)
By clicking on exists Server, you will receive the following informations, which you
can configure (not all as they are grayed out as seen on the screen shoot), attach to
the notebooks, install additional packages and have access to Spark UI, Driver Logs,
Metrics for easier troubleshooting.
But when selecting and creating a new Azure Databricks cluster, you will get much all
attributes available for defining in order to create a cluster tailored to your needs.
You will need to provide the following information for creating a new cluster:
1. Cluster Name - go creative, but still stick to naming convention and give a name
that will also include the Worker Type, Databricks Runtime, Cluster Mode, Pool, Azure
Resource Group, Project name (or task you are working on) and environment type
(DEV, TEST, UAT, PROD). The more you have, the better
2. Cluster Mode - Azure Databricks support three types of clusters: Standard, High
Concurrency and Single node. Standard is the default selection and is primarily used
for single-user environment, and support any workload using languages as Python,
R, Scala, Spark or SQL. High Concurrency mode is designed to handle workloads for
many users and is a managed cloud resource. Main benefit is that it provides Apache
Spark native environment for sharing maximum resources utilisation and provide
minimum query latencies. It supports languages as Python, R, Spark and SQL but not
support Scala, because Scala does not support running user code in separate
processes. This cluster also support TAC - table access control - for finer and grained
level of access security, granting more detailed permissions on SQL tables. Single
Node will give no workers and will run Spark jobs on a driver node. What does this
mean in simple english: work will not be distributed among workers, resulting in
poorer performances.
3. Pool - as of writing this post, this feature is still in Public preview. It will create a
pool of clusters (so you need more predefined clusters) for better response and up-
times. Pool keep a defined number of instances in ready-mode (idle) to reduce the
cluster start time. Cluster needs to be attached to the pool (after creation of a cluster
or if you already have a pool, it will automatically be available) in order to have
allocated its driver and worker nodes from the pool.
4. Databricks runtime version - is an image of Databricks version that will be created
on every cluster. Images are designed for particular type of jobs (Genomics, Machine
Learning, Standard workloads) and for different versions of Spark or Databricks.
When selecting the right image, remember the abbreviations and versions. Each
image will have a version of Scala / Spark and there are some significant differences
General images will have up to 6 months of bug fixed and 12 months Databricks
support. Unless there is LTS (Long time Support) this period will extend to 24 months
of support. In addition the ML abbreviation stands for Machine Learning, bringing to
image additional packages for machine learning tasks (which can also be added to
general image, but out-of-the box solution will be better). And GPU will denote some
optimized software for GPU tasks.

5. Worker and driver type will give you the option to select the VM that will suit your
needs. For the first timers, keep the default selected Worker and driver type as
selected. And later you can explore and change DBU (DataBricks Units) for higher
performances. Three types of workloads are to be understood; All-purpose, Job
Compute and Light-job Compute and many more Instances types; General, Memory
Optimized, Storage optimized, Compute optimized and GPU optimized. All come
with different pricing plans and set of tiers and regions.
All workers will have the minimum and maximum number of nodes available. More
you want to scale out, give your cluster more workers. DBU will change with more
workers are added.
6. AutoScalling - is the tick option that will give you capabilites to scale automatically
between minimum and maximum number of nodes (workers) based on the
workload.
7. Termination - is the timeout in minutes, when there is no work after given period,
the cluster will terminate. Expect different behaviour when cluster is attached to the
pool.
Explore also the advanced options, where additional Spark configuration and runtime
variables can be set. Very useful when finet-uning the behaviour of the cluster at
startup. Add also Tags (as key-value pairs), to keep additional metadata on your
cluster, you can also give a Init script that can be stored on DBFS and can initiate
some job, load some data or models at the start time.

Once you have selected the cluster options suited for your needs, you are ready to
hit that "Create cluster" button.
Tomorrow we will cover basics on architecture of clusters, workers, DBFS storage and
how Spark handles jobs.
Complete set of code and Notebooks will be available at the Github repository.
Stay Healthy! See you tomorrow.
Dec 05 2020 - Understanding Azure
Databricks cluster architecture,
workers, drivers and jobs

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks  platform
 Dec 04: Creating your first Azure Databricks cluster
Yesterday we have unveiled couple of concepts about the workers, drivers and how
autoscaling works. In order to explore the services behind, start up the cluster, we
have created yesterday (it it was automatically terminated or you have stopped it
manually).
Cluster is starting up (when is started, the green loading circle will become full):

My Cluster is Standard DS3_v2 cluster (4 cores) with Min 2 and Max 8 workers. Same
applies for the driver. Once the cluster is up and running, go to Azure Portal. Look for
your resource group that you have created it at the beginning (Day 2) when we
started the Databricks Service. I have named my Resource group "RG_DB_py"
(naming is importat! RG - ResourceGroup; DB - Service DataBricks; py - my project
name). Search for the correct resource:
And Select "Resource Groups" and find your resource group. I have a lot of resource
groups, since I try to bundle the projects to a small groups that are closely related:

Find yours and select it and you will find the Azure Databricks service that belongs to
this resource group.
Databricks creates additional (automatically generated) resource group to hold all
the services (storage, VM, network, etc.). Follow the naming convention:
RG_DB_py - is my resource group. What Azure does in this case, it prefixes and
suffices your resource group name as: databricks_rg_DB_py_npkw4cltqrcxe. Prefix will
always be "databricks_rg" and suffix will be 13-characters random string for
uniqueness. In my case: npkw4cltqrcxe. Why separate resource group? It used to be
under the same resource group, but decoupling and having services in separate
group makes it easier to start/stop services, manage IAM, create pool and scale. Find
your resource group and see what is insight:

In detail list you will find following resources (in accordance with my standard DS3_v2
Cluster):
 Disk (9x Resources)
 Network Interface (3x resources)
 Network Security group (1x resource)
 Public IP address (3x resources)
 Storage account (1x resource)
 Virtual Machine (3x resources)
 Virtual network (1x resource)
Inspect the naming of these resources, you can see that the names are guid based,
but the names are repeating through different resources and can easily be bundled
together. Drawing the components together to get a full picture of it:
At a high level, the Azure Databricks service manages worker nodes and driver node
in the separate resource group, that is tight to the same Azure subscription (for
easier scalability and management). The platform or "appliance" or "managed
service" is deployed as an set of Azure resources and Databricks manages all other
aspects. The additional VNet, Security groups, IP addresses, and storage accounts are
ready to be used for end user and managed through Azure Databricks Portal (UI).
Storage is also replicated (geo redundant replication) for disaster scenarios and fault
tolerance. Even when cluster is turned off, the data is persisted in storage.
Cluster is a virtual machine that has a blob storage attached to it. Virtual machine is
rocking Linux Ubuntu (16.04 as of writing this) and it has 4 vCPUs and 14GiB of RAM.
The workers are using two Virtual Machines. And the same Virtual machine is
reserved for the driver. This is what we have set on Day 2.

Since each VM machine is the same (for Worker and Driver), the workers can be
scaled up based on the vCPU. Two VM for Workers, with 4 cores each, is maximum 8
workers. So each vCPU / Core is considered one worker. And the Driver machine (also
VM with Linux Ubuntu) is a manager machine for load distribution among the
workers.
Each Virtual machine is set with public and private sub-net and all are mapped
together in Virtual network (VNet) for secure connectivity and communication of
work loads and data results. And Each VM has a dedicated public IP address for
communication with other services or Databricks Connect tools (I will talk about this
in later posts).
Disks are also bundled in three types, each for one VM. These types are:
 Scratch Volume
 Container Root Volume
 Standard Volume
Each type has a specific function but all are designed for optimised performance data
caching, especially for delta caching. This means for faster data reading, creating
copies of remote files in nodes’ local storage using and using a fast intermediate
data format. The data is cached automatically. Even when file has to be fetched from
a remote location. This performs well also for repetitive and successive reads. Delta
caching (as part of Spark caching)
This is supported for reading only Parquet files in DBFS, HDFS, Azure Blob storage,
Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2. Optimized
storage (Spark caching) does not uspport file types as CSV, JSON, TXT, ORC, XML.
When request is pushed from the Databricks Portal (UI) the main driver accepts the
requests and by using spark jobs, pushes the workload down to each node. Each
node has a shards and copies of the data or it it gets through DBFS from Blob
Storage and executes the job. After execution the summary / results of each worker
node is summed and gathered again by driver. Driver node returns the results in
fashionable manner back to UI.

The more worker nodes you have, the more "parallel" the request can be executed.
And the more workers you have available (or in ready mode) the more you can
"scale" your workloads.
Tomorrow we will start with working our way up to importing and storing the data
and see how it is stored on blob storage and explore different type of storages that
Azure Databricks provides.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
Dec 06 2020 - Importing and storing
data to Azure Databricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
Yesterday we started exploring the Azure services that are created when using Azure
Databricks. One of the service, that I would like to explore today is storage and
especially how to import and how to store data.
Log in to Azure Databricks and on the main (home) site select "Create Table" under
recommended common task. Don't start your cluster yet (if it's running, please
terminate it for now).

This will prompt you a variety of actions on importing data to DBFS or connecting
Azure Databricks with other services.
Drag the data file (available on Github in data folder) named Day6data.csv to square
for upload. For easier understanding, let's check the CSV file schema (simple one,
three columns: 1. Date (datetime format), 2. Temperature (integer format), 3. City
(string format)).

But before you start with uploading the data, let's check the Azure resource group. I
have not yet started any Databricks cluster in my workspace. And here you can see
that Vnet, Storage and Network Security group will always be available for Azure
Databricks service. Only when you start the cluster, additional services (IP addresses,
disks, VM,...) will appear.

This gives us better idea where and how data is persisted. Your data will always be
available and stored on blob storage. Meaning, even if you decide - not only to
terminate the cluster, but to delete the cluster as well, your data will always be safely
stored. Only when you add new cluster to same workspace, cluster will automatically
retrieved the data from blob storage.
1. Import
Drag and drop the csv file in the "Drop zone" as discussed previously. And is should
looked like this:
You have now two options:
 create table with UI
 create table in Notebook
Select the "Create table with UI". Only now you will be asked to select the cluster:

Now select the "Create table in Notebook" and Databricks will create a first
Notebook for you using Spark language to upload the data to DBFS.
In case I want to run this notebook, I will need to have my cluster up and running. So
let's start a cluster. On your left vertical navigation bar, select Cluster Icon. You will
get the list of all the clusters you are using. Select the one we have created on Day 4.
If you want, check the resource group for your Azure Databricks to see all the
running VM, disks and VNets.
Now insert the data using the import method, by drag and drop the CSV file in the
"Drop Zone" (repeat the process) and hit "Create Table with UI". Now you should
have Cluster available. Select it and preview the Table.
You can see that table name is propagated from filename, the file Type is
automatically selected, Column delimiter is automatically selected. Only "First row in
header" should be selected in order to have columns properly named and data types
corrected, respectively.

Now we can create a table. After Databricks will finish, the report will be presented
with recap of the table location (yes, location!), Schema and overview of sample data.
This table is now available on my Cluster. What does this mean? This table is now
persistent on your cluster, but not only on cluster, but on your Azure Databricks
Workspace. This is important to understand how and where data is stored. Go to
Data icon on left vertical navigation bar.

This database is attached to my Cluster. If I terminate my cluster, will I loose my data?


Trying stoping the cluster and check data again. And bam... Database is not available,
since there is no cluster "attached" to it.
But hold your horses. Data is still available on blob storage, just not seen to DBFS.
Database will be visible again, when you start your cluster.
2. Storing data to DBFS
DBFS - Databricks File System is a distrubuted file system mounted into an
enclosed Azure Databricks workspace. DBFS is available on selected cluster through
UI or Notebooks. In this way, DBFS is decoupled data layer (or abstraction layer) on
top of Azure object storage
is a distributed file system mounted into an Azure Databricks workspace and
available on Azure Databricks clusters. DBFS is an abstraction on top of scalable
object storage and offers the following benefits:
 easy communication and interaction with object storage using bash / CLI command
line
 data is always persistent
 mounting storage objects is easy and accessing it done seamlessly
 No additional credentials are needed, since you are "locked" in azure workspace.
Storage is located as root and there are some folders created with following
locations:
 dbfs:/root - is a root folder
 dbfs:/filestore - folder that holds imported data files, generated plots, tables, and
uploaded libraries
 dbfs:/databricks - folder for mlflow, init scripts, sample public datasets, etc.
 dbfs:/user/hive - data and metadata to hive (SQL) tables
You will find many other folder that will be generated though notebooks.
Before we begin, let's make your life easier. Go to admin console setting, select
advanced tab and find "DBFS File browser". By default, this option is disabled, so
let's enable it.

This will enable you to view the data through DBFS structure, give you the upload
option and search option.
Uploading files will be now easier and would be seen immediately in FileStore. There
is same file prefixed Day6Data_dbfs.csv in github data folder, that you can upload
manually and it would be seen in FileStore:

Tomorrow we will explore how we can use Notebook to access this file in different
commands (CLI, Bash, Utils, Python, R, Spark). And since we will be using notebooks
for the first time, we will do a little exploration of notebooks as well.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 07 2020 - Starting with Databricks


notebooks and loading data to DBFS

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
Yesterday we started working towards data import and how to use drop zone to
import data to DBFS. We have also created our first Notebook and this is where I
would like to start today. With a light introduction to notebooks.
What are Notebooks?
Notebook is a powerful text document that integrated interactive computing
environment for data engineers, data scientists and machine learning engineers. It
supports multiple kernels (compute environments) and multiple languages. Most
common Notebook is Jupyter Notebook, name suggesting and providing acronyms
for Julia, Python and R. Usually a notebook will consist of a text, rich text, HTML,
figures, photos, videos, and all sorts of engineering blocks of code or text. These
blocks of code can be executed, since the notebooks are part of client-server web
application. Azure Databricks notebooks are type of ipynb notebooks, same format
as the Jupyter notebooks. Databricks environment provides client and server and you
do not have to worry about installation not setup. Once the Databrick cluster is up
and running, you are good to go.
On your home screen, select "New Notebook"
and give it a Name, Language and Cluster.

Databricks notebooks are support multi languages and you can seaminglessly switch
the language in the notebook, without the need to switching the languange. If the
notebooks are instructions of operations and what to do, is the cluster the engine
that will execute all the instructions. Select cluster, that you have created on Day 4. I
am inserting following:
Name: Day7_DB_Py_Notebook

Default
Python
Language:

Cluster: databricks_cl1_standard
If your clusters are not started, you can still create a notebook and later attach
selected cluster to notebook.

Notebook consists of cells that can be either formatted text or code. Notebooks are
saved automatically. Under File, you will find useful functions to manage your
notebooks, as: Move, Clone, Rename, Upload, Export. Under menu Edit, you will be
able to work with cells, and code blocks. Run all is a quick function to execute all cells
at one time (or if you prefer you can run a cell one by one, or selected cell all below
or above). Once you start writing formatted text (Markdown, HTML, others),
Databricks will automatically start building Table of content, giving you better
overview of your content.
Let's start with Markdown and write the title and some text to notebook and adding
some Python code. I have inserted:
%md # Day 7 - Advent of Azure Databricks 2020

%md
## Welcome to day 7.
In this document we will explore how to write notebook, use different languages
import file.

%md Default language in this notebook is Python. So if we want to add a text cell,
instead of Python, we need to explicitly
set **%md** at the beginning of each cell. In this way, language of execution
will be re-defined.

%md Now, let's start using Python

dbutils.fs.put("/FileStore/Day7/SaveContent.txt", 'This is the sample')


And the result was a perfect Notebook with Heading, subtitle and text. In the middle
the Table of content is generated automatically.

Under view, changing from Standard to side-by-side and you can see the code and
converted code as notebook on the other-side. Useful for copying, changing or
debugging the code.
Each cell text has %md at the beginning, for converting text to rich text - Markdown.
The last cell is Python
dbutils.fs.put("/FileStore/Day7/SaveContent.txt", 'This is the sample')
That generated a txt file to Filestore. File can be also seen in the left pane. DbUtils -
Databricks utils is a set of utility tools for efficiently working with object storage.
dbUtils are available in R, Python and Scala.
Importing file
In Notebook we have the ability to use multiple languages. Mixing Python with Bash
and Spark and R is something common. But in this case, we will use DbUtils -
powerful set of functions. Learn to like it, because it will be utterly helpful.
Let us explore the Bash and R to import the file into data.frame.
dbutils.fs.ls("dbfs:/FileStore")
df = spark.read.text("dbfs:/FileStore/Day6Data_dbfs.csv")

df.show()
And the results is:

And do the same for R Language:


%r library(dplyr)

%r Day6_df <- read.csv(file = "/dbfs/FileStore/Day6Data_dbfs.csv", sep=";")


head(Day6_df)
Day6_df <- data.frame(Day6_df)

%md Let's do a quick R analysis

%r
library(dplyr)
Day6_df %>%
group_by(city) %>%
summarise(mean = mean(mean_daily_temp), n = n())
And the result is the same, just using R Language.
Tomorrow we will use Databricks CLI and DBFS API to upload the files from e.g.: your
client machine to filestore. In this way, you will be able to migrate and upload file to
Azure Databricks in no time.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 08 2020 - Using Databricks CLI and


DBFS CLI for file upload
Azure Databricks repository is a set of blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
Yesterday we worked toward using notebooks and how to read data using
notebooks.
Today we will check Databricks CLI and look into how you can use CLI to upload
(copy) files from your remote server to DBFS.
Databricks CLI is a command-line interface (CLI) that provides an easy-to-use
interface to the Databricks platform. Databricks CLI is from group of developer tools
and should be easy to setup and straightforward to use. You can automate many of
the tasks with CLI.
1.Installing the CLI
Using Python 3.6 (or above), run the following pip command in CMD:
pip3 install databricks-cli
But before using CLI, Personal access token needs to be created for authentication.
2. Authentication with Personal Access Token
On your Azure Databricks Workspace home screen go to settings:

And select User settings to get the list of Access Tokens.

Click on Generate New Token and in dialog window, give a token name and lifetime.

After the token is generated, make sure to copy, because you will not be able to see
it later. Token can be revoked (when needed), otherwise it has a expiry date (in my
case 90 days). So make sure to remember to renew it after the lifetime period!
3. Working with CLI
Go back to CMD and run the following:
databricks --version
will give you the current version you are rocking. After that, let's configure the
connectivity.
databricks configure --token
and you will be prompted to insert two information (!)
 the host ( in my case: https://adb-8606925487212195.15.azuredatabricks.net/)
 the token

Host is is available for you in your browser. Go to Azure databricks tab/Browser and
copy paste the URL:

And the token, that has been generated for you in step two. Token should look
like: dapib166345f2938xxxxxxxxxxxxxxc.
Once you insert both information, the connection is set!
By using bash commands, now you can work with DBFS from your local machine /
server using CLI. For example:
databricks fs ls
will list all the files on root folder of DBFS of your Azure Databricks

4. Uploading file using DBFS CLI


Databricks has already shorthanded / aliased databricks fs  command to simply dbfs.
Essentially following commands are equivalent:
databricks fs ls
dbfs ls
so using DBFS CLI means in otherwords using Databricks FileStore CLI. And with this,
we can start copying a file. So copying from my local machine to Azure Databricks
should look like:
dbfs cp /mymachine/test_dbfs.txt dbfs:/FileStore/file_dbfs.txt

My complete bash code (as seen on the screen shot) is:


pwd
touch test_dbfs.txt
dbfs cp test_dbfs.txt dbfs:/FileStore/file_dbfs.txt
And after refreshing the data on my Databricks workspace, you can see that the file is
there. Commands pwd and touch are here merely for demonstration.
This approach can be heavily automated for daily data loads to Azure Databricks,
delta uploads, data migration or any other data engineering and data movement
task. And also note, that Databricks CLI is a powerful tool with broader usage.
Tomorrow we will check how to connect Azure Blob storage with Azure Databricks
and how to read data from Blob Storage in Notebooks.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 09 2020 - Connect to Azure Blob


storage using Notebooks in Azure
Databricks
Azure Databricks repository is a set of blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Connect to Azure Blob storage using Notebooks in Azure Databricks
Yesterday we introduced the Databricks CLI and how to upload the file from
"anywhere" to Databricks. Today we will look how to use Azure Blob Storage for
storing files and accessing the data using Azure Databricks notebooks.
1. Create Azure Storage account
We will need to go outside of Azure Databricks to Azure portal. And search for
Storage accounts.

And create a new Storage account by clicking on "+ Add". And select the
subscription, Resource group, Storage account name, location, account type and
replication.
Continue to set up networking, data protection, advance settings and create the
storage account. When you are finished with storage account, we will create a
storage itself. Note that General Purpose v2 Storage accounts support latest Azure
Storage features and all functionality of general purpose v1 and Blob Storage
accounts. General purpose v2 accounts bring lowest per-gigabyte capacity prices for
Azure storege and support following Azure Storage services:
 Blobs (all types: Block, Append, Page)
 Data Lake Gen2
 Files
 Disks
 Queues
 Tables
Once the Account is ready to be used, select it and choose "Container".
Container is a blob storage for unstructured data and will communicate with Azure
Databricks DBFS perfectly. When in Container part, select "+ Container" to add new
container and give a container a name.

Once the container is created, click on the container to get additional details.

Your data will be stored in this container and later used with Azure Databricks
Notebooks. you can also access the storage using Microsoft Azure Storage Explorer.
It is much more intuitive and and offers easier management, folder creation and
binary files management.
You can upload a file using Microsoft Azure Storage Explorer tool or directly on
portal. But in organisation, you will have files and data being here copied
automatically using many other Azure service. Upload a file that is available for you
on Github repository (data/Day9_MLBPlayers.csv - data file is licensed under GNU) to
blob storage container in any desired way. I have used Storage explorer and simply
drag and dropped the file to container.

2. Shared access Signature (SAS)


Before we go back to Azure Databricks, we need to set the access policy for this
container. Select "Access Policy"
We need to create a Shared Access Signature which is a general Microsoft grant to
access the storage account. Click on Access policy from left menu and once new site
is loaded, select "+ Add Policy" under Shared access policies and give it a name,
access and validity period:

Click OK to confirm and click Save (save icon). Go back to Storage account and on
the left select Shared Access Signature.
Under Allowed resource types, it is mandatory to select Container, but you can select
all. Set the Start and expiry date - 1 month in my case. Select button "Generate SAS
and connection string" and copy paste the needed strings; connection string and SAS
token should be enough (copy and paste it to a text editor)
Once this is done, let's continue with Azure Databricks notebooks.
3. Creating notebooks in Azure Databricks
Start up a cluster and create new notebooks (as we have discussed on Day 4 and Day
7). The notebook is available at Github.
And the code is:
%scala

val containerName = "dbpystorecontainer"


val storageAccountName = "dbpystorage"
val sas = "?sv=2019-12-12&ss=bfqt&srt=sco&sp=rwdlacupx&se=2020-12-
09T06:15:32Z&st=2020-12-08T22:15:32Z&spr=https&sig=S%2B0nzHXioi85aW
%2FpBdtUdR9vd20SRKTzhNwNlcJJDqc%3D"
val config = "fs.azure.sas." + containerName+ "." + storageAccountName +
".blob.core.windows.net"
with the mount function.
dbutils.fs.mount(
source =
"wasbs://dbpystorecontainer@dbpystorage.blob.core.windows.net/Day9_MLBPlayers.csv"
,
mount_point = "/mnt/storage1")
When you run a following scala command, it will generate a data.frame called mydf1
data.frame
%scala

val mydf1 = spark.read


.option("header","true")
.option("inferSchema", "true")
.csv("/mnt/storage1")
display(mydf1)
And now we can start exploring the dataset. And I am using R language.
This was a long but important topic that we have addressed. Now you know how to
addree and store data.
Tomorrow we will check how to start using Notebooks and will be for now focusing
more on analytics and less on infrastructure.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 10 2020 - Using Azure Databricks


Notebooks with SQL for Data
engineering tasks
Azure Databricks repository is a set of blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
Yesterday we were working our way to get data from Azure Storage to Azure
Databricks using dbfs mount function and sorted credentials (secret, key).
Log into Azure Databricks portal and create a new Notebook (day 7 explains how to
create Notebook). In this notebook, we will use SQL to explore data engineering
tasks. I have given my a name Day10_SQL_EngineeringTasks and selecting default
Language Python. Attach a cluster to Notebook.

And here is the Notebook, and how it looks:


1. Exploring Databases with SHOW
SHOW is a useful clause to see that your database holds.
%sql
SHOW TABLES;
%sql
SHOW TABLES FROM default;
%sql
SHOW TABLES IN default LIKE 'day6*'
2. Creating database and getting information with DESCRIBE
Creating a database is simple, by defining the location and adding some information.
%sql
CREATE DATABASE IF NOT EXISTS Day10 COMMENT 'This is a sample database for day10'
LOCATION '/user';
Getting some additional information can be done with DESCRIBE clause.
%sql
DESCRIBE DATABASE EXTENDED Day10;
3. Creating tables and connecting it with CSV
For the underlying CSV we will create a table. We will be using CSV file from Day
6, and it should be still available on location dbfs:/FileStore/Day6_data_dbfs.csv. This
dataset has three columns (Date, Temperature and City) and it should be good
starting example.
%sql
USE Day10;

DROP TABLE IF EXISTS temperature;


CREATE TABLE temperature (date STRING, mean_daily_temp STRING, city STRING)
And we can check the content of the table and the database:
%sql
USE Day10;

SELECT * FROM temperature


%sql
SHOW TABLES IN Day10;
And now connect CSV with the table (or view):
%sql
USE Day10;

DROP VIEW IF EXISTS temp_view2;


CREATE TEMPORARY VIEW temp_view2
USING CSV
OPTIONS (path "/FileStore/Day6Data_dbfs.csv", header "true", mode "FAILFAST")
And check the content:
%sql
USE Day10;
SELECT * FROM temp_view2
IF you would want to change the data type of a particular column, you can also do it
as:
%sql
USE Day10;

ALTER TABLE temperature CHANGE COLUMN mean_daily_temp INT


4. Creating a JOIN between two tables
Let's create two sample tables :
%sql
USE Day10;

DROP TABLE IF EXISTS temp1;


DROP TABLE IF EXISTS temp2;

CREATE TABLE temp1 (id_t1 INT, name STRING, temperature INT);


CREATE TABLE temp2 (id_t2 INT, name STRING, temperature INT);
And add some insert statements:
%sql
USE Day10;

INSERT INTO temp1 VALUES (2, 'Ljubljana', 1);


INSERT INTO temp1 VALUES (3, 'Seattle', 5);
INSERT INTO temp2 VALUES (1, 'Ljubljana', -3);
INSERT INTO temp2 VALUES (2, 'Seattle`', 3);
And create an inner join
%sql
USE Day10;

SELECT
t1.Name as City1
,t2.Name AS City2
,t1.temperature*t2.Temperature AS MultipliedTemperature

FROM temp1 AS t1
JOIN temp2 AS t2
ON t1.id_t1 = t2.id_t2
WHERE
t1.name <> t2.name
LIMIT 1

If you follow the notebook, you will find some additional information, but all in all,
the HIVE SQL is ANSI compliant and getting started, should be no problem. When
using notebook, each cell must have a language defined at the beginning, unless it is
a language of kernel. %sql for SQL language, %md for Markdown, %r for R
language, %scala for Scala. Beware, these language pointers are case sensitive, so
%sql will interpret as SQL script, where as %SQL will return an error.
Tomorrow we will check and explore how to use R to do data engineering, but
mostly the data analysis tasks. So, stay tuned.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
Dec 11 2020 - Using Azure Databricks
Notebooks with R Language for data
analytics
Azure Databricks repository is a set of blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
We looked into SQL language and how to get some basic data preparation done.
Today we will look into R and how to get started with data analytics.
Creating a data.frame (or getting data from SQL Table)
Create a new notebook (Name: Day11_R_AnalyticsTasks, Language: R) and let's go.
Now we will get data from SQL tables and DBFS files.
We will be using a database from Day10 and the table called temperature.
%sql
USE Day10;

SELECT * FROM temperature


For getting SQL query result into R data.frame, we will use SparkR package.
library(SparkR)
Getting Query results in R data frame (using SparkR R library)
temp_df <- sql("SELECT * FROM temperature")
With this temp_df  data.frame we can start using R or SparkR functions. For example
viewing the content of the data.frame.
showDF(temp_df)
This is a SparkR data.frame. you can aslo create a R data.frame by
using as.data.frame function.
df <- as.data.frame(temp_df)
Creating standard R data.frame and it can be used with any other R packages.
Importing CSV file into R data.frame
Another way to get data into R data.frame is to feed data from CSV file. And in this
case, SparkR library will again come in handy. Once data in data.frame, it can be used
with other R libraries.
Day6 <- read.df("dbfs:/FileStore/Day6Data_dbfs.csv", source = "csv",
header="true", inferSchema = "true")
head(Day6)
Doing simple analysis and visualisations
Once data is available in data.frame and it can be used for analysis and visualisations.
Let's load ggplot2.
library(ggplot2)
p <- ggplot(df, aes(date, mean_daily_temp))
p <- p + geom_jitter() + facet_wrap(~city)
p
And make the graph smaller and give it a theme.
options(repr.plot.height = 500, repr.plot.res = 120)
p + geom_point(aes(color = city)) + geom_smooth() +
theme_bw()

Once again, we can use other data wrangling packages. Both dplyr and ggplot2 are


preinstalled on Databricks Cluster.
library(dplyr)
When you load a library, nothing might be returned as a result. In case of warning,
Databricks will display the warnings. Dplyr package can be used as any other
package absolutely normally, without any limitations.
df %>%
dplyr::group_by(city) %>%
dplyr::summarise(
n = dplyr::n()
,mean_pos = mean(as.integer(df$mean_daily_temp))
)
#%>% dplyr::filter( as.integer(df$date) > "2020/12/01")

But note(!), dplyr functions might not work, and it is due to the collision of function
names with SparkR library. SparkR has same functions (arrange, between, coalesce,
collect, contains, count, cume_dist,
dense_rank, desc, distinct, explain, filter, first, group_by, intersect, lag, last, lead,
mutate, n, n_distinct, ntile,
percent_rank, rename, row_number, sample_frac, select, sql, summarize, union). In
other to solve this collision, either detach (detach("package:dplyr")) the dplyr package,
or we instance the package by: dplyr::summarise instead of just summarise.
Creating a simple linear regression
We can also use many of the R packages for data analysis, and in this case I will run
simple regression, trying to predict the daily temperature. Simply run the regression
function lm().
model <- lm(mean_daily_temp ~ city + date, data = df)
model
And run base  r function summary() to get model insights.
summary(model)
confint(model)
In addition, you can directly install any missing or needed package in notebook (R
engine and Databricks Runtime environment version should be applied). In this case,
I am running a residualPlot()  function from extra installed package car.
install.packages("car")
library(car)
residualPlot(model)
Azure Databricks will generate RMarkdown notebook when using R Language as
Kernel language. If you want to create a IPython notebook, make Python as Kernel
language and use %r for switching to R Language. Both RMarkdown notebook and
HTML file (with included results) are included and available on Github.
Tomorrow we will check and explore how to use Python to do data engineering, but
mostly the data analysis tasks. So, stay tuned.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 12 2020 - Using Azure Databricks


Notebooks with Python Language for
data analytics
Azure Databricks repository is a set of blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
We looked into SQL and R language and how to get some basic data preparation
done. Today we will look into Python and how to go about data analytics.
Using data frames and getting data from SQL and DBFS
Create a new notebook with Name: Day12_Py_Analytics and Language: Python and
connect notebook to a cluster we have created on Day 4. And let’s go and add some
data from FileStore and use data that we uploaded in Day 6.
csv_df = spark.read.csv("/FileStore/Day6Data_dbfs.csv", header="True")
display(csv_df)

We can also import data from SQL Table into data frame by simply writing an SQL
statement.
#from pyspark.sql.functions import explode
from pyspark.sql import *
import pandas as pd

display(sql("select * from day10.temperature"))


Besides displaying dataset, you can store a result of a query to a variable and use it
later.
#for display
display(sql("select * from day10.temperature"))
#to save to variable
df = sql("select * from day10.temperature")
Let's get now some data from Databricks sample data (that is available to anybody).
So you can insert data from dbfs store and use the sample datasets as well, by using
Python Pandas.
import pandas as pd

dfcovid = pd.read_csv("/dbfs/databricks-datasets/COVID/covid-19-data/us-
states.csv")
dfcovid.head()

and now let's scatter plot some number of cases and deaths per states and use the
following Python code that can be simply used in Azure Databricks.
# Filter to 2020-12-01 on first of december
df_12_01 = dfcovid[dfcovid["date"] == "2020-12-01"]

ax = df_12_01.plot(x="cases", y="deaths", kind="scatter",


figsize=(12,8), s=100, title="Deaths vs Cases on 2020-12-01 -
All States")

df_12_01[["cases", "deaths", "state"]].apply(lambda row: ax.text(*row), axis=1);


And now let's compare only couple of these extreme states (New York, Texas,
California and Florida). And create a subset for only these four states:
df_ny_cal_tex_flor = dfcovid[(dfcovid["state"] == "New York") | (dfcovid["state"]
== "California") | (dfcovid["state"] == "Florida") | (dfcovid["state"] ==
"Texas")]
And now to create an index for the plot of deaths over time
df_ny_cal_tex_flor = df_ny_cal_tex_flor.pivot(index='date', columns='state',
values='deaths').fillna(0)
df_ny_cal_tex_flor
and now plot this data using this dataset:
df_ny_cal_tex_flor.plot.line(title="Deaths 2020-01-25 to 2020-12-10 - CA, NY, TX,
FL", figsize=(12,8))
And now for a simple regression analysis, we will split data from test and train. Since
the first and second wave we will need to thing how to split the data. Let's split it
until mid of November and after mid of November.
train_df = dfcovid[(dfcovid["date"] >= "2020-07-01") & (dfcovid["date"] <= "2020-
11-15")]
test_df = dfcovid[dfcovid["date"] > "2020-11-16"]

X_train = train_df[["cases"]]
y_train = train_df["deaths"]

X_test = test_df[["cases"]]
y_test = test_df["deaths"]
We will use scikit-learn to do simple linear regression.
from sklearn.linear_model import LinearRegression

lr = LinearRegression().fit(X_train, y_train)
print(f"num_deaths = {lr.intercept_:.4f} + {lr.coef_[0]:.4f}*cases")
So if we have no cases, then there should be no deaths caused by COVID-19; this
gives us a base line and assume that let's set the intercept to be 0.
lr = LinearRegression(fit_intercept=False).fit(X_train, y_train)
print(f"num_deaths = {lr.coef_[0]:.4f}*cases")
This model imposes that there is a 2.68% mortality rate in our dataset. But we know
that some states have higher mortality rates and that linear model is absolutely not
ideal for that, but it is just to showcase for using Python in Databricks.
Tomorrow we will check and explore how to use Python Koalas to do data
engineering, so stay tuned.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 13 2020 - Using Python Databricks


Koalas with Azure Databricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
So far, we looked into SQL, R and Python and this post will be about Python Koalas
package. A special implementation of pandas DataFrame API on Apache Spark. Data
Engineers and data scientist love Python pandas, since it makes data preparation
with pandas easier, faster and more productive. And Koalas is a direct "response" to
make writing and coding on Spark, easier and more familiar. Also follow the official
documentation with full description of the package.
Koalas come pre-installed on Databricks Runtine 7.1 and above and we can use
package directly in the Azure Databricks notebook. Let us check the Runtime version.
Launch your Azure Databricks environment, go to clusters and there you should see
the version:

My cluster is rocking Databricks Runtime 7.3. So create a new notebook and  name
it: Day13_Py_Koalas and select the Language: Python. And attach the notebook to
your cluster.
1.Object Creation
Before going into sample Python code, we must import the following packages:
pandas and numpy so we can create from or convert from/to Databricks Koalas.
import databricks.koalas as ks
import pandas as pd
import numpy as np
Creating a Koalas Series by passing a list of values, letting Koalas create a default
integer index:
s = ks.Series([1, 3, 5, np.nan, 6, 8])
Creating a Koalas DataFrame by passing a dict of objects that can be converted to
series-like.
kdf = ks.DataFrame(
{'a': [1, 2, 3, 4, 5, 6],
'b': [100, 200, 300, 400, 500, 600],
'c': ["one", "two", "three", "four", "five", "six"]},
index=[10, 20, 30, 40, 50, 60])
with the result:
Now, let's create a pandas DataFrame by passing a numpy array, with a datetime
index and labeled columns:
dates = pd.date_range('20200807', periods=6)
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
and getting the results as pandas dataframe:

Pandas dataframe can easly be converted to Koalas dataframe:


kdf = ks.from_pandas(pdf)
type(kdf)
With type of: Out[67]: databricks.koalas.frame.DataFrame
And we can output the dataframe to get the same result as with pandas dataframe:
kdf
Also, it is possible to create a Koalas DataFrame from Spark DataFrame. We need to
load additional pyspark package first, then create a SparkSession and create a Spark
Dataframe.
#Load package
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(pdf)
Since spark is lazy we need to explicitly call the show function in order to see the
spark dataframe.
sdf.show()
Creating Koalas DataFrame from Spark DataFrame. to_koalas() is automatically
attached to Spark DataFrame and available as an API when Koalas is imported.
kdf = sdf.to_koalas()

2. Viewing data
See the top rows of the frame. The results may not be the same as pandas though:
unlike pandas, the data in a Spark dataframe is not ordered, it has no intrinsic notion
of index. When asked for the head of a dataframe, Spark will just take the requested
number of rows from a partition.
kdf.head()

You can also display the index, columns, and the underlying numpy data.
kdf.index
kdf.columns
kdf.to_numpy()
And you can also use describe function to get a statistic summary of your data:
kdf.describe()
You can also transpose the data, by adding a T function:
kdf.T
and many other functions. Group is also another great way to get summary of your
data. Grouping can be done by "chaining" or adding a group by clause. The internal
process - when grouping is applied - happens in three steps:
 Splitting data into groups (base on criteria)
 applying the function and
 combining the results back to data structure.
kdf.groupby('A').sum()
#or
kdf.groupby(['A', 'B']).sum()
Both are grouping data, first time on Column A and second time on both columns A
and B:
3. Plotting data
Databricks Koalas is also compatible with matplotlib and inline plotting. We need to
load the package:
%matplotlib inline
from matplotlib import pyplot as plt
And can continue by creating a simple pandas series:
pser = pd.Series(np.random.randn(1000),
index=pd.date_range('1/1/2000', periods=1000))
that can be simply converted to Koalas series:
kser = ks.Series(pser)
After we have a series in Koalas, we can create cumulative sum of values using series
and plot it:
kser = kser.cummax()
kser.plot()
And many other variations of plot. You can also load the seaborn package, boket
package and many others.
This blogpost is shorter version of a larger "Koalas in 10 minutes" notebook, and is
available in the same Azure Databricks repository as all the samples from this
Blogpost series. Notebook briefly touches also data conversion to/from CSV, Parquet
(*.parquet data format) and Spark IO (*.orc data format). All will be persistent and
visible on DBFS.
Tomorrow we will explore the Databricks jobs, from configuration to execution and
troubleshooting., so stay tuned.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 14 2020 - From configuration to


execution of Databricks jobs
Azure Databricks repository is a set of blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
Last four days we were exploring languages available in Azure Databricks. Today we
will explore the Databricks jobs and what we can do with it, besides running a job.
1.Viewing a job
In the left vertical navigation bar, click the Job icon:

And you will get to the screen where you will be able to view all your jobs:

Jobs will have the following attributes (all are mostly self explanatory):
 Name - Name of the Databricks job
 Job ID - ID of the job and it is set automatically
 Created By - Name of the user (AD Username) who created a job
 Task - Name of the Notebook, that is attached and executed, when job is triggered.
Task can also be a JAR file or a spark-submit command.
 Cluster - Name of the cluster, that is attached to this job. When job is fired, all the
work will be done on this cluster.
 Schedule - CRON expression written in "readable" manner
 Last Run - Status of last run (E.g.: successful, failed, running)
 Action - buttons available to start or terminate the job.
2.Creating a job
If you would like to create a new job, use the "+Create job":

And you will be prompted with a new site, to fill in all the needed information:

Give job a name by renaming the "Untitled". I named my job to "Day14_Job". Next
step is to make a task. A task can be:
 Selected Notebook (is a notebook that we have learned to use and create on Day 7)
 Set JAR file (is a program or function that can be executed using main class and
arguments)
 Configure spark-submit (Spark-submit is a Apache Spark script to execute other
applications on a cluster)
We will use a "Selected Notebook. So jump to workspaces and create a new
Notebook. I have named mine as Day14_Running_job and it is running on Python.
This is the following code, I have used:

Notebook is available in the same Github repository. And the code to this notebook


is:
dbutils.widgets.dropdown("select_number", "1", [str(x) for x in range(1, 10)])

dbutils.widgets.get("select_number")

num_Select = dbutils.widgets.get('select_number')

for i in range(1, int(num_Select)):


print(i, end=',')
We have created a widget that looks like a dropdown menu in the head of notebook,
a value (integer or string) that can be set once and called / used multiple-times in the
notebook.
You can copy and paste the code to code cells or use the notebook in the github
repository. We wil use notebook: Day14_Running_job with job: Day14_job.
Navigate back to jobs and you will see that Azure Databricks automatically saves the
work progress for you. in the this job (Day14_job), select a notebook you have
created.

We will need to add some additional parameters:


 Task - Parameter
 Dependent libraries
 Cluster
 Schedule
The Parameters, can be used with the Spark dbutils.widget  command, the same we
used in notebook. In this way we are giving notebook some external parametrization.
Since We are not using any additional libraries, just attach the cluster. I am using an
existing cluster and not creating new one (as offered by default) and select the
schedule timing (my cron configuration is: 0 0 * * * ? and this is execute every full
hour). At the end, my setup looks like:
Under the advanced features, you can also set the following:
 Alerts - set up the alerts on email (or multiple email addresses) on three event: on
start, on success or on failure.
 Maximum Concurrent Runs - is the upper limit of how many concurrent runs of this
job can be executing at the same time
 Timeout - specify the terminate time - the timeout (in minutes) of the running job
 Retries - number of retries if a job has failed
3.Executing a job
We leave all advanced setting empty by default. You can run job manually or leave it
as it is and check if the CRON schedule will do the work. Nevertheless, I have ran one
manually and left one to be run automatically.

You can also check the each run separately and see what has been executed. Each
run has a specific ID and holds typical information, such as Duration, Start time,
Status, Type and which Task was executed, end others.
But the best part of it, you have the results of the notebook also available and
persistent in the Log of the job. This is useful especially if you don't want to store
results of your job to a file or table, but you need to only see them after run.
Do not forget to stop the running jobs, when you don't needed any more. This is
important due to several facts: jobs will be executed even if the clusters are not
running. Thinking, that clusters are not started, but the job is still active, can generate
some traffic and unwanted expenses. In addition, get the feeling, how long does the
job run, so you can plan the cluster up/down time accordingly.
Tomorrow we will explore the Spark UI, metrics and logs that is available on
Databricks cluster and job.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
Dec 15 2020 - Databricks Spark UI,
Event Logs, Driver logs and Metrics

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
Yesterday we looked into how Databricks jobs can be configured, how to use widgets
to pass the parameters and typical general setting.
When debugging the jobs (or in this matter clusters), you will come across this part
of the menu (it can be accessed from Jobs or from clusters) with Event Log, Spark UI,
Driver Logs, Metrics. This is a view from Clusters

And same information can be accessed from Jobs (it is just positioned in the
overview of the job):
Both will get you to the same page.
1.Spark UI
After running a job, or executing commands in notebooks, check the Spark UI on the
cluster you have executed all the commands. The graphical User Interface will give
you overview of execution of particular jobs/Executors and the timeline:
But if you need detailed description, where will be for each particular job ID (Job ID
13), you can see the execution time, Duration, Status and Job ID global unique
identifier.
When clicking on Description of this Job ID, you will get more detailed overview.
Besides the Event Timeline (what you can see in the above printscreen), you can also
get the DAG visualization for better understanding how Spark API works and which
services is using.

and under stages (completed, failed) you will find detailed execution description of
each step.
And for each of the steps under the description you can get even more detailed
information of the stage.. Here is an example, of the detailed stage and the
aggregated metrics:

and the aggregated metrics


There is a lot of logs, when you want to investigate and troubleshoot the particular
step.
Databricks provide three type of cluster activity logs:
 event logs - these logs capture the lifecycles of clusters: creation of cluster, start of
cluster, termination and others
 driver logs - Spark driver and worker logs are great for debugging;
 init-script logs - for debugging init scripts.
2.Event Logs
Event logs capture and holds cluster information and action against the cluster.
And you can see for each event type, there is a timestamp and message with detailed
information. You can click on each of the event to get additional information. But this
is what Event Logs will offer you. A good informative overview to what is happening
with your clusters and their states.
3. Driver logs
Driver logs are divided into three sections:
 standard output
 standard error
 Log4j logs
and are a direct output (or prints) and log statements from the notebooks, jobs or
libraries that go through Spark driver.
These logs will help you understand the execution of each cell on your notebook, or
execution of a job and many more. The logs can easily be copy/pasted and, but the
driver logs are stored periodically that newer content is usually at the bottom.
4. Metrics
Metrics in Azure Databricks are mostly used for performance monitoring. These
metrics are called Ganglia UI as metrics for lightweight troubleshooting.
Each metrics represents historical snapshot and by clicking on one of them will get
you a PNG report and can be zooom-in or zoom-out.
Tomorrow we will explore the models, and management of the model and will make
one in R and in Python..
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 16 2020 - Databricks experiments,


models and MLFlow

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks  platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers
and jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering
tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data
analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
We have almost cover all the items on the left vertical bar, the last one is to check the
Model.
But before we click the models, I wanted to explain little of the background.
1.What is MLflow?
MLflow is an open-source platform, vendor agnostic, for managing your machine
learning experiments, models and artefacts. MLflow consists of the following
components:
 Models - gives you ability yo manage, deploy and track models and compare
them between environments
 Models Registry - allows you to centralize model store and manages all stages
of model - from staging to production using also versioning.
 Models Serving - for hosting MLflow models are REST API endpoint
 Tracking - allows you to track experiments for comparison of experiment
parameters and results
 Projects - is a wrapper for ML code, models and package to be reusable,
reproducible and repeatable by same or other group of data scientists
Azure Databricks manages and host the MLflow integration (AD/SSO), with all the
features and gives end user to feature as experiment and run management within
workspace. MLflow on Azure Databricks offers an integrated experience for tracking
and securing machine learning model training runs and running machine learning
projects.
An MLflow run is a collection of parameters, metrics, tags, and artifacts associated
with a machine learning model training process. it supports R, Python, Java and REST
APIs. Experiment is a collection of MLflow runs. Each experiment holds information
about runs, that can be visualized and compered among each other or even
dowloaded as artifacts to be used locally or else. Experiments are stored in MLflow
tracking server.
Experiment is available in your workspace and are stored as objects.
2.Create a notebook and install the mlflow package
Create new notebook, I have named mine Day16_MLflow  and select R as main
language. Attach the cluster to notebooks. MLflow comes pre-installed on Databricks
Runtime for Machine Learning clusters. Check your cluster Runtime version. Mine is
LTS but not ML.

This means, that we need to install additional libraries to cluster. Under cluster, click
on libraries.
And select "% Install New" to get:

And Select Source: CRAN (famous R repository) and package name: mlflow. And


after couple of minutes, you should see the package being installed:
In the Notebook use the following command to start initialize mlflow:
library(mlflow)
install_mlflow()
and the conda environment and underlying packages will be installed. (Yes, Python)

To start the tracking API for a particular run (on notebook), initiate it with:
run <- mlflow_start_run()
and add all the code, calculations and functions you want to be tracked in MLflow.
This is just the short dummy example how to pass parameters, logs and create
artifacts for MLflow:
# Log a parameter (key-value pair)
mlflow_log_param("test_run_nof_runs_param1", 5)

# Log a metric; metrics can be updated throughout the run


mlflow_log_metric("RMSE", 2, step = 1)
mlflow_log_metric("RMSE", 4, step = 2)
mlflow_log_metric("RMSE", 6, step = 3)
mlflow_log_metric("RMSE", 8, step = 4)
mlflow_log_metric("RMSE", 1, step = 5)
# Log an artifact (output file)

write("This is R code from Azure Databricks notebook", file = "output.txt")

mlflow_log_artifact("output.txt")
When your code is completed, finish off with end run:
mlflow_end_run()
Within this block of code, each time you will run it, the run will documented and
stored to experiment.

Now under the Experiments Run, click the "View run details":

And you will get to the experiment page. This page


This page holds all the information on each run, with all the parameters, metrics and
all the relevant information about the runs, or models.

Scrolling down on this page, you will also find all the artifacts that you can store
during the runs (that might be pickled files, logs, intermediate results, binary files,
etc.)
3. Create a model
Once you have a data set ready and the experiment running, you want to register the
model as well. Model registry is taking care of this. In the same notebook, what we
will do, is add little experiment. Wine-quality experiment. Data is available at github
repository and you will just add the file to your DBFS.
Now use R standard packages:
library(mlflow)
library(glmnet)
library(carrier)
And load data to data.frame (please note, that file is on my FileStore DBFS location
and path might vary based on your location).
library(SparkR)

data <- read.df("/FileStore/Day16_wine_quality.csv", source = "csv",


header="true")

display(data)
data <- as.data.frame(data)
In addition, I will detach the SparkR package, for not causing any interference
between data types:
#detaching the package due to data type conflicts
detach("package:SparkR", unload=TRUE)
And now do the typical train and test split.
# Split the data into training and test sets. (0.75, 0.25) split.
sampled <- sample(1:nrow(data), 0.75 * nrow(data))
train <- data[sampled, ]
test <- data[-sampled, ]

# The predicted column is "quality" which is a scalar from [3, 9]


train_x <- as.matrix(train[, !(names(train) == "quality")])
test_x <- as.matrix(test[, !(names(train) == "quality")])
train_y <- train[, "quality"]
test_y <- test[, "quality"]

alpha <- mlflow_param("alpha", 0.5, "numeric")


lambda <- mlflow_param("lambda", 0.5, "numeric")
And now we register the model and all the parameter:
mlflow_start_run()

model <- glmnet(train_x, train_y, alpha = alpha, lambda = lambda, family=


"gaussian", standardize = FALSE)
predictor <- crate(~ glmnet::predict.glmnet(!!model, as.matrix(.x)), !!model)
predicted <- predictor(test_x)

rmse <- sqrt(mean((predicted - test_y) ^ 2))


mae <- mean(abs(predicted - test_y))
r2 <- as.numeric(cor(predicted, test_y) ^ 2)

message("Elasticnet model (alpha=", alpha, ", lambda=", lambda, "):")


message(" RMSE: ", rmse)
message(" MAE: ", mae)
message(" R2: ", r2)

mlflow_log_param("alpha", alpha)
mlflow_log_param("lambda", lambda)
mlflow_log_metric("rmse", rmse)
mlflow_log_metric("r2", r2)
mlflow_log_metric("mae", mae)

mlflow_log_model(
predictor,
artifact_path = "model",
registered_model_name = "wine-quality")

mlflow_end_run()
And this should also cause additional runs in the same experiment. But in addition, it
will create a model in model registry and this model you can later version and
approve to be moved to next stage or environment.

Tomorrow we will do end-to-end machine learning project using all three languages,
R, Python and SQL.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 17 2020 - End-to-End Machine


learning project in Azure Databricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
In the past couple of days we looked into configurations and infrastructure and
today it is again time to do an analysis, let's call it end-to-end analysis using R or
Python or SQL.
1. Notebook, Cluster and Data
Create new notebook, I am calling my Day17_Analysis and selecting Python as kernel
language. Attach cluster to your notebook and start the cluster (if it is not yet
running). Import data using SparkR:
%r
library(SparkR)

data_r <- read.df("/FileStore/Day16_wine_quality.csv", source = "csv",


header="true")

display(data_r)
data_r <- as.data.frame(data_r)
And we can also do the same for Python:
import pandas as pd
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=';')
We can use also Python to insert the data and get the dataset insight.
import matplotlib.pyplot as plt
import seaborn as sns
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=',')
data_py.info()
Importing also all other packages that will be relevant in following steps:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
2.Data wrangling
So let's continue using Python. You can get the sense of the dataset by using Python
describe function:
data_py.describe()
%r
library(SparkR)

data_r <- read.df("/FileStore/Day16_wine_quality.csv", source = "csv",


header="true")

display(data_r)
data_r <- as.data.frame(data_r)
And we can also do the same for Python:
import pandas as pd
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=';')
We can use also Python to insert the data and get the dataset insight.
import matplotlib.pyplot as plt
import seaborn as sns
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=',')
data_py.info()
Importing also all other packages that will be relevant in following steps:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
2.Data wrangling
So let's continue using Python. You can get the sense of the dataset by using Python
describe function:
data_py.describe()
And also work with duplicate values (remove them) and missing values (remove them
or replace them with mean value):
#remove duplicates
sum(data_py.duplicated())
data_py.drop_duplicates(inplace=True)

#remove rows with empty values


data_py.isnull().sum(axis=0)
data_py.dropna(axis=0, how='any', inplace=True)

#fill the missing values with mean


data_py.fillna(0, inplace=True)
data_py['quality'].fillna(data_py['quality'].mean(), inplace=True)
data_py.apply(lambda x: x.fillna(x.mean(), inplace=True), axis=0)
You can also find and filter out the outlier by using IQR - Interquartile rang:
Q1 = data_py.quantile(0.25)
Q3 = data_py.quantile(0.75)
IQR = Q3 - Q1
data_py2 = data_py[~((data_py < (Q1 - 1.5 * IQR)) |(data_py > (Q3 + 1.5 *
IQR))).any(axis=1)]
#print(data_py2.shape)
print(data_py2 < (Q1 - 1.5 * IQR)) |(data_py2 > (Q3 + 1.5 * IQR))
3.Exploring dataset
We can check the distribution of some variables and best way is to show it with
graphs:
fig, axs = plt.subplots(1,5,figsize=(20,4),constrained_layout=True)

data_py['fixed acidity'].plot(kind='hist', ax=axs[0])


data_py['pH'].plot(kind='hist', ax=axs[1])
data_py['quality'].plot(kind='line', ax=axs[2])
data_py['alcohol'].plot(kind='hist', ax=axs[3])
data_py['total sulfur dioxide'].plot(kind='hist', ax=axs[4])
Adding also a plot of counts per quality:
counts = data_py.groupby(['quality']).count()['pH'] # pH or anything else - just
for count
counts.plot(kind='bar', title='Quantity by Quality')
plt.xlabel('Quality', fontsize=18)
plt.ylabel('Count', fontsize=18)

Adding some boxplots will also give a great understanding of the data and statistics
of particular variable. So, let's take pH and Quality
sns.boxplot(x='quality',y='pH',data=data_py,palette='GnBu_d')
plt.title("Boxplot - Quality and pH")
plt.show()
or quality with fixed acidity:
sns.boxplot(x="quality",y="fixed acidity",data=data_py,palette="coolwarm")
plt.title("Boxplot of Quality and Fixed Acidity")
plt.show()

And also add some correlation among all the variables in dataset:
plt.figure(figsize=(10,10))
sns.heatmap(data_py.corr(),annot=True,linewidth=0.5,center=0,cmap='coolwarm')
plt.show()

4.Modeling
We will split the dataset into Y-set - our predict variable and X-set - all the other
variables. After that, we will do splitting of the y-set and x-set into train and test
subset.
X = data_py.iloc[:,:11].values
Y = data_py.iloc[:,-1].values

#Splitting the dataset into training and test set


X_train,X_test,Y_train,Y_test =
train_test_split(X,Y,test_size=0.25,random_state=0)
We will also to the feature scaling
#Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
And get the general understanding of explained variance:
# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 3)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
You will see, that three variables together contribute more than 50% of all variance of
the model.
Based on the train and test test, let us now fit the different type of model into the
dataset. Using Logistic regression:
#Fitting Logistic Regression into dataset
lr_c=LogisticRegression(random_state=0)
lr_c.fit(X_train,Y_train)
lr_pred=lr_c.predict(X_test)
lr_cm=confusion_matrix(Y_test,lr_pred)
print("The accuracy of LogisticRegression is:",accuracy_score(Y_test, lr_pred))
and create a confusion matrix to see the correctly predicted values per category.
#Making confusion matrix
print(lr_cm)

I will repeat this for the following algorithms: SVM, RandomForest, KNN, Naive Bayes
and I will make a comparison at the end.
SVM
#Fitting SVM into dataset
cl = SVC(kernel="rbf")
cl.fit(X_train,Y_train)
svm_pred=cl.predict(X_test)
svm_cm = confusion_matrix(Y_test,cl.predict(X_test))
print("The accuracy of SVM is:",accuracy_score(Y_test, svm_pred))
RandomForest
#Fitting Randomforest into dataset
rdf_c=RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=0)
rdf_c.fit(X_train,Y_train)
rdf_pred=rdf_c.predict(X_test)
rdf_cm=confusion_matrix(Y_test,rdf_pred)
print("The accuracy of RandomForestClassifier
is:",accuracy_score(rdf_pred,Y_test))
KNN
#Fitting KNN into dataset
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,Y_train)
knn_pred=knn.predict(X_test)
knn_cm=confusion_matrix(Y_test,knn_pred)
print("The accuracy of KNeighborsClassifier is:",accuracy_score(knn_pred,Y_test))
and Naive Bayes
#Fitting Naive bayes into dataset
gaussian=GaussianNB()
gaussian.fit(X_train,Y_train)
bayes_pred=gaussian.predict(X_test)
bayes_cm=confusion_matrix(Y_test,bayes_pred)
print("The accuracy of naives bayes is:",accuracy_score(bayes_pred,Y_test))
And the accuracy for all the model fitting is the following:
 LogisticRegression is: 0.4722502522704339
 SVM is: 0.48335015136226034
 KNeighborsClassifier is: 0.39455095862764883
 naives bayes is: 0.46316851664984865
It is clear which model would give improvements,
Tomorrow we will do use Azure Data Factory with Databricks.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 18 2020 - Using Azure Data Factory


with Azure Databricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks  platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers
and jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering
tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data
analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: DEnd-to-End Machine learning project in Azure Databricks
Yesterday we did end-to-end Machine Learning project. Almost, one can argue. What
if we want to incorporate this notebook in larger data flow in Azure. In this case we
would need Azure Data Factory (ADF).

Azure Data Factory is Azure service for ETL operations. It is a serverless service for
data transformation and data integration and orchestration across several different
Azure services. There are also some resemblance to SSIS (SQL Server Integration
Services) that can be found.
On Azure portal (not on Azure Databricks portal) search for "data factory" or "data
factories" and you should get the search recommendation services. Select "Data
factories" is what we are looking for.

Once on tha page for Data factories, select "+ Add" to create new ADF. Insert the
needed information:
 Subscription
 Resource Group
 Region
 Name
 Version
I have selected the same resource group as the one for Databricks and name, I have
given it ADF-Databricks-day18.

Select additional information such as Git configuration (you can also do this part
later), Networking for special configurations, Advanced settings, add tags and create
a ADF. Once the service is completed with creation and deployment, jump into the
service and you should have the dashboard for this particular service
Select "Author & Monitor" to get into the Azure Data Factory. You will be redirected
to a new site:

On the left-hand site, you will find a vertical navigation bar. Look for the bottom icon,
that looks like a tool-box and is for managing the ADF:
You will get to the setting site for this Data factory.

And we will be creating a new linked service. Linked service that will enable
communication between Azure Data factory and Azure Databricks. Select "+ New" to
add new Linked Services:
On your right-hand side, the window will pop-up with available services that you
want to link to ADF. Either search for Azure Databricks, or click "compute" (not
"storage") and you should see Databricks logo immediately.
Click on Azure Databricks and click on "Continue". You will get a list of information
you will need to fill-in.
All needed information is relatively straight-forward. Yet, there is the authentication,
we still need to fix. We will use access token. On Day 9 we have used Shared Access
Signature (SAS), where we needed to make a Azure Databricks tokens. Open a new
window (but do not close ADF Settings for creating a new linked service) in Azure
Databricks and go to settings for this particular workspace.

Click on the icon (mine is: "DB_py" and gliph) and select "User Settings".
You can see, I have a Token ID and secret from Day 9 already in the Databricks
system. So let's generate a new Token by clicking "Generate New Token".
Give token a name: "adf-db" and lifetime period. Once the token is generated, copy
it somewhere, as you will never see it again. Of course, you can always generate a
new one, but if you have multiple services bound to it, it is wise to store it
somewhere secure. Generate it and copy the token (!).
Go back to Azure Data Factory and paste the token in the settings. Select or fill-in the
additional information.

Once the linked server is created, select the Author in the left vertical menu in Azure
Data Factory.
This will bring you a menu where you can start putting together a pipeline. But in
addition, you can also register in ADF the datasets and data flows; this is especially
useful when for large scale ETL or orchestration tasks.

Select Pipeline and choose "New pipeline". You will be presented with a designer
tool where you can start putting together the pipelines, flow and all the
orchestrations.
Under the section of Databricks, you will find:
 Notebooks
 Jar
 Python
You can use Notebooks to build the whole pipelines and help you communicate
between notebooks. You can add an application (Jar) or you can add a Python script
for carrying a particular task.
Drag and drop the elements on to the canvas. Select Notebook (Databricks
notebook), drag the icon to canvas, and drop it.

Under the settings for this particular Notebook, you have a tab "Azure Databricks"
where you select the Linked server connection. Chose the one, we have created
previously. Mine is named as "AzureDatabricks_adf". And under setting select the
path to the notebook:

Once you finish with entering all the needed information, remember to hit "Publish
all" and if there is any conflict between code, we can just and fix it immediately.
You can now trigger the pipeline, you can schedule it or connect it to another
notebook by selecting run or debbug.

In this manner you can schedule and connect other services with Azure Databricks.
Tomorrow, we will look into adding a Python element or another notebook to make
more use of Azure Data factory.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 19 2020 - Using Azure Data Factory


with Azure Databricks for merging CSV
files

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
Yesterday we created data factory and started using the service, created linked
service and our first pipeline.
Today we will look how we can start using blob storage and Azure Databricks with
Azure Data factory.

This would be a one of the scenarios where you would have multiple csv files coming
in to blob storage (particular folder) and we would want:
 merge CSV files
 merge files with some transformation in between
 transform the files first and do the merge
 copying files from one data lake zone to another zone and making transformation in
between
 or any other...
Regardless of the scenario, let's dive in.
1.Create linked service for Azure Blob
Yesterday (day18) we looked how to create a linked service for Azure Databricks. We
will need another linked service for Azure Blob storage. Navigate to linked services
and create a new one. We need a new linked service for Azure Blob Storage.
While configuring, select your Azure Subscription, and choose the Storage account,
we have created on day9 and I called it dbpystorage. You should have something like
this:
On day 9 we also copied a file into the blobstorage, called Day9MLB_players.csv (file
is also available at the Github repository). Now you should have Azure Blob Storage
and Azure Databricks services linked to Azure Data Factory.
We will now need to create a dataset and a pipeline in ADF
2.Creating a dataset
By adding a new dataset, go to Datasets and select "New Dataset". Window will pop-
up asking for the location of the dataset. Select the Azure Blob Storage, because file
is available in this service.
After selecting the storage type, you will be prompted with file type. Choose CSV -
DelimitedText type.
And after this, specify the path to the file. As I am using only one file, I am specifying
the name. Otherwise, if this folder would have been a landing for multiple files (with
same schema), I could use a wildcard, eg.: Day*.csv and all files following this patter
would be read.

Once you have a dataset created, we will need a pipeline to connect the services.
3. Creating Pipeline
On the Author view in ADF, create a new Pipeline. A new canvas will appear for you
to start working on data integration.

Select element "Copy Data" and element "Databricks". Element Copy Data will need
the source and the sink data. It can copy a file from one location to another, it can
merge files to another location or change format (going from CSV to Parquet). I will
be using from CSV to merge into CSV.
Select all the properties for Source.
And for the Sink. For copy behaviour, I am selecting "merge files" to mimic the ETL
job.

Once this part is completed, we need to look into the Databricks element:
Azure Databricks notebook can hold literarily anything. From data transformation, to
data merge, analytics, or it can even serve as a transformation element and
connection to further other elements. In this case, Databricks element will hold only
for reading activity and creating a table.
4. Notebook
Before connecting the elements in ADF, we need to give some instructions to
Notebook. Head to Azure Databricks and create a new notebook. I have named
mine: Day19_csv and choose language: Python.
Set up the connection to file (this time using with python - before we used Scala):
%python
storage_account_name = "dbpystorage"
storage_account_access_key = "YOUR_ACCOUNT_ACCESS_KEY"

file_location = "wasbs://dbpystorecontainer@dbpystorage.blob.core.windows.net/"
file_type = "csv"

spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.ne
t",storage_account_access_key)
After the initial connection is set, we can load the data and create a SQL table:
%python
df = spark.read.format(file_type).option("header","true").option("inferSchema",
"true").load(file_location)

df.createOrReplaceTempView("Day9data_view")
And the SQL query:
%sql
SELECT * FROM Day9data_view
--Check number of rows
SELECT COUNT(*) FROM Day9data_view
You can add many other data transformation or ETL scripts. Or you can harvest the
Machine Learning script to do data analysis and data predictions. Normally, I would
add analysis of merged dataset and save or expose the results to other services (via
ADF), but to keep the post short, let's keep it as it is.
5. Connecting the dots
Back in Azure Data Factory, set the Notebook and select the Azure Databricks linked
service and under setting, set the path to the notebook we have created in previous
step.

You can always browse through the path and select the correct path.

Once you set the path, you can connect the elements (or activities) together, debug
and publish all the elements. Once published you can schedule and harvest the
pipeline.
This pipeline can be scheduled, can be used as part of bigger ETL or it can be
extended. You can have each notebook doing part of ETL and have the notebooks
orchestrated in ADF, you can have data flows created in ADF and connect Python
code (instead of Notebooks). The possibilities are endless. Even if you want to
capture streaming data, you can use ADF and Databricks, or only Databricks with
Spark or you can use other services (Event hub, Azure functions, etc.).
Tomorrow we will look this orchestration part using two notebooks with Scala and
Python.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 19 2020 - Using Azure Data Factory


with Azure Databricks for merging CSV
files

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
Yesterday we created data factory and started using the service, created linked
service and our first pipeline.
Today we will look how we can start using blob storage and Azure Databricks with
Azure Data factory.

This would be a one of the scenarios where you would have multiple csv files coming
in to blob storage (particular folder) and we would want:
 merge CSV files
 merge files with some transformation in between
 transform the files first and do the merge
 copying files from one data lake zone to another zone and making transformation in
between
 or any other...
Regardless of the scenario, let's dive in.
1.Create linked service for Azure Blob
Yesterday (day18) we looked how to create a linked service for Azure Databricks. We
will need another linked service for Azure Blob storage. Navigate to linked services
and create a new one. We need a new linked service for Azure Blob Storage.
While configuring, select your Azure Subscription, and choose the Storage account,
we have created on day9 and I called it dbpystorage. You should have something like
this:
On day 9 we also copied a file into the blobstorage, called Day9MLB_players.csv (file
is also available at the Github repository). Now you should have Azure Blob Storage
and Azure Databricks services linked to Azure Data Factory.
We will now need to create a dataset and a pipeline in ADF
2.Creating a dataset
By adding a new dataset, go to Datasets and select "New Dataset". Window will pop-
up asking for the location of the dataset. Select the Azure Blob Storage, because file
is available in this service.
After selecting the storage type, you will be prompted with file type. Choose CSV -
DelimitedText type.
And after this, specify the path to the file. As I am using only one file, I am specifying
the name. Otherwise, if this folder would have been a landing for multiple files (with
same schema), I could use a wildcard, eg.: Day*.csv and all files following this patter
would be read.

Once you have a dataset created, we will need a pipeline to connect the services.
3. Creating Pipeline
On the Author view in ADF, create a new Pipeline. A new canvas will appear for you
to start working on data integration.

Select element "Copy Data" and element "Databricks". Element Copy Data will need
the source and the sink data. It can copy a file from one location to another, it can
merge files to another location or change format (going from CSV to Parquet). I will
be using from CSV to merge into CSV.
Select all the properties for Source.
And for the Sink. For copy behaviour, I am selecting "merge files" to mimic the ETL
job.

Once this part is completed, we need to look into the Databricks element:
Azure Databricks notebook can hold literarily anything. From data transformation, to
data merge, analytics, or it can even serve as a transformation element and
connection to further other elements. In this case, Databricks element will hold only
for reading activity and creating a table.
4. Notebook
Before connecting the elements in ADF, we need to give some instructions to
Notebook. Head to Azure Databricks and create a new notebook. I have named
mine: Day19_csv and choose language: Python.
Set up the connection to file (this time using with python - before we used Scala):
%python
storage_account_name = "dbpystorage"
storage_account_access_key = "YOUR_ACCOUNT_ACCESS_KEY"

file_location = "wasbs://dbpystorecontainer@dbpystorage.blob.core.windows.net/"
file_type = "csv"

spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.ne
t",storage_account_access_key)
After the initial connection is set, we can load the data and create a SQL table:
%python
df = spark.read.format(file_type).option("header","true").option("inferSchema",
"true").load(file_location)

df.createOrReplaceTempView("Day9data_view")
And the SQL query:
%sql
SELECT * FROM Day9data_view
--Check number of rows
SELECT COUNT(*) FROM Day9data_view
You can add many other data transformation or ETL scripts. Or you can harvest the
Machine Learning script to do data analysis and data predictions. Normally, I would
add analysis of merged dataset and save or expose the results to other services (via
ADF), but to keep the post short, let's keep it as it is.
5. Connecting the dots
Back in Azure Data Factory, set the Notebook and select the Azure Databricks linked
service and under setting, set the path to the notebook we have created in previous
step.

You can always browse through the path and select the correct path.

Once you set the path, you can connect the elements (or activities) together, debug
and publish all the elements. Once published you can schedule and harvest the
pipeline.
This pipeline can be scheduled, can be used as part of bigger ETL or it can be
extended. You can have each notebook doing part of ETL and have the notebooks
orchestrated in ADF, you can have data flows created in ADF and connect Python
code (instead of Notebooks). The possibilities are endless. Even if you want to
capture streaming data, you can use ADF and Databricks, or only Databricks with
Spark or you can use other services (Event hub, Azure functions, etc.).
Tomorrow we will look this orchestration part using two notebooks with Scala and
Python.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!

Dec 20 2020 - Orchestrating multiple


notebooks with Azure Databricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
 Dec 19: Using Azure Data Factory with Azure Databricks for merging CSV files
Yesterday we were looking into ADF and how to create a pipelines and connect
different Azure services.
Today, we will look into connecting multiple notebooks and trying to create
orchestration or a workflow of several notebooks.

1. Creating a folder with multiple notebooks


In Azure Databricks workspace, create a new Folder, called Day20. Inside the folder,
let's create couple of Notebooks:
 Day20_NB1
 Day20_NB2
 Day20_functions
 Day20_Main
 Day20_NB3_Widget
And all are running Language: Python.
2.Concept
The outline of the notebooks is the following:
Main notebook (Day20_Main) is the one, end user or job will be running all the
commands from.
First step is to executed is to run notebook Day20_1NB, which is executed and until
finished, the next code (or step) on the main notebook will not be executed.
Notebook is deliberately empty, mimicking the notebook that does the task, that are
independent from any other steps or notebooks.
Second step is to load Python functions with notebook: Day20_functions. This
notebook is just a collection (or class) of user defined functions. Once the notebooks
is executed, all the functions will be declared and available in workspace for current
user, through all notebooks.
Third step is to try and run couple of Python functions in main notebook
(Day20_Main). Nevertheless the simplicity of Python notebooks, they serve the
purpose and show the ability how to use and run it.
Fourth step is to create a persistent element for storing results in Notebooks. I
created a table day20_NB_run where timestamp, ID, comments are stored and will be
executed at the end of every non-main notebook (to serve for multiple purposes -
feedback information for next step, triggering next step, logging, etc.) Table is
created in database Day10 (created on Day10) and re-created (DROP and CREATE)
every time the notebook is executed.
Fifth step is executing notebook Day20_2NB that stores end result to
table day20_NB_run for logging purposes or troubleshooting.
Sixth step is a different one, because we want to have back and forth communication
between main notebook (Day20_Main) and "child" notebook (Day20_NB3_Widget).
This can be done using Scala Widgets. This notebook has a predefine set of possible
values, that are called as argument-key entity from the main (Day20 main) notebook.
Once the widget value is executed against the set of code, it can also be stored in
database (what I did in demo) or return it (pass it) to main notebook in order to
execute additional task based on new values.
3.Notebooks
The insights of the notebooks.
Day20_NB1 - this one is no brainer.
Day20_functions - is the collection of Python functions:

with the sample code:


def add_numbers(x,y):
sum = x + y
return sum

import datetime

def current_time():
e = datetime.datetime.now()
print ("Current date and time = %s" % e)
Day20_NB2 - is the notebook that outputs the results to SQL table.
And the SQL code:
%sql
INSERT INTO day10.day20_NB_run VALUES (10, "Running from Day20_Notebook2",
CAST(current_timestamp() AS TIMESTAMP))
Day20_NB3_Widget - is the notebook that receives the arguments from previous
step (either notebook, or CRON or function) and executes the steps with this input
information accordingly and stores results to SQL table.

And the code from this notebook. This step creates a widget on the notebook.
dbutils.widgets.dropdown("Wid_arg", "1", [str(x) for x in range(1, 10)])
Each time a number is selected, you can run the command to get the value out:
selected_value = dbutils.widgets.get("Wid_arg")
print("Returned value: ", selected_value)
And the last step is to insert the values (and also the result of the widget) into SQL
Table.
%sql
INSERT INTO day10.day20_NB_run VALUES (10, "Running from day20_Widget notebook",
CAST(current_timestamp() AS TIMESTAMP))
Day20_Main - is the umbrella notebook or the main notebook, where all the
orchestration is carried out. This notebook also holds the logic behind the steps and
it's communication.

Executing notebook from another (main) notebook, is in this notebook Day20_Main


done by using this command (%run and path_to_notebook).
%run /Users/tomaz.kastrun@gmail.com/Day20/Day20_functions
SQL Table is the easier part, when defining table and roles.
%sql
DROP TABLE IF EXISTS day10.day20_NB_run;
CREATE TABLE day10.day20_NB_run (id INT, NB_Name STRING, Run_time TIMESTAMP)

%sql
INSERT INTO day10.day20_NB_run VALUES (10, "Running from day20_Main notebook",
CAST(current_timestamp() AS TIMESTAMP))
For each step in between, I am checking the values in SQL Table, giving you the
current status, and also logging all the steps on Main ntoebook.

Command to execute notebook with input parameters, meaning that the command
will run the Day20_NB3_Widget notebook. And this should have been trying to max
60 seconds. And the last part is the collection of parameters, that are executed..
dbutils.notebook.run("Day20_NB3_Widget", 60, {"Wid_arg": "5"})
We have seen that the orchestration has the capacity to run in multiple languages,
use input or output parameters and can be part of larger ETL or ELT process.
Tomorrow we will check and explore how to go about Scala, since we haven't yes
discussed anything about Scala, but yet mentioned it on multiple occasions.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
Dec 21 2020 - Using Scala with Spark
Core API in Azure Databricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
 Dec 19: Using Azure Data Factory with Azure Databricks for merging CSV files
 Dec 20: Orchestrating multiple notebooks with Azure Databricks
Yesterday we explored the capabilities of orchestrating notebooks in Azure
Databricks. Also in previous days we have seen that Spark is the main glue between
the different languages. But today we will talk about Scala.

And in the following blogposts we will explore the core engine and services on top:
 Spark SQL+ Dataframes
 Streaming
 MLlib - Machine learning library
 GraphX - Graph computations

Apache Spark is a powerful open-source processing engine built around speed, ease
of use, and sophisticated analytics.
Spark Core is underlying general execution engine for the Spark Platform with all
other functionalities built-in. It is in memory computing engine that provides variety
of language support, as Scala, R, Python for easier data engineering development
and machine learning development.
Spark has three key interfaces:
 Resilient Distributed Dataset (RDD) - It is an interface to a sequence of data objects
that consist of one or more types that are located across a collection of machines (a
cluster). RDDs can be created in a variety of ways and are the “lowest level” API
available. While this is the original data structure for Apache Spark, you should focus
on the DataFrame API, which is a superset of the RDD functionality. The RDD API is
available in the Java, Python, and Scala languages.
 DataFrame - similar in concept to the DataFrame you will find with the pandas
Python library and the R language. The DataFrame API is available in the Java, Python,
R, and Scala languages.
 Dataset - is combination of RDD and DataFrame. It proved typed interface of RDD
and gives you the convenience of the DataFrame. The Dataset API si available only for
Scala and Java.
In general, when you will be working with the performance optimisations, either
DataFrames or Datasets should be enough. But when going into more advanced
components of Spark, it may be necessary to use RDDs. Also the visualisation within
Spark UI references directly RDDs.
1.Datasets
Let us start with Databricks datasets, that are available within every workspace and
are here mainly for test purposes. This is nothing new; both Python and R come with
sample datasets. For example the Iris dataset that is available with Base R engine and
Seaborn Python package. Same goes with Databricks and sample dataset can be
found in /databricks-datasets folder.
Create a new notebook in your workspace and name it Day21_Scala. Language: Scala.
And run the following Scala command.
display(dbutils.fs.ls("/databricks-datasets"))

You can always store the results to variable and later use is multiple times:
// transformation
val textFile = spark.read.textFile("/databricks-datasets/samples/docs/README.md")
and listing the content of the variable by using show() function:
textFile.show()
And some other useful functions; to count all the lines in textfile, to show the first
line and to filter the text file showing only the lines containing the search argument
(word sudo).
// Count number or lines in textFile
textFile.count()
// Show the first line of the textFile
textFile.first()
// show all the lines with word Sudo
val linesWithSudo = textFile.filter(line => line.contains("sudo"))
And also printing all (first four) lines of with the subset of text containing the word
"sudo". In the second example finding the Line number with most words:
// Output the all four lines
linesWithSudo.collect().take(4).foreach(println)
// find the lines with most words
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
2. Create a dataset
Now let's create a dataset (remember the difference
between Dataset and DataFrame) and load some data from /databricks-
datasets  folder.
val df = spark.read.json("/databricks-datasets/samples/people/people.json")
3. Convert Dataset to DataFrame
We can also convert Dataset to DataFrame for easier operation and usage. We must
define a class that represents a type-specific Scala JVM object (like a schema) and
now repeat the same process with definition.
case class Person (name: String, age: Long)
val ds =
spark.read.json("/databricks-datasets/samples/people/people.json").as[Person]
We can also create and define another dataset, taken from the /databricks-datasets
folder that is in JSON (flattened) format:
// define a case class that represents the device data.
case class DeviceIoTData (
battery_level: Long,
c02_level: Long,
cca2: String,
cca3: String,
cn: String,
device_id: Long,
device_name: String,
humidity: Long,
ip: String,
latitude: Double,
longitude: Double,
scale: String,
temp: Long,
timestamp: Long
)

val ds =
spark.read.json("/databricks-datasets/iot/iot_devices.json").as[DeviceIoTData]

and run show() function to see the imported Dataset from JSON file:
Now let's play with the dataset using Scala Dataset API with following frequently
used functions:
 display(),
 describe(),
 sum(),
 count(),
 select(),
 avg(),
 filter(),
 map() or where(),
 groupBy(),
 join(), and
 union().
display()
You can also view the dataset using display() (similar to .show() function):
display(ds)
describe()
Describe() function is great for exploring the data and the structure of the data:
ds.describe()

or for getting descriptive statistics of the Dataset or of particular column:


display(ds.describe())
// or for column
display(ds.describe("c02_level"))
sum()
Let's sum all c02_level values:
//create a variable sum_c02_1 and sum_c02_2;
// both are correct and return same results

val sum_c02_1 = ds.select("c02_level").groupBy().sum()


val sum_c02_2 = ds.groupBy().sum("c02_level")

display(sum_c02_1)
And we can also double check the result of this sum with SQL. Just because it is fun.
But first We need to create a SQL view (or it could be a table) from this dataset.
ds.createOrReplaceTempView("SQL_iot_table")
And then define cell as SQL statement, using %sql. Remember, complete code today
is written in Scala, unless otherwise stated with %{lang} and the beginning.
%sql
SELECT sum(c02_level) as Total_c02_level FROM SQL_iot_table
And for sure, we get the same result (!).
select()
Select() function will let you show only the columns you want to see.
// Both will return same results
ds.select("cca2","cca3", "c02_level").show()
// or
display(ds.select("cca2","cca3","c02_level"))

avg()
Avg() function will let you aggregate a column (let us take: c02_level) over another
column (let us take: countries in variable cca3). First we want to calculate average
value over the complete dataset:
val avg_c02 = ds.groupBy().avg("c02_level")
display(avg_c02)
And then also the average value for each country:
val avg_c02_byCountry = ds.groupBy("cca3").avg("c02_level")

display(avg_c02_byCountry)

filter()
Filter() function will shorten or filter out the values that will not comply with the
condition. Filter() function can also be replaced by where() function; they both have
similar behaviour.
Following command will return dataset that meet the condition where batter_level is
greater than 7.
display(ds.filter(d => d.battery_level > 7))
And the following command will filter the database on same condition, but only
return the specify columns (in comparison with previous command which returned all
columns):
display(ds.filter(d => d.battery_level > 7).select("battery_level", "c02_level",
"cca3"))
groupBy()
Adding aggregation to filtered data (avg() function) and grouping dataset based on
cca3 variable:
display(ds.filter(d => d.battery_level > 7).select("c02_level",
"cca3").groupBy("cca3").avg("c02_level"))
Note that there is explicit definition of internal subset in filter function. Part where "d
=> d.battery_level>7" is creating a separate subset of data that can also be used with
map() function, as part of map-reduce Hadoop function.

join()
Join() function will combine two objects. So let us create two simple DataFrames and
create a join between them.
val df_1 = Seq((0, "Tom"), (1, "Jones")).toDF("id", "first")
val df_2 = Seq((0, "Tom"), (2, "Jones"), (3, "Martin")).toDF("id", "second")
Using function Seq() to create a sequence and toDF() to save it as DataFrame.
To join two DataFrames, we use
display(df_1.join(df_2, "id"))
Name of the first DataFrame - df_1 (on left-hand side) joined by second DataFrame
- df_2 (on the right-hand side) by a column "id".

Join() implies inner.join and returns all the rows where there is a complete match. If
interested, you can also explore the execution plan of this join by adding explain at
the end of command:
df_1.join(df_2, "id").explain
and also create left/right join or any other semi-, anti-, cross- join.
df_1.join(df_2, Seq("id"), "LeftOuter").show
df_1.join(df_2, Seq("id"), "RightOuter").show

union()
To append two datasets (or DataFrames), union() function can be used.
val df3 = df_1.union(df_2)
display(df3)
// df3.show(true)

distinct()
Distinct() function will return only the unique values, and it can also be used with
union() function to achieve union all type of behaviour:
display(df3.distinct())
Tomorrow we will Spark SQL and DataFrames with Spark Core API in Azure
Databricks. Todays' post was little bit longer, but it is important to get a good
understanding on Spark API, get your hands wrapped around Scala and start working
with Azure Databricks.
Complete set of code and Scala notebooks (including HTML) will be available at
the Github repository.
Happy Coding and Stay Healthy!

Dec 22 2020 - Using Spark SQL and


DataFrames in Azure Databricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks  platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers
and jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering
tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data
analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
 Dec 19: Using Azure Data Factory with Azure Databricks for merging CSV files
 Dec 20: Orchestrating multiple notebooks with Azure Databricks
 Dec 21: Using Scala with Spark Core API in Azure Databricks
Yesterday we took a closer look into Spark Scala with notebooks in Azure Databricks
and how to handle data engineering. Today we will look into the Spark SQL and
DataFrames that is using Spark Core API.

"Spark SQL is a spark module for structured data processing and data querying. It
provides programming abstraction called DataFrames and can also serve as
distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up
to 100x faster on existing deployments and data. It also provides powerful
integration with the rest of the Spark ecosystem (e.g.: integrating SQL query
processing with machine learning)." (Apache Spark Tutorial).
Start your Azure Databricks workspace and create new Notebook. I named mine
as: Day22_SparkSQL and set the language: SQL. Now let's explore the functionalities
of Spark SQL.
1.Loading Data
We will load data from /databricks-datasets using Spark SQL, R and Python
languages. The CSV dataset will be data_geo.csv in the following folder:
%scala
display(dbutils.fs.ls("/databricks-datasets/samples/population-vs-price"))
1.1. Loading using Python
%python
data =
spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv",
header="true", inferSchema="true")
And materialize the data using to create a view with name data_geo_py:
%python
data.createOrReplaceTempView("data_geo_py")
And run the following SQL Statement:
SELECT * FROM data_geo_py LIMIT 10
1.2. Loading using SQL
DROP TABLE IF EXISTS data_geo;

CREATE TABLE data_geo


USING com.databricks.spark.csv
OPTIONS (path "/databricks-datasets/samples/population-vs-price/data_geo.csv",
header "true", inferSchema "true")
And run the following SQL Statement:
SELECT * FROM data_geo LIMIT 10
1.3. Loading using R
%r
library(SparkR)
data_geo_r <-
read.df("/databricks-datasets/samples/population-vs-price/data_geo.csv", source =
"csv", header="true", inferSchema = "true")
registerTempTable(data_geo_r, "data_geo_r")
Cache the results:
CACHE TABLE data_geo_r
And run the following SQL Statement:
SELECT * FROM data_geo_r LIMIT 10
All three DataFrames are the same (unless additional modification are done; like:
dropping rows with null values, etc).
2.Viewing DataFrame
Viewing DataFrame is done by simple SELECT statement, the ANSI SQL Standard. E.g.:
SELECT City
,`2014 Population estimate`
,`2015 median sales price`
,`State Code` AS State_Code
FROM data_geo
WHERE `State Code` = 'AZ';
You can also combine all three DataFrames that were imported using three different
languages (SQL, R, Python).
SELECT *, 'data_geo_SQL' AS dataset FROM data_geo
UNION
SELECT *, 'data_geo_Python' AS dataset FROM data_geo_py
UNION
SELECT *, 'data_geo_R' AS dataset FROM data_geo_r
ORDER BY `2014 rank`, dataset
LIMIT 12
3.Running SQL
3.1. Date and Time functions
SELECT
CURRENT_TIMESTAMP() AS now
,date_format(CURRENT_TIMESTAMP(), "L") AS Month_
,date_format(CURRENT_TIMESTAMP(), "LL") AS Month_LeadingZero
,date_format(CURRENT_TIMESTAMP(), "y") AS Year_
,date_format(CURRENT_TIMESTAMP(), "d") AS Day_
,date_format(CURRENT_TIMESTAMP(), "E") AS DayOFTheWeek
,date_format(CURRENT_TIMESTAMP(), "H") AS Hour
,date_format(CURRENT_TIMESTAMP(), "m") AS Minute
,date_format(CURRENT_TIMESTAMP(), "s") AS Second

3.2. Built-in functions


SELECT
COUNT(*) AS Nof_rows
,SUM(`2014 rank`) AS Sum_Rank
,AVG(`2014 rank`) AS Avg_Rank
,SUM(CASE WHEN `2014 rank` > 150 THEN 1 ELSE -1 END) AS Sum_case
,STD(`2014 rank`) as stdev
,MAX(`2014 rank`) AS Max_Val
,MIN(`2014 rank`) AS Min_Val
,KURTOSIS (`2014 rank`) as Kurt
,SKEWNESS(`2014 rank`) AS Skew
,CAST(SKEWNESS(`2014 rank`) AS INT) AS Skew_cast
FROM data_geo

3.3. SELECT INTO


You can also store results using SELECT INTO statement, with table being predifined:
DROP TABLE IF EXISTS tmp_data_geo;
CREATE TABLE tmp_data_geo (`2014 rank` INT, State VARCHAR(64), `State Code`
VARCHAR(2));

INSERT INTO tmp_data_geo


FROM data_geo SELECT
`2014 rank`
,State
,`State Code`
WHERE `2014 rank` >= 50 AND `2014 rank` < 60 AND `State Code` = "C";

SELECT * FROM tmp_data_geo;


3.4. JOIN
SELECT
dg1.`State Code`
,dg1.`2014 rank`
,dg2.`State Code`
,dg2.`2014 rank`
FROM data_geo AS dg1
JOIN data_geo AS dg2
ON dg1.`2014 rank` = dg2.`2014 rank`+1
AND dg1.`State Code` = dg2.`State Code`
WHERE dg1.`State Code` = "CA"
3.5. Common Table Expressions
WITH cte AS (
SELECT * FROM data_geo
WHERE `2014 rank` >= 50 AND `2014 rank` < 60
)
SELECT * FROM cte;
3.6. Inline tables
SELECT * FROM VALUES
("WA", "Seattle"),
("WA", "Tacoma"),
("WA", "Spokane") AS data(StateName, CityName)
3.7. EXISTS
WITH cte AS (
SELECT * FROM data_geo
WHERE `2014 rank` >= 50 AND `2014 rank` < 60
)
SELECT *
FROM data_geo as dg
WHERE
EXISTS (SELECT * FROM cte WHERE cte.city = dg.city)
AND NOT EXISTS (SELECT * FROM cte WHERE cte.city = dg.city AND `2015 median sales
price` IS NULL )

3.8.Window functions
SELECT
City
,State
,RANK() OVER (PARTITION BY State ORDER BY `2015 median sales price`) AS rank
,`2015 median sales price` AS MedianPrice
FROM data_geo
WHERE
`2015 median sales price` IS NOT NULL;
4. Exploring the visuals
Results of a SQL SELECT statements that are returned as a table, can also be
visualised. Given the following SQL Statement:
SELECT `State Code`, `2015 median sales price` FROM data_geo
in the result cell you can select the plot icon and pick Map.
Furthermore, using "Plot Options..." you can change the settings of the variables on
the graph, aggregations and data series.
With additional query:
SELECT
`State Code`
, `2015 median sales price`
FROM data_geo_SQL
ORDER BY `2015 median sales price` DESC;
There are also many other visuals available and much more SQL statements to
explore and feel free to go a step further and beyond this blogpost.
Tomorrow we will explore Streaming with Spark Core API in Azure Databricks.
Complete set of code and SQL notebooks (including HTML) will be available at
the Github repository.
Happy Coding and Stay Healthy!

Dec 23 2020 - Using Spark Streaming in


Azure Databricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks  platform
 Dec 04: Creating your first Azure
Databricks cluster
 Dec 05: Understanding Azure
Databricks cluster architecture,
workers, drivers and jobs
 Dec 06: Importing and storing data
to Azure Databricks
 Dec 07: Starting with Databricks
notebooks and loading data to
DBFS
 Dec 08: Using Databricks CLI and
DBFS CLI for file upload
 Dec 09: Connect to Azure Blob
storage using Notebooks in Azure
Databricks
 Dec 10: Using Azure Databricks
Notebooks with SQL for Data
engineering tasks
 Dec 11: Using Azure Databricks
Notebooks with R Language for
data analytics
 Dec 12: Using Azure Databricks
Notebooks with Python Language
for data analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
 Dec 19: Using Azure Data Factory with Azure Databricks for merging CSV files
 Dec 20: Orchestrating multiple notebooks with Azure Databricks
 Dec 21: Using Scala with Spark Core API in Azure Databricks
 Dec 22: Using Spark SQL and DataFrames in Azure Databricks
Yesterday we took a closer look into the nuts and bolts of DataFrames using Spark
SQL and the power of using SQL to query data.
For today we will take a glimpse into Streaming with Spark Core API in Azure
Databricks.

Spark Streaming is the process that can analyse not only batches of data but also
streams of data in near real-time. It gives the powerful interactive and analytical
applications across both hot and cold data (streaming data and historical data). Spark
Streaming is a fault tolerance system, meaning due to lineage of operations, Spark
will always remember where you stopped and in case of a worker error, another
worker can always recreate all the data transformation from partitioned RDD
(assuming that all the RDD transformations are deterministic).
Spark streaming has a native connectors to many data sources, such as HDFS, Kafka,
S3, Kinesis and even Twitter.
Start your Workspace in Azure Databricks. Create new notebook, name
it: Day23_streaming and use the default language: Python. If you decide to use
EventHubs from reading data from HDFS or other places, Scala language might be
slightly better.
If you will be using Spark context, otherwise just import pyspark.sql namespace.
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
We will be using the demo data from databricks-datasets folder:
%fs ls /databricks-datasets/structured-streaming/events/

And you can check the structure of one file, by using:


%fs head /databricks-datasets/structured-streaming/events/file-0.json
You must do the initialisation of the stream with:
 inputPath (where your files will be coming)
 Schema of the input files
 ReadStream function with a function if schema, input data additional options
(As: picking one file at a time)
 Aggregate function (for count of events in this particular case) and use of
ReadStream function.
inputPath = "/databricks-datasets/structured-streaming/events/"

# Define the schema to speed up processing


jsonSchema = StructType([ StructField("time", TimestampType(), True),
StructField("action", StringType(), True) ])

streamingInputDF = (
spark
.readStream
.schema(jsonSchema) # Set the schema of the JSON data
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as stream of one
at a time
.json(inputPath)
)

streamingCountsDF = (
streamingInputDF
.groupBy(
streamingInputDF.action,
window(streamingInputDF.time, "1 hour"))
.count()
)
You start a streaming computation by defining a sink and starting it. In this case, to
query the counts interactively, set the completeset of 1 hour counts to be in an in-
memory table.
Run the following command to examine the outcome of a query.
query = (
streamingCountsDF
.writeStream
.format("memory") # memory = store in-memory table (for testing only)
.queryName("counts") # counts = name of the in-memory table
.outputMode("complete") # complete = all the counts should be in the table
.start()
)
And once the cluster is running, you can do variety of analysis. The Key component is
the ".start" method - embedded in the main function, that you can run the spark due
to incoming poklikuc.

You can also further shape the data by using Spark SQL:
%sql
SELECT
action
,date_format(window.end, "MMM-dd HH:mm") as time
,count
FROM counts
ORDER BY time, action
Tomorrow we will explore Spark's own MLlib package for Machine Learning using
Azure Databricks.
Complete set of code and SQL notebooks (including HTML) will be available at
the Github repository.
Happy Coding and Stay Healthy!
Dec 24 2020 - Using Spark MLlib for
Machine Learning in Azure Databricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks  platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers
and jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering
tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data
analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
 Dec 19: Using Azure Data Factory with Azure Databricks for merging CSV files
 Dec 20: Orchestrating multiple notebooks with Azure Databricks
 Dec 21: Using Scala with Spark Core API in Azure Databricks
 Dec 22: Using Spark SQL and DataFrames in Azure Databricks
 Dec 23: Using Spark Streaming in Azure Databricks
Yesterday we briefly touched Spark Streaming as part of Spark component on top of
Spark Core.
Another important component is Machine Learning Spark package called MLlib.
MLlib is a scalable machine learning library bringing quality algorithms and giving
you process speed. (due to upgradede functionality of Hadoops' map-reduce.
Besides supporting several languages (Java, R, Scala, Python), it brings also the
pipelines - for better data movement.
MLlib package brings you several covered topics:
 Basic statistics
 Pipelines and data transformation
 Classification and regression
 Clustering
 Collaborative filtering
 Frequent pattern mining
 Dimensionality reduction
 Feature selection and transformation
 Model Selection and tuning
 Evaluation metrics
The Apache Spark machine learning library (MLlib) allows data scientists to focus on
their data problems and models instead of solving the complexities surrounding
distributed data (such as infrastructure, configurations, and so on).
Now, let's create a new notebook. I named mine Day24_MLlib. And select Python
Language.
1.Load Data
We will use the sample data that is available in /databricks-datasets folder.
%fs ls databricks-datasets/adult/adult.data
And we will use Spark SQL to import the dataset into Spark Table:
%sql DROP TABLE IF EXISTS adult
CREATE TABLE adult (
age DOUBLE,
workclass STRING,
fnlwgt DOUBLE,
education STRING,
education_num DOUBLE,
marital_status STRING,
occupation STRING,
relationship STRING,
race STRING,
sex STRING,
capital_gain DOUBLE,
capital_loss DOUBLE,
hours_per_week DOUBLE,
native_country STRING,
income STRING)
USING CSV
OPTIONS (path "/databricks-datasets/adult/adult.data", header "true")
And get the data into DataSet from Spark SQL table:
dataset = spark.table("adult")
cols = dataset.columns
2.Data Preparation
Since we are going to try algorithms like Logistic Regression, we will have to convert
the categorical variables in the dataset into numeric variables.We will use one-hot
encoding (and not categoy indexing)
One-Hot Encoding - converts categories into binary vectors with at most one nonzero
value: Blue: [1, 0], etc.
In this dataset, we have ordinal variables like education (Preschool - Doctorate), and
also nominal variables like relationship (Wife, Husband, Own-child, etc). For
simplicity's sake, we will use One-Hot Encoding to convert all categorical variables
into binary vectors. It is possible here to improve prediction accuracy by converting
each categorical column with an appropriate method.
Here, we will use a combination of StringIndexer and OneHotEncoder to convert
the categorical variables. The OneHotEncoder will return a SparseVector.
Since we will have more than 1 stage of feature transformations, we use a Pipeline to
tie the stages together; similar to chaining with R dplyr.
Predict variable will be income; binary variable with two values:
 "<=50K"
 ">50K"
All other variables will be used for feature selections.
We will be using MLlib Spark for Python to continue the work. Let's load the
packages for data pre-processing and data preparing. Pipelines for easier working
with dataset and onehot encoding.
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
We will indexes each categorical column using the StringIndexer, and then converts
the indexed categories into one-hot encoded variables. The resulting output has the
binary vectors appended to the end of each row.
We use the StringIndexer again to encode our labels to label indices.
categoricalColumns = ["workclass", "education", "marital_status", "occupation",
"relationship", "race", "sex", "native_country"]
stages = [] # stages in our Pipeline
for categoricalCol in categoricalColumns:
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol + "Index")
encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()],
outputCols=[categoricalCol + "classVec"])
stages += [stringIndexer, encoder]

# Convert label into label indices using the StringIndexer


label_stringIdx = StringIndexer(inputCol="income", outputCol="label")
stages += [label_stringIdx]
Use a VectorAssembler to combine all the feature columns into a single vector
column. This goes for all types: numeric and one-hot encoded variables.
numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss",
"hours_per_week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
3. Running Pipelines
Run the stages as a Pipeline. This puts the data through all of the feature
transformations we described in a single call.
partialPipeline = Pipeline().setStages(stages)
pipelineModel = partialPipeline.fit(dataset)
preppedDataDF = pipelineModel.transform(dataset)
Now we can do a Logistic regression classification and fit the model on prepared
data
from pyspark.ml.classification import LogisticRegression
# Fit model to prepped data
lrModel = LogisticRegression().fit(preppedDataDF)
And run ROC
# ROC for training data
display(lrModel, preppedDataDF, "ROC")
And check the fitted values (from the model) against the prepared dataset:
display(lrModel, preppedDataDF)
Now we can check the dataset with added labels and features:
selectedcols = ["label", "features"] + cols
dataset = preppedDataDF.select(selectedcols)
display(dataset)

4. Logistic Regression
In the Pipelines API, we are now able to perform Elastic-Net Regularization with
Logistic Regression, as well as other linear methods.
from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model


lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)

lrModel = lr.fit(trainingData)
And make predictions on test dataset. Using transform() method to use only the
vector of features as a column:
predictions = lrModel.transform(testData)
We can check the dataset:
selected = predictions.select("label", "prediction", "probability", "age",
"occupation")
display(selected)

5. Evaluating the model


We want to evaluate the model, before doing anything else. This will give us the
sense of not only the quality but also the under or over performance.
We can use BinaryClassificationEvaluator to evaluate our model. We can set the
required column names in rawPredictionCol and labelCol Param and the metric
in metricName Param. The default metric for
the BinaryClassificationEvaluator is areaUnderROC. Let's load the functions:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
and start with evaluation:
# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)

And the score of evaluated predictions is: 0.898976. What we. want to do next is to
fine tune the model with the ParamGridBuilder and the CrossValidator. You can
use explainParams() to see the list of parameters and the definition. Set up the
ParamGrid with Regularization Parametrs, ElasticNet Parameters  and number
of maximum iterations.
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Create ParamGrid for Cross Validation


paramGrid = (ParamGridBuilder()
.addGrid(lr.regParam, [0.01, 0.5, 2.0])
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
.addGrid(lr.maxIter, [1, 5, 10])
.build())
And run the cross validation.  I am taking 5-fold cross-validation. And you will see
how Spark will distribute the loads among the workers using Spark Jobs. Since the
matrix of ParamGrid is prepared in such way, that can be parallelised, the powerful
and massive computations of Spark gives you the better and fastest compute time.
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid,
evaluator=evaluator, numFolds=5)

# Run cross validations


cvModel = cv.fit(trainingData)
When the CV finished, check the results of model accuracy again:
# Use test set to measure the accuracy of our model on new data
predictions = cvModel.transform(testData)

# Evaluate best model


evaluator.evaluate(predictions)
The model accuracy, after cross validations, is 0.89732, which is relatively the same
as before CV. So the model was stable and accurate from the beginning and CV only
confirmed it.
You can also display the dataset:
selected = predictions.select("label", "prediction", "probability", "age",
"occupation")
display(selected)

You can also change the graphs here and explore each observation in the dataset:
The advent is here :) And I wish you all Merry Christmas and a Happy New Year 2021.
The series will continue for couple of more days. And tomorrow we will explore
Spark’s GraphX for Spark Core API.
Complete set of code and the Notebook is available at the Github repository.
Happy Coding and Stay Healthy!

Dec 25 2020 - Using Spark


GraphFrames in Azure Databricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure
Databricks cluster architecture,
workers, drivers and jobs
 Dec 06: Importing and storing data to
Azure Databricks
 Dec 07: Starting with Databricks
notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS
CLI for file upload
 Dec 09: Connect to Azure Blob storage
using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks
Notebooks with SQL for Data
engineering tasks
 Dec 11: Using Azure Databricks
Notebooks with R Language for data
analytics
 Dec 12: Using Azure Databricks
Notebooks with Python Language for
data analytics
 Dec 13: Using Python Databricks
Koalas with Azure Databricks
 Dec 14: From configuration to
execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event
Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
 Dec 19: Using Azure Data Factory with Azure Databricks for merging CSV files
 Dec 20: Orchestrating multiple notebooks with Azure Databricks
 Dec 21: Using Scala with Spark Core API in Azure Databricks
 Dec 22: Using Spark SQL and DataFrames in Azure Databricks
 Dec 23: Using Spark Streaming in Azure Databricks
 Dec 24: Using Spark MLlib for Machine Learning in Azure Databricks
Yesterday we looked into MLlib package for Machine Learning. And oh, boy, there
are so many topics to cover. But moving forward. Today we will look into the
GraphFrames in Spark for Azure Databricks.
This is the last part of high-level API on Spark engine is the GraphX (legacy) and
GraphFrames. GraphFrames is a computation engine built on top of Spark Core API
that enables end-users and taking advantages of Spark DataFrames in Python and
Scala. It gives you the possibility to transform and build structured data at a massive
scale.

In your workspace, create a new notebook, called: Day25_Graph and select


language: Python. We will need a ML Databricks cluster or install additional Python
packages. I installed additional Python package graphframes using PyPI installer.:
Before we begin, couple of word that I would like to explain:
Edge (edges)- is a link or a line between two nodes or a points in the network.
Vertex (vertices) - is a node or a point that has a relation to another node through a
link.

Motif - you can build more complex relationships involving edges and vertices. The
following cell finds the pairs of vertices with edges in both directions between them.
The result is a DataFrame, in which the column names are given by the motif keys.
Stateful - with combining GraphFrame motif finding with filters on the result where
the filters use sequence operations to operate over DataFrame columns. Therefore it
is called stateful (vis-a-vis stateless), because it remembers previous state.
Now you can start using the notebook. Import the packages that we will need.
from functools import reduce
from pyspark.sql.functions import col, lit, when
from graphframes import *
1.Create a sample dataset
We will create a sample dataset (taken from Databricks website) and will be inserted
as a DataFrame.
Vertices:
vertices = sqlContext.createDataFrame([
("a", "Alice", 34, "F"),
("b", "Bob", 36, "M"),
("c", "Charlie", 30, "M"),
("d", "David", 29, "M"),
("e", "Esther", 32, "F"),
("f", "Fanny", 36, "F"),
("g", "Gabby", 60, "F"),
("h", "Mark", 45, "M"),
("i", "Eddie", 60, "M"),
("j", "Mandy", 21, "F")
], ["id", "name", "age", "gender"])
Edges:
edges = sqlContext.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend"),
("a", "h", "follow"),
("a", "i", "follow"),
("a", "j", "follow"),
("j", "h", "friend"),
("i", "c", "follow"),
("i", "c", "friend"),
("b", "j", "follow"),
("d", "h", "friend"),
("e", "j", "friend"),
("h", "a", "friend")
], ["src", "dst", "relationship"])
Let's create a graph using vertices and edges:
graph_sample = GraphFrame(vertices, edges)
print(graph_sample)
Or you can achieve same with:
# This example graph also comes with the GraphFrames package.
from graphframes.examples import Graphs
same_graph = Graphs(sqlContext).friends()
print(same_graph)
2.Querying graph
We can display Edges, vertices, incoming or outgoing degrees:
display(graph_sample.vertices)
#
display(graph_sample.edges)
#
display(graph_sample.inDegrees)
#
display(graph_sample.degrees)

And you can even combine some filtering and using aggregation funtions:
youngest = graph_sample.vertices.groupBy().min("age")
display(youngest)

3.Using motif
Using motifs you can build more complex relationships involving edges and vertices.
The following cell finds the pairs of vertices with edges in both directions between
them. The result is a DataFrame, in which the column names are given by the motif
keys.
# Search for pairs of vertices with edges in both directions between them.
motifs = graph_sample.find("(a)-[e]->(h); (h)-[e2]->(a)")
display(motifs)
4.Using Filter
You can filter out the relationship between nodes and adding multiple predicates.
filtered = motifs.filter("(b.age > 30 or a.age > 30) and (a.gender = 'M' and
b.gender ='F')")
display(filtered)
# I guess Mark has a crush on Alice, but she just wants to be a follower :)

5. Stateful Queries
Stateful queries are set of filters with given sequences, hence the name. You can
combine GraphFrame motif finding with filters on the result where the filters use
sequence operations to operate over DataFrame columns. Following an example:
# Find chains of 4 vertices.
chain4 = graph_sample.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")

# Query on sequence, with state (cnt)


# (a) Define method for updating state given the next element of the motif.
def cumFriends(cnt, edge):
relationship = col(edge)["relationship"]
return when(relationship == "friend", cnt + 1).otherwise(cnt)

# (b) Use sequence operation to apply method to sequence of elements in motif.


# In this case, the elements are the 3 edges.
edges = ["ab", "bc", "cd"]
numFriends = reduce(cumFriends, edges, lit(0))

chainWith2Friends2 = chain4.withColumn("num_friends", numFriends).where(numFriends


>= 2)
display(chainWith2Friends2)
6.Standard graph algorithms
GraphFrames comes with a number of standard graph algorithms built in:
 Breadth-first search (BFS)
 Connected components
 Strongly connected components
 Label Propagation Algorithm (LPA)
 PageRank (regular and personalised)
 Shortest paths
 Triangle count
6.1.BFS - Breadth-first search; applying expression through edges
This is searching from expression through the Graph to expression. This will look
from A: person named Esther to B: everyone who is 30 or younger.
paths = graph_sample.bfs("name = 'Esther'", "age < 31")
display(paths)

Same result can be achieved with refined query:


filteredPaths = graph_sample.bfs(
fromExpr = "name = 'Esther'",
toExpr = "age < 31",
edgeFilter = "relationship != 'friend'",
maxPathLength = 3)
display(filteredPaths)
6.2. Shortest Path
Computes shortest paths to the given set of "landmark" vertices, where landmarks
are specified by vertex ID.
results = graph_sample.shortestPaths(landmarks=["a", "d"])
display(results)
#or
results = graph_sample.shortestPaths(landmarks=["a", "d", "h"])
display(results)
Tomorrow we will explore how to connect
Azure Machine Learning Services
Workspace and Azure Databricks
Complete set of code and the Notebook is
available at the Github repository.
Happy Coding and Stay Healthy!

Dec 26 2020 -
Connecting Azure
Machine Learning
Services Workspace
and Azure
Databricks

Azure Databricks repository is a set of


blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure
Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks  platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers
and jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering
tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data
analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
 Dec 19: Using Azure Data Factory with Azure Databricks for merging CSV files
 Dec 20: Orchestrating multiple notebooks with Azure Databricks
 Dec 21: Using Scala with Spark Core API in Azure Databricks
 Dec 22: Using Spark SQL and DataFrames in Azure Databricks
 Dec 23: Using Spark Streaming in Azure Databricks
 Dec 24: Using Spark MLlib for Machine Learning in Azure Databricks
 Dec 25: Using Spark GraphFrames in Azure Databricks
Yesterday we looked into GraphFrames in Azure Databricks and the capabilities of
calculating graph data.
Today we will look into Azure Machine Learning services.
What is Azure Machine Learning? It is a cloud-based environment you can use to
train, deploy, automate, manage, and track ML models. It can be used for any kind of
machine learning, from classical ML to deep learning, supervised, and unsupervised
learning.
It supports Python and R code with the SDK and also gives you possibility to use
Azure Machine Learning studio designer for "drag&drop" and no-code option. It
supports also out-of-the-box tracking experiment, for prediction model workflow,
managing, deploying and monitoring models with Azure Machine Learning.
Login to your Azure Portal and select the Databricks services. On the main page of
Databricks service in Azure Portal, select "Link Azure ML workspace"
After selecting "Link Azure ML workspace", you will be prompted to add additional
information, mainly because we are connecting two separate services.

And many of these should already be available from previous days (Key Vault, Store
Account, Container Registry). You should only create a new application insights.
Click Review + create and Create.
Once completed you can download the deployment script or go directly to resource.

Among the resources, one new resource will be created for you. Go to this resource.
You will see, that you will be introduced to Machine Learning Workspace and you
can launch Studio.
Since this is the first time setup of this new workspace that has connection to Azure
Databricks, you will be prompted additional information about the Active Directory
account:

Hit that "Get started" button to launch the Studio.


You will be presented with a brand new Machine Learning Workspace. After you
instantiate your workspace, MLflow Tracking is automatically set to be tracked in all
of the following places:
 The linked Azure Machine Learning workspace.
 Your original ADB workspace.
All your experiments land in the managed Azure Machine Learning tracking service.
Now go to Azure Databricks and add some additional packages. In Azure Databricks
under your cluster that we already used for MLflow, install package azureml-
mlflow using PyPI.

Linking your ADB workspace to your Azure Machine Learning workspace enables you
to track your experiment data in the Azure Machine Learning workspace.
The following code should be in your experiment notebook IN AZURE DATABRICKS
(!) to get your linked Azure Machine Learning workspace.
import mlflow
import mlflow.azureml
import azureml.mlflow
import azureml.core

from azureml.core import Workspace

#Your subscription ID that you are running both Databricks and ML Service
subscription_id = 'subscription_id'

# Azure Machine Learning resource group NOT the managed resource group
resource_group = 'resource_group_name'

#Azure Machine Learning workspace name, NOT Azure Databricks workspace


workspace_name = 'workspace_name'

# Instantiate Azure Machine Learning workspace


ws = Workspace.get(name=workspace_name,
subscription_id=subscription_id,
resource_group=resource_group)

#Set MLflow experiment.


experimentName = "/Users/{user_name}/{experiment_folder}/{experiment_name}"
mlflow.set_experiment(experimentName)
If you want your models to be tracked and monitored only in Azure Machine
Learning service, add these two lines in Databricks notebook:
uri = ws.get_mlflow_tracking_uri()
mlflow.set_tracking_uri(uri)
Tomorrow we will look how to connect to your Azure databricks Service from your
client machine or on-prem machine.
Complete set of code and the Notebook is available at the Github repository.
Happy Coding and Stay Healthy!

Dec 27 2020 - Connecting Azure


Databricks with on premise
environment

Azure Databricks repository is a set of


blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure
Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure
Databricks
 Dec 03: Getting to know the workspace
and Azure Databricks platform
 Dec 04: Creating your first Azure
Databricks cluster
 Dec 05: Understanding Azure
Databricks cluster architecture,
workers, drivers and jobs
 Dec 06: Importing and storing data to
Azure Databricks
 Dec 07: Starting with Databricks
notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS
CLI for file upload
 Dec 09: Connect to Azure Blob storage
using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks
Notebooks with SQL for Data
engineering tasks
 Dec 11: Using Azure Databricks
Notebooks with R Language for data
analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
 Dec 19: Using Azure Data Factory with Azure Databricks for merging CSV files
 Dec 20: Orchestrating multiple notebooks with Azure Databricks
 Dec 21: Using Scala with Spark Core API in Azure Databricks
 Dec 22: Using Spark SQL and DataFrames in Azure Databricks
 Dec 23: Using Spark Streaming in Azure Databricks
 Dec 24: Using Spark MLlib for Machine Learning in Azure Databricks
 Dec 25: Using Spark GraphFrames in Azure Databricks
 Dec 26: Connecting Azure Machine Learning Services Workspace and Azure
Databricks
Yesterday we connected the Azure Machine Learning services and Azure Databricks
workspace for tracking and monitoring experiments and models in Azure Machine
Learning.
Today we will connect on premise development environment (R or Python) with
resources in Azure Databricks. With other words, we will have running code on a
client / on-prem machine and pushing all the workload to the cloud.

1.Connecting with R-Studio and SparkR


Launch your Azure Databricks workspace from Azure portal. Once the Azure
Databricks is launched, head to Clusters.
Start the cluster you will be using to connect to R Studio. Go to "Apps" tab under
clusters:

Before you click the "Set up RStudio" button, you will need to disable the auto
termination option. By default is enabled, making a cluster terminate itself after
period of time of inactivity. Under configuration, select Edit and disable the
termination. Cluster will restart. And then click the set up Rstudio. Beware, you should
not stop the cluster yourself after finishing work (!).
Click Set up RStudio and you will get the following credentials:

And click the "Open RStudio" and you will get redirected to web portal with RStudio
opening.
In order to get the Databricks cluster objects into R Studio, you must also run the
spark_connect:
SparkR::sparkR.session()
library(sparklyr)
sc <- spark_connect(method = "databricks")
And you will see all the DataFrames or CSV Files from previous days:

Please note, you can also connect to RStudio desktop version (!). If so, you would,
there are the following steps:
Open your RStudio Desktop and install:
install.packages("devtools")
devtools::install_github("sparklyr/sparklyr")
Install Databricks-connect in CLI (it is a 250Mb Package):
pip uninstall pyspark
pip install -U databricks-connect
Now set the connections to Azure Databricks:
databricks-connect get-jar-dir
And after that run the command in CLI:
databricks-connect configure
CLI will look like a text input:
And all the information you need to fill in the the CLI can find in URL:

Databrick host: adb-860xxxxxxxxxx95.15.azuredatabricks.net


Cluster Name: 1128-xxxx-yaws18/
Organization: 860xxxxxxxxxx95
Port: 15001
Token: /////
2.Connecting with Python
Go to your Cluster in Azure Databricks and straight to configure.
Cluster will need to have these two items added in the Advanced Options -> Spark
Config section. Set these values to a cluster that you want to connect to from on-
premise or client machine.
spark.databricks.service.server.enabled true
spark.databricks.service.port 8787
It is a key-pair value so there must be a space between the key and it's value. Once
you save this, the cluster will need to restart (click "Confirm & Restart)"!
You can create a virtual environment; it is up to you. I wil create a new environment
for Databricks connections and python. In your CLI run the following command:
conda create --name databricksconnect python=3.7
And activate the environment:
conda activate databricksconnect
If you are using an existing Python environment, I strongly suggest you to first
uninstall PySpark. This is due the fact that databricks-connect package will have it's
own version of PySpark (same as with R).
Now install the Databricks-Connect package. You can alos specify the version of
databricks-connect by adding =5.2 to have like pip install -U databricks-connect=5.2.
This depends on your Datatricks cluster version
#pip uninstall pyspark
pip install -U databricks-connect
After the installation, you will need to run the configurations again:
databricks-connect configure
Adding all the needed settings as explained in R section. And test your connection by
using:
databricks-connect test
Once you have all the settings enabled you are ready to use Databricks on your on-
premis / client machine.
You have many ways to use it:
 Anaconda notebooks / Jupyter notebooks
 PyCharm
 Atom
 Visual Studio Code
 etc.
In Visual Studio Code (lightweight, open-source, multi-platform) you can set up this
connection. Make sure you have a Python extension installed.
and run the following script:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col
spark = SparkSession.builder.getOrCreate()
# Import
data = spark.read.format("csv").option("header", "true").load("/databricks-
datasets/samples/population-vs-price/data_geo.csv")

#Display
display(data)

And once you run this, you should have the results.
Make sure you also choose the correct Python interpreter in Visual Studio Code.
3. Connecting to Azure Databricks with ODBC
You can also connect Azure Databricks SQL tables using ODBC to your on-premise
Excel or to Python or to R.
It will only see the SQL tables and connections. but it can also be done. This will
require some ODBC installation, but I will not go into it.
Tomorrow we will look into Infrastructure as Code and how to automate, script and
deploy Azure Databricks.
Complete set of code and the Notebook is available at the Github repository.
Happy Coding and Stay Healthy!

Dec 28 2020 - Infrastructure as Code


and how to automate, script and
deploy Azure Databricks with
Powershell

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to


readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure Databricks
 Dec 03: Getting to know the workspace and Azure Databricks  platform
 Dec 04: Creating your first Azure Databricks cluster
 Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers
and jobs
 Dec 06: Importing and storing data to Azure Databricks
 Dec 07: Starting with Databricks notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS CLI for file upload
 Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering
tasks
 Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data
analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
 Dec 19: Using Azure Data Factory with Azure Databricks for merging CSV files
 Dec 20: Orchestrating multiple notebooks with Azure Databricks
 Dec 21: Using Scala with Spark Core API in Azure Databricks
 Dec 22: Using Spark SQL and DataFrames in Azure Databricks
 Dec 23: Using Spark Streaming in Azure Databricks
 Dec 24: Using Spark MLlib for Machine Learning in Azure Databricks
 Dec 25: Using Spark GraphFrames in Azure Databricks
 Dec 26: Connecting Azure Machine Learning Services Workspace and Azure
Databricks
 Dec 27: Connecting Azure Databricks with on premise environment
Yesterday we looked into bringing the capabilities of Databricks closer to your client
machine. And making that coding, data wrangling and data science little bit more
convenient.
Today we will look into deploying Databricks workspace using Powershell.

You will need nothing CLI, Powershell and all that you already have. So, let's go into
CLI and get the Azure Powershell Module.
In CLI type:
if ($PSVersionTable.PSEdition -eq 'Desktop' -and (Get-Module -Name AzureRM -
ListAvailable)) {
Write-Warning -Message ('Az module not installed. Having both the AzureRM and
' +
'Az modules installed at the same time is not supported.')
} else {
Install-Module -Name Az -AllowClobber -Scope CurrentUser
}
After that, you can connect to your Azure subscription:
Connect-AzAccount
You will be prompted to add your credentials. And once you enter them, you will get
the results on your Account, tenantID, Environment and Subscription Name.
Once connected, we will look into Databricks module. To list all the modules:
Get-Module -ListAvailable

To explore what is available for Az.Databricks, lets see with the following PS
command:
Get-Command -Module Az.Databricks

Now we can create a new Workspace. In this manner, you can also create "semi"
automation, but ARM will make this next steps even easier.
New-AzDatabricksWorkspace `
-Name databricks-test `
-ResourceGroupName testgroup `
-Location eastus `
-ManagedResourceGroupName databricks-group `
-Sku standard
Or we can use ARM (Azure Resource Manager) deployment:
$templateFile = "/users/template.json"
New-AzResourceGroupDeployment `
-Name blanktemplate `
-ResourceGroupName myResourceGroup `
-TemplateFile $templateFile
Or you can go through Deployment process in Azure Portal:
And select a Github template to create a new Azure Databricks workspace:

Or you can go under "Build your own template" and get my Github Repository IaC
folder with template.json and Parameters.json files and paste the content in here.
Add First the new resource group:
New-AzResourceGroup -Name RG_123xyz -Location “westeurope”
And at the end generate the JSON files for your automated deployment. Adding with
parameters file:
$templateFile =
"/users/tomazkastrun/Documents/GitHub/Azure-Databricks/iac/template.json"
PS /Users/tomazkastrun>
$parameterFile=“/users/tomazkastrun/Documents/GitHub/Azure-Databricks/iac/
parameters.json”
New-AzResourceGroupDeployment -Name DataBricksDeployment -ResourceGroupName
RG_123xyz -TemplateFile $templateFile -TemplateParameterFile $parameterFile
This will take some time:

But you can always check the happening in Azure Portal:


And you can see the deployment status: 1 Deploying. And once you are finished, you
will have the PowerShell returning you the status:

These values will be same as the one in parameters.JSON file. In this manner you can
automate your deployment and continuous integration (CI) and continuous
deployment (CD).
Tomorrow we will dig into Apache Spark.
Complete set of code and the Notebook is available at the Github repository.
Happy Coding and Stay Healthy!

Dec 29 2020 - Performance tuning for


Apache Spark

Azure Databricks repository is a set of


blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure
Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure
Databricks
 Dec 03: Getting to know the workspace
and Azure Databricks platform
 Dec 04: Creating your first Azure
Databricks cluster
 Dec 05: Understanding Azure
Databricks cluster architecture,
workers, drivers and jobs
 Dec 06: Importing and storing data to
Azure Databricks
 Dec 07: Starting with Databricks
notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS
CLI for file upload
 Dec 09: Connect to Azure Blob storage
using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks
Notebooks with SQL for Data
engineering tasks
 Dec 11: Using Azure Databricks
Notebooks with R Language for data
analytics
 Dec 12: Using Azure Databricks
Notebooks with Python Language for data analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
 Dec 19: Using Azure Data Factory with Azure Databricks for merging CSV files
 Dec 20: Orchestrating multiple notebooks with Azure Databricks
 Dec 21: Using Scala with Spark Core API in Azure Databricks
 Dec 22: Using Spark SQL and DataFrames in Azure Databricks
 Dec 23: Using Spark Streaming in Azure Databricks
 Dec 24: Using Spark MLlib for Machine Learning in Azure Databricks
 Dec 25: Using Spark GraphFrames in Azure Databricks
 Dec 26: Connecting Azure Machine Learning Services Workspace and Azure
Databricks
 Dec 27: Connecting Azure Databricks with on premise environment
 Dec 28: Infrastructure as Code and how to automate, script and deploy Azure
Databricks with Powershell
Yesterday we looked into powershell automation for Azure Databricks and how one
can create, updae or remove the Workspace, resource group and VNet, using
deployment templates and parameters.
Today we will address the issues with Spark performance. We have talked several
times about different languages available in Spark.

There are indirect and direct performance improvements that can leverage and make
your Spark run faster.
1.Choice of Languages
Java versus Scala versus R versus Python (versus HiveSQL)? There is no correct or
wrong answer to this choice, but there are some important differences worth
mentioning. If you are running single-node machine learning the SparkR is a best
option, since it has a massive machine learning ecosystem, and it has a lot of
optimised algorithms that can handle this.
If you are running ETL job, Spark and combination of another language (R, Python,
Scala) will yield all great results. Spark's Structured API are consistent in terms of
speed and stability across all the languages, so there should be almost none
differences. But things get much more interesting when there are UDF (user defined
functions) that can not be directly created in Structured API's. In this case, both R nor
Python might not be a good idea, simply because the way Structured API manifests
and transforms as RDD. In general, Python makes better choice over R when writing
UDF, but probably the best way would be to write UDF in Scala (or Java), making
these language jumps easier for the API interpreter.
2.Choice of data presentation
DataFrame versus Datasets versus SQL versus RDD is another choice, yet it is fairly
easy. DataFrames, Datasets and SQL objects are all equal in performance and
stability (at least from Spar 2.3 and above), meaning that if you are using
DataFrames in any language, performance will be the same. Again, when writing
custom objects of functions (UDF), there will be some performance degradation with
both R or Python, so switching to Scala or Java might be a optimisation.
Rule of thumb is, stick to DataFrames. If you go a layer down to RDD, Spark will make
better optimisation and use of it than you will. Spark optimisation engine will write
better RDD code than you do and with certainly less effort. And doing so, you might
also loose additional Spark optimisation with new releases.
When using RDD, try and use Scala or Java. If this is not possible, and you will be
using Python or R extensively, try to use it as little as possible with RDDs. And
convert to DataFrames as quickly as possible. Again, if your Spark code, application
or data engineering task is not compute intensive, it should be fine, otherwise
remember to use Scala or Java or convert to DataFrames. Both Python and R does
not handle serialisation of RDD files optimally and runs a lot of data to and from
Python or R engine, causing a lot of data movement, traffic and potentially making
RDD unstable and making poor performance.
3. Data Storage
Storing data effectively is relevant when data will be read multiple times. If data will
be accessed many times, either from different users in organisation or from a single
user, all making data analysis, make sure to store it for effective reads. Choosing your
storage, choosing the data formats and data partitioning is important.
With numerous file format available, there are some key differences. If you want to
optimise your Spark job, data should be stored in best possible format for this. In
general, always favour structured, binary types to store your data, especially when
you are doing frequent-accessing. Although CSV files look well formatted, they are
obnoxiously sparse, can have "edge" cases (missing line breaks, or other delimiters)
are painfully slow to parse and hard to partition. Same logic applies to txt and xml
formats. Avro are JSON orientated and also sparse and I am not going to even talk
about XML format. Spark works best with Apache Parquet stored data. Parquet
format stores data in a binary files in column-orientated storage, and also track some
statistics of the files, making it possible to skip files not needed for query.
4.Table partitioning and bucketing
Table partitioning is referring to storing files in separate directories based on a
partition key (e.g.: date of purchase, VAT number) such as a date field in data stored
in these directories. Partitioning will help Spark skip files that are not needed for end
result and it will return only the data that is in the range of the key. There are
potentials pitfalls to this techniques, one for sure is the size of these subdirectories
and how to choose the right granularity.
Bucketing is a process of "pre-partitioning" data to allow better data joins and
aggregations operations. This will improve performance, because data can
be consistently distributed across partitions as opposed all being in one partition.
So if you are repeating a particular query that is joins are frequently performed on a
column immediately after read, you can use bucketing to assure that data is well
partitioned in accordance with those values. This will prevent shuffle before join and
speed up data access.
5.Parallelism
Splittable data formats make Spark job easier to run in parallel. A ZIP or a TAR file
can not be split, which means that even if you have 10 files in a ZIP file and 10 cores,
only one core can read in that data, because Spark can not parallelise across ZIP file.
But using GZIP, BZIP2 or LZ4 are generally splittable if (and only if) they are written
by a parallel processing framework like Spark or Hadoop.
In general, Spark will work best when there are two to three tasks per CPU core in
your cluster when working especially with large (big) data. You can also tune
the spark.default.parallelism property.
6.Number of files
With numerous small files you will for sure pay a price for listing and fetching all the
data. There is no golden rule on number of files and the size of the files per directory.
But there are some directions. Multiple small files will is going to make a schedule
worked harder to locate the data and launch all the read tasks. This can increase not
only disk I/O but also network traffic. On the other spectrum, having fewer and larger
files can ease the workload from scheduler, but it will make tasks longer to run.
Again, a rule of thumb would be, to scope the size of the files in such way, that
they contain a few tens of megabyte of data. From Spark 2.2. onward there are
also possibilities to to partitioning and sizing optionally.
7. Temporary data storage
Data that will be reused constantly are great candidates for caching. Caching will
place a DataFrame, Dataset, SQL table or RDD into temporary storage (either
memory or disk) across the executors in your cluster. You might want to cache only
dataset that will be used several times later on, but should not be hastened, because
it takes also resources such as serialisation, deserialisation and storage costs. You can
tell Spark to cache data by using a cache command on DataFrames or RDD's.
Let's put this to the test. In Azure Databricks create a new
notebook: Day29_tuning and language: Python and attach the notebook to your
cluster. Load a sample CSV file:
%python
DF1 = spark.read.format("CSV")
.option("inferSchema", "true")
.option("header","true")
.load("dbfs/databricks-datasets/COVID/covid-19-data/us-states.csv")
The bigger the files, the more evident the difference will be. Create some
aggregations:
DF2 = DF1.groupby("state").count().collect()
DF3 = DF1.groupby("date").count().collect()
DF4 = DF1.groupby("cases").count().collect()
After you have tracked the timing, now, let's cache the DF1:
DF1.cache()
DF1.count()
And rerun the previous command:
DF2 = DF1.groupby("state").count().collect()
DF3 = DF1.groupby("date").count().collect()
DF4 = DF1.groupby("cases").count().collect()
And you should see the difference in results. As mentioned before, the bigger the
dataset, the bigger would be time gained back when caching data.
We have touched today couple of performance tuning points and what approach
one should take, to improve the work of Spark in Azure Databricks. These are
probably the most frequent performance tunings and relatively easy to adjust.
Tomorrow we will look further into Apache Spark.
Complete set of code and the Notebook is available at the Github repository.
Happy Coding and Stay Healthy!

Dec 30 2020 - Monitoring and


troubleshooting of Apache Spark

Azure Databricks repository is a set of


blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure
Databricks!
Series of Azure Databricks posts:
 Dec 01: What is Azure Databricks
 Dec 02: How to get started with Azure
Databricks
 Dec 03: Getting to know the workspace
and Azure Databricks platform
 Dec 04: Creating your first Azure
Databricks cluster
 Dec 05: Understanding Azure
Databricks cluster architecture,
workers, drivers and jobs
 Dec 06: Importing and storing data to
Azure Databricks
 Dec 07: Starting with Databricks
notebooks and loading data to DBFS
 Dec 08: Using Databricks CLI and DBFS
CLI for file upload
 Dec 09: Connect to Azure Blob storage
using Notebooks in Azure Databricks
 Dec 10: Using Azure Databricks
Notebooks with SQL for Data
engineering tasks
 Dec 11: Using Azure Databricks
Notebooks with R Language for data
analytics
 Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
 Dec 13: Using Python Databricks Koalas with Azure Databricks
 Dec 14: From configuration to execution of Databricks jobs
 Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
 Dec 16: Databricks experiments, models and MLFlow
 Dec 17: End-to-End Machine learning project in Azure Databricks
 Dec 18: Using Azure Data Factory with Azure Databricks
 Dec 19: Using Azure Data Factory with Azure Databricks for merging CSV files
 Dec 20: Orchestrating multiple notebooks with Azure Databricks
 Dec 21: Using Scala with Spark Core API in Azure Databricks
 Dec 22: Using Spark SQL and DataFrames in Azure Databricks
 Dec 23: Using Spark Streaming in Azure Databricks
 Dec 24: Using Spark MLlib for Machine Learning in Azure Databricks
 Dec 25: Using Spark GraphFrames in Azure Databricks
 Dec 26: Connecting Azure Machine Learning Services Workspace and Azure
Databricks
 Dec 27: Connecting Azure Databricks with on premise environment
 Dec 28: Infrastructure as Code and how to automate, script and deploy Azure
Databricks with Powershell
 Dec 29: Performance tuning for Apache Spark
Yesterday we looked into performance tuning for improving day to day usage of
Spark and Azure Databricks. And today we will look explore monitoring (as we have
started on Day 15) and troubleshooting for most common mistakes or error a user in
Azure Databricks will encounter.

1.Monitoring
Spark in Databricks is relatively taken care of and can be monitored from Spark UI.
Since Databricks is a encapsulated platform, in a way Azure is managing many of the
components for you, from Network, to JVM (Java Virtual Machine), hosting operating
system and many of the cluster components, Mesos, YARN and any other spark
cluster application.
As we have seen on the Day 15 post, you can monitor Query, tasks, jobs, Spark Logs
and Spark UI in Azure Databricks. Spark Logs will help you pinpoint the problem that
you are encountering. It is also good for creating a history logs to understand the
behaviour of the job or the task over time and for possible future troubleshooting.
Spark UI is a good visual way to monitor what is happening to your cluster and offers
a great value of metrics for troubleshooting.
It also gives you detailed information on Spark Tasks and great visual presentation of
the task run, SQL run and detailed run of all the stages.

All of the tasks can be visualized also as DAG:


2. Troubleshooting
Approaching Spark debugging, let me give you some causes or views and symptoms
of problems in your Spark Jobs and Spark engine itself. There are many issues one
can account, I will try to tackle couple of those that can be found as a return message
in Databricks notebooks or in Spark UI in general.
2.1. Spark job not started
This issue can appear frequently, especially if you are beginner but can also happen
when there is Spark running standalone (not in Azure Databricks).
Sign and symptoms:
- Spark job don't start
- Spark UI does not show any nodes on cluster (except the driver)
- Spark UI is reporting vague information
Potential Solution:
- Cluster is not started or is starting up,
- This often happens with poorly configured cluster (usually when running Spark
applications and (almost) never with Azure Databricks), either IP or network or VNet,
- It can be a memory configuration issue and should be reconfigured in the start up
scripts.
2.2. Error during execution of notebook
During work in notebooks on a cluster that is already running, it can happen that
some part of the code or the Spark Job, that was previously running ok, started to
fail.
Sign and symptoms:
- a job on a Cluster runs successfully over all clusters, but on one fails
- code blocks in notebook runs normally in sequences, but one run fails
- HiveSQL table or R/Python Dataframe, that used to to be created normally, can not
be created
Potential Solution:
- check if your data still exists on the expected location or if the data is still in the
same file format
- if you are running a SQL query, check if the query is valid and all the column names
are correct
- try to go through stack trace and try to figure out which component is failing
2.3. Cluster unresponsive
When running notebook commands or using Spark Apps (widgets, etc), you can get
a message that cluster. This is a severe error and should be
Sign and symptoms:
- code block is not executed and fails with loads of JVM responses
- you get a error message, that cluster is unresponsive
- Spark job is running, with no return or error message.
Potential Solution:
- restart the cluster and attach the notebook to a cluster
- check the dataset for any inconsistencies, data size (limitations of the file uploaded
or distribution of the files over DBFS),
- check the compatibility of the installed libraries and spark version on your cluster.
- change the cluster setting from standard, GPU, ML to LTS. Long-Term Support
Spark installation tend to have greater span of compatibility.
- if you are using high-concurrency cluster, check who and what they are doing, it
there is a potential "dead-lock" in some tasks that consume too many resources.
2.4. Fail to load data
Loading data is probably the most important task in Azure Databricks. And there can
be many ways, that data can not be presented in the notebook.
Sign and symptoms:
- data is stored in blob storage and can not be accessed or loaded to Databricks
- data is taking too long to load, and I stop the load process
- data should be at the location, but it is not
Potential Solution:
- if you are reading the data from Azure blob storage, check that Azure Databricks
have all the needed credentials for access
- loading data files that are wide (have 1000+ columns) might cause some problems
with Spark. Load schema first and create a DataFrame with Scala and later insert the
data into the frame
- check if the persistent data (DBFS) is on the correct location and in expected data
format. It can also happen that different sample files are used, which in this case
might be missing from standard DBFS path
- DataFrame or Dataset was created in different language that the one, you are trying
to read it from. Languages sit on top of Structured API and should be
interchangeable, so you should check your code for some inconsistencies.
2.5. Unexpected Null in Results
Sign and symptoms:
- unexpected Null values in Spark transformations
- scheduled jobs that use to work no longer work, or no longer produce the correct
result
Potential Solution:
- it can be the cause of underlying data that has had format changed,
- use accumulator to run and try to count the number of rows (or records or
observations) or try to parse or process the error where a row (record/observation) is
missing,
- check and ensure that transformations in data return a valid SQL query plan; check
for some implicit data type conversions (a "15" is a string and not a number,
respectively) and Spark can return a strange result or no result.
2.6. Slow Aggregations
This is somehow common problem and also hardest to tackle. Usually happen
because of unevenly distributed workload across cluster or because of hardware
failure (one disk / VM is unresponsive).
Sign and symptoms:
- slow task by .groupBy() call
- after data aggregation, jobs are still slow
Potential Solution:
- try changing the partitioning of the data, to have less data per partition.
- try changing the partition key on your dataset
- check that your SELECT statement is gaining performance from the partitions
- if you are using RDD, try to create a DataFrame or Dataset to get the aggregations
done faster
Tomorrow we will finish with series with looking into sources, documentations and
next learning steps and it should be a nice way to wrap up the series.
Complete set of code and the Notebook is available at the Github repository.
Happy Coding and Stay Healthy!

RDD API examples


Word count
In this example, we use a few transformations to build a dataset of (String, Int) pairs
called counts and then save it to a file.
 Python
 Scala
 Java
text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

Pi estimation
Spark can also be used for compute-intensive tasks. This code estimates π by
"throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1))
and see how many fall in the unit circle. The fraction should be π / 4, so we use this
to get our estimate.
 Python
 Scala
 Java
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1

count = sc.parallelize(range(0, NUM_SAMPLES)) \


.filter(inside).count()
print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))

DataFrame API examples


In Spark, a DataFrame is a distributed collection of data organized into named
columns. Users can use DataFrame API to perform various relational operations on
both external data sources and Spark’s built-in distributed collections without
providing specific procedures for processing data. Also, programs based on
DataFrame API will be automatically optimized by Spark’s built-in optimizer, Catalyst.
Text search
In this example, we search through the error messages in a log file.
 Python
 Scala
 Java
textFile = sc.textFile("hdfs://...")

# Creates a DataFrame having a single column named "line"


df = textFile.map(lambda r: Row(r)).toDF(["line"])
errors = df.filter(col("line").like("%ERROR%"))
# Counts all the errors
errors.count()
# Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
# Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

Simple data operations


In this example, we read a table stored in a database and calculate the number of
people for every age. Finally, we save the calculated result to S3 in the format of
JSON. A simple MySQL table "people" is used in the example and this table has two
columns, "name" and "age".
 Python
 Scala
 Java
# Creates a DataFrame based on a table named "people"
# stored in a MySQL database.
url = \
"jdbc:mysql://yourIP:yourPort/test?
user=yourUsername;password=yourPassword"
df = sqlContext \
.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", "people") \
.load()

# Looks the schema of this DataFrame.


df.printSchema()

# Counts people by age


countsByAge = df.groupBy("age").count()
countsByAge.show()

# Saves countsByAge to S3 in the JSON format.


countsByAge.write.format("json").save("s3a://...")

// hello world
val hello = "Hello, world"
hello: String = Hello, world
Arrays
Array(1,2,3,4,5).foreach(println)
println()
Array("Java","Scala","Python","R","Spark").foreach(println)
1 2 3 4 5 Java Scala Python R Spark
Functional Programming Overview
// assign a function to variable
val inc = (x : Int) => x + 1
inc(7)
inc: Int => Int = <function1> res41: Int = 8
// passing a function as parameter
(1 to 5) map (inc)
res42: scala.collection.immutable.IndexedSeq[Int] = Vector(2, 3, 4, 5, 6)
// take even number, multiply each value by 2 and sum them up
(1 to 7) filter (_ % 2 == 0 ) map (_ * 2) reduce (_ + _)
res43: Int = 24
// Scala
val name = "Scala"
val hasUpperCase = name.exists(_.isUpper)
name: String = Scala hasUpperCase: Boolean = true
var i = 10
loopWhile(i > 0) {
println(i)
i -= 1
}
def loopWhile(cond: => Boolean)(f: => Unit) : Unit = {
if (cond) {
f
loopWhile(cond)(f)
}
}
10 9 8 7 6 5 4 3 2 1 i: Int = 0 loopWhile: (cond: => Boolean)(f: =>
Unit)Unit
Mutable and Immutable Variables
// mutable variables
var counter:Int = 10
var d = 0.0
var f = 0.3f

// immutable variables
val msg = "Hello Scala"
println(msg)
s"Greeting: $msg"

val ? = scala.math.Pi
println(?)
Hello Scala 3.141592653589793 counter: Int = 10 d: Double = 0.0 f: Float =
0.3 msg: String = Hello Scala ?: Double = 3.141592653589793
String Interpolation
// string interpolation
val course = "Spark With Scala"
println(s"I am taking course $course.")
// support arbitrary expressions
println(s"2 + 2 = ${2 + 2}")
val year = 2017
println(s"Next year is ${year + 1}")
I am taking course Spark With Scala. 2 + 2 = 4 Next year is 2018 course:
String = Spark With Scala year: Int = 2017
Looping Constructs
// looping constructs
var i = 0
do {
println(s"Hello, world #$i")
i = i + 1
} while (i <= 5)
println()

for (j<- 1 to 5) {
println(s"Hello, world #$j")
}
println()

// what will be printed?


for (i <- 1 to 3) {
var i = 2
println(i)
}
Hello, world #0 Hello, world #1 Hello, world #2 Hello, world #3 Hello,
world #4 Hello, world #5 Hello, world #1 Hello, world #2 Hello, world #3
Hello, world #4 Hello, world #5 2 2 2 i: Int = 6
Functions
// defining functions
def hello(name:String) : String = { "Hello " + name }

def hello1() = { "Hi there!" }


def hello2() = "Hi there!"
def hello3 = "Hi there!"

def max(a:Int, b:Int) : Int = if (a > b) a else b


max(4,6)
max(8,3)
hello: (name: String)String hello1: ()String hello2: ()String hello3:
String max: (a: Int, b: Int)Int res47: Int = 8
Function Literals
// function literals
(x: Int, y: Int) => x + y
val sum = (x: Int, y: Int) => x + y
val prod = (x: Int, y: Int) => x * y

def doIt(msg:String, x:Int, y:Int, f: (Int, Int) => Int) = {


println(msg + f(x,y))
}
doIt("sum: ", 1, 80, sum)
doIt("prod: ", 2, 33, prod)
sum: 81 prod: 66 sum: (Int, Int) => Int = <function2> prod: (Int, Int) =>
Int = <function2> doIt: (msg: String, x: Int, y: Int, f: (Int, Int) =>
Int)Unit
Tuples
// tuple
val pair1 = ("Scala", 1)
println(pair1._1)
println(pair1._2)
val pair2 = ("Scala", 1, 2017)
println(pair2._3)

Scala 1 2017 pair1: (String, Int) = (Scala,1) pair2: (String, Int, Int) =
(Scala,1,2017)
Classes
// class

// constructor with two private instance variables


class Movie(name:String, year:Int)

// With two getter methods


class Movie1(val name:String, val year:Int)
val m1 = new Movie1("Star Wars", 1977)
println(m1.name + " " + m1.year)

// With two getter and setter methods


class Movie2(var name:String, var year:Int, var rating:String)
val m2 = new Movie2("Alien", 1979, "R")
m2.name = "Alien: Director's Edition"
println(m2.name + " Released: " + m2.year + " Rated: " + m2.rating)
Star Wars 1977 Alien: Director's Edition Released: 1979 Rated: R defined
class Movie defined class Movie1 m1: Movie1 = $iwC$$iwC$$iwC$$iwC$$iwC$
$iwC$Movie1@49882d3 defined class Movie2 m2: Movie2 = $iwC$$iwC$$iwC$$iwC$
$iwC$$iwC$Movie2@775c5894 m2.name: String = Alien: Director's Edition
warning: previously defined object Movie is not a companion to class Movie.
Companions must be defined together; you may wish to use :paste mode for
this.
Case Classes
// case class
case class Movie(name:String, year:Int)
val m = Movie("Avatar", 2009)
m.toString
println(m.name + " " + m.year)
Avatar 2009 defined class Movie m: Movie = Movie(Avatar,2009)
Pattern Matching with Case Class
// pattern matching with case class
abstract class Shape
case class Rectangle(h:Int, w:Int) extends Shape
case class Circle(r:Int) extends Shape

def area(s:Shape) = s match {


case Rectangle(h,w) => h * w
case Circle(r) => r * r * 3.14
}

println(area(Rectangle(4,5)))
println(area(Circle(5)))
20.0 78.5 defined class Shape defined class Rectangle defined class Circle
area: (s: Shape)Double
Arrays
// array
val myArray = Array(1,2,3,4,5);
myArray.foreach(a => print(a + " "))
println
myArray.foreach(println)
1 2 3 4 5 1 2 3 4 5 myArray: Array[Int] = Array(1, 2, 3, 4, 5)
Lists
// list
val l = List(1,2,3,4);
l.foreach(println)
println()

println(l.head) //==> 1
println(l.tail) //==> List(2,3,4)
println(l.last) //==> 4
println(l.init) //==> List(1,2,3)
println()

val table: List[List[Int]] = List (


List(1,0,0),
List(0,1,0),
List(0,0,1)
)
1 2 3 4 1 List(2, 3, 4) 4 List(1, 2, 3) l: List[Int] = List(1, 2, 3, 4)
table: List[List[Int]] = List(List(1, 0, 0), List(0, 1, 0), List(0, 0, 1))
Working with Lists
// working with lists
val list = List(2,3,4);

// cons operator – prepend a new element to the beginning


val m = 1::list

// appending
val n = list :+ 5

// to find out whether a list is empty or not


println("empty list? " + m.isEmpty)

// take the first n elements


list.take(2) //==> List(2,3)

// drop the first n elements


list.drop(2) //==> List(4)
empty list? false list: List[Int] = List(2, 3, 4) m: List[Int] = List(1, 2,
3, 4) n: List[Int] = List(2, 3, 4, 5) res54: List[Int] = List(4)
High Order List Methods
// high order list methods
val n = List(1,2,3,4)
val s = List("LNKD", "GOOG", "AAPL")
val p = List(265.69, 511.78, 108.49)
var product = 1;
n.foreach(product *= _) //==> 24
n.filter(_ % 2 != 0) //==> List(1,3)
n.partition(_ % 2 != 0) //==> (List(1,3), List(2,4))
n.find(_ % 2 != 0) //==> Some(1)
n.find(_ < 0) //==> None
p.takeWhile(_ > 200.00) //==> List(265.69, 511.78)
p.dropWhile(_ > 200.00) //==> List(108.49)
val p2 = List(265.69, 50.11, 511.78, 108.49)
p2.span(_ > 200.00) //==> (List(265.69),List(50.11, 511.78,108.49))

n: List[Int] = List(1, 2, 3, 4) s: List[String] = List(LNKD, GOOG, AAPL) p:


List[Double] = List(265.69, 511.78, 108.49) product: Int = 24 p2:
List[Double] = List(265.69, 50.11, 511.78, 108.49) res55: (List[Double],
List[Double]) = (List(265.69),List(50.11, 511.78, 108.49))
// high order list methods
val n = List(1,2,3,4)
val s = List("LNKD", "GOOG", "AAPL")
n.map(_ + 1) //==> List(2,3,4,5)
s.flatMap(_.toList) //==> List(L,N,K,D,G,O,O,G,A,A,P,L)
n.reduce((a,b) => { a + b} ) //==> 10
n.contains(3) //==> true
n: List[Int] = List(1, 2, 3, 4) s: List[String] = List(LNKD, GOOG, AAPL)
res56: Boolean = true
Pattern Matching with Lists
// pattern matching with lists
val n = List(1,2,3,4)
val s = List("LNKD", "GOOG", "AAPL")
def sum(xs: List[Int]) : Int = xs match {
case Nil => 0
case x :: ys => x + sum(ys)
}

val dups = List(1,2,3,4,6,3,2,7,9,4)

// challenge
//def removeDups(xs : List[int]) : List[Int] = xs match {
// todo
//}
n: List[Int] = List(1, 2, 3, 4) s: List[String] = List(LNKD, GOOG, AAPL)
sum: (xs: List[Int])Int dups: List[Int] = List(1, 2, 3, 4, 6, 3, 2, 7, 9,
4)

Our research group has a very strong focus on using and improving Apache Spark
to solve real world programs. In order to do this we need to have a very solid
understanding of the capabilities of Spark. So one of the first things we have done
is to go through the entire Spark RDD API and write examples to test their
functionality. This has been a very useful exercise and we would like to share the
examples with everyone.

Authors of examples: Matthias Langer and Zhen He


Emails addresses: m.langer@latrobe.edu.au, z.he@latrobe.edu.au

These examples have only been tested for Spark version 1.4. We assume the
functionality of Spark is stable and therefore the examples should be valid for later
releases.

If you find any errors in the example we would love to hear about them so we can
fix them up. So please email us to let us know.

The RDD API By Example

RDD is short for Resilient Distributed Dataset. RDDs are the workhorse of the
Spark system. As a user, one can consider a RDD as a handle for a collection of
individual data partitions, which are the result of some computation.
However, an RDD is actually more than that. On cluster installations, separate data
partitions can be on separate nodes. Using the RDD as a handle one can access all
partitions and perform computations and transformations using the contained data.
Whenever a part of a RDD or an entire RDD is lost, the system is able to
reconstruct the data of lost partitions by using lineage information. Lineage refers
to the sequence of transformations used to produce the current RDD. As a result,
Spark is able to recover automatically from most failures.
All RDDs available in Spark derive either directly or indirectly from the class
RDD. This class comes with a large set of methods that perform operations on the
data within the associated partitions. The class RDD is abstract. Whenever, one
uses a RDD, one is actually using a concertized implementation of RDD. These
implementations have to overwrite some core functions to make the RDD behave
as expected.
One reason why Spark has lately become a very popular system for processing big
data is that it does not impose restrictions regarding what data can be stored within
RDD partitions. The RDD API already contains many useful operations. But,
because the creators of Spark had to keep the core API of RDDs common enough
to handle arbitrary data-types, many convenience functions are missing.
The basic RDD API considers each data item as a single value. However, users
often want to work with key-value pairs. Therefore Spark extended the interface of
RDD to provide additional functions (PairRDDFunctions), which explicitly work
on key-value pairs. Currently, there are four extensions to the RDD API available
in spark. They are as follows:
DoubleRDDFunctions
This extension contains many useful methods for aggregating numeric values.
They become available if the data items of an RDD are implicitly convertible to
the Scala data-type double.
PairRDDFunctions
Methods defined in this interface extension become available when the data
items have a two component tuple structure. Spark will interpret the first
tuple item (i.e. tuplename. 1) as the key and the second item (i.e. tuplename.
2) as the associated value.
OrderedRDDFunctions
Methods defined in this interface extension become available if the data items
are two-component tuples where the key is implicitly sortable.
SequenceFileRDDFunctions
This extension contains several methods that allow users to create Hadoop
sequence- les from RDDs. The data items must be two compo- nent key-
value tuples as required by the PairRDDFunctions. However, there are
additional requirements considering the convertibility of the tuple
components to Writable types.
Since Spark will make methods with extended functionality automatically
available to users when the data items fulfill the above described requirements, we
decided to list all possible available functions in strictly alphabetical order. We will
append either of the followingto the function-name to indicate it belongs to an
extension that requires the data items to conform to a certain format or type.
[Double] - Double RDD Functions
[Ordered] - OrderedRDDFunctions
[Pair] - PairRDDFunctions
[SeqFile] - SequenceFileRDDFunctions

aggregate
The aggregate function allows the user to apply two different reduce functions to
the RDD. The first reduce function is applied within each partition to reduce the
data within each partition into a single result. The second reduce function is used to
combine the different reduced results of all partitions together to arrive at one final
result. The ability to have two separate reduce functions for intra partition versus
across partition reducing adds a lot of flexibility. For example the first reduce
function can be the max function and the second one can be the sum function. The
user also specifies an initial value. Here are some important facts.
 The initial value is applied at both levels of reduce. So both at the intra
partition reduction and across partition reduction.
 Both reduce functions have to be commutative and associative.
 Do not assume any execution order for either partition computations or
combining partitions.
 Why would one want to use two input data types? Let us assume we do an
archaeological site survey using a metal detector. While walking through the
site we take GPS coordinates of important findings based on the output of
the metal detector. Later, we intend to draw an image of a map that
highlights these locations using the aggregate function. In this case
the zeroValue could be an area map with no highlights. The possibly huge
set of input data is stored as GPS coordinates across many partitions. seqOp
(first reducer) could convert the GPS coordinates to map coordinates and
put a marker on the map at the respective position. combOp (second
reducer) will receive these highlights as partial maps and combine them
into a single final output map.
Listing Variants

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U)


=> U): U

Examples 1
val z = sc.parallelize(List(1,2,3,4,5,6), 2)

// lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
  iter.map(x => "[partID:" +  index + ", val: " + x + "]")
}

z.mapPartitionsWithIndex(myfunc).collect
res28: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3],
[partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])

z.aggregate(0)(math.max(_, _), _ + _)
res40: Int = 9

// This example returns 16 since the initial value is 5


// reduce of partition 0 will be max(5, 1, 2, 3) = 5
// reduce of partition 1 will be max(5, 4, 5, 6) = 6
// final reduce across partitions will be 5 + 5 + 6 = 16
// note the final reduce include the initial value
z.aggregate(5)(math.max(_, _), _ + _)
res29: Int = 16

val z = sc.parallelize(List("a","b","c","d","e","f"),2)

//lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: Iterator[(String)]) : Iterator[String] = {
  iter.map(x => "[partID:" +  index + ", val: " + x + "]")
}

z.mapPartitionsWithIndex(myfunc).collect
res31: Array[String] = Array([partID:0, val: a], [partID:0, val: b], [partID:0, val: c],
[partID:1, val: d], [partID:1, val: e], [partID:1, val: f])

z.aggregate("")(_ + _, _+_)
res115: String = abcdef

// See here how the initial value "x" is applied three times.
//  - once for each partition
//  - once when combining all the partitions in the second reduce function.
z.aggregate("x")(_ + _, _+_)
res116: String = xxdefxabc
// Below are some more advanced examples. Some are quite tricky to work out.

val z = sc.parallelize(List("12","23","345","4567"),2)
z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res141: String = 42

z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)


res142: String = 11

val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res143: String = 10

The main issue with the code above is that the result of the inner min is a string of
length 1.
The zero in the output is due to the empty string being the last string in the list. We
see this result because we are not recursively reducing any further within the
partition for the final string.

Examples 2

val z = sc.parallelize(List("12","23","","345"),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res144: String = 11

In contrast to the previous example, this example has the empty string at the
beginning of the second partition. This results in length of zero being input to the
second reduce which then upgrades it a length of 1. (Warning: The above example
shows bad design since the output is dependent on the order of the data inside the
partitions.)

aggregateByKey [Pair]
Works like the aggregate function except the aggregation is applied to the values
with the same key. Also unlike the aggregate function the initial value is not
applied to the second reduce.

Listing Variants
def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)
(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U,
combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) ⇒
U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

Example

val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12),


("dog", 12), ("mouse", 2)), 2)

// lets have a look at what is in the partitions


def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
  iter.map(x => "[partID:" +  index + ", val: " + x + "]")
}
pairRDD.mapPartitionsWithIndex(myfunc).collect

res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)],


[partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)],
[partID:1, val: (mouse,2)])

pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect


res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))

pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect


res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))

cartesian
Computes the cartesian product between two RDDs (i.e. Each item of the first
RDD is joined with each item of the second RDD) and returns them as a new
RDD. (Warning: Be careful when using this function.! Memory consumption can
quickly become an issue!)

Listing Variants

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]


Example

val x = sc.parallelize(List(1,2,3,4,5))
val y = sc.parallelize(List(6,7,8,9,10))
x.cartesian(y).collect
res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8),
(2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8),
(4,9), (4,10), (5,9), (5,10))

checkpoint
Will create a checkpoint when the RDD is computed next. Checkpointed RDDs are
stored as a binary file within the checkpoint directory which can be specified using
the Spark context. (Warning: Spark applies lazy evaluation. Checkpointing will
not occur until an action is invoked.)

Important note: the directory  "my_directory_name" should exist in all slaves. As


an alternative you could use an HDFS directory URL as well.

Listing Variants

def checkpoint()

Example

sc.setCheckpointDir("my_directory_name")
val a = sc.parallelize(1 to 4)
a.checkpoint
a.count
14/02/25 18:13:53 INFO SparkContext: Starting job: count at <console>:15
...
14/02/25 18:13:53 INFO MemoryStore: Block broadcast_5 stored as values to
memory (estimated size 115.7 KB, free 296.3 MB)
14/02/25 18:13:53 INFO RDDCheckpointData: Done checkpointing RDD 11 to
file:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/
my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11, new
parent is RDD 12
res23: Long = 4
coalesce, repartition
Coalesces the associated data into a given number of
partitions. repartition(numPartitions) is simply an abbreviation
for coalesce(numPartitions, shuffle = true).

Listing Variants

def coalesce ( numPartitions : Int , shuffle : Boolean = false ): RDD [T]


def repartition ( numPartitions : Int ): RDD [T]

Example

val y = sc.parallelize(1 to 10, 10)


val z = y.coalesce(2, false)
z.partitions.length
res9: Int = 2

cogroup [Pair], groupWith [Pair]
A very powerful set of functions that allow grouping up to 3 key-value RDDs
together using their keys.

Listing Variants

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]


def cogroup[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Iterable[V],
Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K,
(Iterable[V], Iterable[W]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K,
(Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)],
numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner:
Partitioner): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def groupWith[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
def groupWith[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]):
RDD[(K, (Iterable[V], IterableW1], Iterable[W2]))]

Examples

val a = sc.parallelize(List(1, 2, 1, 3), 1)


val b = a.map((_, "b"))
val c = a.map((_, "c"))
b.cogroup(c).collect
res7: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(2,(ArrayBuffer(b),ArrayBuffer(c))),
(3,(ArrayBuffer(b),ArrayBuffer(c))),
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))
)

val d = a.map((_, "d"))


b.cogroup(c, d).collect
res9: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array(
(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),
(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d, d)))
)

val x = sc.parallelize(List((1, "apple"), (2, "banana"), (3, "orange"), (4, "kiwi")),


2)
val y = sc.parallelize(List((5, "computer"), (1, "laptop"), (1, "desktop"), (4,
"iPad")), 2)
x.cogroup(y).collect
res23: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))),
(2,(ArrayBuffer(banana),ArrayBuffer())),
(3,(ArrayBuffer(orange),ArrayBuffer())),
(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),
(5,(ArrayBuffer(),ArrayBuffer(computer))))

collect, toArray
Converts the RDD into a Scala array and returns it. If you provide a standard map-
function (i.e. f = T -> U) it will be applied before inserting the values into the result
array.

Listing Variants
def collect(): Array[T]
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]
def toArray(): Array[T]

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)


c.collect
res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)

collectAsMap [Pair]
Similar to collect, but works on key-value RDDs and converts them into Scala
maps to preserve their key-value structure.

Listing Variants

def collectAsMap(): Map[K, V]

Example

val a = sc.parallelize(List(1, 2, 1, 3), 1)


val b = a.zip(a)
b.collectAsMap
res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)

combineByKey[Pair]
Very efficient implementation that combines the values of a RDD consisting of
two-component tuples by applying multiple aggregators one after another.

Listing Variants
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean
= true, serializerClass: String = null): RDD[(K, C)]

Example

val a =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String],
y:List[String]) => x ::: y)
d.collect
res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit,
salmon, bee, bear, wolf)))

compute
Executes dependencies and computes the actual representation of the RDD. This
function should not be called directly by users.

Listing Variants

def compute(split: Partition, context: TaskContext): Iterator[T]

context, sparkContext
Returns the SparkContext that was used to create the RDD.

Listing Variants

def compute(split: Partition, context: TaskContext): Iterator[T]

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)


c.context
res8: org.apache.spark.SparkContext = org.apache.spark.SparkContext@58c1c2f1
count
Returns the number of items stored within a RDD.

Listing Variants

def count(): Long

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)


c.count
res2: Long = 4

countApprox
Marked as experimental feature! Experimental features are currently not covered
by this document!

Listing Variants
def (timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]

countApproxDistinct
Computes the approximate number of distinct values. For large RDDs which are
spread across many nodes, this function may execute faster than other counting
methods. The parameter relativeSD controls the accuracy of the computation.

Listing Variants

def countApproxDistinct(relativeSD: Double = 0.05): Long

Example
val a = sc.parallelize(1 to 10000, 20)
val b = a++a++a++a++a
b.countApproxDistinct(0.1)
res14: Long = 8224

b.countApproxDistinct(0.05)
res15: Long = 9750

b.countApproxDistinct(0.01)
res16: Long = 9947

b.countApproxDistinct(0.001)
res0: Long = 10000

countApproxDistinctByKey [Pair]
 
Similar to countApproxDistinct, but computes the approximate number of distinct
values for each distinct key. Hence, the RDD must consist of two-component
tuples. For large RDDs which are spread across many nodes, this function may
execute faster than other counting methods. The parameter relativeSD controls the
accuracy of the computation.

Listing Variants

def countApproxDistinctByKey(relativeSD: Double = 0.05): RDD[(K, Long)]


def countApproxDistinctByKey(relativeSD: Double, numPartitions: Int): RDD[(K,
Long)]
def countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner):
RDD[(K, Long)]

Example

val a = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)


val b = sc.parallelize(a.takeSample(true, 10000, 0), 20)
val c = sc.parallelize(1 to b.count().toInt, 20)
val d = b.zip(c)
d.countApproxDistinctByKey(0.1).collect
res15: Array[(String, Long)] = Array((Rat,2567), (Cat,3357), (Dog,2414), (Gnu,2494))

d.countApproxDistinctByKey(0.01).collect
res16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455), (Dog,2425), (Gnu,2513))

d.countApproxDistinctByKey(0.001).collect
res0: Array[(String, Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451), (Gnu,2521))

countByKey [Pair]
Very similar to count, but counts the values of a RDD consisting of two-
component tuples for each distinct key separately.

Listing Variants

def countByKey(): Map[K, Long]

Example

val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2)


c.countByKey
res3: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)

countByKeyApprox [Pair]
Marked as experimental feature! Experimental features are currently not covered
by this document!

Listing Variants

def countByKeyApprox(timeout: Long, confidence: Double = 0.95):


PartialResult[Map[K, BoundedDouble]]
countByValue
Returns a map that contains all unique values of the RDD and their respective
occurrence counts. (Warning: This operation will finally aggregate the
information in a single reducer.)

Listing Variants

def countByValue(): Map[T, Long]

Example

val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b.countByValue
res27: scala.collection.Map[Int,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4
-> 2, 7 -> 1)

countByValueApprox
Marked as experimental feature! Experimental features are currently not covered
by this document!

Listing Variants

def countByValueApprox(timeout: Long, confidence: Double = 0.95):


PartialResult[Map[T, BoundedDouble]]

dependencies
 
Returns the RDD on which this RDD depends.

Listing Variants
final def dependencies: Seq[Dependency[_]]

Example

val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at
<console>:12
b.dependencies.length
Int = 0

b.map(a => a).dependencies.length


res40: Int = 1

b.cartesian(a).dependencies.length
res41: Int = 2

b.cartesian(a).dependencies
res42: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.rdd.CartesianRDD$
$anon$1@576ddaaa, org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)

distinct
 
Returns a new RDD that contains each unique value only once.

Listing Variants

def distinct(): RDD[T]


def distinct(numPartitions: Int): RDD[T]

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)


c.distinct.collect
res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
a.distinct(2).partitions.length
res16: Int = 2

a.distinct(3).partitions.length
res17: Int = 3
first
 
Looks for the very first data item of the RDD and returns it.

Listing Variants

def first(): T

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)


c.first
res1: String = Gnu

filter
 
Evaluates a boolean function for each data item of the RDD and puts the items for
which the function returned true into the resulting RDD.

Listing Variants

def filter(f: T => Boolean): RDD[T]

Example

val a = sc.parallelize(1 to 10, 3)


val b = a.filter(_ % 2 == 0)
b.collect
res3: Array[Int] = Array(2, 4, 6, 8, 10)

When you provide a filter function, it must be able to handle all data items
contained in the RDD. Scala provides so-called partial functions to deal with
mixed data-types. (Tip: Partial functions are very useful if you have some data
which may be bad and you do not want to handle but for the good data (matching
data) you want to apply some kind of map function. The following article is good.
It teaches you about partial functions in a very nice way and explains why case has
to be used for partial functions:  article)

Examples for mixed data without partial functions

val b = sc.parallelize(1 to 8)
b.filter(_ < 4).collect
res15: Array[Int] = Array(1, 2, 3)

val a = sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog"))


a.filter(_ < 4).collect
<console>:15: error: value < is not a member of Any

This fails because some components of a are not implicitly comparable against
integers. Collect uses the isDefinedAt property of a function-object to determine
whether the test-function is compatible with each data item. Only data items that
pass this test (=filter) are then mapped using the function-object.

Examples for mixed data with partial functions

val a = sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog"))


a.collect({case a: Int    => "is integer" |
           case b: String => "is string" }).collect
res17: Array[String] = Array(is string, is string, is integer, is string)

val myfunc: PartialFunction[Any, Any] = {


  case a: Int    => "is integer" |
  case b: String => "is string" }
myfunc.isDefinedAt("")
res21: Boolean = true

myfunc.isDefinedAt(1)
res22: Boolean = true

myfunc.isDefinedAt(1.5)
res23: Boolean = false

Be careful! The above code works because it only checks the type itself! If you use
operations on this type, you have to explicitly declare what type you want instead
of any. Otherwise the compiler does (apparently) not know what bytecode it should
produce:
val myfunc2: PartialFunction[Any, Any] = {case x if (x < 4) => "x"}
<console>:10: error: value < is not a member of Any

val myfunc2: PartialFunction[Int, Any] = {case x if (x < 4) => "x"}


myfunc2: PartialFunction[Int,Any] = <function1>

filterByRange [Ordered]
 
Returns an RDD containing only the items in the key range specified. From our
testing, it appears this only works if your data is in key value pairs and it has
already been sorted by key.

Listing Variants

def filterByRange(lower: K, upper: K): RDD[P]

Example

val randRDD = sc.parallelize(List( (2,"cat"), (6, "mouse"),(7, "cup"), (3, "book"), (4,
"tv"), (1, "screen"), (5, "heater")), 3)
val sortedRDD = randRDD.sortByKey()

sortedRDD.filterByRange(1, 3).collect
res66: Array[(Int, String)] = Array((1,screen), (2,cat), (3,book))

filterWith  (deprecated)
 
This is an extended version of filter. It takes two function arguments. The first
argument must conform to Int -> T and is executed once per partition. It will
transform the partition index to type T. The second function looks like (U, T) ->
Boolean. T is the transformed partition index and U are the data items from the
RDD. Finally the function has to return either true or false (i.e. Apply the filter).
Listing Variants

def filterWith[A: ClassTag](constructA: Int => A)(p: (T, A) => Boolean): RDD[T]

Example

val a = sc.parallelize(1 to 9, 3)
val b = a.filterWith(i => i)((x,i) => x % 2 == 0 || i % 2 == 0)
b.collect
res37: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)
a.filterWith(x=> x)((a, b) =>  b == 0).collect
res30: Array[Int] = Array(1, 2)

a.filterWith(x=> x)((a, b) =>  a % (b+1) == 0).collect


res33: Array[Int] = Array(1, 2, 4, 6, 8, 10)

a.filterWith(x=> x.toString)((a, b) =>  b == "2").collect


res34: Array[Int] = Array(5, 6)

flatMap
 
Similar to map, but allows emitting more than one item in the map function.

Listing Variants

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]

Example

val a = sc.parallelize(1 to 10, 5)


a.flatMap(1 to _).collect
res47: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5,
6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collect


res85: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)

// The program below generates a random number of copies (up to 10) of the items in the
list.
val x  = sc.parallelize(1 to 10, 3)
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect

res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7,


7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)

flatMapValues
 
Very similar to mapValues, but collapses the inherent structure of the values during
mapping.

Listing Variants

def flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)


val b = a.map(x => (x.length, x))
b.flatMapValues("x" + _ + "x").collect
res6: Array[(Int, Char)] = Array((3,x), (3,d), (3,o), (3,g), (3,x), (5,x), (5,t), (5,i), (5,g),
(5,e), (5,r), (5,x), (4,x), (4,l), (4,i), (4,o), (4,n), (4,x), (3,x), (3,c), (3,a), (3,t), (3,x), (7,x),
(7,p), (7,a), (7,n), (7,t), (7,h), (7,e), (7,r), (7,x), (5,x), (5,e), (5,a), (5,g), (5,l), (5,e), (5,x))

flatMapWith (deprecated)
 
Similar to flatMap, but allows accessing the partition index or a derivative of the
partition index from within the flatMap-function.

Listing Variants

def flatMapWith[A: ClassTag, U: ClassTag](constructA: Int => A,


preservesPartitioning: Boolean = false)(f: (T, A) => Seq[U]): RDD[U]
Example

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)
a.flatMapWith(x => x, true)((x, y) => List(y, x)).collect
res58: Array[Int] = Array(0, 1, 0, 2, 0, 3, 1, 4, 1, 5, 1, 6, 2, 7, 2, 8, 2, 9)

fold
 
Aggregates the values of each partition. The aggregation variable within each
partition is initialized with zeroValue.

Listing Variants

def fold(zeroValue: T)(op: (T, T) => T): T

Example

val a = sc.parallelize(List(1,2,3), 3)
a.fold(0)(_ + _)
res59: Int = 6

foldByKey [Pair]
 
Very similar to fold, but performs the folding separately for each key of the RDD.
This function is only available if the RDD consists of two-component tuples.

Listing Variants

def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]


def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K,
V)]
def foldByKey(zeroValue: V, partitioner: Partitioner)(func: (V, V) => V):
RDD[(K, V)]
Example

val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)


val b = a.map(x => (x.length, x))
b.foldByKey("")(_ + _).collect
res84: Array[(Int, String)] = Array((3,dogcatowlgnuant)

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)


val b = a.map(x => (x.length, x))
b.foldByKey("")(_ + _).collect
res85: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))

foreach
 
Executes an parameterless function for each data item.

Listing Variants

def foreach(f: T => Unit)

Example

val c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale",


"dolphin", "spider"), 3)
c.foreach(x => println(x + "s are yummy"))
lions are yummy
gnus are yummy
crocodiles are yummy
ants are yummy
whales are yummy
dolphins are yummy
spiders are yummy
foreachPartition
 
Executes an parameterless function for each partition. Access to the data items
contained in the partition is provided via the iterator argument.

Listing Variants

def foreachPartition(f: Iterator[T] => Unit)

Example

val b = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)


b.foreachPartition(x => println(x.reduce(_ + _)))
6
15
24

foreachWith (Deprecated)
 
Executes an parameterless function for each partition. Access to the data items
contained in the partition is provided via the iterator argument.

Listing Variants

def foreachWith[A: ClassTag](constructA: Int => A)(f: (T, A) => Unit)

Example

val a = sc.parallelize(1 to 9, 3)
a.foreachWith(i => i)((x,i) => if (x % 2 == 1 && i % 2 == 0) println(x) )
1
3
7
9
fullOuterJoin [Pair]
 
Performs the full outer join between two paired RDDs.

Listing Variants

def fullOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K,


(Option[V], Option[W]))]
def fullOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]
def fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K,
(Option[V], Option[W]))]

Example

val pairRDD1 = sc.parallelize(List( ("cat",2), ("cat", 5), ("book", 4),("cat", 12)))


val pairRDD2 = sc.parallelize(List( ("cat",2), ("cup", 5), ("mouse", 4),("cat", 12)))
pairRDD1.fullOuterJoin(pairRDD2).collect

res5: Array[(String, (Option[Int], Option[Int]))] = Array((book,(Some(4),None)),


(mouse,(None,Some(4))), (cup,(None,Some(5))), (cat,(Some(2),Some(2))), (cat,
(Some(2),Some(12))), (cat,(Some(5),Some(2))), (cat,(Some(5),Some(12))), (cat,
(Some(12),Some(2))), (cat,(Some(12),Some(12))))

generator, setGenerator
 
Allows setting a string that is attached to the end of the RDD's name when printing
the dependency graph.

Listing Variants

@transient var generator


def setGenerator(_generator: String)
getCheckpointFile
 
Returns the path to the checkpoint file or null if RDD has not yet been
checkpointed.

Listing Variants

def getCheckpointFile: Option[String]

Example

sc.setCheckpointDir("/home/cloudera/Documents")
val a = sc.parallelize(1 to 500, 5)
val b = a++a++a++a++a
b.getCheckpointFile
res49: Option[String] = None

b.checkpoint
b.getCheckpointFile
res54: Option[String] = None

b.collect
b.getCheckpointFile
res57: Option[String] = Some(file:/home/cloudera/Documents/cb978ffb-a346-4820-b3ba-
d56580787b20/rdd-40)

preferredLocations
 
Returns the hosts which are preferred by this RDD. The actual preference of a
specific host depends on various assumptions.

Listing Variants

final def preferredLocations(split: Partition): Seq[String]


getStorageLevel
 
Retrieves the currently set storage level of the RDD. This can only be used to
assign a new storage level if the RDD does not have a storage level set yet. The
example below shows the error you will get, when you try to reassign the storage
level.

Listing Variants

def getStorageLevel

Example

val a = sc.parallelize(1 to 100000, 2)


a.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
a.getStorageLevel.description
String = Disk Serialized 1x Replicated

a.cache
java.lang.UnsupportedOperationException: Cannot change storage level of an RDD after
it was already assigned a level

glom
 
Assembles an array that contains all elements of the partition and embeds it in an
RDD. Each returned array contains the contents of one partition.

Listing Variants

def glom(): RDD[Array[T]]

Example

val a = sc.parallelize(1 to 100, 3)


a.glom.collect
res8: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33), Array(34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
63, 64, 65, 66), Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100))

groupBy
 

Listing Variants

def groupBy[K: ClassTag](f: T => K): RDD[(K, Iterable[T])]


def groupBy[K: ClassTag](f: T => K, numPartitions: Int): RDD[(K, Iterable[T])]
def groupBy[K: ClassTag](f: T => K, p: Partitioner): RDD[(K, Iterable[T])]

Example

val a = sc.parallelize(1 to 9, 3)
a.groupBy(x => { if (x % 2 == 0) "even" else "odd" }).collect
res42: Array[(String, Seq[Int])] = Array((even,ArrayBuffer(2, 4, 6, 8)),
(odd,ArrayBuffer(1, 3, 5, 7, 9)))

val a = sc.parallelize(1 to 9, 3)
def myfunc(a: Int) : Int =
{
 a%2
}
a.groupBy(myfunc).collect
res3: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7,
9)))

val a = sc.parallelize(1 to 9, 3)
def myfunc(a: Int) : Int =
{
 a%2
}
a.groupBy(x => myfunc(x), 3).collect
a.groupBy(myfunc(_), 1).collect
res7: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7,
9)))
import org.apache.spark.Partitioner
class MyPartitioner extends Partitioner {
def numPartitions: Int = 2
def getPartition(key: Any): Int =
{
    key match
    {
      case null     => 0
      case key: Int => key          % numPartitions
      case _        => key.hashCode % numPartitions
    }
 }
  override def equals(other: Any): Boolean =
 {
    other match
    {
      case h: MyPartitioner => true
      case _                => false
    }
 }
}
val a = sc.parallelize(1 to 9, 3)
val p = new MyPartitioner()
val b = a.groupBy((x:Int) => { x }, p)
val c = b.mapWith(i => i)((a, b) => (b, a))
c.collect
res42: Array[(Int, (Int, Seq[Int]))] = Array((0,(4,ArrayBuffer(4))), (0,(2,ArrayBuffer(2))),
(0,(6,ArrayBuffer(6))), (0,(8,ArrayBuffer(8))), (1,(9,ArrayBuffer(9))), (1,
(3,ArrayBuffer(3))), (1,(1,ArrayBuffer(1))), (1,(7,ArrayBuffer(7))), (1,
(5,ArrayBuffer(5))))

groupByKey [Pair]
 
Very similar to groupBy, but instead of supplying a function, the key-component
of each pair will automatically be presented to the partitioner.

Listing Variants
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)


val b = a.keyBy(_.length)
b.groupByKey.collect
res11: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)),
(3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle)))

histogram [Double]
 
These functions take an RDD of doubles and create a histogram with either even
spacing (the number of buckets equals to bucketCount) or arbitrary spacing based
on  custom bucket boundaries supplied by the user via an array of double values.
The result type of both variants is slightly different, the first function will return a
tuple consisting of two arrays. The first array contains the computed bucket
boundary values and the second array contains the corresponding count of
values (i.e. the histogram). The second variant of the function will just return the
histogram as an array of integers.

Listing Variants

def histogram(bucketCount: Int): Pair[Array[Double], Array[Long]]


def histogram(buckets: Array[Double], evenBuckets: Boolean = false):
Array[Long]

Example with even spacing

val a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 9.0), 3)
a.histogram(5)
res11: (Array[Double], Array[Long]) = (Array(1.1, 2.68, 4.26, 5.84, 7.42, 9.0),Array(5, 0,
0, 1, 4))

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9,
5.5), 3)
a.histogram(6)
res18: (Array[Double], Array[Long]) = (Array(1.0, 2.5, 4.0, 5.5, 7.0, 8.5, 10.0),Array(6, 0,
1, 1, 3, 4))
Example with custom spacing

val a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 9.0), 3)
a.histogram(Array(0.0, 3.0, 8.0))
res14: Array[Long] = Array(5, 3)

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9,
5.5), 3)
a.histogram(Array(0.0, 5.0, 10.0))
res1: Array[Long] = Array(6, 9)

a.histogram(Array(0.0, 5.0, 10.0, 15.0))


res1: Array[Long] = Array(6, 8, 1)

id
Retrieves the ID which has been assigned to the RDD by its device context.

Listing Variants

val id: Int

Example

val y = sc.parallelize(1 to 10, 10)


y.id
res16: Int = 19

intersection
Returns the elements in the two RDDs which are the same.
Listing Variants

def intersection(other: RDD[T], numPartitions: Int): RDD[T]


def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T]
= null): RDD[T]
def intersection(other: RDD[T]): RDD[T]

Example

val x = sc.parallelize(1 to 20)


val y = sc.parallelize(10 to 30)
val z = x.intersection(y)

z.collect
res74: Array[Int] = Array(16, 12, 20, 13, 17, 14, 18, 10, 19, 15, 11)

isCheckpointed
Indicates whether the RDD has been checkpointed. The flag will only raise once
the checkpoint has really been created.

Listing Variants

def isCheckpointed: Boolean

Example

sc.setCheckpointDir("/home/cloudera/Documents")
c.isCheckpointed
res6: Boolean = false

c.checkpoint
c.isCheckpointed
res8: Boolean = false

c.collect
c.isCheckpointed
res9: Boolean = true
iterator
Returns a compatible iterator object for a partition of this RDD. This function
should never be called directly.

Listing Variants

final def iterator(split: Partition, context: TaskContext): Iterator[T]

join [Pair]
Performs an inner join using two key-value RDDs. Please note that the keys must
be generally comparable to make this work.

Listing Variants

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]


def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

Example

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)


val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.join(d).collect

res0: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,


(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,
(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))
keyBy
Constructs two-component tuples (key-value pairs) by applying a function on each
data item. The result of the function becomes the key and the original data item
becomes the value of the newly created tuples.

Listing Variants

def keyBy[K](f: T => K): RDD[(K, T)]

Example

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)


val b = a.keyBy(_.length)
b.collect
res26: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant))

keys [Pair]
Extracts the keys from all contained tuples and returns them in a new RDD.

Listing Variants

def keys: RDD[K]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)


val b = a.map(x => (x.length, x))
b.keys.collect
res2: Array[Int] = Array(3, 5, 4, 3, 7, 5)
leftOuterJoin [Pair]
Performs an left outer join using two key-value RDDs. Please note that the keys
must be generally comparable to make this work correctly.

Listing Variants

def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]


def leftOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V,
Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V,
Option[W]))]

Example

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)


val b = a.keyBy(_.length)
val c =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.leftOuterJoin(d).collect

res1: Array[(Int, (String, Option[String]))] = Array((6,(salmon,Some(salmon))), (6,


(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (6,(salmon,Some(salmon))), (6,
(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (3,(dog,Some(dog))), (3,
(dog,Some(cat))), (3,(dog,Some(gnu))), (3,(dog,Some(bee))), (3,(rat,Some(dog))), (3,
(rat,Some(cat))), (3,(rat,Some(gnu))), (3,(rat,Some(bee))), (8,(elephant,None)))

lookup
Scans the RDD for all keys that match the provided value and returns their values
as a Scala sequence.

Listing Variants

def lookup(key: K): Seq[V]

Example
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.lookup(5)
res0: Seq[String] = WrappedArray(tiger, eagle)

map
Applies a transformation function on each item of the RDD and returns the result
as a new RDD.

Listing Variants

def map[U: ClassTag](f: T => U): RDD[U]

Example
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.map(_.length)
val c = a.zip(b)
c.collect
res0: Array[(String, Int)] = Array((dog,3), (salmon,6), (salmon,6), (rat,3), (elephant,8))

mapPartitions
This is a specialized map that is called only once for each partition. The entire
content of the respective partitions is available as a sequential stream of values via
the input argument (Iterarator[T]). The custom function must return yet
another Iterator[U]. The combined result iterators are automatically converted into
a new RDD. Please note, that the tuples (3,4) and (6,7) are missing from the
following result due to the partitioning we chose.

Listing Variants
def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning:
Boolean = false): RDD[U]

Example 1

val a = sc.parallelize(1 to 9, 3)
def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = {
  var res = List[(T, T)]()
  var pre = iter.next
  while (iter.hasNext)
 {
    val cur = iter.next;
    res .::= (pre, cur)
    pre = cur;
 }
  res.iterator
}
a.mapPartitions(myfunc).collect
res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))

Example 2

val x = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9,10), 3)


def myfunc(iter: Iterator[Int]) : Iterator[Int] = {
  var res = List[Int]()
  while (iter.hasNext) {
    val cur = iter.next;
    res = res ::: List.fill(scala.util.Random.nextInt(10))(cur)
 }
  res.iterator
}
x.mapPartitions(myfunc).collect
// some of the number are not outputted at all. This is because the random number
generated for it is zero.
res8: Array[Int] = Array(1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 7, 7, 7, 9,
9, 10)

The above program can also be written using flatMap as follows.

Example 2 using flatmap

val x  = sc.parallelize(1 to 10, 3)


x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect

res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7,


7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)
mapPartitionsWithContext   (deprecated and developer API)
Similar to mapPartitions, but allows accessing information about the processing
state within the mapper.

Listing Variants

def mapPartitionsWithContext[U: ClassTag](f: (TaskContext, Iterator[T]) =>


Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

Example

val a = sc.parallelize(1 to 9, 3)
import org.apache.spark.TaskContext
def myfunc(tc: TaskContext, iter: Iterator[Int]) : Iterator[Int] = {
  tc.addOnCompleteCallback(() => println(
    "Partition: "     + tc.partitionId +
    ", AttemptID: "   + tc.attemptId ))
 
  iter.toList.filter(_ % 2 == 0).iterator
}
a.mapPartitionsWithContext(myfunc).collect

14/04/01 23:05:48 INFO SparkContext: Starting job: collect at <console>:20


...
14/04/01 23:05:48 INFO Executor: Running task ID 0
Partition: 0, AttemptID: 0, Interrupted: false
...
14/04/01 23:05:48 INFO Executor: Running task ID 1
14/04/01 23:05:48 INFO TaskSetManager: Finished TID 0 in 470 ms on localhost
(progress: 0/3)
...
14/04/01 23:05:48 INFO Executor: Running task ID 2
14/04/01 23:05:48 INFO TaskSetManager: Finished TID 1 in 23 ms on localhost
(progress: 1/3)
14/04/01 23:05:48 INFO DAGScheduler: Completed ResultTask(0, 1)

?
res0: Array[Int] = Array(2, 6, 4, 8)
mapPartitionsWithIndex
Similar to mapPartitions, but takes two parameters. The first parameter is the
index of the partition and the second is an iterator through all the items within this
partition. The output is an iterator containing the list of items after applying
whatever transformation the function encodes.

Listing Variants
def mapPartitionsWithIndex[U: ClassTag](f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]

Example

val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
  iter.map(x => index + "," + x)
}
x.mapPartitionsWithIndex(myfunc).collect()
res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10)

mapPartitionsWithSplit
This method has been marked as deprecated in the API. So, you should not use this
method anymore. Deprecated methods will not be covered in this document.

Listing Variants
def mapPartitionsWithSplit[U: ClassTag](f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
mapValues [Pair]
Takes the values of a RDD that consists of two-component tuples, and applies the
provided function to transform each value. Then, it forms new two-component
tuples using the key and the transformed value and stores them in a new RDD.

Listing Variants

def mapValues[U](f: V => U): RDD[(K, U)]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)


val b = a.map(x => (x.length, x))
b.mapValues("x" + _ + "x").collect
res5: Array[(Int, String)] = Array((3,xdogx), (5,xtigerx), (4,xlionx), (3,xcatx),
(7,xpantherx), (5,xeaglex))

mapWith  (deprecated)
This is an extended version of map. It takes two function arguments. The first
argument must conform to Int -> T and is executed once per partition. It will map
the partition index to some transformed partition index of type T. This is where it is
nice to do some kind of initialization code once per partition. Like create a
Random number generator object. The second function must conform to (U, T) ->
U. T is the transformed partition index and U is a data item of the RDD. Finally the
function has to return a transformed data item of type U.

Listing Variants

def mapWith[A: ClassTag, U: ClassTag](constructA: Int => A,


preservesPartitioning: Boolean = false)(f: (T, A) => U): RDD[U]
Example

// generates 9 random numbers less than 1000.


val x = sc.parallelize(1 to 9, 3)
x.mapWith(a => new scala.util.Random)((x, r) => r.nextInt(1000)).collect
res0: Array[Int] = Array(940, 51, 779, 742, 757, 982, 35, 800, 15)

val a = sc.parallelize(1 to 9, 3)
val b = a.mapWith("Index:" + _)((a, b) => ("Value:" + a, b))
b.collect
res0: Array[(String, String)] = Array((Value:1,Index:0), (Value:2,Index:0),
(Value:3,Index:0), (Value:4,Index:1), (Value:5,Index:1), (Value:6,Index:1),
(Value:7,Index:2), (Value:8,Index:2), (Value:9,Index:2)

max
Returns the largest element in the RDD

Listing Variants

def max()(implicit ord: Ordering[T]): T

Example

val y = sc.parallelize(10 to 30)


y.max
res75: Int = 30

val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (18, "cat")))


a.max
res6: (Int, String) = (18,cat)

mean [Double], meanApprox [Double]
Calls stats and extracts the mean component. The approximate version of the
function can finish somewhat faster in some scenarios. However, it trades accuracy
for speed.

Listing Variants

def mean(): Double


def meanApprox(timeout: Long, confidence: Double = 0.95):
PartialResult[BoundedDouble]

Example

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9,
5.5), 3)
a.mean
res0: Double = 5.3

min
Returns the smallest element in the RDD

Listing Variants

def min()(implicit ord: Ordering[T]): T

Example

val y = sc.parallelize(10 to 30)


y.min
res75: Int = 10

val a = sc.parallelize(List((10, "dog"), (3, "tiger"), (9, "lion"), (8, "cat")))


a.min
res4: (Int, String) = (3,tiger)
name, setName
Allows a RDD to be tagged with a custom name.

Listing Variants

@transient var name: String


def setName(_name: String)

Example

val y = sc.parallelize(1 to 10, 10)


y.name
res13: String = null
y.setName("Fancy RDD Name")
y.name
res15: String = Fancy RDD Name

partitionBy [Pair]
Repartitions as key-value RDD using its keys. The partitioner implementation can
be supplied as the first argument.

Listing Variants

def partitionBy(partitioner: Partitioner): RDD[(K, V)]

partitioner
Specifies a function pointer to the default partitioner that will be used
for groupBy, subtract, reduceByKey (from PairedRDDFunctions), etc. functions.

Listing Variants

@transient val partitioner: Option[Partitioner]

partitions
Returns an array of the partition objects associated with this RDD.

Listing Variants

final def partitions: Array[Partition]

Example

val b = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)


b.partitions
res48: Array[org.apache.spark.Partition] =
Array(org.apache.spark.rdd.ParallelCollectionPartition@18aa,
org.apache.spark.rdd.ParallelCollectionPartition@18ab)

persist, cache
These functions can be used to adjust the storage level of a RDD. When freeing up
memory, Spark will use the storage level identifier to decide which partitions
should be kept. The parameterless variants persist() and cache() are just
abbreviations for persist(StorageLevel.MEMORY_ONLY). (Warning: Once the
storage level has been changed, it cannot be changed again!)

Listing Variants

def cache(): RDD[T]


def persist(): RDD[T]
def persist(newLevel: StorageLevel): RDD[T]

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)


c.getStorageLevel
res0: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, 1)
c.cache
c.getStorageLevel
res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1)

pipe
Takes the RDD data of each partition and sends it via stdin to a shell-command.
The resulting output of the command is captured and returned as a RDD of string
values.

Listing Variants

def pipe(command: String): RDD[String]


def pipe(command: String, env: Map[String, String]): RDD[String]
def pipe(command: Seq[String], env: Map[String, String] = Map(),
printPipeContext: (String => Unit) => Unit = null, printRDDElement: (T, String
=> Unit) => Unit = null): RDD[String]

Example

val a = sc.parallelize(1 to 9, 3)
a.pipe("head -n 1").collect
res2: Array[String] = Array(1, 4, 7)
randomSplit
Randomly splits an RDD into multiple smaller RDDs according to a weights Array
which specifies the percentage of the total data elements that is assigned to each
smaller RDD. Note the actual size of each smaller RDD is only approximately
equal to the percentages specified by the weights Array. The second example
below shows the number of items in each smaller RDD does not exactly match the
weights Array.   A random optional seed can be specified. This function is useful
for spliting data into a training set and a testing set for machine learning.

Listing Variants

def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong):


Array[RDD[T]]

Example

val y = sc.parallelize(1 to 10)


val splits = y.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)
training.collect
res:85 Array[Int] = Array(1, 4, 5, 6, 8, 10)
test.collect
res86: Array[Int] = Array(2, 3, 7, 9)

val y = sc.parallelize(1 to 10)


val splits = y.randomSplit(Array(0.1, 0.3, 0.6))

val rdd1 = splits(0)


val rdd2 = splits(1)
val rdd3 = splits(2)

rdd1.collect
res87: Array[Int] = Array(4, 10)
rdd2.collect
res88: Array[Int] = Array(1, 3, 5, 8)
rdd3.collect
res91: Array[Int] = Array(2, 6, 7, 9)

reduce
This function provides the well-known reduce functionality in Spark. Please note
that any function f you provide, should be commutative in order to generate
reproducible results.

Listing Variants

def reduce(f: (T, T) => T): T

Example

val a = sc.parallelize(1 to 100, 3)


a.reduce(_ + _)
res41: Int = 5050

reduceByKey [Pair], 
reduceByKeyLocally [Pair], reduceByKeyToDriver [Pair]
This function provides the well-known reduce functionality in Spark. Please note
that any function f you provide, should be commutative in order to generate
reproducible results.

Listing Variants

def reduceByKey(func: (V, V) => V): RDD[(K, V)]


def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
def reduceByKeyLocally(func: (V, V) => V): Map[K, V]
def reduceByKeyToDriver(func: (V, V) => V): Map[K, V]

Example

val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)


val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res86: Array[(Int, String)] = Array((3,dogcatowlgnuant))

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)


val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res87: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))

repartition
This function changes the number of partitions to the number specified by the
numPartitions parameter

Listing Variants

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

Example

val rdd = sc.parallelize(List(1, 2, 10, 4, 5, 2, 1, 1, 1), 3)


rdd.partitions.length
res2: Int = 3
val rdd2  = rdd.repartition(5)
rdd2.partitions.length
res6: Int = 5
repartitionAndSortWithinPartitions [Ordered]
Repartition the RDD according to the given partitioner and, within each resulting
partition, sort records by their keys.

Listing Variants

def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)]

Example

// first we will do range partitioning which is not sorted


val randRDD = sc.parallelize(List( (2,"cat"), (6, "mouse"),(7, "cup"), (3, "book"), (4, "tv"),
"screen"), (5, "heater")), 3)
val rPartitioner = new org.apache.spark.RangePartitioner(3, randRDD)
val partitioned = randRDD.partitionBy(rPartitioner)
def myfunc(index: Int, iter: Iterator[(Int, String)]) : Iterator[String] = {
  iter.map(x => "[partID:" +  index + ", val: " + x + "]")
}
partitioned.mapPartitionsWithIndex(myfunc).collect

res0: Array[String] = Array([partID:0, val: (2,cat)], [partID:0, val: (3,book)], [partID:0, val
(1,screen)], [partID:1, val: (4,tv)], [partID:1, val: (5,heater)], [partID:2, val: (6,mouse)],
[partID:2, val: (7,cup)])

// now lets repartition but this time have it sorted


val partitioned = randRDD.repartitionAndSortWithinPartitions(rPartitioner)
def myfunc(index: Int, iter: Iterator[(Int, String)]) : Iterator[String] = {
  iter.map(x => "[partID:" +  index + ", val: " + x + "]")
}
partitioned.mapPartitionsWithIndex(myfunc).collect

res1: Array[String] = Array([partID:0, val: (1,screen)], [partID:0, val: (2,cat)], [partID:0, va


(3,book)], [partID:1, val: (4,tv)], [partID:1, val: (5,heater)], [partID:2, val: (6,mouse)],
[partID:2, val: (7,cup)])
rightOuterJoin [Pair]
Performs an right outer join using two key-value RDDs. Please note that the keys
must be generally comparable to make this work correctly.

Listing Variants

def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]


def rightOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K,
(Option[V], W))]
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K,
(Option[V], W))]

Example

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)


val b = a.keyBy(_.length)
val c =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.rightOuterJoin(d).collect

res2: Array[(Int, (Option[String], String))] = Array((6,(Some(salmon),salmon)), (6,


(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (6,(Some(salmon),salmon)), (6,
(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (3,(Some(dog),dog)), (3,
(Some(dog),cat)), (3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat),dog)), (3,
(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)), (4,(None,wolf)), (4,
(None,bear)))

sample
Randomly selects a fraction of the items of a RDD and returns them in a new
RDD.

Listing Variants

def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T]

Example

val a = sc.parallelize(1 to 10000, 3)


a.sample(false, 0.1, 0).count
res24: Long = 960

a.sample(true, 0.3, 0).count


res25: Long = 2888

a.sample(true, 0.3, 13).count


res26: Long = 2985

sampleByKey [Pair]
Randomly samples the key value pair RDD according to the fraction of each key
you want to appear in the final RDD.

Listing Variants

def sampleByKey(withReplacement: Boolean, fractions: Map[K, Double], seed:


Long = Utils.random.nextLong): RDD[(K, V)]

Example

val randRDD = sc.parallelize(List( (7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7,
"tv"), (6, "screen"), (7, "heater")))
val sampleMap = List((7, 0.4), (6, 0.6)).toMap
randRDD.sampleByKey(false, sampleMap,42).collect

res6: Array[(Int, String)] = Array((7,cat), (6,mouse), (6,book), (6,screen), (7,heater))

sampleByKeyExact [Pair, experimental]
This is labelled as experimental and so we do not document it.

Listing Variants

def sampleByKeyExact(withReplacement: Boolean, fractions: Map[K, Double],


seed: Long = Utils.random.nextLong): RDD[(K, V)]

saveAsHadoopFile [Pair], saveAsHadoopDataset [Pair],
saveAsNewAPIHadoopFile [Pair]
Saves the RDD in a Hadoop compatible format using any Hadoop outputFormat
class the user specifies.

Listing Variants

def saveAsHadoopDataset(conf: JobConf)


def saveAsHadoopFile[F <: OutputFormat[K, V]](path: String)(implicit fm:
ClassTag[F])
def saveAsHadoopFile[F <: OutputFormat[K, V]](path: String, codec: Class[_ <:
CompressionCodec]) (implicit fm: ClassTag[F])
def saveAsHadoopFile(path: String, keyClass: Class[_], valueClass: Class[_],
outputFormatClass: Class[_ <: OutputFormat[_, _]], codec: Class[_ <:
CompressionCodec])
def saveAsHadoopFile(path: String, keyClass: Class[_], valueClass: Class[_],
outputFormatClass: Class[_ <: OutputFormat[_, _]], conf: JobConf = new
JobConf(self.context.hadoopConfiguration), codec: Option[Class[_ <:
CompressionCodec]] = None)
def saveAsNewAPIHadoopFile[F <: NewOutputFormat[K, V]](path: String)
(implicit fm: ClassTag[F])
def saveAsNewAPIHadoopFile(path: String, keyClass: Class[_], valueClass:
Class[_], outputFormatClass: Class[_ <: NewOutputFormat[_, _]], conf:
Configuration = self.context.hadoopConfiguration)

saveAsObjectFile
Saves the RDD in binary format.

Listing Variants

def saveAsObjectFile(path: String)

Example

val x = sc.parallelize(1 to 100, 3)


x.saveAsObjectFile("objFile")
val y = sc.objectFile[Int]("objFile")
y.collect
res52: Array[Int] =  Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88,
89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

saveAsSequenceFile [SeqFile]
Saves the RDD as a Hadoop sequence file.

Listing Variants
def saveAsSequenceFile(path: String, codec: Option[Class[_ <:
CompressionCodec]] = None)

Example

val v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2), ("ant",5)), 2)


v.saveAsSequenceFile("hd_seq_file")
14/04/19 05:45:43 INFO FileOutputCommitter: Saved output of task
'attempt_201404190545_0000_m_000001_191' to file:/home/cloudera/hd_seq_file

[cloudera@localhost ~]$ ll ~/hd_seq_file


total 8
-rwxr-xr-x 1 cloudera cloudera 117 Apr 19 05:45 part-00000
-rwxr-xr-x 1 cloudera cloudera 133 Apr 19 05:45 part-00001
-rwxr-xr-x 1 cloudera cloudera   0 Apr 19 05:45 _SUCCESS

saveAsTextFile
Saves the RDD as text files. One line at a time.

Listing Variants

def saveAsTextFile(path: String)


def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec])

Example without compression

val a = sc.parallelize(1 to 10000, 3)


a.saveAsTextFile("mydata_a")
14/04/03 21:11:36 INFO FileOutputCommitter: Saved output of task
'attempt_201404032111_0000_m_000002_71' to file:/home/cloudera/Documents/spark-
0.9.0-incubating-bin-cdh4/bin/mydata_a

[cloudera@localhost ~]$ head -n 5


~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/part-00000
1
2
3
4
5

// Produces 3 output files since we have created the a RDD with 3 partitions
[cloudera@localhost ~]$ ll ~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/
-rwxr-xr-x 1 cloudera cloudera 15558 Apr  3 21:11 part-00000
-rwxr-xr-x 1 cloudera cloudera 16665 Apr  3 21:11 part-00001
-rwxr-xr-x 1 cloudera cloudera 16671 Apr  3 21:11 part-00002

Example with compression

import org.apache.hadoop.io.compress.GzipCodec
a.saveAsTextFile("mydata_b", classOf[GzipCodec])

[cloudera@localhost ~]$ ll ~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_b/


total 24
-rwxr-xr-x 1 cloudera cloudera 7276 Apr  3 21:29 part-00000.gz
-rwxr-xr-x 1 cloudera cloudera 6517 Apr  3 21:29 part-00001.gz
-rwxr-xr-x 1 cloudera cloudera 6525 Apr  3 21:29 part-00002.gz

val x = sc.textFile("mydata_b")
x.count
res2: Long = 10000

Example writing into HDFS

val x = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3)
x.saveAsTextFile("hdfs://localhost:8020/user/cloudera/test");

val sp = sc.textFile("hdfs://localhost:8020/user/cloudera/sp_data")
sp.flatMap(_.split(" ")).saveAsTextFile("hdfs://localhost:8020/user/cloudera/sp_x")

stats [Double]
Simultaneously computes the mean, variance and the standard deviation of all
values in the RDD.

Listing Variants

def stats(): StatCounter


Example

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.stats
res16: org.apache.spark.util.StatCounter = (count: 9, mean: 11.266667, stdev: 8.126859)

sortBy
This function sorts the input RDD's data and stores it in a new RDD. The first
parameter requires you to specify a function which  maps the input data into the
key that you want to sortBy. The second parameter (optional) specifies whether
you want the data to be sorted in ascending or descending order.

Listing Variants

def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int =


this.partitions.size)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]

Example

val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1))


y.sortBy(c => c, true).collect
res101: Array[Int] = Array(1, 1, 2, 3, 5, 7)

y.sortBy(c => c, false).collect


res102: Array[Int] = Array(7, 5, 3, 2, 1, 1)

val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5)))


z.sortBy(c => c._1, true).collect
res109: Array[(String, Int)] = Array((A,26), (H,10), (L,5), (Z,1))

z.sortBy(c => c._2, true).collect


res108: Array[(String, Int)] = Array((Z,1), (L,5), (H,10), (A,26))
sortByKey [Ordered]
This function sorts the input RDD's data and stores it in a new RDD. The output
RDD is a shuffled RDD because it stores data that is output by a reducer which has
been shuffled. The implementation of this function is actually very clever. First, it
uses a range partitioner to partition the data in ranges within the shuffled RDD.
Then it sorts these ranges individually with mapPartitions using standard sort
mechanisms.

Listing Variants

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size):


RDD[P]

Example

val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)


val b = sc.parallelize(1 to a.count.toInt, 2)
val c = a.zip(b)
c.sortByKey(true).collect
res74: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))
c.sortByKey(false).collect
res75: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))

val a = sc.parallelize(1 to 100, 5)


val b = a.cartesian(a)
val c = sc.parallelize(b.takeSample(true, 5, 13), 2)
val d = c.sortByKey(false)
res56: Array[(Int, Int)] = Array((96,9), (84,76), (59,59), (53,65), (52,4))

stdev [Double], sampleStdev [Double]


Calls stats and extracts either stdev-component or corrected sampleStdev-
component.

Listing Variants

def stdev(): Double


def sampleStdev(): Double

Example

val d = sc.parallelize(List(0.0, 0.0, 0.0), 3)


d.stdev
res10: Double = 0.0
d.sampleStdev
res11: Double = 0.0

val d = sc.parallelize(List(0.0, 1.0), 3)


d.stdev
d.sampleStdev
res18: Double = 0.5
res19: Double = 0.7071067811865476

val d = sc.parallelize(List(0.0, 0.0, 1.0), 3)


d.stdev
res14: Double = 0.4714045207910317
d.sampleStdev
res15: Double = 0.5773502691896257

subtract
Performs the well known standard set subtraction operation: A - B

Listing Variants

def subtract(other: RDD[T]): RDD[T]


def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], p: Partitioner): RDD[T]

Example

val a = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.collect
res3: Array[Int] = Array(6, 9, 4, 7, 5, 8)

subtractByKey [Pair]
Very similar to subtract, but instead of supplying a function, the key-component of
each pair will be automatically used as criterion for removing items from the first
RDD.

Listing Variants

def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]


def subtractByKey[W: ClassTag](other: RDD[(K, W)], numPartitions: Int):
RDD[(K, V)]
def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K,
V)]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)


val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant", "falcon", "squid"), 2)
val d = c.keyBy(_.length)
b.subtractByKey(d).collect
res15: Array[(Int, String)] = Array((4,lion))

sum [Double], sumApprox [Double]


Computes the sum of all values contained in the RDD. The approximate version of
the function can finish somewhat faster in some scenarios. However, it trades
accuracy for speed.
Listing Variants

def sum(): Double


def sumApprox(timeout: Long, confidence: Double = 0.95):
PartialResult[BoundedDouble]

Example

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.sum
res17: Double = 101.39999999999999

take
Extracts the first n items of the RDD and returns them as an array. (Note: This
sounds very easy, but it is actually quite a tricky problem for the implementors of
Spark because the items in question can be in many different partitions.)

Listing Variants

def take(num: Int): Array[T]

Example

val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)


b.take(2)
res18: Array[String] = Array(dog, cat)

val b = sc.parallelize(1 to 10000, 5000)


b.take(100)
res6: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 100)
takeOrdered
Orders the data items of the RDD using their inherent implicit ordering function
and returns the first n items as an array.

Listing Variants

def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]

Example

val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)


b.takeOrdered(2)
res19: Array[String] = Array(ape, cat)

takeSample
Behaves different from sample in the following respects:

   It will return an exact number of samples (Hint: 2nd parameter)


   It returns an Array instead of RDD.
   It internally randomizes the order of the items returned.

Listing Variants

def takeSample(withReplacement: Boolean, num: Int, seed: Int): Array[T]

Example

val x = sc.parallelize(1 to 1000, 3)


x.takeSample(true, 100, 1)
res3: Array[Int] = Array(339, 718, 810, 105, 71, 268, 333, 360, 341, 300, 68, 848, 431,
449, 773, 172, 802, 339, 431, 285, 937, 301, 167, 69, 330, 864, 40, 645, 65, 349, 613, 468,
982, 314, 160, 675, 232, 794, 577, 571, 805, 317, 136, 860, 522, 45, 628, 178, 321, 482,
657, 114, 332, 728, 901, 290, 175, 876, 227, 130, 863, 773, 559, 301, 694, 460, 839, 952,
664, 851, 260, 729, 823, 880, 792, 964, 614, 821, 683, 364, 80, 875, 813, 951, 663, 344,
546, 918, 436, 451, 397, 670, 756, 512, 391, 70, 213, 896, 123, 858)

toDebugString
Returns a string that contains debug information about the RDD and its
dependencies.

Listing Variants

def toDebugString: String

Example

val a = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.toDebugString
res6: String =
MappedRDD[15] at subtract at <console>:16 (3 partitions)
  SubtractedRDD[14] at subtract at <console>:16 (3 partitions)
    MappedRDD[12] at subtract at <console>:16 (3 partitions)
      ParallelCollectionRDD[10] at parallelize at <console>:12 (3 partitions)
    MappedRDD[13] at subtract at <console>:16 (3 partitions)
      ParallelCollectionRDD[11] at parallelize at <console>:12 (3 partitions)

toJavaRDD
Embeds this RDD object within a JavaRDD object and returns it.

Listing Variants

def toJavaRDD() : JavaRDD[T]


Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)


c.toJavaRDD
res3: org.apache.spark.api.java.JavaRDD[String] = ParallelCollectionRDD[6] at
parallelize at <console>:12

toLocalIterator
Converts the RDD into a scala iterator at the master node.

Listing Variants

def toLocalIterator: Iterator[T]

Example

val z = sc.parallelize(List(1,2,3,4,5,6), 2)
val iter = z.toLocalIterator

iter.next
res51: Int = 1

iter.next
res52: Int = 2

top
Utilizes the implicit ordering of $T$ to determine the top $k$ values and returns
them as an array.

Listing Variants
ddef top(num: Int)(implicit ord: Ordering[T]): Array[T]

Example

val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)


c.top(2)
res28: Array[Int] = Array(9, 8)

toString
Assembles a human-readable textual description of the RDD.

Listing Variants

override def toString: String

Example

val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.toString
res61: String = ParallelCollectionRDD[80] at parallelize at <console>:21

val randRDD = sc.parallelize(List( (7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7, "tv"),
(6, "screen"), (7, "heater")))
val sortedRDD = randRDD.sortByKey()
sortedRDD.toString
res64: String = ShuffledRDD[88] at sortByKey at <console>:23

treeAggregate
Computes the same thing as aggregate, except it aggregates the elements of the
RDD in a multi-level tree pattern. Another difference is that it does not use the
initial value for the second reduce function (combOp).  By default a tree of depth 2
is used, but this can be changed via the depth parameter.
Listing Variants

def treeAggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U,


depth: Int = 2)(implicit arg0: ClassTag[U]): U

Example

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

// lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
  iter.map(x => "[partID:" +  index + ", val: " + x + "]")
}

z.mapPartitionsWithIndex(myfunc).collect
res28: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID
val: 4], [partID:1, val: 5], [partID:1, val: 6])

z.treeAggregate(0)(math.max(_, _), _ + _)
res40: Int = 9

// Note unlike normal aggregrate. Tree aggregate does not apply the initial value for the sec
reduce
// This example returns 11 since the initial value is 5
// reduce of partition 0 will be max(5, 1, 2, 3) = 5
// reduce of partition 1 will be max(4, 5, 6) = 6
// final reduce across partitions will be 5 + 6 = 11
// note the final reduce does not include the initial value
z.treeAggregate(5)(math.max(_, _), _ + _)
res42: Int = 11

treeReduce
Works like reduce except reduces the elements of the RDD in a multi-level tree
pattern.
Listing Variants

def  treeReduce(f: (T, T) ⇒ T, depth: Int = 2): T

Example

val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.treeReduce(_+_)
res49: Int = 21

union, ++
Performs the standard set operation: A union B

Listing Variants

def ++(other: RDD[T]): RDD[T]


def union(other: RDD[T]): RDD[T]

Example

val a = sc.parallelize(1 to 3, 1)
val b = sc.parallelize(5 to 7, 1)
(a ++ b).collect
res0: Array[Int] = Array(1, 2, 3, 5, 6, 7)

unpersist
Dematerializes the RDD (i.e. Erases all data items from hard-disk and memory).
However, the RDD object remains. If it is referenced in a computation, Spark will
regenerate it automatically using the stored dependency graph.
Listing Variants

def unpersist(blocking: Boolean = true): RDD[T]

Example

val y = sc.parallelize(1 to 10, 10)


val z = (y++y)
z.collect
z.unpersist(true)
14/04/19 03:04:57 INFO UnionRDD: Removing RDD 22 from persistence list
14/04/19 03:04:57 INFO BlockManager: Removing RDD 22

values
Extracts the values from all contained tuples and returns them in a new RDD.

Listing Variants

def values: RDD[V]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)


val b = a.map(x => (x.length, x))
b.values.collect
res3: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)

variance [Double], sampleVariance [Double]
Calls stats and extracts either variance-component or corrected sampleVariance-
component.
Listing Variants

def variance(): Double


def sampleVariance(): Double

Example

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9,
5.5), 3)
a.variance
res70: Double = 10.605333333333332

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.variance
res14: Double = 66.04584444444443

x.sampleVariance
res13: Double = 74.30157499999999

zip
Joins two RDDs by combining the i-th of either partition with each other. The
resulting RDD will consist of two-component tuples which are interpreted as key-
value pairs by the methods provided by the PairRDDFunctions extension.

Listing Variants

def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]

Example

val a = sc.parallelize(1 to 100, 3)


val b = sc.parallelize(101 to 200, 3)
a.zip(b).collect
res1: Array[(Int, Int)] = Array((1,101), (2,102), (3,103), (4,104), (5,105), (6,106), (7,107),
(8,108), (9,109), (10,110), (11,111), (12,112), (13,113), (14,114), (15,115), (16,116),
(17,117), (18,118), (19,119), (20,120), (21,121), (22,122), (23,123), (24,124), (25,125),
(26,126), (27,127), (28,128), (29,129), (30,130), (31,131), (32,132), (33,133), (34,134),
(35,135), (36,136), (37,137), (38,138), (39,139), (40,140), (41,141), (42,142), (43,143),
(44,144), (45,145), (46,146), (47,147), (48,148), (49,149), (50,150), (51,151), (52,152),
(53,153), (54,154), (55,155), (56,156), (57,157), (58,158), (59,159), (60,160), (61,161),
(62,162), (63,163), (64,164), (65,165), (66,166), (67,167), (68,168), (69,169), (70,170),
(71,171), (72,172), (73,173), (74,174), (75,175), (76,176), (77,177), (78,...

val a = sc.parallelize(1 to 100, 3)


val b = sc.parallelize(101 to 200, 3)
val c = sc.parallelize(201 to 300, 3)
a.zip(b).zip(c).map((x) => (x._1._1, x._1._2, x._2 )).collect
res12: Array[(Int, Int, Int)] = Array((1,101,201), (2,102,202), (3,103,203), (4,104,204),
(5,105,205), (6,106,206), (7,107,207), (8,108,208), (9,109,209), (10,110,210),
(11,111,211), (12,112,212), (13,113,213), (14,114,214), (15,115,215), (16,116,216),
(17,117,217), (18,118,218), (19,119,219), (20,120,220), (21,121,221), (22,122,222),
(23,123,223), (24,124,224), (25,125,225), (26,126,226), (27,127,227), (28,128,228),
(29,129,229), (30,130,230), (31,131,231), (32,132,232), (33,133,233), (34,134,234),
(35,135,235), (36,136,236), (37,137,237), (38,138,238), (39,139,239), (40,140,240),
(41,141,241), (42,142,242), (43,143,243), (44,144,244), (45,145,245), (46,146,246),
(47,147,247), (48,148,248), (49,149,249), (50,150,250), (51,151,251), (52,152,252),
(53,153,253), (54,154,254), (55,155,255)...

zipParititions
Similar to zip. But provides more control over the zipping process.

Listing Variants

def zipPartitions[B: ClassTag, V: ClassTag](rdd2: RDD[B])(f: (Iterator[T],


Iterator[B]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, V: ClassTag](rdd2: RDD[B], preservesPartitioning:
Boolean)(f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, V: ClassTag](rdd2: RDD[B], rdd3:
RDD[C])(f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, V: ClassTag](rdd2: RDD[B], rdd3:
RDD[C], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C])
=> Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V: ClassTag](rdd2:
RDD[B], rdd3: RDD[C], rdd4: RDD[D])(f: (Iterator[T], Iterator[B], Iterator[C],
Iterator[D]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V: ClassTag](rdd2:
RDD[B], rdd3: RDD[C], rdd4: RDD[D], preservesPartitioning: Boolean)(f:
(Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]): RDD[V]
Example

val a = sc.parallelize(0 to 9, 3)
val b = sc.parallelize(10 to 19, 3)
val c = sc.parallelize(100 to 109, 3)
def myfunc(aiter: Iterator[Int], biter: Iterator[Int], citer: Iterator[Int]): Iterator[String] =
{
  var res = List[String]()
  while (aiter.hasNext && biter.hasNext && citer.hasNext)
 {
    val x = aiter.next + " " + biter.next + " " + citer.next
    res ::= x
 }
  res.iterator
}
a.zipPartitions(b, c)(myfunc).collect
res50: Array[String] = Array(2 12 102, 1 11 101, 0 10 100, 5 15 105, 4 14 104, 3 13 103, 9
19 109, 8 18 108, 7 17 107, 6 16 106)

zipWithIndex
Zips the elements of the RDD with its element indexes. The indexes start from 0. If
the RDD is spread across multiple partitions then a spark Job is started to perform
this operation.

Listing Variants

def zipWithIndex(): RDD[(T, Long)]

Example

val z = sc.parallelize(Array("A", "B", "C", "D"))


val r = z.zipWithIndex
res110: Array[(String, Long)] = Array((A,0), (B,1), (C,2), (D,3))

val z = sc.parallelize(100 to 120, 5)


val r = z.zipWithIndex
r.collect
res11: Array[(Int, Long)] = Array((100,0), (101,1), (102,2), (103,3), (104,4), (105,5),
(106,6), (107,7), (108,8), (109,9), (110,10), (111,11), (112,12), (113,13), (114,14),
(115,15), (116,16), (117,17), (118,18), (119,19), (120,20))
zipWithUniqueId
This is different from zipWithIndex since just gives a unique id to each data
element but the ids may not match the index number of the data element. This
operation does not start a spark job even if the RDD is spread across multiple
partitions.
Compare the results of the example below with that of the 2nd example of
zipWithIndex. You should be able to see the difference.

Listing Variants

def zipWithUniqueId(): RDD[(T, Long)]

Example

val z = sc.parallelize(100 to 120, 5)


val r = z.zipWithUniqueId
r.collect

res12: Array[(Int, Long)] = Array((100,0), (101,5), (102,10), (103,15), (104,1), (105,6),


(106,11), (107,16), (108,2), (109,7), (110,12), (111,17), (112,3), (113,8), (114,13), (115,18)
(116,4), (117,9), (118,14), (119,19), (120,24))

You might also like