Professional Documents
Culture Documents
Once you have done the most important thing, you are left with selection of detailed
network, Tags and you are finished.
After this, you can check your Databrick services:
And select the newly created Azure Databrick Service to get the overview page:
And just Launch the Workspace!
As always, the code and notebooks is available at Github repository.
Stay Healthy! See you tomorrow.
On main console page of Azure Databricks you will find the following sections:
1. Main vertical navigation bar, that will be available all the time and gives users simple
transitions from one task (or page) to another.
2. Common tasks to get started immediately with one desired task
3. Importing & Exploring data is task for Drag&Dropping your external data to DBFS
system
4. Starting new notebook or getting some additional information on Databricks
documentation and Release notes
5. Settings for any user settings, administration of the console and management.
When you will be using Azure Databricks, the vertical navigation bar (1) and Settings
(5) will always be available for you to access.
Navigation bar
Thanks to the intuitive and self-explanatory icons and names, there is no need to
explain what each icon represents.
Home - this will always get you at the console page, no matter where you are.
Workspaces - this page is where all the collaboration will happen, where user will
have data, notebooks and all the work at their disposal. Workspaces is by far - from
data engineer, data scientist, machine learning engineer point of view - the most
important section
Recents - where you will find all recently used documents, data, services in Azure
Databricks
Data - is access point to all the data - databases and tables that reside on DBFS and
as files; in order to see the data, a cluster must be up and running, due to the nature
of Spark data distribution
Clusters - is a VM in the background that runs the Azure Databricks. Without the
cluster up and running, the whole Azure Databricks will not work. Here you can setup
new cluster, shut down a cluster, manage the cluster, attach cluster to notebook or to
a job, create job cluster and setup the pools. This is the "horses" behind the code and
it is the compute power, decoupled from the notebooks in order to give it scalability.
Jobs - is a overview of scheduled (crontab) jobs that are executing and are available
to user. This is the control center for job overview, job history, troubleshooting and
administration of the jobs.
Models - page that gives you overview and tracking of your machine learning
models, operations over the model, artefacts, metadata and parameters for particular
model or a run of a model.
Search - is a fast, easy and user-friendly way to search your workspace.
Settings
Here you will have overview of your service, user management and account:
User setting - where you can setup personal access tokens for Databricks API,
manage GIT integration and notebooks settings
Admin console - where administrator will set IAM policies, security and group access
and enabling/disabling additional services as Databricks genomics, Container
services, workspaces behaviour, etc.
Manage account - will redirect you to start page on Azure dashboard for managing
of the Azure account that you are using to access Azure Databricks.
Log Out - will log out you from Azure Databricks.
This will get you around the platform. Tomorrow we will start exploring the clusters!
Complete set of code and Notebooks will be available at the Github repository.
Stay Healthy! See you tomorrow.
5. Worker and driver type will give you the option to select the VM that will suit your
needs. For the first timers, keep the default selected Worker and driver type as
selected. And later you can explore and change DBU (DataBricks Units) for higher
performances. Three types of workloads are to be understood; All-purpose, Job
Compute and Light-job Compute and many more Instances types; General, Memory
Optimized, Storage optimized, Compute optimized and GPU optimized. All come
with different pricing plans and set of tiers and regions.
All workers will have the minimum and maximum number of nodes available. More
you want to scale out, give your cluster more workers. DBU will change with more
workers are added.
6. AutoScalling - is the tick option that will give you capabilites to scale automatically
between minimum and maximum number of nodes (workers) based on the
workload.
7. Termination - is the timeout in minutes, when there is no work after given period,
the cluster will terminate. Expect different behaviour when cluster is attached to the
pool.
Explore also the advanced options, where additional Spark configuration and runtime
variables can be set. Very useful when finet-uning the behaviour of the cluster at
startup. Add also Tags (as key-value pairs), to keep additional metadata on your
cluster, you can also give a Init script that can be stored on DBFS and can initiate
some job, load some data or models at the start time.
Once you have selected the cluster options suited for your needs, you are ready to
hit that "Create cluster" button.
Tomorrow we will cover basics on architecture of clusters, workers, DBFS storage and
how Spark handles jobs.
Complete set of code and Notebooks will be available at the Github repository.
Stay Healthy! See you tomorrow.
Dec 05 2020 - Understanding Azure
Databricks cluster architecture,
workers, drivers and jobs
My Cluster is Standard DS3_v2 cluster (4 cores) with Min 2 and Max 8 workers. Same
applies for the driver. Once the cluster is up and running, go to Azure Portal. Look for
your resource group that you have created it at the beginning (Day 2) when we
started the Databricks Service. I have named my Resource group "RG_DB_py"
(naming is importat! RG - ResourceGroup; DB - Service DataBricks; py - my project
name). Search for the correct resource:
And Select "Resource Groups" and find your resource group. I have a lot of resource
groups, since I try to bundle the projects to a small groups that are closely related:
Find yours and select it and you will find the Azure Databricks service that belongs to
this resource group.
Databricks creates additional (automatically generated) resource group to hold all
the services (storage, VM, network, etc.). Follow the naming convention:
RG_DB_py - is my resource group. What Azure does in this case, it prefixes and
suffices your resource group name as: databricks_rg_DB_py_npkw4cltqrcxe. Prefix will
always be "databricks_rg" and suffix will be 13-characters random string for
uniqueness. In my case: npkw4cltqrcxe. Why separate resource group? It used to be
under the same resource group, but decoupling and having services in separate
group makes it easier to start/stop services, manage IAM, create pool and scale. Find
your resource group and see what is insight:
In detail list you will find following resources (in accordance with my standard DS3_v2
Cluster):
Disk (9x Resources)
Network Interface (3x resources)
Network Security group (1x resource)
Public IP address (3x resources)
Storage account (1x resource)
Virtual Machine (3x resources)
Virtual network (1x resource)
Inspect the naming of these resources, you can see that the names are guid based,
but the names are repeating through different resources and can easily be bundled
together. Drawing the components together to get a full picture of it:
At a high level, the Azure Databricks service manages worker nodes and driver node
in the separate resource group, that is tight to the same Azure subscription (for
easier scalability and management). The platform or "appliance" or "managed
service" is deployed as an set of Azure resources and Databricks manages all other
aspects. The additional VNet, Security groups, IP addresses, and storage accounts are
ready to be used for end user and managed through Azure Databricks Portal (UI).
Storage is also replicated (geo redundant replication) for disaster scenarios and fault
tolerance. Even when cluster is turned off, the data is persisted in storage.
Cluster is a virtual machine that has a blob storage attached to it. Virtual machine is
rocking Linux Ubuntu (16.04 as of writing this) and it has 4 vCPUs and 14GiB of RAM.
The workers are using two Virtual Machines. And the same Virtual machine is
reserved for the driver. This is what we have set on Day 2.
Since each VM machine is the same (for Worker and Driver), the workers can be
scaled up based on the vCPU. Two VM for Workers, with 4 cores each, is maximum 8
workers. So each vCPU / Core is considered one worker. And the Driver machine (also
VM with Linux Ubuntu) is a manager machine for load distribution among the
workers.
Each Virtual machine is set with public and private sub-net and all are mapped
together in Virtual network (VNet) for secure connectivity and communication of
work loads and data results. And Each VM has a dedicated public IP address for
communication with other services or Databricks Connect tools (I will talk about this
in later posts).
Disks are also bundled in three types, each for one VM. These types are:
Scratch Volume
Container Root Volume
Standard Volume
Each type has a specific function but all are designed for optimised performance data
caching, especially for delta caching. This means for faster data reading, creating
copies of remote files in nodes’ local storage using and using a fast intermediate
data format. The data is cached automatically. Even when file has to be fetched from
a remote location. This performs well also for repetitive and successive reads. Delta
caching (as part of Spark caching)
This is supported for reading only Parquet files in DBFS, HDFS, Azure Blob storage,
Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2. Optimized
storage (Spark caching) does not uspport file types as CSV, JSON, TXT, ORC, XML.
When request is pushed from the Databricks Portal (UI) the main driver accepts the
requests and by using spark jobs, pushes the workload down to each node. Each
node has a shards and copies of the data or it it gets through DBFS from Blob
Storage and executes the job. After execution the summary / results of each worker
node is summed and gathered again by driver. Driver node returns the results in
fashionable manner back to UI.
The more worker nodes you have, the more "parallel" the request can be executed.
And the more workers you have available (or in ready mode) the more you can
"scale" your workloads.
Tomorrow we will start with working our way up to importing and storing the data
and see how it is stored on blob storage and explore different type of storages that
Azure Databricks provides.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
Dec 06 2020 - Importing and storing
data to Azure Databricks
This will prompt you a variety of actions on importing data to DBFS or connecting
Azure Databricks with other services.
Drag the data file (available on Github in data folder) named Day6data.csv to square
for upload. For easier understanding, let's check the CSV file schema (simple one,
three columns: 1. Date (datetime format), 2. Temperature (integer format), 3. City
(string format)).
But before you start with uploading the data, let's check the Azure resource group. I
have not yet started any Databricks cluster in my workspace. And here you can see
that Vnet, Storage and Network Security group will always be available for Azure
Databricks service. Only when you start the cluster, additional services (IP addresses,
disks, VM,...) will appear.
This gives us better idea where and how data is persisted. Your data will always be
available and stored on blob storage. Meaning, even if you decide - not only to
terminate the cluster, but to delete the cluster as well, your data will always be safely
stored. Only when you add new cluster to same workspace, cluster will automatically
retrieved the data from blob storage.
1. Import
Drag and drop the csv file in the "Drop zone" as discussed previously. And is should
looked like this:
You have now two options:
create table with UI
create table in Notebook
Select the "Create table with UI". Only now you will be asked to select the cluster:
Now select the "Create table in Notebook" and Databricks will create a first
Notebook for you using Spark language to upload the data to DBFS.
In case I want to run this notebook, I will need to have my cluster up and running. So
let's start a cluster. On your left vertical navigation bar, select Cluster Icon. You will
get the list of all the clusters you are using. Select the one we have created on Day 4.
If you want, check the resource group for your Azure Databricks to see all the
running VM, disks and VNets.
Now insert the data using the import method, by drag and drop the CSV file in the
"Drop Zone" (repeat the process) and hit "Create Table with UI". Now you should
have Cluster available. Select it and preview the Table.
You can see that table name is propagated from filename, the file Type is
automatically selected, Column delimiter is automatically selected. Only "First row in
header" should be selected in order to have columns properly named and data types
corrected, respectively.
Now we can create a table. After Databricks will finish, the report will be presented
with recap of the table location (yes, location!), Schema and overview of sample data.
This table is now available on my Cluster. What does this mean? This table is now
persistent on your cluster, but not only on cluster, but on your Azure Databricks
Workspace. This is important to understand how and where data is stored. Go to
Data icon on left vertical navigation bar.
This will enable you to view the data through DBFS structure, give you the upload
option and search option.
Uploading files will be now easier and would be seen immediately in FileStore. There
is same file prefixed Day6Data_dbfs.csv in github data folder, that you can upload
manually and it would be seen in FileStore:
Tomorrow we will explore how we can use Notebook to access this file in different
commands (CLI, Bash, Utils, Python, R, Spark). And since we will be using notebooks
for the first time, we will do a little exploration of notebooks as well.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
Databricks notebooks are support multi languages and you can seaminglessly switch
the language in the notebook, without the need to switching the languange. If the
notebooks are instructions of operations and what to do, is the cluster the engine
that will execute all the instructions. Select cluster, that you have created on Day 4. I
am inserting following:
Name: Day7_DB_Py_Notebook
Default
Python
Language:
Cluster: databricks_cl1_standard
If your clusters are not started, you can still create a notebook and later attach
selected cluster to notebook.
Notebook consists of cells that can be either formatted text or code. Notebooks are
saved automatically. Under File, you will find useful functions to manage your
notebooks, as: Move, Clone, Rename, Upload, Export. Under menu Edit, you will be
able to work with cells, and code blocks. Run all is a quick function to execute all cells
at one time (or if you prefer you can run a cell one by one, or selected cell all below
or above). Once you start writing formatted text (Markdown, HTML, others),
Databricks will automatically start building Table of content, giving you better
overview of your content.
Let's start with Markdown and write the title and some text to notebook and adding
some Python code. I have inserted:
%md # Day 7 - Advent of Azure Databricks 2020
%md
## Welcome to day 7.
In this document we will explore how to write notebook, use different languages
import file.
%md Default language in this notebook is Python. So if we want to add a text cell,
instead of Python, we need to explicitly
set **%md** at the beginning of each cell. In this way, language of execution
will be re-defined.
Under view, changing from Standard to side-by-side and you can see the code and
converted code as notebook on the other-side. Useful for copying, changing or
debugging the code.
Each cell text has %md at the beginning, for converting text to rich text - Markdown.
The last cell is Python
dbutils.fs.put("/FileStore/Day7/SaveContent.txt", 'This is the sample')
That generated a txt file to Filestore. File can be also seen in the left pane. DbUtils -
Databricks utils is a set of utility tools for efficiently working with object storage.
dbUtils are available in R, Python and Scala.
Importing file
In Notebook we have the ability to use multiple languages. Mixing Python with Bash
and Spark and R is something common. But in this case, we will use DbUtils -
powerful set of functions. Learn to like it, because it will be utterly helpful.
Let us explore the Bash and R to import the file into data.frame.
dbutils.fs.ls("dbfs:/FileStore")
df = spark.read.text("dbfs:/FileStore/Day6Data_dbfs.csv")
df.show()
And the results is:
%r
library(dplyr)
Day6_df %>%
group_by(city) %>%
summarise(mean = mean(mean_daily_temp), n = n())
And the result is the same, just using R Language.
Tomorrow we will use Databricks CLI and DBFS API to upload the files from e.g.: your
client machine to filestore. In this way, you will be able to migrate and upload file to
Azure Databricks in no time.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
Click on Generate New Token and in dialog window, give a token name and lifetime.
After the token is generated, make sure to copy, because you will not be able to see
it later. Token can be revoked (when needed), otherwise it has a expiry date (in my
case 90 days). So make sure to remember to renew it after the lifetime period!
3. Working with CLI
Go back to CMD and run the following:
databricks --version
will give you the current version you are rocking. After that, let's configure the
connectivity.
databricks configure --token
and you will be prompted to insert two information (!)
the host ( in my case: https://adb-8606925487212195.15.azuredatabricks.net/)
the token
Host is is available for you in your browser. Go to Azure databricks tab/Browser and
copy paste the URL:
And the token, that has been generated for you in step two. Token should look
like: dapib166345f2938xxxxxxxxxxxxxxc.
Once you insert both information, the connection is set!
By using bash commands, now you can work with DBFS from your local machine /
server using CLI. For example:
databricks fs ls
will list all the files on root folder of DBFS of your Azure Databricks
And create a new Storage account by clicking on "+ Add". And select the
subscription, Resource group, Storage account name, location, account type and
replication.
Continue to set up networking, data protection, advance settings and create the
storage account. When you are finished with storage account, we will create a
storage itself. Note that General Purpose v2 Storage accounts support latest Azure
Storage features and all functionality of general purpose v1 and Blob Storage
accounts. General purpose v2 accounts bring lowest per-gigabyte capacity prices for
Azure storege and support following Azure Storage services:
Blobs (all types: Block, Append, Page)
Data Lake Gen2
Files
Disks
Queues
Tables
Once the Account is ready to be used, select it and choose "Container".
Container is a blob storage for unstructured data and will communicate with Azure
Databricks DBFS perfectly. When in Container part, select "+ Container" to add new
container and give a container a name.
Once the container is created, click on the container to get additional details.
Your data will be stored in this container and later used with Azure Databricks
Notebooks. you can also access the storage using Microsoft Azure Storage Explorer.
It is much more intuitive and and offers easier management, folder creation and
binary files management.
You can upload a file using Microsoft Azure Storage Explorer tool or directly on
portal. But in organisation, you will have files and data being here copied
automatically using many other Azure service. Upload a file that is available for you
on Github repository (data/Day9_MLBPlayers.csv - data file is licensed under GNU) to
blob storage container in any desired way. I have used Storage explorer and simply
drag and dropped the file to container.
Click OK to confirm and click Save (save icon). Go back to Storage account and on
the left select Shared Access Signature.
Under Allowed resource types, it is mandatory to select Container, but you can select
all. Set the Start and expiry date - 1 month in my case. Select button "Generate SAS
and connection string" and copy paste the needed strings; connection string and SAS
token should be enough (copy and paste it to a text editor)
Once this is done, let's continue with Azure Databricks notebooks.
3. Creating notebooks in Azure Databricks
Start up a cluster and create new notebooks (as we have discussed on Day 4 and Day
7). The notebook is available at Github.
And the code is:
%scala
SELECT
t1.Name as City1
,t2.Name AS City2
,t1.temperature*t2.Temperature AS MultipliedTemperature
FROM temp1 AS t1
JOIN temp2 AS t2
ON t1.id_t1 = t2.id_t2
WHERE
t1.name <> t2.name
LIMIT 1
If you follow the notebook, you will find some additional information, but all in all,
the HIVE SQL is ANSI compliant and getting started, should be no problem. When
using notebook, each cell must have a language defined at the beginning, unless it is
a language of kernel. %sql for SQL language, %md for Markdown, %r for R
language, %scala for Scala. Beware, these language pointers are case sensitive, so
%sql will interpret as SQL script, where as %SQL will return an error.
Tomorrow we will check and explore how to use R to do data engineering, but
mostly the data analysis tasks. So, stay tuned.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
Dec 11 2020 - Using Azure Databricks
Notebooks with R Language for data
analytics
Azure Databricks repository is a set of blogposts as a Advent of 2020 present to
readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
Dec 01: What is Azure Databricks
Dec 02: How to get started with Azure Databricks
Dec 03: Getting to know the workspace and Azure Databricks platform
Dec 04: Creating your first Azure Databricks cluster
Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and
jobs
Dec 06: Importing and storing data to Azure Databricks
Dec 07: Starting with Databricks notebooks and loading data to DBFS
Dec 08: Using Databricks CLI and DBFS CLI for file upload
Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
We looked into SQL language and how to get some basic data preparation done.
Today we will look into R and how to get started with data analytics.
Creating a data.frame (or getting data from SQL Table)
Create a new notebook (Name: Day11_R_AnalyticsTasks, Language: R) and let's go.
Now we will get data from SQL tables and DBFS files.
We will be using a database from Day10 and the table called temperature.
%sql
USE Day10;
But note(!), dplyr functions might not work, and it is due to the collision of function
names with SparkR library. SparkR has same functions (arrange, between, coalesce,
collect, contains, count, cume_dist,
dense_rank, desc, distinct, explain, filter, first, group_by, intersect, lag, last, lead,
mutate, n, n_distinct, ntile,
percent_rank, rename, row_number, sample_frac, select, sql, summarize, union). In
other to solve this collision, either detach (detach("package:dplyr")) the dplyr package,
or we instance the package by: dplyr::summarise instead of just summarise.
Creating a simple linear regression
We can also use many of the R packages for data analysis, and in this case I will run
simple regression, trying to predict the daily temperature. Simply run the regression
function lm().
model <- lm(mean_daily_temp ~ city + date, data = df)
model
And run base r function summary() to get model insights.
summary(model)
confint(model)
In addition, you can directly install any missing or needed package in notebook (R
engine and Databricks Runtime environment version should be applied). In this case,
I am running a residualPlot() function from extra installed package car.
install.packages("car")
library(car)
residualPlot(model)
Azure Databricks will generate RMarkdown notebook when using R Language as
Kernel language. If you want to create a IPython notebook, make Python as Kernel
language and use %r for switching to R Language. Both RMarkdown notebook and
HTML file (with included results) are included and available on Github.
Tomorrow we will check and explore how to use Python to do data engineering, but
mostly the data analysis tasks. So, stay tuned.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
We can also import data from SQL Table into data frame by simply writing an SQL
statement.
#from pyspark.sql.functions import explode
from pyspark.sql import *
import pandas as pd
dfcovid = pd.read_csv("/dbfs/databricks-datasets/COVID/covid-19-data/us-
states.csv")
dfcovid.head()
and now let's scatter plot some number of cases and deaths per states and use the
following Python code that can be simply used in Azure Databricks.
# Filter to 2020-12-01 on first of december
df_12_01 = dfcovid[dfcovid["date"] == "2020-12-01"]
X_train = train_df[["cases"]]
y_train = train_df["deaths"]
X_test = test_df[["cases"]]
y_test = test_df["deaths"]
We will use scikit-learn to do simple linear regression.
from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(X_train, y_train)
print(f"num_deaths = {lr.intercept_:.4f} + {lr.coef_[0]:.4f}*cases")
So if we have no cases, then there should be no deaths caused by COVID-19; this
gives us a base line and assume that let's set the intercept to be 0.
lr = LinearRegression(fit_intercept=False).fit(X_train, y_train)
print(f"num_deaths = {lr.coef_[0]:.4f}*cases")
This model imposes that there is a 2.68% mortality rate in our dataset. But we know
that some states have higher mortality rates and that linear model is absolutely not
ideal for that, but it is just to showcase for using Python in Databricks.
Tomorrow we will check and explore how to use Python Koalas to do data
engineering, so stay tuned.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
My cluster is rocking Databricks Runtime 7.3. So create a new notebook and name
it: Day13_Py_Koalas and select the Language: Python. And attach the notebook to
your cluster.
1.Object Creation
Before going into sample Python code, we must import the following packages:
pandas and numpy so we can create from or convert from/to Databricks Koalas.
import databricks.koalas as ks
import pandas as pd
import numpy as np
Creating a Koalas Series by passing a list of values, letting Koalas create a default
integer index:
s = ks.Series([1, 3, 5, np.nan, 6, 8])
Creating a Koalas DataFrame by passing a dict of objects that can be converted to
series-like.
kdf = ks.DataFrame(
{'a': [1, 2, 3, 4, 5, 6],
'b': [100, 200, 300, 400, 500, 600],
'c': ["one", "two", "three", "four", "five", "six"]},
index=[10, 20, 30, 40, 50, 60])
with the result:
Now, let's create a pandas DataFrame by passing a numpy array, with a datetime
index and labeled columns:
dates = pd.date_range('20200807', periods=6)
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
and getting the results as pandas dataframe:
2. Viewing data
See the top rows of the frame. The results may not be the same as pandas though:
unlike pandas, the data in a Spark dataframe is not ordered, it has no intrinsic notion
of index. When asked for the head of a dataframe, Spark will just take the requested
number of rows from a partition.
kdf.head()
You can also display the index, columns, and the underlying numpy data.
kdf.index
kdf.columns
kdf.to_numpy()
And you can also use describe function to get a statistic summary of your data:
kdf.describe()
You can also transpose the data, by adding a T function:
kdf.T
and many other functions. Group is also another great way to get summary of your
data. Grouping can be done by "chaining" or adding a group by clause. The internal
process - when grouping is applied - happens in three steps:
Splitting data into groups (base on criteria)
applying the function and
combining the results back to data structure.
kdf.groupby('A').sum()
#or
kdf.groupby(['A', 'B']).sum()
Both are grouping data, first time on Column A and second time on both columns A
and B:
3. Plotting data
Databricks Koalas is also compatible with matplotlib and inline plotting. We need to
load the package:
%matplotlib inline
from matplotlib import pyplot as plt
And can continue by creating a simple pandas series:
pser = pd.Series(np.random.randn(1000),
index=pd.date_range('1/1/2000', periods=1000))
that can be simply converted to Koalas series:
kser = ks.Series(pser)
After we have a series in Koalas, we can create cumulative sum of values using series
and plot it:
kser = kser.cummax()
kser.plot()
And many other variations of plot. You can also load the seaborn package, boket
package and many others.
This blogpost is shorter version of a larger "Koalas in 10 minutes" notebook, and is
available in the same Azure Databricks repository as all the samples from this
Blogpost series. Notebook briefly touches also data conversion to/from CSV, Parquet
(*.parquet data format) and Spark IO (*.orc data format). All will be persistent and
visible on DBFS.
Tomorrow we will explore the Databricks jobs, from configuration to execution and
troubleshooting., so stay tuned.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
And you will get to the screen where you will be able to view all your jobs:
Jobs will have the following attributes (all are mostly self explanatory):
Name - Name of the Databricks job
Job ID - ID of the job and it is set automatically
Created By - Name of the user (AD Username) who created a job
Task - Name of the Notebook, that is attached and executed, when job is triggered.
Task can also be a JAR file or a spark-submit command.
Cluster - Name of the cluster, that is attached to this job. When job is fired, all the
work will be done on this cluster.
Schedule - CRON expression written in "readable" manner
Last Run - Status of last run (E.g.: successful, failed, running)
Action - buttons available to start or terminate the job.
2.Creating a job
If you would like to create a new job, use the "+Create job":
And you will be prompted with a new site, to fill in all the needed information:
Give job a name by renaming the "Untitled". I named my job to "Day14_Job". Next
step is to make a task. A task can be:
Selected Notebook (is a notebook that we have learned to use and create on Day 7)
Set JAR file (is a program or function that can be executed using main class and
arguments)
Configure spark-submit (Spark-submit is a Apache Spark script to execute other
applications on a cluster)
We will use a "Selected Notebook. So jump to workspaces and create a new
Notebook. I have named mine as Day14_Running_job and it is running on Python.
This is the following code, I have used:
dbutils.widgets.get("select_number")
num_Select = dbutils.widgets.get('select_number')
You can also check the each run separately and see what has been executed. Each
run has a specific ID and holds typical information, such as Duration, Start time,
Status, Type and which Task was executed, end others.
But the best part of it, you have the results of the notebook also available and
persistent in the Log of the job. This is useful especially if you don't want to store
results of your job to a file or table, but you need to only see them after run.
Do not forget to stop the running jobs, when you don't needed any more. This is
important due to several facts: jobs will be executed even if the clusters are not
running. Thinking, that clusters are not started, but the job is still active, can generate
some traffic and unwanted expenses. In addition, get the feeling, how long does the
job run, so you can plan the cluster up/down time accordingly.
Tomorrow we will explore the Spark UI, metrics and logs that is available on
Databricks cluster and job.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
Dec 15 2020 - Databricks Spark UI,
Event Logs, Driver logs and Metrics
And same information can be accessed from Jobs (it is just positioned in the
overview of the job):
Both will get you to the same page.
1.Spark UI
After running a job, or executing commands in notebooks, check the Spark UI on the
cluster you have executed all the commands. The graphical User Interface will give
you overview of execution of particular jobs/Executors and the timeline:
But if you need detailed description, where will be for each particular job ID (Job ID
13), you can see the execution time, Duration, Status and Job ID global unique
identifier.
When clicking on Description of this Job ID, you will get more detailed overview.
Besides the Event Timeline (what you can see in the above printscreen), you can also
get the DAG visualization for better understanding how Spark API works and which
services is using.
and under stages (completed, failed) you will find detailed execution description of
each step.
And for each of the steps under the description you can get even more detailed
information of the stage.. Here is an example, of the detailed stage and the
aggregated metrics:
This means, that we need to install additional libraries to cluster. Under cluster, click
on libraries.
And select "% Install New" to get:
To start the tracking API for a particular run (on notebook), initiate it with:
run <- mlflow_start_run()
and add all the code, calculations and functions you want to be tracked in MLflow.
This is just the short dummy example how to pass parameters, logs and create
artifacts for MLflow:
# Log a parameter (key-value pair)
mlflow_log_param("test_run_nof_runs_param1", 5)
mlflow_log_artifact("output.txt")
When your code is completed, finish off with end run:
mlflow_end_run()
Within this block of code, each time you will run it, the run will documented and
stored to experiment.
Now under the Experiments Run, click the "View run details":
Scrolling down on this page, you will also find all the artifacts that you can store
during the runs (that might be pickled files, logs, intermediate results, binary files,
etc.)
3. Create a model
Once you have a data set ready and the experiment running, you want to register the
model as well. Model registry is taking care of this. In the same notebook, what we
will do, is add little experiment. Wine-quality experiment. Data is available at github
repository and you will just add the file to your DBFS.
Now use R standard packages:
library(mlflow)
library(glmnet)
library(carrier)
And load data to data.frame (please note, that file is on my FileStore DBFS location
and path might vary based on your location).
library(SparkR)
display(data)
data <- as.data.frame(data)
In addition, I will detach the SparkR package, for not causing any interference
between data types:
#detaching the package due to data type conflicts
detach("package:SparkR", unload=TRUE)
And now do the typical train and test split.
# Split the data into training and test sets. (0.75, 0.25) split.
sampled <- sample(1:nrow(data), 0.75 * nrow(data))
train <- data[sampled, ]
test <- data[-sampled, ]
mlflow_log_param("alpha", alpha)
mlflow_log_param("lambda", lambda)
mlflow_log_metric("rmse", rmse)
mlflow_log_metric("r2", r2)
mlflow_log_metric("mae", mae)
mlflow_log_model(
predictor,
artifact_path = "model",
registered_model_name = "wine-quality")
mlflow_end_run()
And this should also cause additional runs in the same experiment. But in addition, it
will create a model in model registry and this model you can later version and
approve to be moved to next stage or environment.
Tomorrow we will do end-to-end machine learning project using all three languages,
R, Python and SQL.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
display(data_r)
data_r <- as.data.frame(data_r)
And we can also do the same for Python:
import pandas as pd
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=';')
We can use also Python to insert the data and get the dataset insight.
import matplotlib.pyplot as plt
import seaborn as sns
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=',')
data_py.info()
Importing also all other packages that will be relevant in following steps:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
2.Data wrangling
So let's continue using Python. You can get the sense of the dataset by using Python
describe function:
data_py.describe()
%r
library(SparkR)
display(data_r)
data_r <- as.data.frame(data_r)
And we can also do the same for Python:
import pandas as pd
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=';')
We can use also Python to insert the data and get the dataset insight.
import matplotlib.pyplot as plt
import seaborn as sns
data_py = pd.read_csv("/dbfs/FileStore/Day16_wine_quality.csv", sep=',')
data_py.info()
Importing also all other packages that will be relevant in following steps:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
2.Data wrangling
So let's continue using Python. You can get the sense of the dataset by using Python
describe function:
data_py.describe()
And also work with duplicate values (remove them) and missing values (remove them
or replace them with mean value):
#remove duplicates
sum(data_py.duplicated())
data_py.drop_duplicates(inplace=True)
Adding some boxplots will also give a great understanding of the data and statistics
of particular variable. So, let's take pH and Quality
sns.boxplot(x='quality',y='pH',data=data_py,palette='GnBu_d')
plt.title("Boxplot - Quality and pH")
plt.show()
or quality with fixed acidity:
sns.boxplot(x="quality",y="fixed acidity",data=data_py,palette="coolwarm")
plt.title("Boxplot of Quality and Fixed Acidity")
plt.show()
And also add some correlation among all the variables in dataset:
plt.figure(figsize=(10,10))
sns.heatmap(data_py.corr(),annot=True,linewidth=0.5,center=0,cmap='coolwarm')
plt.show()
4.Modeling
We will split the dataset into Y-set - our predict variable and X-set - all the other
variables. After that, we will do splitting of the y-set and x-set into train and test
subset.
X = data_py.iloc[:,:11].values
Y = data_py.iloc[:,-1].values
I will repeat this for the following algorithms: SVM, RandomForest, KNN, Naive Bayes
and I will make a comparison at the end.
SVM
#Fitting SVM into dataset
cl = SVC(kernel="rbf")
cl.fit(X_train,Y_train)
svm_pred=cl.predict(X_test)
svm_cm = confusion_matrix(Y_test,cl.predict(X_test))
print("The accuracy of SVM is:",accuracy_score(Y_test, svm_pred))
RandomForest
#Fitting Randomforest into dataset
rdf_c=RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=0)
rdf_c.fit(X_train,Y_train)
rdf_pred=rdf_c.predict(X_test)
rdf_cm=confusion_matrix(Y_test,rdf_pred)
print("The accuracy of RandomForestClassifier
is:",accuracy_score(rdf_pred,Y_test))
KNN
#Fitting KNN into dataset
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,Y_train)
knn_pred=knn.predict(X_test)
knn_cm=confusion_matrix(Y_test,knn_pred)
print("The accuracy of KNeighborsClassifier is:",accuracy_score(knn_pred,Y_test))
and Naive Bayes
#Fitting Naive bayes into dataset
gaussian=GaussianNB()
gaussian.fit(X_train,Y_train)
bayes_pred=gaussian.predict(X_test)
bayes_cm=confusion_matrix(Y_test,bayes_pred)
print("The accuracy of naives bayes is:",accuracy_score(bayes_pred,Y_test))
And the accuracy for all the model fitting is the following:
LogisticRegression is: 0.4722502522704339
SVM is: 0.48335015136226034
KNeighborsClassifier is: 0.39455095862764883
naives bayes is: 0.46316851664984865
It is clear which model would give improvements,
Tomorrow we will do use Azure Data Factory with Databricks.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
Azure Data Factory is Azure service for ETL operations. It is a serverless service for
data transformation and data integration and orchestration across several different
Azure services. There are also some resemblance to SSIS (SQL Server Integration
Services) that can be found.
On Azure portal (not on Azure Databricks portal) search for "data factory" or "data
factories" and you should get the search recommendation services. Select "Data
factories" is what we are looking for.
Once on tha page for Data factories, select "+ Add" to create new ADF. Insert the
needed information:
Subscription
Resource Group
Region
Name
Version
I have selected the same resource group as the one for Databricks and name, I have
given it ADF-Databricks-day18.
Select additional information such as Git configuration (you can also do this part
later), Networking for special configurations, Advanced settings, add tags and create
a ADF. Once the service is completed with creation and deployment, jump into the
service and you should have the dashboard for this particular service
Select "Author & Monitor" to get into the Azure Data Factory. You will be redirected
to a new site:
On the left-hand site, you will find a vertical navigation bar. Look for the bottom icon,
that looks like a tool-box and is for managing the ADF:
You will get to the setting site for this Data factory.
And we will be creating a new linked service. Linked service that will enable
communication between Azure Data factory and Azure Databricks. Select "+ New" to
add new Linked Services:
On your right-hand side, the window will pop-up with available services that you
want to link to ADF. Either search for Azure Databricks, or click "compute" (not
"storage") and you should see Databricks logo immediately.
Click on Azure Databricks and click on "Continue". You will get a list of information
you will need to fill-in.
All needed information is relatively straight-forward. Yet, there is the authentication,
we still need to fix. We will use access token. On Day 9 we have used Shared Access
Signature (SAS), where we needed to make a Azure Databricks tokens. Open a new
window (but do not close ADF Settings for creating a new linked service) in Azure
Databricks and go to settings for this particular workspace.
Click on the icon (mine is: "DB_py" and gliph) and select "User Settings".
You can see, I have a Token ID and secret from Day 9 already in the Databricks
system. So let's generate a new Token by clicking "Generate New Token".
Give token a name: "adf-db" and lifetime period. Once the token is generated, copy
it somewhere, as you will never see it again. Of course, you can always generate a
new one, but if you have multiple services bound to it, it is wise to store it
somewhere secure. Generate it and copy the token (!).
Go back to Azure Data Factory and paste the token in the settings. Select or fill-in the
additional information.
Once the linked server is created, select the Author in the left vertical menu in Azure
Data Factory.
This will bring you a menu where you can start putting together a pipeline. But in
addition, you can also register in ADF the datasets and data flows; this is especially
useful when for large scale ETL or orchestration tasks.
Select Pipeline and choose "New pipeline". You will be presented with a designer
tool where you can start putting together the pipelines, flow and all the
orchestrations.
Under the section of Databricks, you will find:
Notebooks
Jar
Python
You can use Notebooks to build the whole pipelines and help you communicate
between notebooks. You can add an application (Jar) or you can add a Python script
for carrying a particular task.
Drag and drop the elements on to the canvas. Select Notebook (Databricks
notebook), drag the icon to canvas, and drop it.
Under the settings for this particular Notebook, you have a tab "Azure Databricks"
where you select the Linked server connection. Chose the one, we have created
previously. Mine is named as "AzureDatabricks_adf". And under setting select the
path to the notebook:
Once you finish with entering all the needed information, remember to hit "Publish
all" and if there is any conflict between code, we can just and fix it immediately.
You can now trigger the pipeline, you can schedule it or connect it to another
notebook by selecting run or debbug.
In this manner you can schedule and connect other services with Azure Databricks.
Tomorrow, we will look into adding a Python element or another notebook to make
more use of Azure Data factory.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
This would be a one of the scenarios where you would have multiple csv files coming
in to blob storage (particular folder) and we would want:
merge CSV files
merge files with some transformation in between
transform the files first and do the merge
copying files from one data lake zone to another zone and making transformation in
between
or any other...
Regardless of the scenario, let's dive in.
1.Create linked service for Azure Blob
Yesterday (day18) we looked how to create a linked service for Azure Databricks. We
will need another linked service for Azure Blob storage. Navigate to linked services
and create a new one. We need a new linked service for Azure Blob Storage.
While configuring, select your Azure Subscription, and choose the Storage account,
we have created on day9 and I called it dbpystorage. You should have something like
this:
On day 9 we also copied a file into the blobstorage, called Day9MLB_players.csv (file
is also available at the Github repository). Now you should have Azure Blob Storage
and Azure Databricks services linked to Azure Data Factory.
We will now need to create a dataset and a pipeline in ADF
2.Creating a dataset
By adding a new dataset, go to Datasets and select "New Dataset". Window will pop-
up asking for the location of the dataset. Select the Azure Blob Storage, because file
is available in this service.
After selecting the storage type, you will be prompted with file type. Choose CSV -
DelimitedText type.
And after this, specify the path to the file. As I am using only one file, I am specifying
the name. Otherwise, if this folder would have been a landing for multiple files (with
same schema), I could use a wildcard, eg.: Day*.csv and all files following this patter
would be read.
Once you have a dataset created, we will need a pipeline to connect the services.
3. Creating Pipeline
On the Author view in ADF, create a new Pipeline. A new canvas will appear for you
to start working on data integration.
Select element "Copy Data" and element "Databricks". Element Copy Data will need
the source and the sink data. It can copy a file from one location to another, it can
merge files to another location or change format (going from CSV to Parquet). I will
be using from CSV to merge into CSV.
Select all the properties for Source.
And for the Sink. For copy behaviour, I am selecting "merge files" to mimic the ETL
job.
Once this part is completed, we need to look into the Databricks element:
Azure Databricks notebook can hold literarily anything. From data transformation, to
data merge, analytics, or it can even serve as a transformation element and
connection to further other elements. In this case, Databricks element will hold only
for reading activity and creating a table.
4. Notebook
Before connecting the elements in ADF, we need to give some instructions to
Notebook. Head to Azure Databricks and create a new notebook. I have named
mine: Day19_csv and choose language: Python.
Set up the connection to file (this time using with python - before we used Scala):
%python
storage_account_name = "dbpystorage"
storage_account_access_key = "YOUR_ACCOUNT_ACCESS_KEY"
file_location = "wasbs://dbpystorecontainer@dbpystorage.blob.core.windows.net/"
file_type = "csv"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.ne
t",storage_account_access_key)
After the initial connection is set, we can load the data and create a SQL table:
%python
df = spark.read.format(file_type).option("header","true").option("inferSchema",
"true").load(file_location)
df.createOrReplaceTempView("Day9data_view")
And the SQL query:
%sql
SELECT * FROM Day9data_view
--Check number of rows
SELECT COUNT(*) FROM Day9data_view
You can add many other data transformation or ETL scripts. Or you can harvest the
Machine Learning script to do data analysis and data predictions. Normally, I would
add analysis of merged dataset and save or expose the results to other services (via
ADF), but to keep the post short, let's keep it as it is.
5. Connecting the dots
Back in Azure Data Factory, set the Notebook and select the Azure Databricks linked
service and under setting, set the path to the notebook we have created in previous
step.
You can always browse through the path and select the correct path.
Once you set the path, you can connect the elements (or activities) together, debug
and publish all the elements. Once published you can schedule and harvest the
pipeline.
This pipeline can be scheduled, can be used as part of bigger ETL or it can be
extended. You can have each notebook doing part of ETL and have the notebooks
orchestrated in ADF, you can have data flows created in ADF and connect Python
code (instead of Notebooks). The possibilities are endless. Even if you want to
capture streaming data, you can use ADF and Databricks, or only Databricks with
Spark or you can use other services (Event hub, Azure functions, etc.).
Tomorrow we will look this orchestration part using two notebooks with Scala and
Python.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
This would be a one of the scenarios where you would have multiple csv files coming
in to blob storage (particular folder) and we would want:
merge CSV files
merge files with some transformation in between
transform the files first and do the merge
copying files from one data lake zone to another zone and making transformation in
between
or any other...
Regardless of the scenario, let's dive in.
1.Create linked service for Azure Blob
Yesterday (day18) we looked how to create a linked service for Azure Databricks. We
will need another linked service for Azure Blob storage. Navigate to linked services
and create a new one. We need a new linked service for Azure Blob Storage.
While configuring, select your Azure Subscription, and choose the Storage account,
we have created on day9 and I called it dbpystorage. You should have something like
this:
On day 9 we also copied a file into the blobstorage, called Day9MLB_players.csv (file
is also available at the Github repository). Now you should have Azure Blob Storage
and Azure Databricks services linked to Azure Data Factory.
We will now need to create a dataset and a pipeline in ADF
2.Creating a dataset
By adding a new dataset, go to Datasets and select "New Dataset". Window will pop-
up asking for the location of the dataset. Select the Azure Blob Storage, because file
is available in this service.
After selecting the storage type, you will be prompted with file type. Choose CSV -
DelimitedText type.
And after this, specify the path to the file. As I am using only one file, I am specifying
the name. Otherwise, if this folder would have been a landing for multiple files (with
same schema), I could use a wildcard, eg.: Day*.csv and all files following this patter
would be read.
Once you have a dataset created, we will need a pipeline to connect the services.
3. Creating Pipeline
On the Author view in ADF, create a new Pipeline. A new canvas will appear for you
to start working on data integration.
Select element "Copy Data" and element "Databricks". Element Copy Data will need
the source and the sink data. It can copy a file from one location to another, it can
merge files to another location or change format (going from CSV to Parquet). I will
be using from CSV to merge into CSV.
Select all the properties for Source.
And for the Sink. For copy behaviour, I am selecting "merge files" to mimic the ETL
job.
Once this part is completed, we need to look into the Databricks element:
Azure Databricks notebook can hold literarily anything. From data transformation, to
data merge, analytics, or it can even serve as a transformation element and
connection to further other elements. In this case, Databricks element will hold only
for reading activity and creating a table.
4. Notebook
Before connecting the elements in ADF, we need to give some instructions to
Notebook. Head to Azure Databricks and create a new notebook. I have named
mine: Day19_csv and choose language: Python.
Set up the connection to file (this time using with python - before we used Scala):
%python
storage_account_name = "dbpystorage"
storage_account_access_key = "YOUR_ACCOUNT_ACCESS_KEY"
file_location = "wasbs://dbpystorecontainer@dbpystorage.blob.core.windows.net/"
file_type = "csv"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.ne
t",storage_account_access_key)
After the initial connection is set, we can load the data and create a SQL table:
%python
df = spark.read.format(file_type).option("header","true").option("inferSchema",
"true").load(file_location)
df.createOrReplaceTempView("Day9data_view")
And the SQL query:
%sql
SELECT * FROM Day9data_view
--Check number of rows
SELECT COUNT(*) FROM Day9data_view
You can add many other data transformation or ETL scripts. Or you can harvest the
Machine Learning script to do data analysis and data predictions. Normally, I would
add analysis of merged dataset and save or expose the results to other services (via
ADF), but to keep the post short, let's keep it as it is.
5. Connecting the dots
Back in Azure Data Factory, set the Notebook and select the Azure Databricks linked
service and under setting, set the path to the notebook we have created in previous
step.
You can always browse through the path and select the correct path.
Once you set the path, you can connect the elements (or activities) together, debug
and publish all the elements. Once published you can schedule and harvest the
pipeline.
This pipeline can be scheduled, can be used as part of bigger ETL or it can be
extended. You can have each notebook doing part of ETL and have the notebooks
orchestrated in ADF, you can have data flows created in ADF and connect Python
code (instead of Notebooks). The possibilities are endless. Even if you want to
capture streaming data, you can use ADF and Databricks, or only Databricks with
Spark or you can use other services (Event hub, Azure functions, etc.).
Tomorrow we will look this orchestration part using two notebooks with Scala and
Python.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
import datetime
def current_time():
e = datetime.datetime.now()
print ("Current date and time = %s" % e)
Day20_NB2 - is the notebook that outputs the results to SQL table.
And the SQL code:
%sql
INSERT INTO day10.day20_NB_run VALUES (10, "Running from Day20_Notebook2",
CAST(current_timestamp() AS TIMESTAMP))
Day20_NB3_Widget - is the notebook that receives the arguments from previous
step (either notebook, or CRON or function) and executes the steps with this input
information accordingly and stores results to SQL table.
And the code from this notebook. This step creates a widget on the notebook.
dbutils.widgets.dropdown("Wid_arg", "1", [str(x) for x in range(1, 10)])
Each time a number is selected, you can run the command to get the value out:
selected_value = dbutils.widgets.get("Wid_arg")
print("Returned value: ", selected_value)
And the last step is to insert the values (and also the result of the widget) into SQL
Table.
%sql
INSERT INTO day10.day20_NB_run VALUES (10, "Running from day20_Widget notebook",
CAST(current_timestamp() AS TIMESTAMP))
Day20_Main - is the umbrella notebook or the main notebook, where all the
orchestration is carried out. This notebook also holds the logic behind the steps and
it's communication.
%sql
INSERT INTO day10.day20_NB_run VALUES (10, "Running from day20_Main notebook",
CAST(current_timestamp() AS TIMESTAMP))
For each step in between, I am checking the values in SQL Table, giving you the
current status, and also logging all the steps on Main ntoebook.
Command to execute notebook with input parameters, meaning that the command
will run the Day20_NB3_Widget notebook. And this should have been trying to max
60 seconds. And the last part is the collection of parameters, that are executed..
dbutils.notebook.run("Day20_NB3_Widget", 60, {"Wid_arg": "5"})
We have seen that the orchestration has the capacity to run in multiple languages,
use input or output parameters and can be part of larger ETL or ELT process.
Tomorrow we will check and explore how to go about Scala, since we haven't yes
discussed anything about Scala, but yet mentioned it on multiple occasions.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
Dec 21 2020 - Using Scala with Spark
Core API in Azure Databricks
And in the following blogposts we will explore the core engine and services on top:
Spark SQL+ Dataframes
Streaming
MLlib - Machine learning library
GraphX - Graph computations
Apache Spark is a powerful open-source processing engine built around speed, ease
of use, and sophisticated analytics.
Spark Core is underlying general execution engine for the Spark Platform with all
other functionalities built-in. It is in memory computing engine that provides variety
of language support, as Scala, R, Python for easier data engineering development
and machine learning development.
Spark has three key interfaces:
Resilient Distributed Dataset (RDD) - It is an interface to a sequence of data objects
that consist of one or more types that are located across a collection of machines (a
cluster). RDDs can be created in a variety of ways and are the “lowest level” API
available. While this is the original data structure for Apache Spark, you should focus
on the DataFrame API, which is a superset of the RDD functionality. The RDD API is
available in the Java, Python, and Scala languages.
DataFrame - similar in concept to the DataFrame you will find with the pandas
Python library and the R language. The DataFrame API is available in the Java, Python,
R, and Scala languages.
Dataset - is combination of RDD and DataFrame. It proved typed interface of RDD
and gives you the convenience of the DataFrame. The Dataset API si available only for
Scala and Java.
In general, when you will be working with the performance optimisations, either
DataFrames or Datasets should be enough. But when going into more advanced
components of Spark, it may be necessary to use RDDs. Also the visualisation within
Spark UI references directly RDDs.
1.Datasets
Let us start with Databricks datasets, that are available within every workspace and
are here mainly for test purposes. This is nothing new; both Python and R come with
sample datasets. For example the Iris dataset that is available with Base R engine and
Seaborn Python package. Same goes with Databricks and sample dataset can be
found in /databricks-datasets folder.
Create a new notebook in your workspace and name it Day21_Scala. Language: Scala.
And run the following Scala command.
display(dbutils.fs.ls("/databricks-datasets"))
You can always store the results to variable and later use is multiple times:
// transformation
val textFile = spark.read.textFile("/databricks-datasets/samples/docs/README.md")
and listing the content of the variable by using show() function:
textFile.show()
And some other useful functions; to count all the lines in textfile, to show the first
line and to filter the text file showing only the lines containing the search argument
(word sudo).
// Count number or lines in textFile
textFile.count()
// Show the first line of the textFile
textFile.first()
// show all the lines with word Sudo
val linesWithSudo = textFile.filter(line => line.contains("sudo"))
And also printing all (first four) lines of with the subset of text containing the word
"sudo". In the second example finding the Line number with most words:
// Output the all four lines
linesWithSudo.collect().take(4).foreach(println)
// find the lines with most words
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
2. Create a dataset
Now let's create a dataset (remember the difference
between Dataset and DataFrame) and load some data from /databricks-
datasets folder.
val df = spark.read.json("/databricks-datasets/samples/people/people.json")
3. Convert Dataset to DataFrame
We can also convert Dataset to DataFrame for easier operation and usage. We must
define a class that represents a type-specific Scala JVM object (like a schema) and
now repeat the same process with definition.
case class Person (name: String, age: Long)
val ds =
spark.read.json("/databricks-datasets/samples/people/people.json").as[Person]
We can also create and define another dataset, taken from the /databricks-datasets
folder that is in JSON (flattened) format:
// define a case class that represents the device data.
case class DeviceIoTData (
battery_level: Long,
c02_level: Long,
cca2: String,
cca3: String,
cn: String,
device_id: Long,
device_name: String,
humidity: Long,
ip: String,
latitude: Double,
longitude: Double,
scale: String,
temp: Long,
timestamp: Long
)
val ds =
spark.read.json("/databricks-datasets/iot/iot_devices.json").as[DeviceIoTData]
and run show() function to see the imported Dataset from JSON file:
Now let's play with the dataset using Scala Dataset API with following frequently
used functions:
display(),
describe(),
sum(),
count(),
select(),
avg(),
filter(),
map() or where(),
groupBy(),
join(), and
union().
display()
You can also view the dataset using display() (similar to .show() function):
display(ds)
describe()
Describe() function is great for exploring the data and the structure of the data:
ds.describe()
display(sum_c02_1)
And we can also double check the result of this sum with SQL. Just because it is fun.
But first We need to create a SQL view (or it could be a table) from this dataset.
ds.createOrReplaceTempView("SQL_iot_table")
And then define cell as SQL statement, using %sql. Remember, complete code today
is written in Scala, unless otherwise stated with %{lang} and the beginning.
%sql
SELECT sum(c02_level) as Total_c02_level FROM SQL_iot_table
And for sure, we get the same result (!).
select()
Select() function will let you show only the columns you want to see.
// Both will return same results
ds.select("cca2","cca3", "c02_level").show()
// or
display(ds.select("cca2","cca3","c02_level"))
avg()
Avg() function will let you aggregate a column (let us take: c02_level) over another
column (let us take: countries in variable cca3). First we want to calculate average
value over the complete dataset:
val avg_c02 = ds.groupBy().avg("c02_level")
display(avg_c02)
And then also the average value for each country:
val avg_c02_byCountry = ds.groupBy("cca3").avg("c02_level")
display(avg_c02_byCountry)
filter()
Filter() function will shorten or filter out the values that will not comply with the
condition. Filter() function can also be replaced by where() function; they both have
similar behaviour.
Following command will return dataset that meet the condition where batter_level is
greater than 7.
display(ds.filter(d => d.battery_level > 7))
And the following command will filter the database on same condition, but only
return the specify columns (in comparison with previous command which returned all
columns):
display(ds.filter(d => d.battery_level > 7).select("battery_level", "c02_level",
"cca3"))
groupBy()
Adding aggregation to filtered data (avg() function) and grouping dataset based on
cca3 variable:
display(ds.filter(d => d.battery_level > 7).select("c02_level",
"cca3").groupBy("cca3").avg("c02_level"))
Note that there is explicit definition of internal subset in filter function. Part where "d
=> d.battery_level>7" is creating a separate subset of data that can also be used with
map() function, as part of map-reduce Hadoop function.
join()
Join() function will combine two objects. So let us create two simple DataFrames and
create a join between them.
val df_1 = Seq((0, "Tom"), (1, "Jones")).toDF("id", "first")
val df_2 = Seq((0, "Tom"), (2, "Jones"), (3, "Martin")).toDF("id", "second")
Using function Seq() to create a sequence and toDF() to save it as DataFrame.
To join two DataFrames, we use
display(df_1.join(df_2, "id"))
Name of the first DataFrame - df_1 (on left-hand side) joined by second DataFrame
- df_2 (on the right-hand side) by a column "id".
Join() implies inner.join and returns all the rows where there is a complete match. If
interested, you can also explore the execution plan of this join by adding explain at
the end of command:
df_1.join(df_2, "id").explain
and also create left/right join or any other semi-, anti-, cross- join.
df_1.join(df_2, Seq("id"), "LeftOuter").show
df_1.join(df_2, Seq("id"), "RightOuter").show
union()
To append two datasets (or DataFrames), union() function can be used.
val df3 = df_1.union(df_2)
display(df3)
// df3.show(true)
distinct()
Distinct() function will return only the unique values, and it can also be used with
union() function to achieve union all type of behaviour:
display(df3.distinct())
Tomorrow we will Spark SQL and DataFrames with Spark Core API in Azure
Databricks. Todays' post was little bit longer, but it is important to get a good
understanding on Spark API, get your hands wrapped around Scala and start working
with Azure Databricks.
Complete set of code and Scala notebooks (including HTML) will be available at
the Github repository.
Happy Coding and Stay Healthy!
"Spark SQL is a spark module for structured data processing and data querying. It
provides programming abstraction called DataFrames and can also serve as
distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up
to 100x faster on existing deployments and data. It also provides powerful
integration with the rest of the Spark ecosystem (e.g.: integrating SQL query
processing with machine learning)." (Apache Spark Tutorial).
Start your Azure Databricks workspace and create new Notebook. I named mine
as: Day22_SparkSQL and set the language: SQL. Now let's explore the functionalities
of Spark SQL.
1.Loading Data
We will load data from /databricks-datasets using Spark SQL, R and Python
languages. The CSV dataset will be data_geo.csv in the following folder:
%scala
display(dbutils.fs.ls("/databricks-datasets/samples/population-vs-price"))
1.1. Loading using Python
%python
data =
spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv",
header="true", inferSchema="true")
And materialize the data using to create a view with name data_geo_py:
%python
data.createOrReplaceTempView("data_geo_py")
And run the following SQL Statement:
SELECT * FROM data_geo_py LIMIT 10
1.2. Loading using SQL
DROP TABLE IF EXISTS data_geo;
3.8.Window functions
SELECT
City
,State
,RANK() OVER (PARTITION BY State ORDER BY `2015 median sales price`) AS rank
,`2015 median sales price` AS MedianPrice
FROM data_geo
WHERE
`2015 median sales price` IS NOT NULL;
4. Exploring the visuals
Results of a SQL SELECT statements that are returned as a table, can also be
visualised. Given the following SQL Statement:
SELECT `State Code`, `2015 median sales price` FROM data_geo
in the result cell you can select the plot icon and pick Map.
Furthermore, using "Plot Options..." you can change the settings of the variables on
the graph, aggregations and data series.
With additional query:
SELECT
`State Code`
, `2015 median sales price`
FROM data_geo_SQL
ORDER BY `2015 median sales price` DESC;
There are also many other visuals available and much more SQL statements to
explore and feel free to go a step further and beyond this blogpost.
Tomorrow we will explore Streaming with Spark Core API in Azure Databricks.
Complete set of code and SQL notebooks (including HTML) will be available at
the Github repository.
Happy Coding and Stay Healthy!
Spark Streaming is the process that can analyse not only batches of data but also
streams of data in near real-time. It gives the powerful interactive and analytical
applications across both hot and cold data (streaming data and historical data). Spark
Streaming is a fault tolerance system, meaning due to lineage of operations, Spark
will always remember where you stopped and in case of a worker error, another
worker can always recreate all the data transformation from partitioned RDD
(assuming that all the RDD transformations are deterministic).
Spark streaming has a native connectors to many data sources, such as HDFS, Kafka,
S3, Kinesis and even Twitter.
Start your Workspace in Azure Databricks. Create new notebook, name
it: Day23_streaming and use the default language: Python. If you decide to use
EventHubs from reading data from HDFS or other places, Scala language might be
slightly better.
If you will be using Spark context, otherwise just import pyspark.sql namespace.
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
We will be using the demo data from databricks-datasets folder:
%fs ls /databricks-datasets/structured-streaming/events/
streamingInputDF = (
spark
.readStream
.schema(jsonSchema) # Set the schema of the JSON data
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as stream of one
at a time
.json(inputPath)
)
streamingCountsDF = (
streamingInputDF
.groupBy(
streamingInputDF.action,
window(streamingInputDF.time, "1 hour"))
.count()
)
You start a streaming computation by defining a sink and starting it. In this case, to
query the counts interactively, set the completeset of 1 hour counts to be in an in-
memory table.
Run the following command to examine the outcome of a query.
query = (
streamingCountsDF
.writeStream
.format("memory") # memory = store in-memory table (for testing only)
.queryName("counts") # counts = name of the in-memory table
.outputMode("complete") # complete = all the counts should be in the table
.start()
)
And once the cluster is running, you can do variety of analysis. The Key component is
the ".start" method - embedded in the main function, that you can run the spark due
to incoming poklikuc.
You can also further shape the data by using Spark SQL:
%sql
SELECT
action
,date_format(window.end, "MMM-dd HH:mm") as time
,count
FROM counts
ORDER BY time, action
Tomorrow we will explore Spark's own MLlib package for Machine Learning using
Azure Databricks.
Complete set of code and SQL notebooks (including HTML) will be available at
the Github repository.
Happy Coding and Stay Healthy!
Dec 24 2020 - Using Spark MLlib for
Machine Learning in Azure Databricks
4. Logistic Regression
In the Pipelines API, we are now able to perform Elastic-Net Regularization with
Logistic Regression, as well as other linear methods.
from pyspark.ml.classification import LogisticRegression
lrModel = lr.fit(trainingData)
And make predictions on test dataset. Using transform() method to use only the
vector of features as a column:
predictions = lrModel.transform(testData)
We can check the dataset:
selected = predictions.select("label", "prediction", "probability", "age",
"occupation")
display(selected)
And the score of evaluated predictions is: 0.898976. What we. want to do next is to
fine tune the model with the ParamGridBuilder and the CrossValidator. You can
use explainParams() to see the list of parameters and the definition. Set up the
ParamGrid with Regularization Parametrs, ElasticNet Parameters and number
of maximum iterations.
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
You can also change the graphs here and explore each observation in the dataset:
The advent is here :) And I wish you all Merry Christmas and a Happy New Year 2021.
The series will continue for couple of more days. And tomorrow we will explore
Spark’s GraphX for Spark Core API.
Complete set of code and the Notebook is available at the Github repository.
Happy Coding and Stay Healthy!
Motif - you can build more complex relationships involving edges and vertices. The
following cell finds the pairs of vertices with edges in both directions between them.
The result is a DataFrame, in which the column names are given by the motif keys.
Stateful - with combining GraphFrame motif finding with filters on the result where
the filters use sequence operations to operate over DataFrame columns. Therefore it
is called stateful (vis-a-vis stateless), because it remembers previous state.
Now you can start using the notebook. Import the packages that we will need.
from functools import reduce
from pyspark.sql.functions import col, lit, when
from graphframes import *
1.Create a sample dataset
We will create a sample dataset (taken from Databricks website) and will be inserted
as a DataFrame.
Vertices:
vertices = sqlContext.createDataFrame([
("a", "Alice", 34, "F"),
("b", "Bob", 36, "M"),
("c", "Charlie", 30, "M"),
("d", "David", 29, "M"),
("e", "Esther", 32, "F"),
("f", "Fanny", 36, "F"),
("g", "Gabby", 60, "F"),
("h", "Mark", 45, "M"),
("i", "Eddie", 60, "M"),
("j", "Mandy", 21, "F")
], ["id", "name", "age", "gender"])
Edges:
edges = sqlContext.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend"),
("a", "h", "follow"),
("a", "i", "follow"),
("a", "j", "follow"),
("j", "h", "friend"),
("i", "c", "follow"),
("i", "c", "friend"),
("b", "j", "follow"),
("d", "h", "friend"),
("e", "j", "friend"),
("h", "a", "friend")
], ["src", "dst", "relationship"])
Let's create a graph using vertices and edges:
graph_sample = GraphFrame(vertices, edges)
print(graph_sample)
Or you can achieve same with:
# This example graph also comes with the GraphFrames package.
from graphframes.examples import Graphs
same_graph = Graphs(sqlContext).friends()
print(same_graph)
2.Querying graph
We can display Edges, vertices, incoming or outgoing degrees:
display(graph_sample.vertices)
#
display(graph_sample.edges)
#
display(graph_sample.inDegrees)
#
display(graph_sample.degrees)
And you can even combine some filtering and using aggregation funtions:
youngest = graph_sample.vertices.groupBy().min("age")
display(youngest)
3.Using motif
Using motifs you can build more complex relationships involving edges and vertices.
The following cell finds the pairs of vertices with edges in both directions between
them. The result is a DataFrame, in which the column names are given by the motif
keys.
# Search for pairs of vertices with edges in both directions between them.
motifs = graph_sample.find("(a)-[e]->(h); (h)-[e2]->(a)")
display(motifs)
4.Using Filter
You can filter out the relationship between nodes and adding multiple predicates.
filtered = motifs.filter("(b.age > 30 or a.age > 30) and (a.gender = 'M' and
b.gender ='F')")
display(filtered)
# I guess Mark has a crush on Alice, but she just wants to be a follower :)
5. Stateful Queries
Stateful queries are set of filters with given sequences, hence the name. You can
combine GraphFrame motif finding with filters on the result where the filters use
sequence operations to operate over DataFrame columns. Following an example:
# Find chains of 4 vertices.
chain4 = graph_sample.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")
Dec 26 2020 -
Connecting Azure
Machine Learning
Services Workspace
and Azure
Databricks
And many of these should already be available from previous days (Key Vault, Store
Account, Container Registry). You should only create a new application insights.
Click Review + create and Create.
Once completed you can download the deployment script or go directly to resource.
Among the resources, one new resource will be created for you. Go to this resource.
You will see, that you will be introduced to Machine Learning Workspace and you
can launch Studio.
Since this is the first time setup of this new workspace that has connection to Azure
Databricks, you will be prompted additional information about the Active Directory
account:
Linking your ADB workspace to your Azure Machine Learning workspace enables you
to track your experiment data in the Azure Machine Learning workspace.
The following code should be in your experiment notebook IN AZURE DATABRICKS
(!) to get your linked Azure Machine Learning workspace.
import mlflow
import mlflow.azureml
import azureml.mlflow
import azureml.core
#Your subscription ID that you are running both Databricks and ML Service
subscription_id = 'subscription_id'
# Azure Machine Learning resource group NOT the managed resource group
resource_group = 'resource_group_name'
Before you click the "Set up RStudio" button, you will need to disable the auto
termination option. By default is enabled, making a cluster terminate itself after
period of time of inactivity. Under configuration, select Edit and disable the
termination. Cluster will restart. And then click the set up Rstudio. Beware, you should
not stop the cluster yourself after finishing work (!).
Click Set up RStudio and you will get the following credentials:
And click the "Open RStudio" and you will get redirected to web portal with RStudio
opening.
In order to get the Databricks cluster objects into R Studio, you must also run the
spark_connect:
SparkR::sparkR.session()
library(sparklyr)
sc <- spark_connect(method = "databricks")
And you will see all the DataFrames or CSV Files from previous days:
Please note, you can also connect to RStudio desktop version (!). If so, you would,
there are the following steps:
Open your RStudio Desktop and install:
install.packages("devtools")
devtools::install_github("sparklyr/sparklyr")
Install Databricks-connect in CLI (it is a 250Mb Package):
pip uninstall pyspark
pip install -U databricks-connect
Now set the connections to Azure Databricks:
databricks-connect get-jar-dir
And after that run the command in CLI:
databricks-connect configure
CLI will look like a text input:
And all the information you need to fill in the the CLI can find in URL:
#Display
display(data)
And once you run this, you should have the results.
Make sure you also choose the correct Python interpreter in Visual Studio Code.
3. Connecting to Azure Databricks with ODBC
You can also connect Azure Databricks SQL tables using ODBC to your on-premise
Excel or to Python or to R.
It will only see the SQL tables and connections. but it can also be done. This will
require some ODBC installation, but I will not go into it.
Tomorrow we will look into Infrastructure as Code and how to automate, script and
deploy Azure Databricks.
Complete set of code and the Notebook is available at the Github repository.
Happy Coding and Stay Healthy!
You will need nothing CLI, Powershell and all that you already have. So, let's go into
CLI and get the Azure Powershell Module.
In CLI type:
if ($PSVersionTable.PSEdition -eq 'Desktop' -and (Get-Module -Name AzureRM -
ListAvailable)) {
Write-Warning -Message ('Az module not installed. Having both the AzureRM and
' +
'Az modules installed at the same time is not supported.')
} else {
Install-Module -Name Az -AllowClobber -Scope CurrentUser
}
After that, you can connect to your Azure subscription:
Connect-AzAccount
You will be prompted to add your credentials. And once you enter them, you will get
the results on your Account, tenantID, Environment and Subscription Name.
Once connected, we will look into Databricks module. To list all the modules:
Get-Module -ListAvailable
To explore what is available for Az.Databricks, lets see with the following PS
command:
Get-Command -Module Az.Databricks
Now we can create a new Workspace. In this manner, you can also create "semi"
automation, but ARM will make this next steps even easier.
New-AzDatabricksWorkspace `
-Name databricks-test `
-ResourceGroupName testgroup `
-Location eastus `
-ManagedResourceGroupName databricks-group `
-Sku standard
Or we can use ARM (Azure Resource Manager) deployment:
$templateFile = "/users/template.json"
New-AzResourceGroupDeployment `
-Name blanktemplate `
-ResourceGroupName myResourceGroup `
-TemplateFile $templateFile
Or you can go through Deployment process in Azure Portal:
And select a Github template to create a new Azure Databricks workspace:
Or you can go under "Build your own template" and get my Github Repository IaC
folder with template.json and Parameters.json files and paste the content in here.
Add First the new resource group:
New-AzResourceGroup -Name RG_123xyz -Location “westeurope”
And at the end generate the JSON files for your automated deployment. Adding with
parameters file:
$templateFile =
"/users/tomazkastrun/Documents/GitHub/Azure-Databricks/iac/template.json"
PS /Users/tomazkastrun>
$parameterFile=“/users/tomazkastrun/Documents/GitHub/Azure-Databricks/iac/
parameters.json”
New-AzResourceGroupDeployment -Name DataBricksDeployment -ResourceGroupName
RG_123xyz -TemplateFile $templateFile -TemplateParameterFile $parameterFile
This will take some time:
These values will be same as the one in parameters.JSON file. In this manner you can
automate your deployment and continuous integration (CI) and continuous
deployment (CD).
Tomorrow we will dig into Apache Spark.
Complete set of code and the Notebook is available at the Github repository.
Happy Coding and Stay Healthy!
There are indirect and direct performance improvements that can leverage and make
your Spark run faster.
1.Choice of Languages
Java versus Scala versus R versus Python (versus HiveSQL)? There is no correct or
wrong answer to this choice, but there are some important differences worth
mentioning. If you are running single-node machine learning the SparkR is a best
option, since it has a massive machine learning ecosystem, and it has a lot of
optimised algorithms that can handle this.
If you are running ETL job, Spark and combination of another language (R, Python,
Scala) will yield all great results. Spark's Structured API are consistent in terms of
speed and stability across all the languages, so there should be almost none
differences. But things get much more interesting when there are UDF (user defined
functions) that can not be directly created in Structured API's. In this case, both R nor
Python might not be a good idea, simply because the way Structured API manifests
and transforms as RDD. In general, Python makes better choice over R when writing
UDF, but probably the best way would be to write UDF in Scala (or Java), making
these language jumps easier for the API interpreter.
2.Choice of data presentation
DataFrame versus Datasets versus SQL versus RDD is another choice, yet it is fairly
easy. DataFrames, Datasets and SQL objects are all equal in performance and
stability (at least from Spar 2.3 and above), meaning that if you are using
DataFrames in any language, performance will be the same. Again, when writing
custom objects of functions (UDF), there will be some performance degradation with
both R or Python, so switching to Scala or Java might be a optimisation.
Rule of thumb is, stick to DataFrames. If you go a layer down to RDD, Spark will make
better optimisation and use of it than you will. Spark optimisation engine will write
better RDD code than you do and with certainly less effort. And doing so, you might
also loose additional Spark optimisation with new releases.
When using RDD, try and use Scala or Java. If this is not possible, and you will be
using Python or R extensively, try to use it as little as possible with RDDs. And
convert to DataFrames as quickly as possible. Again, if your Spark code, application
or data engineering task is not compute intensive, it should be fine, otherwise
remember to use Scala or Java or convert to DataFrames. Both Python and R does
not handle serialisation of RDD files optimally and runs a lot of data to and from
Python or R engine, causing a lot of data movement, traffic and potentially making
RDD unstable and making poor performance.
3. Data Storage
Storing data effectively is relevant when data will be read multiple times. If data will
be accessed many times, either from different users in organisation or from a single
user, all making data analysis, make sure to store it for effective reads. Choosing your
storage, choosing the data formats and data partitioning is important.
With numerous file format available, there are some key differences. If you want to
optimise your Spark job, data should be stored in best possible format for this. In
general, always favour structured, binary types to store your data, especially when
you are doing frequent-accessing. Although CSV files look well formatted, they are
obnoxiously sparse, can have "edge" cases (missing line breaks, or other delimiters)
are painfully slow to parse and hard to partition. Same logic applies to txt and xml
formats. Avro are JSON orientated and also sparse and I am not going to even talk
about XML format. Spark works best with Apache Parquet stored data. Parquet
format stores data in a binary files in column-orientated storage, and also track some
statistics of the files, making it possible to skip files not needed for query.
4.Table partitioning and bucketing
Table partitioning is referring to storing files in separate directories based on a
partition key (e.g.: date of purchase, VAT number) such as a date field in data stored
in these directories. Partitioning will help Spark skip files that are not needed for end
result and it will return only the data that is in the range of the key. There are
potentials pitfalls to this techniques, one for sure is the size of these subdirectories
and how to choose the right granularity.
Bucketing is a process of "pre-partitioning" data to allow better data joins and
aggregations operations. This will improve performance, because data can
be consistently distributed across partitions as opposed all being in one partition.
So if you are repeating a particular query that is joins are frequently performed on a
column immediately after read, you can use bucketing to assure that data is well
partitioned in accordance with those values. This will prevent shuffle before join and
speed up data access.
5.Parallelism
Splittable data formats make Spark job easier to run in parallel. A ZIP or a TAR file
can not be split, which means that even if you have 10 files in a ZIP file and 10 cores,
only one core can read in that data, because Spark can not parallelise across ZIP file.
But using GZIP, BZIP2 or LZ4 are generally splittable if (and only if) they are written
by a parallel processing framework like Spark or Hadoop.
In general, Spark will work best when there are two to three tasks per CPU core in
your cluster when working especially with large (big) data. You can also tune
the spark.default.parallelism property.
6.Number of files
With numerous small files you will for sure pay a price for listing and fetching all the
data. There is no golden rule on number of files and the size of the files per directory.
But there are some directions. Multiple small files will is going to make a schedule
worked harder to locate the data and launch all the read tasks. This can increase not
only disk I/O but also network traffic. On the other spectrum, having fewer and larger
files can ease the workload from scheduler, but it will make tasks longer to run.
Again, a rule of thumb would be, to scope the size of the files in such way, that
they contain a few tens of megabyte of data. From Spark 2.2. onward there are
also possibilities to to partitioning and sizing optionally.
7. Temporary data storage
Data that will be reused constantly are great candidates for caching. Caching will
place a DataFrame, Dataset, SQL table or RDD into temporary storage (either
memory or disk) across the executors in your cluster. You might want to cache only
dataset that will be used several times later on, but should not be hastened, because
it takes also resources such as serialisation, deserialisation and storage costs. You can
tell Spark to cache data by using a cache command on DataFrames or RDD's.
Let's put this to the test. In Azure Databricks create a new
notebook: Day29_tuning and language: Python and attach the notebook to your
cluster. Load a sample CSV file:
%python
DF1 = spark.read.format("CSV")
.option("inferSchema", "true")
.option("header","true")
.load("dbfs/databricks-datasets/COVID/covid-19-data/us-states.csv")
The bigger the files, the more evident the difference will be. Create some
aggregations:
DF2 = DF1.groupby("state").count().collect()
DF3 = DF1.groupby("date").count().collect()
DF4 = DF1.groupby("cases").count().collect()
After you have tracked the timing, now, let's cache the DF1:
DF1.cache()
DF1.count()
And rerun the previous command:
DF2 = DF1.groupby("state").count().collect()
DF3 = DF1.groupby("date").count().collect()
DF4 = DF1.groupby("cases").count().collect()
And you should see the difference in results. As mentioned before, the bigger the
dataset, the bigger would be time gained back when caching data.
We have touched today couple of performance tuning points and what approach
one should take, to improve the work of Spark in Azure Databricks. These are
probably the most frequent performance tunings and relatively easy to adjust.
Tomorrow we will look further into Apache Spark.
Complete set of code and the Notebook is available at the Github repository.
Happy Coding and Stay Healthy!
1.Monitoring
Spark in Databricks is relatively taken care of and can be monitored from Spark UI.
Since Databricks is a encapsulated platform, in a way Azure is managing many of the
components for you, from Network, to JVM (Java Virtual Machine), hosting operating
system and many of the cluster components, Mesos, YARN and any other spark
cluster application.
As we have seen on the Day 15 post, you can monitor Query, tasks, jobs, Spark Logs
and Spark UI in Azure Databricks. Spark Logs will help you pinpoint the problem that
you are encountering. It is also good for creating a history logs to understand the
behaviour of the job or the task over time and for possible future troubleshooting.
Spark UI is a good visual way to monitor what is happening to your cluster and offers
a great value of metrics for troubleshooting.
It also gives you detailed information on Spark Tasks and great visual presentation of
the task run, SQL run and detailed run of all the stages.
Pi estimation
Spark can also be used for compute-intensive tasks. This code estimates π by
"throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1))
and see how many fall in the unit circle. The fraction should be π / 4, so we use this
to get our estimate.
Python
Scala
Java
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
// hello world
val hello = "Hello, world"
hello: String = Hello, world
Arrays
Array(1,2,3,4,5).foreach(println)
println()
Array("Java","Scala","Python","R","Spark").foreach(println)
1 2 3 4 5 Java Scala Python R Spark
Functional Programming Overview
// assign a function to variable
val inc = (x : Int) => x + 1
inc(7)
inc: Int => Int = <function1> res41: Int = 8
// passing a function as parameter
(1 to 5) map (inc)
res42: scala.collection.immutable.IndexedSeq[Int] = Vector(2, 3, 4, 5, 6)
// take even number, multiply each value by 2 and sum them up
(1 to 7) filter (_ % 2 == 0 ) map (_ * 2) reduce (_ + _)
res43: Int = 24
// Scala
val name = "Scala"
val hasUpperCase = name.exists(_.isUpper)
name: String = Scala hasUpperCase: Boolean = true
var i = 10
loopWhile(i > 0) {
println(i)
i -= 1
}
def loopWhile(cond: => Boolean)(f: => Unit) : Unit = {
if (cond) {
f
loopWhile(cond)(f)
}
}
10 9 8 7 6 5 4 3 2 1 i: Int = 0 loopWhile: (cond: => Boolean)(f: =>
Unit)Unit
Mutable and Immutable Variables
// mutable variables
var counter:Int = 10
var d = 0.0
var f = 0.3f
// immutable variables
val msg = "Hello Scala"
println(msg)
s"Greeting: $msg"
val ? = scala.math.Pi
println(?)
Hello Scala 3.141592653589793 counter: Int = 10 d: Double = 0.0 f: Float =
0.3 msg: String = Hello Scala ?: Double = 3.141592653589793
String Interpolation
// string interpolation
val course = "Spark With Scala"
println(s"I am taking course $course.")
// support arbitrary expressions
println(s"2 + 2 = ${2 + 2}")
val year = 2017
println(s"Next year is ${year + 1}")
I am taking course Spark With Scala. 2 + 2 = 4 Next year is 2018 course:
String = Spark With Scala year: Int = 2017
Looping Constructs
// looping constructs
var i = 0
do {
println(s"Hello, world #$i")
i = i + 1
} while (i <= 5)
println()
for (j<- 1 to 5) {
println(s"Hello, world #$j")
}
println()
Scala 1 2017 pair1: (String, Int) = (Scala,1) pair2: (String, Int, Int) =
(Scala,1,2017)
Classes
// class
println(area(Rectangle(4,5)))
println(area(Circle(5)))
20.0 78.5 defined class Shape defined class Rectangle defined class Circle
area: (s: Shape)Double
Arrays
// array
val myArray = Array(1,2,3,4,5);
myArray.foreach(a => print(a + " "))
println
myArray.foreach(println)
1 2 3 4 5 1 2 3 4 5 myArray: Array[Int] = Array(1, 2, 3, 4, 5)
Lists
// list
val l = List(1,2,3,4);
l.foreach(println)
println()
println(l.head) //==> 1
println(l.tail) //==> List(2,3,4)
println(l.last) //==> 4
println(l.init) //==> List(1,2,3)
println()
// appending
val n = list :+ 5
// challenge
//def removeDups(xs : List[int]) : List[Int] = xs match {
// todo
//}
n: List[Int] = List(1, 2, 3, 4) s: List[String] = List(LNKD, GOOG, AAPL)
sum: (xs: List[Int])Int dups: List[Int] = List(1, 2, 3, 4, 6, 3, 2, 7, 9,
4)
Our research group has a very strong focus on using and improving Apache Spark
to solve real world programs. In order to do this we need to have a very solid
understanding of the capabilities of Spark. So one of the first things we have done
is to go through the entire Spark RDD API and write examples to test their
functionality. This has been a very useful exercise and we would like to share the
examples with everyone.
These examples have only been tested for Spark version 1.4. We assume the
functionality of Spark is stable and therefore the examples should be valid for later
releases.
If you find any errors in the example we would love to hear about them so we can
fix them up. So please email us to let us know.
RDD is short for Resilient Distributed Dataset. RDDs are the workhorse of the
Spark system. As a user, one can consider a RDD as a handle for a collection of
individual data partitions, which are the result of some computation.
However, an RDD is actually more than that. On cluster installations, separate data
partitions can be on separate nodes. Using the RDD as a handle one can access all
partitions and perform computations and transformations using the contained data.
Whenever a part of a RDD or an entire RDD is lost, the system is able to
reconstruct the data of lost partitions by using lineage information. Lineage refers
to the sequence of transformations used to produce the current RDD. As a result,
Spark is able to recover automatically from most failures.
All RDDs available in Spark derive either directly or indirectly from the class
RDD. This class comes with a large set of methods that perform operations on the
data within the associated partitions. The class RDD is abstract. Whenever, one
uses a RDD, one is actually using a concertized implementation of RDD. These
implementations have to overwrite some core functions to make the RDD behave
as expected.
One reason why Spark has lately become a very popular system for processing big
data is that it does not impose restrictions regarding what data can be stored within
RDD partitions. The RDD API already contains many useful operations. But,
because the creators of Spark had to keep the core API of RDDs common enough
to handle arbitrary data-types, many convenience functions are missing.
The basic RDD API considers each data item as a single value. However, users
often want to work with key-value pairs. Therefore Spark extended the interface of
RDD to provide additional functions (PairRDDFunctions), which explicitly work
on key-value pairs. Currently, there are four extensions to the RDD API available
in spark. They are as follows:
DoubleRDDFunctions
This extension contains many useful methods for aggregating numeric values.
They become available if the data items of an RDD are implicitly convertible to
the Scala data-type double.
PairRDDFunctions
Methods defined in this interface extension become available when the data
items have a two component tuple structure. Spark will interpret the first
tuple item (i.e. tuplename. 1) as the key and the second item (i.e. tuplename.
2) as the associated value.
OrderedRDDFunctions
Methods defined in this interface extension become available if the data items
are two-component tuples where the key is implicitly sortable.
SequenceFileRDDFunctions
This extension contains several methods that allow users to create Hadoop
sequence- les from RDDs. The data items must be two compo- nent key-
value tuples as required by the PairRDDFunctions. However, there are
additional requirements considering the convertibility of the tuple
components to Writable types.
Since Spark will make methods with extended functionality automatically
available to users when the data items fulfill the above described requirements, we
decided to list all possible available functions in strictly alphabetical order. We will
append either of the followingto the function-name to indicate it belongs to an
extension that requires the data items to conform to a certain format or type.
[Double] - Double RDD Functions
[Ordered] - OrderedRDDFunctions
[Pair] - PairRDDFunctions
[SeqFile] - SequenceFileRDDFunctions
aggregate
The aggregate function allows the user to apply two different reduce functions to
the RDD. The first reduce function is applied within each partition to reduce the
data within each partition into a single result. The second reduce function is used to
combine the different reduced results of all partitions together to arrive at one final
result. The ability to have two separate reduce functions for intra partition versus
across partition reducing adds a lot of flexibility. For example the first reduce
function can be the max function and the second one can be the sum function. The
user also specifies an initial value. Here are some important facts.
The initial value is applied at both levels of reduce. So both at the intra
partition reduction and across partition reduction.
Both reduce functions have to be commutative and associative.
Do not assume any execution order for either partition computations or
combining partitions.
Why would one want to use two input data types? Let us assume we do an
archaeological site survey using a metal detector. While walking through the
site we take GPS coordinates of important findings based on the output of
the metal detector. Later, we intend to draw an image of a map that
highlights these locations using the aggregate function. In this case
the zeroValue could be an area map with no highlights. The possibly huge
set of input data is stored as GPS coordinates across many partitions. seqOp
(first reducer) could convert the GPS coordinates to map coordinates and
put a marker on the map at the respective position. combOp (second
reducer) will receive these highlights as partial maps and combine them
into a single final output map.
Listing Variants
Examples 1
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
// lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}
z.mapPartitionsWithIndex(myfunc).collect
res28: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3],
[partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])
z.aggregate(0)(math.max(_, _), _ + _)
res40: Int = 9
val z = sc.parallelize(List("a","b","c","d","e","f"),2)
//lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: Iterator[(String)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}
z.mapPartitionsWithIndex(myfunc).collect
res31: Array[String] = Array([partID:0, val: a], [partID:0, val: b], [partID:0, val: c],
[partID:1, val: d], [partID:1, val: e], [partID:1, val: f])
z.aggregate("")(_ + _, _+_)
res115: String = abcdef
// See here how the initial value "x" is applied three times.
// - once for each partition
// - once when combining all the partitions in the second reduce function.
z.aggregate("x")(_ + _, _+_)
res116: String = xxdefxabc
// Below are some more advanced examples. Some are quite tricky to work out.
val z = sc.parallelize(List("12","23","345","4567"),2)
z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res141: String = 42
val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res143: String = 10
The main issue with the code above is that the result of the inner min is a string of
length 1.
The zero in the output is due to the empty string being the last string in the list. We
see this result because we are not recursively reducing any further within the
partition for the final string.
Examples 2
val z = sc.parallelize(List("12","23","","345"),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res144: String = 11
In contrast to the previous example, this example has the empty string at the
beginning of the second partition. This results in length of zero being input to the
second reduce which then upgrades it a length of 1. (Warning: The above example
shows bad design since the output is dependent on the order of the data inside the
partitions.)
aggregateByKey [Pair]
Works like the aggregate function except the aggregation is applied to the values
with the same key. Also unlike the aggregate function the initial value is not
applied to the second reduce.
Listing Variants
def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)
(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U,
combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) ⇒
U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
Example
cartesian
Computes the cartesian product between two RDDs (i.e. Each item of the first
RDD is joined with each item of the second RDD) and returns them as a new
RDD. (Warning: Be careful when using this function.! Memory consumption can
quickly become an issue!)
Listing Variants
val x = sc.parallelize(List(1,2,3,4,5))
val y = sc.parallelize(List(6,7,8,9,10))
x.cartesian(y).collect
res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8),
(2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8),
(4,9), (4,10), (5,9), (5,10))
checkpoint
Will create a checkpoint when the RDD is computed next. Checkpointed RDDs are
stored as a binary file within the checkpoint directory which can be specified using
the Spark context. (Warning: Spark applies lazy evaluation. Checkpointing will
not occur until an action is invoked.)
Listing Variants
def checkpoint()
Example
sc.setCheckpointDir("my_directory_name")
val a = sc.parallelize(1 to 4)
a.checkpoint
a.count
14/02/25 18:13:53 INFO SparkContext: Starting job: count at <console>:15
...
14/02/25 18:13:53 INFO MemoryStore: Block broadcast_5 stored as values to
memory (estimated size 115.7 KB, free 296.3 MB)
14/02/25 18:13:53 INFO RDDCheckpointData: Done checkpointing RDD 11 to
file:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/
my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11, new
parent is RDD 12
res23: Long = 4
coalesce, repartition
Coalesces the associated data into a given number of
partitions. repartition(numPartitions) is simply an abbreviation
for coalesce(numPartitions, shuffle = true).
Listing Variants
Example
cogroup [Pair], groupWith [Pair]
A very powerful set of functions that allow grouping up to 3 key-value RDDs
together using their keys.
Listing Variants
Examples
collect, toArray
Converts the RDD into a Scala array and returns it. If you provide a standard map-
function (i.e. f = T -> U) it will be applied before inserting the values into the result
array.
Listing Variants
def collect(): Array[T]
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]
def toArray(): Array[T]
Example
collectAsMap [Pair]
Similar to collect, but works on key-value RDDs and converts them into Scala
maps to preserve their key-value structure.
Listing Variants
Example
combineByKey[Pair]
Very efficient implementation that combines the values of a RDD consisting of
two-component tuples by applying multiple aggregators one after another.
Listing Variants
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean
= true, serializerClass: String = null): RDD[(K, C)]
Example
val a =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String],
y:List[String]) => x ::: y)
d.collect
res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit,
salmon, bee, bear, wolf)))
compute
Executes dependencies and computes the actual representation of the RDD. This
function should not be called directly by users.
Listing Variants
context, sparkContext
Returns the SparkContext that was used to create the RDD.
Listing Variants
Example
Listing Variants
Example
countApprox
Marked as experimental feature! Experimental features are currently not covered
by this document!
Listing Variants
def (timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]
countApproxDistinct
Computes the approximate number of distinct values. For large RDDs which are
spread across many nodes, this function may execute faster than other counting
methods. The parameter relativeSD controls the accuracy of the computation.
Listing Variants
Example
val a = sc.parallelize(1 to 10000, 20)
val b = a++a++a++a++a
b.countApproxDistinct(0.1)
res14: Long = 8224
b.countApproxDistinct(0.05)
res15: Long = 9750
b.countApproxDistinct(0.01)
res16: Long = 9947
b.countApproxDistinct(0.001)
res0: Long = 10000
countApproxDistinctByKey [Pair]
Similar to countApproxDistinct, but computes the approximate number of distinct
values for each distinct key. Hence, the RDD must consist of two-component
tuples. For large RDDs which are spread across many nodes, this function may
execute faster than other counting methods. The parameter relativeSD controls the
accuracy of the computation.
Listing Variants
Example
d.countApproxDistinctByKey(0.01).collect
res16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455), (Dog,2425), (Gnu,2513))
d.countApproxDistinctByKey(0.001).collect
res0: Array[(String, Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451), (Gnu,2521))
countByKey [Pair]
Very similar to count, but counts the values of a RDD consisting of two-
component tuples for each distinct key separately.
Listing Variants
Example
countByKeyApprox [Pair]
Marked as experimental feature! Experimental features are currently not covered
by this document!
Listing Variants
Listing Variants
Example
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b.countByValue
res27: scala.collection.Map[Int,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4
-> 2, 7 -> 1)
countByValueApprox
Marked as experimental feature! Experimental features are currently not covered
by this document!
Listing Variants
dependencies
Returns the RDD on which this RDD depends.
Listing Variants
final def dependencies: Seq[Dependency[_]]
Example
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at
<console>:12
b.dependencies.length
Int = 0
b.cartesian(a).dependencies.length
res41: Int = 2
b.cartesian(a).dependencies
res42: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.rdd.CartesianRDD$
$anon$1@576ddaaa, org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)
distinct
Returns a new RDD that contains each unique value only once.
Listing Variants
Example
val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
a.distinct(2).partitions.length
res16: Int = 2
a.distinct(3).partitions.length
res17: Int = 3
first
Looks for the very first data item of the RDD and returns it.
Listing Variants
def first(): T
Example
filter
Evaluates a boolean function for each data item of the RDD and puts the items for
which the function returned true into the resulting RDD.
Listing Variants
Example
When you provide a filter function, it must be able to handle all data items
contained in the RDD. Scala provides so-called partial functions to deal with
mixed data-types. (Tip: Partial functions are very useful if you have some data
which may be bad and you do not want to handle but for the good data (matching
data) you want to apply some kind of map function. The following article is good.
It teaches you about partial functions in a very nice way and explains why case has
to be used for partial functions: article)
val b = sc.parallelize(1 to 8)
b.filter(_ < 4).collect
res15: Array[Int] = Array(1, 2, 3)
This fails because some components of a are not implicitly comparable against
integers. Collect uses the isDefinedAt property of a function-object to determine
whether the test-function is compatible with each data item. Only data items that
pass this test (=filter) are then mapped using the function-object.
myfunc.isDefinedAt(1)
res22: Boolean = true
myfunc.isDefinedAt(1.5)
res23: Boolean = false
Be careful! The above code works because it only checks the type itself! If you use
operations on this type, you have to explicitly declare what type you want instead
of any. Otherwise the compiler does (apparently) not know what bytecode it should
produce:
val myfunc2: PartialFunction[Any, Any] = {case x if (x < 4) => "x"}
<console>:10: error: value < is not a member of Any
filterByRange [Ordered]
Returns an RDD containing only the items in the key range specified. From our
testing, it appears this only works if your data is in key value pairs and it has
already been sorted by key.
Listing Variants
Example
val randRDD = sc.parallelize(List( (2,"cat"), (6, "mouse"),(7, "cup"), (3, "book"), (4,
"tv"), (1, "screen"), (5, "heater")), 3)
val sortedRDD = randRDD.sortByKey()
sortedRDD.filterByRange(1, 3).collect
res66: Array[(Int, String)] = Array((1,screen), (2,cat), (3,book))
filterWith (deprecated)
This is an extended version of filter. It takes two function arguments. The first
argument must conform to Int -> T and is executed once per partition. It will
transform the partition index to type T. The second function looks like (U, T) ->
Boolean. T is the transformed partition index and U are the data items from the
RDD. Finally the function has to return either true or false (i.e. Apply the filter).
Listing Variants
def filterWith[A: ClassTag](constructA: Int => A)(p: (T, A) => Boolean): RDD[T]
Example
val a = sc.parallelize(1 to 9, 3)
val b = a.filterWith(i => i)((x,i) => x % 2 == 0 || i % 2 == 0)
b.collect
res37: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)
val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)
a.filterWith(x=> x)((a, b) => b == 0).collect
res30: Array[Int] = Array(1, 2)
flatMap
Similar to map, but allows emitting more than one item in the map function.
Listing Variants
Example
// The program below generates a random number of copies (up to 10) of the items in the
list.
val x = sc.parallelize(1 to 10, 3)
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect
flatMapValues
Very similar to mapValues, but collapses the inherent structure of the values during
mapping.
Listing Variants
Example
flatMapWith (deprecated)
Similar to flatMap, but allows accessing the partition index or a derivative of the
partition index from within the flatMap-function.
Listing Variants
val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)
a.flatMapWith(x => x, true)((x, y) => List(y, x)).collect
res58: Array[Int] = Array(0, 1, 0, 2, 0, 3, 1, 4, 1, 5, 1, 6, 2, 7, 2, 8, 2, 9)
fold
Aggregates the values of each partition. The aggregation variable within each
partition is initialized with zeroValue.
Listing Variants
Example
val a = sc.parallelize(List(1,2,3), 3)
a.fold(0)(_ + _)
res59: Int = 6
foldByKey [Pair]
Very similar to fold, but performs the folding separately for each key of the RDD.
This function is only available if the RDD consists of two-component tuples.
Listing Variants
foreach
Executes an parameterless function for each data item.
Listing Variants
Example
Listing Variants
Example
foreachWith (Deprecated)
Executes an parameterless function for each partition. Access to the data items
contained in the partition is provided via the iterator argument.
Listing Variants
Example
val a = sc.parallelize(1 to 9, 3)
a.foreachWith(i => i)((x,i) => if (x % 2 == 1 && i % 2 == 0) println(x) )
1
3
7
9
fullOuterJoin [Pair]
Performs the full outer join between two paired RDDs.
Listing Variants
Example
generator, setGenerator
Allows setting a string that is attached to the end of the RDD's name when printing
the dependency graph.
Listing Variants
Listing Variants
Example
sc.setCheckpointDir("/home/cloudera/Documents")
val a = sc.parallelize(1 to 500, 5)
val b = a++a++a++a++a
b.getCheckpointFile
res49: Option[String] = None
b.checkpoint
b.getCheckpointFile
res54: Option[String] = None
b.collect
b.getCheckpointFile
res57: Option[String] = Some(file:/home/cloudera/Documents/cb978ffb-a346-4820-b3ba-
d56580787b20/rdd-40)
preferredLocations
Returns the hosts which are preferred by this RDD. The actual preference of a
specific host depends on various assumptions.
Listing Variants
Listing Variants
def getStorageLevel
Example
a.cache
java.lang.UnsupportedOperationException: Cannot change storage level of an RDD after
it was already assigned a level
glom
Assembles an array that contains all elements of the partition and embeds it in an
RDD. Each returned array contains the contents of one partition.
Listing Variants
Example
groupBy
Listing Variants
Example
val a = sc.parallelize(1 to 9, 3)
a.groupBy(x => { if (x % 2 == 0) "even" else "odd" }).collect
res42: Array[(String, Seq[Int])] = Array((even,ArrayBuffer(2, 4, 6, 8)),
(odd,ArrayBuffer(1, 3, 5, 7, 9)))
val a = sc.parallelize(1 to 9, 3)
def myfunc(a: Int) : Int =
{
a%2
}
a.groupBy(myfunc).collect
res3: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7,
9)))
val a = sc.parallelize(1 to 9, 3)
def myfunc(a: Int) : Int =
{
a%2
}
a.groupBy(x => myfunc(x), 3).collect
a.groupBy(myfunc(_), 1).collect
res7: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7,
9)))
import org.apache.spark.Partitioner
class MyPartitioner extends Partitioner {
def numPartitions: Int = 2
def getPartition(key: Any): Int =
{
key match
{
case null => 0
case key: Int => key % numPartitions
case _ => key.hashCode % numPartitions
}
}
override def equals(other: Any): Boolean =
{
other match
{
case h: MyPartitioner => true
case _ => false
}
}
}
val a = sc.parallelize(1 to 9, 3)
val p = new MyPartitioner()
val b = a.groupBy((x:Int) => { x }, p)
val c = b.mapWith(i => i)((a, b) => (b, a))
c.collect
res42: Array[(Int, (Int, Seq[Int]))] = Array((0,(4,ArrayBuffer(4))), (0,(2,ArrayBuffer(2))),
(0,(6,ArrayBuffer(6))), (0,(8,ArrayBuffer(8))), (1,(9,ArrayBuffer(9))), (1,
(3,ArrayBuffer(3))), (1,(1,ArrayBuffer(1))), (1,(7,ArrayBuffer(7))), (1,
(5,ArrayBuffer(5))))
groupByKey [Pair]
Very similar to groupBy, but instead of supplying a function, the key-component
of each pair will automatically be presented to the partitioner.
Listing Variants
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
Example
histogram [Double]
These functions take an RDD of doubles and create a histogram with either even
spacing (the number of buckets equals to bucketCount) or arbitrary spacing based
on custom bucket boundaries supplied by the user via an array of double values.
The result type of both variants is slightly different, the first function will return a
tuple consisting of two arrays. The first array contains the computed bucket
boundary values and the second array contains the corresponding count of
values (i.e. the histogram). The second variant of the function will just return the
histogram as an array of integers.
Listing Variants
val a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 9.0), 3)
a.histogram(5)
res11: (Array[Double], Array[Long]) = (Array(1.1, 2.68, 4.26, 5.84, 7.42, 9.0),Array(5, 0,
0, 1, 4))
val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9,
5.5), 3)
a.histogram(6)
res18: (Array[Double], Array[Long]) = (Array(1.0, 2.5, 4.0, 5.5, 7.0, 8.5, 10.0),Array(6, 0,
1, 1, 3, 4))
Example with custom spacing
val a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 9.0), 3)
a.histogram(Array(0.0, 3.0, 8.0))
res14: Array[Long] = Array(5, 3)
val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9,
5.5), 3)
a.histogram(Array(0.0, 5.0, 10.0))
res1: Array[Long] = Array(6, 9)
id
Retrieves the ID which has been assigned to the RDD by its device context.
Listing Variants
Example
intersection
Returns the elements in the two RDDs which are the same.
Listing Variants
Example
z.collect
res74: Array[Int] = Array(16, 12, 20, 13, 17, 14, 18, 10, 19, 15, 11)
isCheckpointed
Indicates whether the RDD has been checkpointed. The flag will only raise once
the checkpoint has really been created.
Listing Variants
Example
sc.setCheckpointDir("/home/cloudera/Documents")
c.isCheckpointed
res6: Boolean = false
c.checkpoint
c.isCheckpointed
res8: Boolean = false
c.collect
c.isCheckpointed
res9: Boolean = true
iterator
Returns a compatible iterator object for a partition of this RDD. This function
should never be called directly.
Listing Variants
join [Pair]
Performs an inner join using two key-value RDDs. Please note that the keys must
be generally comparable to make this work.
Listing Variants
Example
Listing Variants
Example
keys [Pair]
Extracts the keys from all contained tuples and returns them in a new RDD.
Listing Variants
Example
Listing Variants
Example
lookup
Scans the RDD for all keys that match the provided value and returns their values
as a Scala sequence.
Listing Variants
Example
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.lookup(5)
res0: Seq[String] = WrappedArray(tiger, eagle)
map
Applies a transformation function on each item of the RDD and returns the result
as a new RDD.
Listing Variants
Example
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.map(_.length)
val c = a.zip(b)
c.collect
res0: Array[(String, Int)] = Array((dog,3), (salmon,6), (salmon,6), (rat,3), (elephant,8))
mapPartitions
This is a specialized map that is called only once for each partition. The entire
content of the respective partitions is available as a sequential stream of values via
the input argument (Iterarator[T]). The custom function must return yet
another Iterator[U]. The combined result iterators are automatically converted into
a new RDD. Please note, that the tuples (3,4) and (6,7) are missing from the
following result due to the partitioning we chose.
Listing Variants
def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning:
Boolean = false): RDD[U]
Example 1
val a = sc.parallelize(1 to 9, 3)
def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = {
var res = List[(T, T)]()
var pre = iter.next
while (iter.hasNext)
{
val cur = iter.next;
res .::= (pre, cur)
pre = cur;
}
res.iterator
}
a.mapPartitions(myfunc).collect
res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))
Example 2
Listing Variants
Example
val a = sc.parallelize(1 to 9, 3)
import org.apache.spark.TaskContext
def myfunc(tc: TaskContext, iter: Iterator[Int]) : Iterator[Int] = {
tc.addOnCompleteCallback(() => println(
"Partition: " + tc.partitionId +
", AttemptID: " + tc.attemptId ))
iter.toList.filter(_ % 2 == 0).iterator
}
a.mapPartitionsWithContext(myfunc).collect
?
res0: Array[Int] = Array(2, 6, 4, 8)
mapPartitionsWithIndex
Similar to mapPartitions, but takes two parameters. The first parameter is the
index of the partition and the second is an iterator through all the items within this
partition. The output is an iterator containing the list of items after applying
whatever transformation the function encodes.
Listing Variants
def mapPartitionsWithIndex[U: ClassTag](f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
Example
val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
iter.map(x => index + "," + x)
}
x.mapPartitionsWithIndex(myfunc).collect()
res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10)
mapPartitionsWithSplit
This method has been marked as deprecated in the API. So, you should not use this
method anymore. Deprecated methods will not be covered in this document.
Listing Variants
def mapPartitionsWithSplit[U: ClassTag](f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
mapValues [Pair]
Takes the values of a RDD that consists of two-component tuples, and applies the
provided function to transform each value. Then, it forms new two-component
tuples using the key and the transformed value and stores them in a new RDD.
Listing Variants
Example
mapWith (deprecated)
This is an extended version of map. It takes two function arguments. The first
argument must conform to Int -> T and is executed once per partition. It will map
the partition index to some transformed partition index of type T. This is where it is
nice to do some kind of initialization code once per partition. Like create a
Random number generator object. The second function must conform to (U, T) ->
U. T is the transformed partition index and U is a data item of the RDD. Finally the
function has to return a transformed data item of type U.
Listing Variants
val a = sc.parallelize(1 to 9, 3)
val b = a.mapWith("Index:" + _)((a, b) => ("Value:" + a, b))
b.collect
res0: Array[(String, String)] = Array((Value:1,Index:0), (Value:2,Index:0),
(Value:3,Index:0), (Value:4,Index:1), (Value:5,Index:1), (Value:6,Index:1),
(Value:7,Index:2), (Value:8,Index:2), (Value:9,Index:2)
max
Returns the largest element in the RDD
Listing Variants
Example
mean [Double], meanApprox [Double]
Calls stats and extracts the mean component. The approximate version of the
function can finish somewhat faster in some scenarios. However, it trades accuracy
for speed.
Listing Variants
Example
val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9,
5.5), 3)
a.mean
res0: Double = 5.3
min
Returns the smallest element in the RDD
Listing Variants
Example
Listing Variants
Example
partitionBy [Pair]
Repartitions as key-value RDD using its keys. The partitioner implementation can
be supplied as the first argument.
Listing Variants
partitioner
Specifies a function pointer to the default partitioner that will be used
for groupBy, subtract, reduceByKey (from PairedRDDFunctions), etc. functions.
Listing Variants
partitions
Returns an array of the partition objects associated with this RDD.
Listing Variants
Example
persist, cache
These functions can be used to adjust the storage level of a RDD. When freeing up
memory, Spark will use the storage level identifier to decide which partitions
should be kept. The parameterless variants persist() and cache() are just
abbreviations for persist(StorageLevel.MEMORY_ONLY). (Warning: Once the
storage level has been changed, it cannot be changed again!)
Listing Variants
Example
pipe
Takes the RDD data of each partition and sends it via stdin to a shell-command.
The resulting output of the command is captured and returned as a RDD of string
values.
Listing Variants
Example
val a = sc.parallelize(1 to 9, 3)
a.pipe("head -n 1").collect
res2: Array[String] = Array(1, 4, 7)
randomSplit
Randomly splits an RDD into multiple smaller RDDs according to a weights Array
which specifies the percentage of the total data elements that is assigned to each
smaller RDD. Note the actual size of each smaller RDD is only approximately
equal to the percentages specified by the weights Array. The second example
below shows the number of items in each smaller RDD does not exactly match the
weights Array. A random optional seed can be specified. This function is useful
for spliting data into a training set and a testing set for machine learning.
Listing Variants
Example
rdd1.collect
res87: Array[Int] = Array(4, 10)
rdd2.collect
res88: Array[Int] = Array(1, 3, 5, 8)
rdd3.collect
res91: Array[Int] = Array(2, 6, 7, 9)
reduce
This function provides the well-known reduce functionality in Spark. Please note
that any function f you provide, should be commutative in order to generate
reproducible results.
Listing Variants
Example
reduceByKey [Pair],
reduceByKeyLocally [Pair], reduceByKeyToDriver [Pair]
This function provides the well-known reduce functionality in Spark. Please note
that any function f you provide, should be commutative in order to generate
reproducible results.
Listing Variants
Example
repartition
This function changes the number of partitions to the number specified by the
numPartitions parameter
Listing Variants
Example
Listing Variants
Example
res0: Array[String] = Array([partID:0, val: (2,cat)], [partID:0, val: (3,book)], [partID:0, val
(1,screen)], [partID:1, val: (4,tv)], [partID:1, val: (5,heater)], [partID:2, val: (6,mouse)],
[partID:2, val: (7,cup)])
Listing Variants
Example
sample
Randomly selects a fraction of the items of a RDD and returns them in a new
RDD.
Listing Variants
Example
sampleByKey [Pair]
Randomly samples the key value pair RDD according to the fraction of each key
you want to appear in the final RDD.
Listing Variants
Example
val randRDD = sc.parallelize(List( (7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7,
"tv"), (6, "screen"), (7, "heater")))
val sampleMap = List((7, 0.4), (6, 0.6)).toMap
randRDD.sampleByKey(false, sampleMap,42).collect
sampleByKeyExact [Pair, experimental]
This is labelled as experimental and so we do not document it.
Listing Variants
saveAsHadoopFile [Pair], saveAsHadoopDataset [Pair],
saveAsNewAPIHadoopFile [Pair]
Saves the RDD in a Hadoop compatible format using any Hadoop outputFormat
class the user specifies.
Listing Variants
saveAsObjectFile
Saves the RDD in binary format.
Listing Variants
Example
saveAsSequenceFile [SeqFile]
Saves the RDD as a Hadoop sequence file.
Listing Variants
def saveAsSequenceFile(path: String, codec: Option[Class[_ <:
CompressionCodec]] = None)
Example
saveAsTextFile
Saves the RDD as text files. One line at a time.
Listing Variants
// Produces 3 output files since we have created the a RDD with 3 partitions
[cloudera@localhost ~]$ ll ~/Documents/spark-0.9.0-incubating-bin-cdh4/bin/mydata_a/
-rwxr-xr-x 1 cloudera cloudera 15558 Apr 3 21:11 part-00000
-rwxr-xr-x 1 cloudera cloudera 16665 Apr 3 21:11 part-00001
-rwxr-xr-x 1 cloudera cloudera 16671 Apr 3 21:11 part-00002
import org.apache.hadoop.io.compress.GzipCodec
a.saveAsTextFile("mydata_b", classOf[GzipCodec])
val x = sc.textFile("mydata_b")
x.count
res2: Long = 10000
val x = sc.parallelize(List(1,2,3,4,5,6,6,7,9,8,10,21), 3)
x.saveAsTextFile("hdfs://localhost:8020/user/cloudera/test");
val sp = sc.textFile("hdfs://localhost:8020/user/cloudera/sp_data")
sp.flatMap(_.split(" ")).saveAsTextFile("hdfs://localhost:8020/user/cloudera/sp_x")
stats [Double]
Simultaneously computes the mean, variance and the standard deviation of all
values in the RDD.
Listing Variants
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.stats
res16: org.apache.spark.util.StatCounter = (count: 9, mean: 11.266667, stdev: 8.126859)
sortBy
This function sorts the input RDD's data and stores it in a new RDD. The first
parameter requires you to specify a function which maps the input data into the
key that you want to sortBy. The second parameter (optional) specifies whether
you want the data to be sorted in ascending or descending order.
Listing Variants
Example
Listing Variants
Example
Listing Variants
Example
subtract
Performs the well known standard set subtraction operation: A - B
Listing Variants
Example
val a = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.collect
res3: Array[Int] = Array(6, 9, 4, 7, 5, 8)
subtractByKey [Pair]
Very similar to subtract, but instead of supplying a function, the key-component of
each pair will be automatically used as criterion for removing items from the first
RDD.
Listing Variants
Example
Example
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.sum
res17: Double = 101.39999999999999
take
Extracts the first n items of the RDD and returns them as an array. (Note: This
sounds very easy, but it is actually quite a tricky problem for the implementors of
Spark because the items in question can be in many different partitions.)
Listing Variants
Example
Listing Variants
Example
takeSample
Behaves different from sample in the following respects:
Listing Variants
Example
toDebugString
Returns a string that contains debug information about the RDD and its
dependencies.
Listing Variants
Example
val a = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.toDebugString
res6: String =
MappedRDD[15] at subtract at <console>:16 (3 partitions)
SubtractedRDD[14] at subtract at <console>:16 (3 partitions)
MappedRDD[12] at subtract at <console>:16 (3 partitions)
ParallelCollectionRDD[10] at parallelize at <console>:12 (3 partitions)
MappedRDD[13] at subtract at <console>:16 (3 partitions)
ParallelCollectionRDD[11] at parallelize at <console>:12 (3 partitions)
toJavaRDD
Embeds this RDD object within a JavaRDD object and returns it.
Listing Variants
toLocalIterator
Converts the RDD into a scala iterator at the master node.
Listing Variants
Example
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
val iter = z.toLocalIterator
iter.next
res51: Int = 1
iter.next
res52: Int = 2
top
Utilizes the implicit ordering of $T$ to determine the top $k$ values and returns
them as an array.
Listing Variants
ddef top(num: Int)(implicit ord: Ordering[T]): Array[T]
Example
toString
Assembles a human-readable textual description of the RDD.
Listing Variants
Example
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.toString
res61: String = ParallelCollectionRDD[80] at parallelize at <console>:21
val randRDD = sc.parallelize(List( (7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7, "tv"),
(6, "screen"), (7, "heater")))
val sortedRDD = randRDD.sortByKey()
sortedRDD.toString
res64: String = ShuffledRDD[88] at sortByKey at <console>:23
treeAggregate
Computes the same thing as aggregate, except it aggregates the elements of the
RDD in a multi-level tree pattern. Another difference is that it does not use the
initial value for the second reduce function (combOp). By default a tree of depth 2
is used, but this can be changed via the depth parameter.
Listing Variants
Example
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
// lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}
z.mapPartitionsWithIndex(myfunc).collect
res28: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID
val: 4], [partID:1, val: 5], [partID:1, val: 6])
z.treeAggregate(0)(math.max(_, _), _ + _)
res40: Int = 9
// Note unlike normal aggregrate. Tree aggregate does not apply the initial value for the sec
reduce
// This example returns 11 since the initial value is 5
// reduce of partition 0 will be max(5, 1, 2, 3) = 5
// reduce of partition 1 will be max(4, 5, 6) = 6
// final reduce across partitions will be 5 + 6 = 11
// note the final reduce does not include the initial value
z.treeAggregate(5)(math.max(_, _), _ + _)
res42: Int = 11
treeReduce
Works like reduce except reduces the elements of the RDD in a multi-level tree
pattern.
Listing Variants
Example
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.treeReduce(_+_)
res49: Int = 21
union, ++
Performs the standard set operation: A union B
Listing Variants
Example
val a = sc.parallelize(1 to 3, 1)
val b = sc.parallelize(5 to 7, 1)
(a ++ b).collect
res0: Array[Int] = Array(1, 2, 3, 5, 6, 7)
unpersist
Dematerializes the RDD (i.e. Erases all data items from hard-disk and memory).
However, the RDD object remains. If it is referenced in a computation, Spark will
regenerate it automatically using the stored dependency graph.
Listing Variants
Example
values
Extracts the values from all contained tuples and returns them in a new RDD.
Listing Variants
Example
variance [Double], sampleVariance [Double]
Calls stats and extracts either variance-component or corrected sampleVariance-
component.
Listing Variants
Example
val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9,
5.5), 3)
a.variance
res70: Double = 10.605333333333332
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.variance
res14: Double = 66.04584444444443
x.sampleVariance
res13: Double = 74.30157499999999
zip
Joins two RDDs by combining the i-th of either partition with each other. The
resulting RDD will consist of two-component tuples which are interpreted as key-
value pairs by the methods provided by the PairRDDFunctions extension.
Listing Variants
Example
zipParititions
Similar to zip. But provides more control over the zipping process.
Listing Variants
val a = sc.parallelize(0 to 9, 3)
val b = sc.parallelize(10 to 19, 3)
val c = sc.parallelize(100 to 109, 3)
def myfunc(aiter: Iterator[Int], biter: Iterator[Int], citer: Iterator[Int]): Iterator[String] =
{
var res = List[String]()
while (aiter.hasNext && biter.hasNext && citer.hasNext)
{
val x = aiter.next + " " + biter.next + " " + citer.next
res ::= x
}
res.iterator
}
a.zipPartitions(b, c)(myfunc).collect
res50: Array[String] = Array(2 12 102, 1 11 101, 0 10 100, 5 15 105, 4 14 104, 3 13 103, 9
19 109, 8 18 108, 7 17 107, 6 16 106)
zipWithIndex
Zips the elements of the RDD with its element indexes. The indexes start from 0. If
the RDD is spread across multiple partitions then a spark Job is started to perform
this operation.
Listing Variants
Example
Listing Variants
Example