You are on page 1of 28

Apache Spark on HDInsight

Contents

Lab 1 - Create a Spark cluster in HDInsight ................................................................................................... 2


Overview ................................................................................................................................................................. 2
Exercise 1 – Create an HDInsight Spark cluster........................................................................................ 2
Optional – Create a cluster with ARM Template: ................................................................................ 6
Exercise 2: Explore the Azure Storage using Azure Storage Explorer .............................................. 7
Exercise 3: Connect with the cluster using SSH .................................................................................... 10
Exercise 4: Using the Python Shell ............................................................................................................. 11
Exercise 5: Create a Jupyter notebook...................................................................................................... 11
Exercise 6: Clean up resources (whenever required) ........................................................................... 13
Lab 2 - Load data and run queries with Spark SQL.................................................................................. 14
Overview .............................................................................................................................................................. 14
Exercise 0 – Getting Started with Jupyter notebook ........................................................................... 14
Exercise 1 – Create a dataframe from a csv file..................................................................................... 16
Exercise 2: Run SQL queries on the dataframe...................................................................................... 17
Exercise 3: Clean up resources (whenever required) ........................................................................... 18
Lab 3: Analyze Spark data using Power BI in HDInsight......................................................................... 20
Prerequisites ....................................................................................................................................................... 20
Exercise 1: Verify the data ............................................................................................................................. 20
Exercise 2: Visualize the data........................................................................................................................ 21
Task 1: Create a report in Power BI Desktop ..................................................................................... 21
Task 2: Publish the report to the Power BI Service (optional) ..................................................... 25
Conclusion........................................................................................................................................................... 28

1
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
Lab 1 - Create a Spark cluster in HDInsight
Overview
In this lab, you will learn how to create an Apache Spark cluster in Azure HDInsight, and how to
run Spark SQL queries against Hive tables. Apache Spark enables fast data analytics and cluster
computing using in-memory processing.

For a quickstart, you will use a Resource Manager template to create an HDInsight Spark cluster.
The cluster uses Azure Storage Blobs as the cluster storage.

Exercise 1 – Create an HDInsight Spark cluster


1. Create a new HDInsight Cluster by clicking on “Create a resource” on the Azure
Portal. Under Analytics section, click on “HDInsight” menu as shown below:

2. Enter a globally unique cluster name, select a subscription and a strong Cluster login
password:

2
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
Also, note the default Cluster login username and SSH username. We will be using
this in the future exercises.
3. The cluster type should be selected as “Spark” and version should be Spark 2.2.0.
Notice that Azure only supports Spark with Linux clusters.

4. Click on “Select” to close the blade, and move to Storage section. Leave the defaults
as is and this will lead to creation of underlying Azure storage accounts that will be

3
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
used to host the Hadoop cluster.

5. On the summary screen, we are going to change the size of the cluster that is
populated by default. Scroll down on the “Cluster summary” blade and click on “Edit”
hyperlink, next to “Cluster size” section.

4
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
6. Click on “View all” to view all the available VM sizes. We have chosen D3v2 for this
lab exercise.

7. Reduce the number of worker nodes to 2 for this exercise so that your cluster size
blade looks like the following:

8. Click on select and leave the networking settings to default.


9. Azure verifies all the settings are correct and then activates the “Create” button. Click
on the Create button to create the cluster. It takes about 20 mins to create the
HDInsight cluster.

5
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
Optional – Create a cluster with ARM Template:

1. Create an HDInsight Spark cluster using an Azure Resource Manager template. The
template can be found in github.

2. Select the following link to open the template in the Azure portal in a new browser
tab:
Deploy to Azure
3. Enter the following values in the Azure quickstart template blade:

Property Value
Subscription Select your Azure subscription used for creating this cluster. The subscription
used for this quickstart is <Azure subscription name>.
Resource Create a resource group or select an existing one. Resource group is used to
group manage Azure resources for your projects. The new resource group name
used for this quickstart is myspark20180403rg.
Location Select a location for the resource group. The template uses this location for
creating the cluster as well as for the default cluster storage. Choose a
datacentre location nearest to you.
ClusterName Enter a name for the HDInsight cluster that you want to create. The new
cluster name used for this quickstart is myspark20180403.
Cluster login The default login name is admin. Choose a password for the cluster login. The
name and login name used for this quickstart is admin.
password
SSH user Choose a password for the SSH user. The SSH user name used for this
name and quickstart is sshuser.
password

6
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
4. Select I agree to the terms and conditions stated above, select Pin to dashboard,
and then select Purchase. You can see a new tile titled Deploying Template
deployment. It takes about 20 minutes to create the cluster. The cluster must be
created before you can proceed to the next session.

If you run into an issue with creating HDInsight clusters, it could be that you do not have the
right permissions to do so.

Exercise 2: Explore the Azure Storage using Azure Storage


Explorer
1. Download & install the Azure Storage Explorer:

7
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
2. Connect to Azure Storage using Microsoft Account which you have used for Azure
Pass.

8
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
3. After signing into your Azure Account see your subscription(s) and associated
storage by selecting the below check box and click Apply.

4. Once Signed in your view should look like the figure with the subscription(s) listed
as dropdowns. When you expand the drop downs, you will see a list of all your
storage accounts, which further expand to Blobs, Files, Queues and Tables.

5. Under Blob Containers, choose the container name that you selected during the
cluster creation. The right panel shows different folders containing various
resources.

9
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
Exercise 3: Connect with the cluster using SSH
Connect to the cluster using SSH

1. In the Microsoft Azure Portal, in the Properties pane for the HDInsight Cluster
blade for your HDInsight cluster, click Secure Shell (SSH).

2. On the Secure Shell blade, under Windows users, copy the host name <your
name><date>-ssh.azurehdinsight.net to the clipboard.

3. Open a Windows command prompt.

4. In the Command Prompt window, type the following command, and then press Enter:

putty

5. In the PuTTY Configuration window, on the Session page, paste the host name
from step 2 into the Host Name box.

6. Under Connection type, select SSH, and then click Open.

7. If a security warning that the host certificate cannot be verified is displayed, click Yes
to continue.

8. When you are prompted, enter sadmin and Pa55w.rdPa55w.rd as the credentials.

Use SSH to browse HDFS

1. In the SSH console window, type the following command to view the contents of the
root folder in the HDFS file system:

hdfs dfs -ls /

2. Type the following command to view the contents of the /example folder in the HDFS
file system. This folder contains subfolders for sample apps, data, and JAR
components:

10
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
hdfs dfs -ls /example

3. Type the following command to view the contents of the /example/data/gutenberg


folder, which contains sample text files:

hdfs dfs -ls /example/data/gutenberg

4. Type the following command on one line to view the text in the ulysses.txt file:

hdfs dfs -text /example/data/gutenberg/ulysses.txt

5. Note that the file contains large volumes of unstructured text.

Exercise 4: Using the Python Shell


1. In the SSH console window, launch the pyspark shell by typing the following
command, and then press Enter:

pyspark

2. Point out the Spark log appearing with a >>> prompt.

3. The pyspark shell will give you the Spark context by default.

4. Explain that you will now create an RDD for a text file.

5. Type the following command, and then press Enter:

txtRdd = sc.textFile(‘/example/data/gutenberg/ulysses.txt’)

6. Explain that you will now call an action method on the RDD to count the number of
records in the RDD by calling the following function:

txtRdd.count()

7. Point out that you can do simple transformations, such as filtering lines that have the
word ‘good’ in the lines, by calling a transformation API such as filter before you call
an Action API count():

txtRdd = txtRdd.filter(lambda x: ‘good’ in x)

txtRdd.count()

Exercise 5: Create a Jupyter notebook


Jupyter Notebook is an interactive notebook environment that supports various
programming languages. The notebook allows you to interact with your data, combine code
with markdown text and perform simple visualizations.

11
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
1. Open the Azure portal.
2. Select HDInsight clusters, and then select the cluster you created.

3. From the portal, select Cluster dashboards, and then select Jupyter Notebook. If
prompted, enter the cluster login credentials for the cluster.

4. Select New > PySpark to create a notebook.

12
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
A new notebook is created and opened with the name Untitled(Untitled.pynb).

Exercise 6: Clean up resources (whenever required)


HDInsight saves your data in Azure Storage or Azure Data Lake Store, so you can safely
delete a cluster when it is not in use. You are also charged for an HDInsight cluster, even
when it is not in use. Since the charges for the cluster are many times more than the charges
for storage, it makes economic sense to delete clusters when they are not in use. If you plan
to work on the next excercise immediately, you might want to keep the cluster.

To delete the cluster, switch back to the Azure portal, and select Delete.

You can also select the resource group name to open the resource group page, and then
select Delete resource group. By deleting the resource group, you delete both the
HDInsight Spark cluster, and the default storage account.

13
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
Lab 2 - Load data and run queries with Spark SQL
Overview
In this lab, you learn how to use create a dataframe from a csv file, and how to run interactive
Spark SQL queries against an Apache Spark cluster in Azure HDInsight. In Spark, a dataframe is a
distributed collection of data organized into named columns. Dataframe is conceptually
equivalent to a table in a relational database or a data frame in R/Python.

In this tutorial, you learn how to:

• Create a dataframe from a csv file


• Run queries on the dataframe

Make sure you have access to an Azure Subscription.

Exercise 0 – Getting Started with Jupyter notebook


SQL (Structured Query Language) is the most common and widely used language for
querying and defining data. Spark SQL functions as an extension to Apache Spark for
processing structured data, using the familiar SQL syntax.

1. Verify the kernel is ready. The kernel is ready when you see a hollow circle next to the
kernel name in the notebook. Solid circle denotes that the kernel is busy.

When you start the notebook for the first time, the kernel performs some tasks in the
background. Wait for the kernel to be ready.

2. Paste the following code in an empty cell, and then press SHIFT + ENTER to run the
code. The command lists the Hive tables on the cluster:

%%sql
SHOW TABLES
3. When you use a Jupyter Notebook with your HDInsight Spark cluster, you get a
preset sqlContext that you can use to run Hive queries using Spark SQL. %%sql tells
Jupyter Notebook to use the preset sqlContext to run the Hive query. The query
retrieves the top 10 rows from a Hive table (hivesampletable) that comes with all
HDInsight clusters by default. It takes about 30 seconds to get the results. The output
looks like:

14
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
Every time you run a query in Jupyter, your web browser window title shows a (Busy) status
along with the notebook title. You also see a solid circle next to the PySpark text in the top-
right corner.

Run another query to see the data in hivesampletable.

%%sql
SELECT * FROM hivesampletable LIMIT 10

The screen shall refresh to show the query output.

15
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
4. From the File menu on the notebook, select Close and Halt. Shutting down the
notebook releases the cluster resources.

Exercise 1 – Create a dataframe from a csv file


Applications can create dataframes from an existing Resilient Distributed Dataset (RDD),
from a Hive table, or from data sources using the SQLContext object. The following
screenshot shows a snapshot of the HVAC.csv file used in this tutorial. The csv file comes
with all HDInsight Spark clusters. The data captures the temperature variations of some
buildings.

1. Open the Jupyter notebook that you created in the previous lab
2. Paste the following code in an empty cell of the notebook, and then press SHIFT + ENTER to
run the code. The code imports the types required for this scenario:

from pyspark.sql import *


from pyspark.sql.types import *

When running an interactive query in Jupyter, the web browser window or tab caption shows
a (Busy) status along with the notebook title. You also see a solid circle next to the PySpark
text in the top-right corner. After the job is completed, it changes to a hollow circle.

16
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
3. Run the following code to create a dataframe and a temporary table (hvac) by running the
following code.
# Create an RDD from sample data
csvFile =
spark.read.csv('wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv',
header=True, inferSchema=True)
csvFile.write.saveAsTable("hvac")

NOTE By using the PySpark kernel to create a notebook, the SQL contexts are
automatically created for you when you run the first code cell. You do not need to
explicitly create any contexts.

Exercise 2: Run SQL queries on the dataframe


Once the table is created, you can run an interactive query on the data.

1. Run the following code in an empty cell of the notebook:

%%sql
SELECT buildingID, (targettemp - actualtemp) AS temp_diff, date FROM
hvac WHERE date = \"6/1/13\"

17
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
Because the PySpark kernel is used in the notebook, you can now directly run an
interactive SQL query on the temporary table hvac.

The following tabular output is displayed:

2. You can also see the results in other visualizations as well. To see an area graph for
the same output, select Area then set other values as shown.

3. From the File menu on the notebook, select Save and Checkpoint.

4. If you're starting the next lab now, leave the notebook open. If not, shut down the
notebook to release the cluster resources: from the File menu on the notebook,
selectx Close and Halt.

Exercise 3: Clean up resources (whenever required)

18
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
HDInsight saves your data in Azure Storage or Azure Data Lake Store, so you can safely
delete a cluster when it is not in use. You are also charged for an HDInsight cluster, even
when it is not in use. Since the charges for the cluster are many times more than the charges
for storage, it makes economic sense to delete clusters when they are not in use. If you plan
to work on the next excercise immediately, you might want to keep the cluster.

To delete the cluster, switch back to the Azure portal, and select Delete.

You can also select the resource group name to open the resource group page, and then
select Delete resource group. By deleting the resource group, you delete both the
HDInsight Spark cluster, and the default storage account.

19
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
Lab 3: Analyze Spark data using Power BI in
HDInsight

Learn how to use Microsoft Power BI to visualize data in Apache Spark cluster in Azure
HDInsight.

In this lab, you will learn how to:

• Visualize Spark data using Power BI

Make sure you have access to an Azure subscription where you have created the resources
as instructed in Lab 1 and Lab 2 and have created the Data Science Virtual Machine as per
Exercise 2 of Lab 0.

Prerequisites
Power BI: Power BI Desktop and Power BI trial subscription (optional).

Exercise 1: Verify the data


The Jupyter notebook that you created in the previous lab, includes code to create an hvac
table. This table is based on the CSV file available on all HDInsight Spark clusters at
\HdiSamples\HdiSamples\SensorSampleData\hvac\hvac.csv. Use the following
procedure to verify the data.

1. From the Jupyter notebook, paste the following code, and then press SHIFT + ENTER.
The code verifies the existence of the tables.

%%sql

SHOW TABLES

The output looks like:

If you closed the notebook before starting this lab, hvactemptable is cleaned up, so it's not
included in the output. Only Hive tables that are stored in the metastore (indicated by False
under the isTemporary column) can be accessed from the BI tools. In this lab, you connect
to the hvac table that you created.

20
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
2. Paste the following code in an empty cell, and then press SHIFT + ENTER. The code
verifies the data in the table.

%%sql
SELECT * FROM hvac LIMIT 10
The output looks like:

3. From the File menu on the notebook, click Close and Halt. Shut down the notebook to
release the resources.

Exercise 2: Visualize the data


In this section, you use Power BI to create visualizations, reports, and dashboards from the
Spark cluster data.

Task 1: Create a report in Power BI Desktop

The first steps in working with Spark are to connect to the cluster in Power BI Desktop, load
data from the cluster, and create a basic visualization based on that data.

[!NOTE] The connector demonstrated in this article is currently in preview. Provide any
feedback you have through the Power BI Community site or Power BI Ideas.

1. Connect to Data Science Virtual Machine created in Exercise 2 of Lab 0.

2. Open Power BI Desktop.

3. From the Home tab, click Get Data, then More.

21
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
4. Enter Spark in the search box, select Azure HDInsight Spark (Beta), and then click
Connect.

5. Enter your cluster URL (in the form mysparkcluster.azurehdinsight.net), select


DirectQuery, and then click OK.

You can use either data connectivity mode with Spark. If you use DirectQuery,
changes are reflected in reports without refreshing the entire dataset. If you import
data, you must refresh the data set to see changes. For more information on how and
when to use DirectQuery, see Using DirectQuery in Power BI.

6. Enter the HDInsight login account information, then click Connect. The default
account name is admin. If you changed the credentials while creating the HDInsight
cluster, enter the changed credentials here.
7. Select the hvac table, wait to see a preview of the data, and then click Load.

22
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
Power BI Desktop has the information it needs to connect to the Spark cluster and load data
from the hvac table. The table and its columns are displayed in the Fields pane. See the
following screenshot:

8. Visualize the variance between target temperature and actual temperature for each
building:
I. In the VISUALIZATIONS pane, select Area Chart.
II. Drag the BuildingID field to Axis, and drag the ActualTemp and
TargetTemp fields to Value.

23
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
The diagram looks like:

By default the visualization shows the sum for ActualTemp and TargetTemp. Click
the down arrow next to ActualTemp and TragetTemp in the Visualizations pane,
you can see Sum is selected.

III. Click the down arrows next to ActualTemp and TragetTemp in the
Visualizations pane, select Average to get an average of actual and target
temperatures for each building.

Your data visualization shall be similar to the one in the screenshot. Move your cursor
over the visualization to get tool tips with relevant data.

24
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
9. Click File then Save, and enter the name BuildingTemperature.pbix for the file.

Task 2: Publish the report to the Power BI Service (optional)

The Power BI service allows you to share reports and dashboards across your organization. In
this section, you first publish the dataset and the report. Then, you pin the report to a
dashboard. Dashboards are typically used to focus on a subset of data in a report; you have
only one visualization in your report, but it's still useful to go through the steps.

1. Open Power BI Desktop.

2. From the Home tab, click Publish.

3. Select a workspace to publish your dataset and report to, then click Select. In the
following image, the default My Workspace is selected.

25
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
4. After the publishing is succeeded, click Open 'BuildingTemperature.pbix' in Power
BI.

5. In the Power BI service, click Enter credentials.

6. Click Edit credentials.

7. Enter the HDInsight login account information, and then click Sign in. The default
account name is admin.

26
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
8. In the left pane, go to Workspaces > My Workspace > REPORTS, then click
BuildingTemperature.

You should also see BuildingTemperature listed under DATASETS in the left pane.

The visual you created in Power BI Desktop is now available in the Power BI service.

9. Hover your cursor over the visualization, and then click the pin icon on the upper
right corner.

27
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.
10. Select "New dashboard", enter the name Building temperature, then click Pin.

11. In the report, click Go to dashboard.

Your visual is pinned to the dashboard - you can add other visuals to the report and pin
them to the same dashboard.

Conclusion
In this tutorial, you learned how to:

• Visualize Spark data using Power BI.

28
Copyright by Abhishek Kant, GTM Catalyst. Not for re-distribution.

You might also like