Professional Documents
Culture Documents
e
Azure DataBricks for Data
Engineering
Eugene Polonichko
Senior Software Developer at Eleks,
Data Platform MVP
https://
www.linkedin.com/in/
eugenepolonichko
/
About me
Eugene Polonichko has over 7 years of experience
with SQL Server. He mainly focused on BI projects
(SSAS, SSIS, PowerBI, Cognos, Informatica
PowerCenter, Pentaho, Tableau). Eugene is a
passionate speaker and SQL community volunteer
presenting regularly at PASS SQL Saturday events
and local user groups around Ukraine and Europe.
Eugene is PASS Chapter Leader and he has a status
MVP Data Platform
https://www.linkedin.com/in/eugenepolonichko/
https://twitter.com/EvgenPolonichko
Agenda
1. What is Azure Databricks?
• Azure Databricks
• Apache Spark
• Componets of
• Apache Spark
• Architecture of
2. Azure Databricks
• Azure integration
• Azure Databricks
• Cluster
Workspa
•
ce Notebooks
•
Visualizations
•
Jobs and Alerts
•
Databricks File
3.
System Business
•
Intelligence Tools
•
For data engineer
What is Azure Databricks?
Azure Databricks
Azure Databricks is an Apache Spark-
based analytics platform optimized for
the Microsoft Azure cloud services
platform. Designed with the founders of
Apache Spark, Databricks is integrated
with Azure to provide one-click setup,
streamlined workflows, and an interactive
workspace that enables collaboration
between data scientists, data engineers,
and business analysts.
Apache Spark-based analytics platform
Azure Databricks comprises the complete open-source Apache Spark cluster technologies and capabilities.
Spark in Azure Databricks includes the following components
Apache Spark-based analytics platform
• Spark SQL and DataFrames: Spark SQL is the Spark module for working with
structured data
• Streaming: Real-time data processing and analysis for analytical and
interactive applications. Integrates with HDFS, Flume, and Kafka.
• MLib: Machine Learning library consisting of common learning algorithms
and utilities, including classification, regression, clustering, collaborative
filtering, dimensionality reduction, as well as underlying optimization
primitives.
• GraphX: Graphs and graph computation for a broad scope of use cases
from cognitive analytics to data exploration.
• Spark Core API: Includes support for R, SQL, Python, Scala, and Java.
Architecture of Azure Databricks
Total Azure integration
• Diversity of VM types
• Security and Privacy
• Flexibility in network topology
• Azure Storage and Azure Data Lake integration
• Azure Power BI
• Azure Active Directory
• Azure SQL Data Warehouse, Azure SQL DB, and
Azure CosmosDB:
Azure Databricks
Clusters
Azure Databricks clusters provide a unified platform for various use cases such as running production ETL
pipelines, streaming analytics, ad-hoc analytics, and machine learning.
Job
Interactive
Workspace
The Workspace is the special root folder for all of
your organization’s Azure Databricks assets.
The Workspace stores:
• notebooks
• libraries
• dashboards
• folders
Notebooks
A notebook is a web-based interface to a document that
contains runnable code, visualizations, and narrative
text.
• Create a notebook
• Delete a notebook
• Control access to a notebook
• Notebook external formats
• Notebooks and clusters
• Schedule a notebook
• Distributing notebooks
Visualizations
Databricks supports a display(<dataframe-name>)