oreillyfodooltweek11675274112220

Data Observability Fundamentals
Concepts and methods

Who is with you today?
Founder / CPO @Kensu Product Manager @Kensu

POLL: Who are you?
a. Data analyst
b. Data engineer
c. Data/ML ops
d. Data scientist
e. Software engineer
f. Team/Project Leader
g. Other, please share in the group chat
POLL: What is your initial driver to follow a DO training
a. Data Quality/Data Governance

b. Data Monitoring/DataOps
c. Curiosity/Learning
d. Upskill/Personal motivation
POLL: How would you rate your level of knowledge of data
observability?
a. 0 - Never heard of it
b. 1-4 - I see broadly what it is
c. 5 - 6 - I consider to use it in my projects
d. 7 - 8 - I already tried to implement a data observability approach
e. 9 - 10 - I already successfully implemented data observability
101
Beyond Data Monitoring - Automation & Observability
What:
● How is data produced?
● How is data consumed?
● What does the data looks like, and
○ Does it match my assumptions (volume, structure, freshness, …)?
○ Does it match others’ expectations (*)?
Why (examples):
● Address requests of information faster Create trust
● Reduce effort to detect and resolve issues Recover trust
● Prevent issues, the known ones at least Maintain trust
How: “By Design” - Turn your Data Platform in a Data Observable System
Data Observable Application
Execution Environment
Data Application
Input
Input
Input Output
Input
Generate Data Observations Contextually

- Application Job (Spark job name)
- Code location and version (Git)
- Environment (PROD) and time (`now`)
Data Application
Input
Input
Input Output
Input
Generate Data Observations Synchronously

- Data source Metadata: Location, Schema
- Compute metrics: Size, Null, Min, Max, Cardinality, …
- And more: Custom measures (Skew, Correlation, …)
Data Application
Input
Input
Input Output
Input
Generate Continuously Validated Data Observations

- Size > 10K
- No Nulls for Address
- No skewed categories
Data Observable Principles
Framework/
● Application Job (Spark job name) Application
● Code location and version (Git)
● Environment (PROD) and time (`now`) + DO
Agent
● Data source Metadata: Location, Schema Data Observations
● Compute metrics: Size, Null, Min, Max, Cardinality, …
● And more: Custom measures (Skew, Correlation, …)
● Size > 10K Data Observable
● No Null Addresses Framework/
● No skewed categories Application
Data Observable Pipeline/System
⚙ ⚙ ⚙
?
What is this thing? Data Observability Platform
?
Data Observability (in a Data Platform)
Data
Data Data Data
Observable
Observable Observable Observable
Transformation
Ingestion Tools Serving Tools Analytics Tools
Tools
(Airbyte) (BigQuery) (Scikit-Learn)
(dbt, Spark)
Data
Infrastructure Orchestration
Observability Data Catalog
Platform Platform
Platform
Data Platform
Environment + Setup
Environment
Today’s session focuses on pySpark and pandas
Please clone the following repository:

https://github.com/Fundamentals-of-Data-Observability/handson
Environment
Prerequisites: Docker
Run the following command in your terminal:

install.sh/
cd python_environment
docker run -it --rm -v $(pwd)/volume:/volume mypython /bin/bash

Data we will use
During this course, we will use a modified version the NYC taxi trip data
set (see https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
Starting from subsets of the data, we will:
● Read & Join data with python pandas and bigquery

● Transform data and compute KPI with spark
● Compute and summarize KPI with dbt
● Ingest data with Airbyte
● Orchestrate a pipeline with Airflow
Tools we will use
This week, we focus on Spark and Python:
● Everything is on docker
● To follow the exercises, you will need:
○ A Kensu Community (free) account
○ Optionally (this week): A (free) Google Cloud account with a service account
Next week, we focus on Airbyte, dbt and airflow
● To follow the exercises, you must have:

○ A Kensu Community account
○ A Google Cloud account with a service account (cf instructions in the code repository)
Spark and Databricks
Goals of the session
● Set up the environment for the course

● Make your first spark integration
● Solve an issue thanks to data observability
● See how it can integrate with DataBricks
Spark Job
What is the job doing?
- Transform the taxi_driver dataset for green taxis

- Compute a KPI on the money paid to the vendor
Structure of the job:
green_ingestion green_ingestion
company_bill
.csv _KPI
Spark Job: Code deep dive
Code walkthrough
How to make a Spark Job Data Observable? (1/3)
Framework: Spark
Input Foundation: DAG (RDDs and Deps)
Input
Input Output
Input

- Application Job (Spark job name)
- Code location and version (Git)
- Environment (PROD) and time (`now`)
- Lineage (thanks GAD… ooops, DAG)
Framework: Spark
Input
Input Output
Input

- Data source Metadata: Location, Schema
- Compute metrics: Size, Null, Min, Max, Cardinality, …
- And more: Custom measures (Skew, Correlation, …)
Framework: Spark
Input
Input Output
Input

- Size > 10K
- No Nulls for Address
- No skewed categories
Getting started with the Spark DO Agent
In pyspark, you need to:
● Add the agent (a JAR file) to the Spark Session (driver) to generate the
observations and send them to the Kensu API
● Call the init function, which configures the agent’s behavior, context,
and API of the DO platform:
○ DO Platform API - URL and JWT Token
○ Project Name, Environment, Application
○ …
Note: All the parameters can be also added to a conf.ini file

Handson
Add and configure the agent in the job

A first issue detected!
Now, let’s run 2 different versions of the code:
- spark_example_v1.py
- spark_example_v2.py
Here are the outcome of the company_bill dataset for both versions:
V1 V2
This mistake did not generate any error in the code!
Nevertheless, we obviously see that the result is wrong… 😱
Let’s see in the data observability platform how this can be detected.
Databricks showcase
Same method as spark:
installing the jar and kensu-py package

Pandas and Bigquery
Goals of the session
● Set up the environment for the course

● Turn pandas data observable
● Understand the underlying concepts of the python agent
● See how we can apply the same concepts for the BigQuery client
Pandas application
What is the script doing?
- Ingest 2 datasets in Bigquery

- Join them in a QueryJob
Structure of the job:
green_ingestion
green_ingestion.csv
BigQuery table
yellow_ingestion
BigQuery table
yellow_ingestion
yellow_ingestion.csv
BigQuery table
Pandas application
An example with a simpler script:
Copy from CSV to CSV
Code walkthrough
Python DO agent: How does it work?
Monkey patch: the python agent augments the native Python modules
with extra generation of data observations and reporting capabilities.
Monkey patch in the script
Initialization of the client

Collection of the execution context
Monkey patch of the pandas module

Collection of the data source metadata

Creation of the lineage
Computation of the metrics
Different level of abstractions can be implemented:
More time consuming to implement

Level 3:
Full Monkey
More complex to develop
Patch
Level 2: Best efforts to

activate data
observability
Level 1: Manual reporting

of all the observations
To sum up, turn a Python script Data Observable needs to:
1. Install the agent (kensu-py) in the environment

2. Configuring the client with:
a. Kensu API URL and Token
b. Project Name, Environment, Application
c. …
3. Modify the imports to enable the monkey patches
import kensu.pandas as pd
import kensu.numpy as np
import kensu.json as json

Bigquery module
The bigquery client is used to manipulate data in the bigquery

data base.
By augmenting this client, information can be retrieved from the

QueryJob.
from kensu.google.cloud import bigquery

Bigquery demonstration
Code walkthrough

oreillyfodooltweek11675274112220

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

oreillyfodooltweek11675274112220

Uploaded by

Copyright:

Available Formats

Data Observability Fundamentals

Concepts and methods

Founder / CPO @Kensu Product Manager @Kensu

a. Data Quality/Data Governance

Generate Data Observations Contextually

Generate Data Observations Synchronously

Generate Continuously Validated Data Observations

Today’s session focuses on pySpark and pandas

Please clone the following repository:

Run the following command in your terminal:

docker run -it --rm -v $(pwd)/volume:/volume mypython /bin/bash

Starting from subsets of the data, we will:

● Read & Join data with python pandas and bigquery

This week, we focus on Spark and Python:

Next week, we focus on Airbyte, dbt and airflow

● To follow the exercises, you must have:

● Set up the environment for the course

What is the job doing?

- Transform the taxi_driver dataset for green taxis

Structure of the job:

Generate Data Observations Contextually

Generate Data Observations Synchronously

Generate Continuously Validated Data Observations

In pyspark, you need to:

Note: All the parameters can be also added to a conf.ini file

Add and configure the agent in the job

Now, let’s run 2 different versions of the code:

This mistake did not generate any error in the code!

Nevertheless, we obviously see that the result is wrong… 😱

Same method as spark:

installing the jar and kensu-py package

● Set up the environment for the course

What is the script doing?

- Ingest 2 datasets in Bigquery

Structure of the job:

An example with a simpler script:

Copy from CSV to CSV

Initialization of the client

Monkey patch of the pandas module

Collection of the data source metadata

Different level of abstractions can be implemented:

More time consuming to implement

Level 2: Best efforts to

Level 1: Manual reporting

To sum up, turn a Python script Data Observable needs to:

1. Install the agent (kensu-py) in the environment

import kensu.json as json

The bigquery client is used to manipulate data in the bigquery

By augmenting this client, information can be retrieved from the

from kensu.google.cloud import bigquery

You might also like