You are on page 1of 39

Python Ireland Meetup, Sep 14th 2022

Jim Dowling, CEO @ Hopsworks and Assoc Prof @ KTH

Serverless ML in Python
Predict surf height at Lahinch Beach
Beyond Notebooks: Don’t just train models, build “Prediction Services”

❌ Static Datasets 💚 Data never stops coming

❌ Data is downloaded from a single 💚 Data comes from different


URL heterogeneous data sources

❌ Features for ML are engineered, 💚 Write code to extract and validate


correct, and unbiased the features from input data

❌ Use a model evaluation metric 💚 Communicate the value of your


(accuracy) to communicate the value model with a UI or app/service
of your model
💚 Build and deploy a reliable service
around your model with MLOps
Serverless ML “Prediction Service”

train model

Models
Features Pipelines & Batch Prediction Pipelines

Features
HOPSWORKS.AI
Once or Twice/day

Twice/day Predictions

Publish to UI
Github Pages UI

https://github.com/jimdowling/cjsurf
Alternatives to Github Actions for Serverless Python

Serverless Python Functions Orchestration Platforms


● render. com ● Astronomer (Airflow)
● pythonanywhere.com ● Dagster
● replit.com ● Prefect
● deta.sh ● Azure Data Factory
● linode.com ● Amazon Managed Workflows
● hetzner.com for Apache Airflow (MWAA)
● digitalocean.com ● Google Cloud Composer
● AWS lambda functions ● Databricks Workflows
● Google Cloud Functions
(Good) Bombs going off at Mullaghmore, Ireland
What height will the surf be at Lahinch this weekend?

When I lived in Dublin, I always wanted to


know what I would do the next weekend…

No Yes
surfs up?
We built a system called CJSurf to predict surf at Lahinch

Open Ocean Swell Predictions Lahinch Beach Surf Height Predictions


Swells/Waves have (1) height, (2) period, (3) direction

Direction

Height

Period is the time between waves

Wave height at the point is 4 times higher than wave height at the beach
Swell Predictions by NOAA Buoys with height, period, direction

https://polar.ncep.noaa.gov/waves/WEB/gfswave.latest_run/plots/gfswave.62108.bull
Accurate Surf Height Observations by Lahinch Surf Shop

Reports are published at 10am every day by


https://www.lahinchsurfshop.com/
Can I write CJSurf from 2004 with with free managed services?
Production Machine Learning in 2004!
lahinchsurfhop.com noaa.gov (62081, 62105)

Java Data Collector


& K-NN Predictions. Php Web App
CronJob.

Write Features &


Predictions Lookup Precomputed Predictions

MySQL

Can we rewrite a LAMP architecture to a free serverless Python architecture in 2022?


Serverless Analytical ML Application in Python (2022)
SERVERLESS COMPUTE SERVERLESS STATE SERVERLESS UI

Lahinch, NOAA

surf-report-features.ipynb
swell-features.ipynb download Hopsworks Github
batch-predict-surf.ipynb model Model Registry Pages

latest_lahinch.png
add model
insert
DataFrames train-model.ipynb

Hopsworks
Feature Store https://github.com/jimdowling/cjsurf
Feature Store
Feature Engineering with Pandas/Spark/SQL/Flink

HOPSWORKS
Feature Store
DataFrames DataFrames/Files

SQL Aggregations
Normalization
Dimensionality One-hot encoding
Reductions

Validations
Feature Store: write to Feature Groups, read from Feature Views

HOPSWORKS FEATURE STORE

Real-Time
Read Feature Vectors
Features
Online API

Write DataFrames Feature Groups Feature Views


Read Files/DataFrames
Batch Data
Offline API

</> Search, Versioning, Metrics

Lineage, Source Code


Feature
Engineering
in Pandas
Feature Engineering: what time does the swell “hit_at” Lahinch?

Prediction “hits_at”
Time=0 Lahinch Time=?

The swell velocity is calculated by


multiplying the swell period by 1.5. But,
we also need to consider swell direction.
Swell Direction and the Swell Window at Lahinch

Lahinch
SWELL WINDOW
for Lahinch

Swell directions that work for


Lahinch ~(15-120 degrees)
Writing Pandas DataFrames to Hopsworks as Feature Groups
Data Validation for Feature Groups with Great Expectations
Data Validation with Great Expectations and Hopsworks
Hopsworks Alerting for Data Validation with Great Expectations

Feature
Group

DataFrame Great
Expectations ❌

Hopsworks
Alert
Creating
Training Data From
Feature Groups
Feature Store: write to Feature Groups, read from Feature Views

HOPSWORKS FEATURE STORE

Real-Time
Read Feature Vectors
Features
Online API

Write DataFrames Feature Groups Feature Views


Read Files/DataFrames
Batch Data
Offline API

</> Search, Versioning, Metrics

Lineage, Source Code


Join Features to create Point-in-time Correct Training Data
lahinch_surf_reports updated every 24 hrs noa_swells updated every 6 hrs
beach_id obs_time height min max buoy_id hits_at height direction period
1 2004-01-01 10:00 1 1 1
62105 2004-01-01 00:00 1.25 88 9.8
1 2004-01-02 11:00 1.5 1 2
62105 2004-01-02 06:00 1.30 92 10.2
1 2004-01-03 12:00 3 2 4 62105 2004-01-03 12:00 2.45 100 11.4

Point-in-time Correct JOIN


(no future data leakage)

obs_time => hits_at height (swell) direction period height (label)


2004-01-01 10:00 1.25 88 9.8 1.5

2004-01-02 11:00 1.30 92 10.2 2

2004-01-03 12:00 2.45 100 11.4 3

Training Data
Python DSL for Point-in-Time JOINs, transpiled into SQL

query = lahinch.select(['wave_height'])
.join(swells.select(['height','period','direction']))

fv = fs.create_feature_view(name='lahinch_surf',
description="Lahinch surf height features",
version=1,
labels=["wave_height"],
query=query)
Avoid Training/Serving
Skew with Online Models
Maximize Feature Reuse: Transformations after Feature Groups

HOPSWORKS
Feature Store
DataFrames DataFrames/Files

SQL Aggregations
Normalization
Dimensionality One-hot encoding
Reductions

Validations
Normalizing numerical features often improves model performance

Normalization of swell height, period, distance

RMSE 7.0 RMSE 5.11


Scikit-Learn Transformation Functions

Training Pipeline

from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
joblib.dump(scaler, ‘scaler.pkl')
Ensure
Consistency
Online Inference Pipeline with Versioning
& Code Review
scalar = joblib.load(“/path/to/scalar.pkl”) & Testing
from sklearn.preprocessing import MinMaxScaler
X_test_scaled = scaler.fit_transform(X_test)
Online Transformation Functions

Training Pipeline
standard_scaler =
fs.get_transformation_function(name="standard_scaler")
transformation_functions = {
"height": standard_scaler,
"period": standard_scaler,
"direction": standard_scaler,
}
fv = fs.create_feature_view(name='lahinch_surf',

transformation_functions=transformation_functions)
X_train,y_train,X_test,y_test = fv.train_test_split(0.1)
Transformation
Online Inference Pipeline functions (UDFs)
consistent over
keys= {“beach_id”: 1} training & serving
feature_vector = fv.get_feature_vector(keys)
Lesson Learned:
Refactor Monolithic ML
Pipelines into
Production ML Services
Beyond Notebooks and Monolithic ML Pipelines

● Monolithic ML Pipelines are a single pipeline that transforms raw data


into features and trains and scores the model in one single program
● No easy path to production, so often just thrown over the wall to ops :)

Feature Train Evaluate


Raw Data Engineering Model Model
Refactor Monolithic Pipelines into Feature, Training, Inference Pipelines

● A feature pipeline to create features from new live data or to backfill features
from historical data
● A training pipeline that can be run when a new model is needed
● An inference pipeline (either batch or online) that takes features from the feature
store, and if the model is online, combines them with online features.

Run on a Run
Training schedule on-demand
Pipeline

training data model


new
data
Data Source
features inference data Batch predictions
Feature
Hopsworks Inference
Pipeline model Pipeline
Historical Data backfill
Online Inference Pipelines are part of Model Serving Infra

● Some features are pre-computed and retrieved from the feature store
(typically those that require history and context information)
● Some features are computed on-demand (at run-time) with
application-supplied data (and possibly also history/context)
Operational Run on a Run
Service schedule on-demand
Training
Pipeline on-demand
features
Stream Source training data model

precomputed
features request
Feature features Model Application
Batch Source Pipeline Hopsworks Serving or Service
prediction

backfill
Historical Data
Case Study: Iris Flowers as a Batch Prediction Service

GH Actions Colab - run


iris-train-knn- Once/day on-demand
model.ipynb

register
training data
new model
data
Synthetic Data
features DataFrame iris-batch-infere
iris-feature- Github
nce-pipeline
pipeline.ipynb Hopsworks iris_model .ipynb Pages UI
iris.csv backfill

https://github.com/featurestoreorg/serverless-ml-course/tree/main/src/01-module
SERVERLESS ML
www.serverless-ml.org
September 2022
Serverless ML Flywheels with Hopsworks

PyData London Exclusive: limited registrations now available at:

https://app.hopsworks.ai

Our Promise to you:

Time Unlimited Free Tier


Twitter: @jim_dowling

Compliance
Governance

Efficiency
At Scale

Open &
modular
www.hopsworks.ai

You might also like