Serverless ML to predict surf heights at Lahinch Beach

Python Ireland Meetup, Sep 14th 2022
Jim Dowling, CEO @ Hopsworks and Assoc Prof @ KTH
Serverless ML in Python
Predict surf height at Lahinch Beach
Beyond Notebooks: Don’t just train models, build “Prediction Services”
❌ Static Datasets 💚 Data never stops coming
❌ Data is downloaded from a single 💚 Data comes from different

URL heterogeneous data sources
❌ Features for ML are engineered, 💚 Write code to extract and validate

correct, and unbiased the features from input data
❌ Use a model evaluation metric 💚 Communicate the value of your

(accuracy) to communicate the value model with a UI or app/service
of your model
💚 Build and deploy a reliable service
around your model with MLOps
Serverless ML “Prediction Service”
train model
Models
Features Pipelines & Batch Prediction Pipelines
Features
HOPSWORKS.AI
Once or Twice/day
Twice/day Predictions
Publish to UI
Github Pages UI
https://github.com/jimdowling/cjsurf
Alternatives to Github Actions for Serverless Python
Serverless Python Functions Orchestration Platforms

● render. com ● Astronomer (Airflow)
● pythonanywhere.com ● Dagster
● replit.com ● Prefect
● deta.sh ● Azure Data Factory
● linode.com ● Amazon Managed Workflows
● hetzner.com for Apache Airflow (MWAA)
● digitalocean.com ● Google Cloud Composer
● AWS lambda functions ● Databricks Workflows
● Google Cloud Functions
(Good) Bombs going off at Mullaghmore, Ireland
What height will the surf be at Lahinch this weekend?
When I lived in Dublin, I always wanted to

know what I would do the next weekend…
No Yes
surfs up?
We built a system called CJSurf to predict surf at Lahinch
Open Ocean Swell Predictions Lahinch Beach Surf Height Predictions

Swells/Waves have (1) height, (2) period, (3) direction
Direction
Height
Period is the time between waves
Wave height at the point is 4 times higher than wave height at the beach
Swell Predictions by NOAA Buoys with height, period, direction
https://polar.ncep.noaa.gov/waves/WEB/gfswave.latest_run/plots/gfswave.62108.bull
Accurate Surf Height Observations by Lahinch Surf Shop
Reports are published at 10am every day by

https://www.lahinchsurfshop.com/
Can I write CJSurf from 2004 with with free managed services?
Production Machine Learning in 2004!
lahinchsurfhop.com noaa.gov (62081, 62105)
Java Data Collector

& K-NN Predictions. Php Web App
CronJob.
Write Features &

Predictions Lookup Precomputed Predictions
MySQL
Can we rewrite a LAMP architecture to a free serverless Python architecture in 2022?

Serverless Analytical ML Application in Python (2022)
SERVERLESS COMPUTE SERVERLESS STATE SERVERLESS UI
Lahinch, NOAA
surf-report-features.ipynb
swell-features.ipynb download Hopsworks Github
batch-predict-surf.ipynb model Model Registry Pages
latest_lahinch.png
add model
insert
DataFrames train-model.ipynb
Hopsworks
Feature Store https://github.com/jimdowling/cjsurf
Feature Store
Feature Engineering with Pandas/Spark/SQL/Flink
HOPSWORKS
Feature Store
DataFrames DataFrames/Files
SQL Aggregations
Normalization
Dimensionality One-hot encoding
Reductions
Validations
Feature Store: write to Feature Groups, read from Feature Views
HOPSWORKS FEATURE STORE
Real-Time
Read Feature Vectors
Features
Online API
Write DataFrames Feature Groups Feature Views

Read Files/DataFrames
Batch Data
Offline API
</> Search, Versioning, Metrics
Lineage, Source Code

Feature
Engineering
in Pandas
Feature Engineering: what time does the swell “hit_at” Lahinch?
Prediction “hits_at”
Time=0 Lahinch Time=?
The swell velocity is calculated by

multiplying the swell period by 1.5. But,
we also need to consider swell direction.
Swell Direction and the Swell Window at Lahinch
Lahinch
SWELL WINDOW
for Lahinch
Swell directions that work for

Lahinch ~(15-120 degrees)
Writing Pandas DataFrames to Hopsworks as Feature Groups
Data Validation for Feature Groups with Great Expectations
Data Validation with Great Expectations and Hopsworks
Hopsworks Alerting for Data Validation with Great Expectations
Feature
Group
DataFrame Great
Expectations ❌
Hopsworks
Alert
Creating
Training Data From
Feature Groups
Feature Store: write to Feature Groups, read from Feature Views
HOPSWORKS FEATURE STORE
Real-Time
Read Feature Vectors
Features
Online API
Write DataFrames Feature Groups Feature Views

Read Files/DataFrames
Batch Data
Offline API
</> Search, Versioning, Metrics
Lineage, Source Code

Join Features to create Point-in-time Correct Training Data
lahinch_surf_reports updated every 24 hrs noa_swells updated every 6 hrs
beach_id obs_time height min max buoy_id hits_at height direction period
1 2004-01-01 10:00 1 1 1
62105 2004-01-01 00:00 1.25 88 9.8
1 2004-01-02 11:00 1.5 1 2
62105 2004-01-02 06:00 1.30 92 10.2
1 2004-01-03 12:00 3 2 4 62105 2004-01-03 12:00 2.45 100 11.4
Point-in-time Correct JOIN

(no future data leakage)
obs_time => hits_at height (swell) direction period height (label)

2004-01-01 10:00 1.25 88 9.8 1.5
2004-01-02 11:00 1.30 92 10.2 2
2004-01-03 12:00 2.45 100 11.4 3
Training Data
Python DSL for Point-in-Time JOINs, transpiled into SQL
query = lahinch.select(['wave_height'])
.join(swells.select(['height','period','direction']))
fv = fs.create_feature_view(name='lahinch_surf',
description="Lahinch surf height features",
version=1,
labels=["wave_height"],
query=query)
Avoid Training/Serving
Skew with Online Models
Maximize Feature Reuse: Transformations after Feature Groups
HOPSWORKS
Feature Store
DataFrames DataFrames/Files
SQL Aggregations
Normalization
Dimensionality One-hot encoding
Reductions
Validations
Normalizing numerical features often improves model performance
Normalization of swell height, period, distance
RMSE 7.0 RMSE 5.11

Scikit-Learn Transformation Functions
Training Pipeline
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
joblib.dump(scaler, ‘scaler.pkl')
Ensure
Consistency
Online Inference Pipeline with Versioning
& Code Review
scalar = joblib.load(“/path/to/scalar.pkl”) & Testing
from sklearn.preprocessing import MinMaxScaler
X_test_scaled = scaler.fit_transform(X_test)
Online Transformation Functions
Training Pipeline
standard_scaler =
fs.get_transformation_function(name="standard_scaler")
transformation_functions = {
"height": standard_scaler,
"period": standard_scaler,
"direction": standard_scaler,
}
fv = fs.create_feature_view(name='lahinch_surf',
…
transformation_functions=transformation_functions)
X_train,y_train,X_test,y_test = fv.train_test_split(0.1)
Transformation
Online Inference Pipeline functions (UDFs)
consistent over
keys= {“beach_id”: 1} training & serving
feature_vector = fv.get_feature_vector(keys)
Lesson Learned:
Refactor Monolithic ML
Pipelines into
Production ML Services
Beyond Notebooks and Monolithic ML Pipelines
● Monolithic ML Pipelines are a single pipeline that transforms raw data

into features and trains and scores the model in one single program
● No easy path to production, so often just thrown over the wall to ops :)
Feature Train Evaluate

Raw Data Engineering Model Model
Refactor Monolithic Pipelines into Feature, Training, Inference Pipelines
● A feature pipeline to create features from new live data or to backfill features
from historical data
● A training pipeline that can be run when a new model is needed
● An inference pipeline (either batch or online) that takes features from the feature
store, and if the model is online, combines them with online features.
Run on a Run
Training schedule on-demand
Pipeline
training data model

new
data
Data Source
features inference data Batch predictions
Feature
Hopsworks Inference
Pipeline model Pipeline
Historical Data backfill
Online Inference Pipelines are part of Model Serving Infra
● Some features are pre-computed and retrieved from the feature store
(typically those that require history and context information)
● Some features are computed on-demand (at run-time) with
application-supplied data (and possibly also history/context)
Operational Run on a Run
Service schedule on-demand
Training
Pipeline on-demand
features
Stream Source training data model
precomputed
features request
Feature features Model Application
Batch Source Pipeline Hopsworks Serving or Service
prediction
backfill
Historical Data
Case Study: Iris Flowers as a Batch Prediction Service
GH Actions Colab - run

iris-train-knn- Once/day on-demand
model.ipynb
register
training data
new model
data
Synthetic Data
features DataFrame iris-batch-infere
iris-feature- Github
nce-pipeline
pipeline.ipynb Hopsworks iris_model .ipynb Pages UI
iris.csv backfill
https://github.com/featurestoreorg/serverless-ml-course/tree/main/src/01-module
SERVERLESS ML
www.serverless-ml.org
September 2022
Serverless ML Flywheels with Hopsworks
PyData London Exclusive: limited registrations now available at:
https://app.hopsworks.ai
Our Promise to you:
Time Unlimited Free Tier

Twitter: @jim_dowling
Compliance
Governance
Efficiency
At Scale
Open &
modular
www.hopsworks.ai

Serverless ML to predict surf heights at Lahinch Beach

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Serverless ML to predict surf heights at Lahinch Beach

Uploaded by

Copyright:

Available Formats

Python Ireland Meetup, Sep 14th 2022

Jim Dowling, CEO @ Hopsworks and Assoc Prof @ KTH

❌ Static Datasets 💚 Data never stops coming

❌ Data is downloaded from a single 💚 Data comes from different

❌ Features for ML are engineered, 💚 Write code to extract and validate

❌ Use a model evaluation metric 💚 Communicate the value of your

Serverless Python Functions Orchestration Platforms

When I lived in Dublin, I always wanted to

Open Ocean Swell Predictions Lahinch Beach Surf Height Predictions

Period is the time between waves

Reports are published at 10am every day by

Java Data Collector

Write Features &

Can we rewrite a LAMP architecture to a free serverless Python architecture in 2022?

HOPSWORKS FEATURE STORE

Write DataFrames Feature Groups Feature Views

</> Search, Versioning, Metrics

Lineage, Source Code

The swell velocity is calculated by

Swell directions that work for

HOPSWORKS FEATURE STORE

Write DataFrames Feature Groups Feature Views

</> Search, Versioning, Metrics

Lineage, Source Code

Point-in-time Correct JOIN

obs_time => hits_at height (swell) direction period height (label)

2004-01-02 11:00 1.30 92 10.2 2

2004-01-03 12:00 2.45 100 11.4 3

Normalization of swell height, period, distance

RMSE 7.0 RMSE 5.11

from sklearn.preprocessing import MinMaxScaler

● Monolithic ML Pipelines are a single pipeline that transforms raw data

Feature Train Evaluate

training data model

GH Actions Colab - run

PyData London Exclusive: limited registrations now available at:

Our Promise to you:

Time Unlimited Free Tier

You might also like