You are on page 1of 33

Building Hopsworks, a cloud-native managed

feature store for machine learning


Jim Dowling
CEO, Logical Clocks

Cloud Native London Meetup, March 3 2021


Can we make a Monolith fly in the clouds?
The Hopsworks Feature Store - Available on all Platforms as Managed, Enterprise, and Community

Runs on any Platform* Runs on any Platform* Only Managed Feature Available
(On-premise, Cloud, VMs, etc) (On-premise, Cloud, VMs, etc) today on both AWS and Azure

Community Hopsworks** Enterprise Hopsworks hopsworks.ai


(self-hosted platform) (self-hosted platform) (managed platform)

2016 2018 2020

*Supported operating systems: RHEL/Centos 7.x and Ubuntu 18.04. Minimum Requirements: 32GB RAM, 100GB disk, 8 CPUs. Runs in air-gapped environments.
**Community Hopsworks does not include (1) Feature Store Connectors to Third-Party Platforms and (2) SSO with Active Directory/OAuth-2/Azure-AD/AWS.
When do I need a Feature
Store for Machine Learning
and what it is anyway?
Business Problem: Use Machine Learning to Predict Money Laundering

Reference: Whitepaper, Webinar


What data can I use to solve my Anti-Money Laundering Problem with?

Data Warehouse

SERVE
Know Your Customer Data

Data Lake

Historical Financial Transactions

Message Bus
TRAIN
Recent
Financial
Transactions
6
It is not always easy to get access to Enterprise data for training and serving.

Data Warehouse

Know Your Customer Data

SERVE
Data Lake

Historical Financial Transactions

TRAIN
Message Bus
Recent
Financial
Transactions
7
What data can I use to make predictions with?

Data Warehouse

Know Your Customer Data

SERVE
Data Lake

Feature
Historical Financial Transactions Store

TRAIN
Message Bus
Recent
Financial
Transactions
8
Where does the Feature Store fit into the ML Pipeline?

FEATURIZE TRAIN / SERVE

FEATURE STORE
Offline Feature Store - Create Training Data and Batch Predictions

df = kycFG.select_all().join(rftFG.select_all()).join(hftFG.select_all())

td = fs.create_training_dataset("precipitation_training_dataset",
version=1,
data_format="tfrecord",
description="Precipitation Training dataset",
splits={'train': 0.7, 'test': 0.2, 'validate': 0.1})
td.save(df)

Feature Store

kycFG train
Training Data Model
(.tfrecord)
rftFG

hftFG

FG=Feature Group https://docs.hopsworks.ai/


Online Feature Store - the Data Layer for Operational (Online) Models

RonDB1
Model

US-West-la

2-20ms Online Application ~5-50ms


RonDB2 RonDB3
Model Model
1.JDBC 2.Predict

US-West-1b US-West-1c
1. Build Feature Vector Using Online Feature Store
2. Send Feature Vector to Model for Prediction
Hopsworks End-to-End Machine Learning (ML) Pipelines

Code and Search (Artifacts,


configuration Provenance and
Metadata)
Feature Statistics Elasticsearch

Data Lake, Feature Model Programs


Warehouse, Engineering Development Scaleout Sync
Kafka Metadata
Features
Experiments
HopsFS
Experiment
Feature Model Tracking
Store Training

Model Serving
Model Model Statistics Statistics
Registry Deploy

A/B Test

Model Model
Serving Monitoring
Retrieve Features

Log Predictions Training Data Statistics


Hopsworks - Develop and Operate ML Applications at Scale

HOPSWORKS
ORCHESTRATION
Airflow

ML DEVELOP MODEL
BATCH AND TRAIN SERVING AND APPLICATIONS
DATASOURCE Apache Spark API
MONITORING
DASHBOARDS
HOPSWORKS Notebooks as Jobs
Apache
Kafka
FEATURE KFServing
STORE Tensorflow TF-Serving
Scikit-Learn Flask
STREAMING PyTorch
Apache Spark
Tensorboard
Apache Flink

FILESYSTEM & METASTORE


HopsFS

Data Preparation Experimentation Deploy


& Ingestion & Model Training & Productionalize
Transitioning Security to
the Cloud….
Project-Based Multi-Tenant Security Model

15
Project-Based Multi-Tenant Security Model

16
Project-Based Multi-Tenant Security Model

17
Moving to the Cloud - Connectors and Integrations

Hopsworks

Project-Based Multi-Tenant Security


databricks

Dev Feature Store

User Staging Feature Store Jobs


SageMaker
Login API
(LDAP, AD, Users KEY
Prod Feature Store
OAuth2, 2FA)

Kubeflow

IAM Profile or Federated IAM Role


Amazon EMR

Amazon
Amazon S3 Snowflake Delta Lake Redshift
Making Hopsworks Cloud-Native

Hopsworks Open Source Cloud Native Service

Open-Source Docker Repository ECR / ACR

Kubernetes EKS / AKS

Hopsworks Services Rejected Cloud Native Versions

Spark-on-YARN Databricks / EMR

HopsFS S3

RonDB DynamoDB/Elasticache

Kafka Managed Kafka

Elastic Open Distro AWS Elastic


19
Developing
Hopsworks.ai
The first European Company to provide a managed
scale-out data and AI platform in the cloud
Hopsworks.ai

Early 2020

Nov 2020 (GA)

21
Serverless Platform on AWS - Amplify, Cognito, CloudFront, Lambdas, Route 53, DynamoDB

22
Lambdas

23
Integration with other Platforms - Databricks
Cloud-Native Kubernetes Integration

Kubernetes (EKS, AKS) Hopsworks


Project Creation
Secrets API
1 server
Project_ X.509 1 Project User
User JWT
Project_User
v X.509 2
JWT

Project_User
X.509 2
JWT Jobs UI
Scheduler
Jobs Project-User

Pod
HopsFS Hive

Access
using
Docker Container X.509 /JWT
Kafka Elastic

25
https://www.logicalclocks.com/blog/how-we-secure-your-data-with-hopsworks
DynamoDB
Expensive,
High Latency (~10ms lookup),
Limited Query Support - Reporting a Problem,
Quotas, Hotspots
RonDB - a new open-source cloud-native distribution of NDB (MySQL Cluster)

RonDB vs Redis - RonDB outperforms on 1 CPU Core and Keeps on Scaling

Inventor of NDB
(MySQL Cluster) MySQL Cluster (NDB) - the world’s highest throughput transactional datastore

www.rondb.com 200m ops/second with NDB - world’s fastest key-value store


RonDB - the first LATS Database in the Cloud. Launched in private beta Feb 2021.

RonDB is a LATS Database


low Latency, high Availability, high Throughput, scalable Storage

< 1ms KV lookup


>10M KV Lookups/sec
>99.999% availability 28
Lessons Learnt
(so far)
Lessons Learnt (so far) in building a Cloud Native Managed Data/AI Platform

Shiny new Toys not always the best

● Lambda functions poor for synchronous events (e.g. request reply)


due to the slow response times
○ Unsuitable for "web" endpoints - 500-2000 ms response time
○ Cold lambdas, but also JS JIT.
○ Parallel operations difficult due to lack of support in lambda

● “Amplifeck’d” is a common word on our Slack

● SQL > Key Value APIs

30
RonDB Competitors

Availability

Online Feature
Stores

Latency RonDB Throughput


Redis

DynamoDB,
Cassandra,
BigTable
Scalability
31
Demo Time.

github.com/logicalclocks/hopsworks

-
@logicalclocks
-
www.logicalclocks.com
Feature Engineering and Model Training Pipeline - With a Feature Store

Feature Store Monitor

Data Warehouse Online Model Online


Feature Vectors
Feature Store Serving Application

Deploy

Feature Train/Test Data Model Model


KAFKA
Engineering (S3, HDFS, etc) Training Repository

Deploy

Offline Batch Access


Batch
Result Sink (DB)
Data Lake Feature Store Scoring

You might also like