Professional Documents
Culture Documents
Runs on any Platform* Runs on any Platform* Only Managed Feature Available
(On-premise, Cloud, VMs, etc) (On-premise, Cloud, VMs, etc) today on both AWS and Azure
*Supported operating systems: RHEL/Centos 7.x and Ubuntu 18.04. Minimum Requirements: 32GB RAM, 100GB disk, 8 CPUs. Runs in air-gapped environments.
**Community Hopsworks does not include (1) Feature Store Connectors to Third-Party Platforms and (2) SSO with Active Directory/OAuth-2/Azure-AD/AWS.
When do I need a Feature
Store for Machine Learning
and what it is anyway?
Business Problem: Use Machine Learning to Predict Money Laundering
Data Warehouse
SERVE
Know Your Customer Data
Data Lake
Message Bus
TRAIN
Recent
Financial
Transactions
6
It is not always easy to get access to Enterprise data for training and serving.
Data Warehouse
SERVE
Data Lake
TRAIN
Message Bus
Recent
Financial
Transactions
7
What data can I use to make predictions with?
Data Warehouse
SERVE
Data Lake
Feature
Historical Financial Transactions Store
TRAIN
Message Bus
Recent
Financial
Transactions
8
Where does the Feature Store fit into the ML Pipeline?
FEATURE STORE
Offline Feature Store - Create Training Data and Batch Predictions
df = kycFG.select_all().join(rftFG.select_all()).join(hftFG.select_all())
td = fs.create_training_dataset("precipitation_training_dataset",
version=1,
data_format="tfrecord",
description="Precipitation Training dataset",
splits={'train': 0.7, 'test': 0.2, 'validate': 0.1})
td.save(df)
Feature Store
kycFG train
Training Data Model
(.tfrecord)
rftFG
hftFG
RonDB1
Model
US-West-la
US-West-1b US-West-1c
1. Build Feature Vector Using Online Feature Store
2. Send Feature Vector to Model for Prediction
Hopsworks End-to-End Machine Learning (ML) Pipelines
Model Serving
Model Model Statistics Statistics
Registry Deploy
A/B Test
Model Model
Serving Monitoring
Retrieve Features
HOPSWORKS
ORCHESTRATION
Airflow
ML DEVELOP MODEL
BATCH AND TRAIN SERVING AND APPLICATIONS
DATASOURCE Apache Spark API
MONITORING
DASHBOARDS
HOPSWORKS Notebooks as Jobs
Apache
Kafka
FEATURE KFServing
STORE Tensorflow TF-Serving
Scikit-Learn Flask
STREAMING PyTorch
Apache Spark
Tensorboard
Apache Flink
15
Project-Based Multi-Tenant Security Model
16
Project-Based Multi-Tenant Security Model
17
Moving to the Cloud - Connectors and Integrations
Hopsworks
Kubeflow
Amazon
Amazon S3 Snowflake Delta Lake Redshift
Making Hopsworks Cloud-Native
HopsFS S3
RonDB DynamoDB/Elasticache
Early 2020
21
Serverless Platform on AWS - Amplify, Cognito, CloudFront, Lambdas, Route 53, DynamoDB
22
Lambdas
23
Integration with other Platforms - Databricks
Cloud-Native Kubernetes Integration
Project_User
X.509 2
JWT Jobs UI
Scheduler
Jobs Project-User
Pod
HopsFS Hive
Access
using
Docker Container X.509 /JWT
Kafka Elastic
25
https://www.logicalclocks.com/blog/how-we-secure-your-data-with-hopsworks
DynamoDB
Expensive,
High Latency (~10ms lookup),
Limited Query Support - Reporting a Problem,
Quotas, Hotspots
RonDB - a new open-source cloud-native distribution of NDB (MySQL Cluster)
Inventor of NDB
(MySQL Cluster) MySQL Cluster (NDB) - the world’s highest throughput transactional datastore
30
RonDB Competitors
Availability
Online Feature
Stores
DynamoDB,
Cassandra,
BigTable
Scalability
31
Demo Time.
github.com/logicalclocks/hopsworks
-
@logicalclocks
-
www.logicalclocks.com
Feature Engineering and Model Training Pipeline - With a Feature Store
Deploy
Deploy