Professional Documents
Culture Documents
• Vertica
• MPP Database
• Storage
• Distributed Processing
• Analytics
Network
Anomaly Detection • ML (Training, Serving, Model Registry)
Insight
• Kubeflow
• Pipelines / Orchestration / Workflows
• ML Toolkit
• Network Insight
• Data Ingestion
• KPI Calculation
• Anomaly Detection
Model Creation
Testing Data
Production
Data
Feature
Training Engineering
code code
(?)
Model
Deployment Serving Monitoring
Building Model
Labelled Inference
data graph /
pipeline
Web application /
SME / Dashboard
Data Engineer ML engineers /
Data Scientists Software Engineers
Model Training
6
3 Specific model
Specific FE python script
to extract features from version is created
Feature
KPIs is obtained from Model as result of the
Engineering training.
the code repository to Code Registry
train a model.
Orchestration
Kubeflow Pipelines
Vertica
Anomaly Cold Feature
Detection Tables Training
Storage Engineering
1
0 Historical data is
Kubernetes
As prerequisite available in a cold
data has been storage. 2 4 5 Training code is run to
Training pipelines Extracts
already processed train a model with the
are executed on features from
by NI and obtained features.
demand. new KPIs
extracted to a cold
The resulting
storage
features are
store in the
cold storage
Model Serving Pipeline
5
3 Specific model
Specific FE python script
to extract features from version is obtained
Feature
KPIs is obtained from Model from the model
Engineering registry for
the code repository Code Registry
serving.
Orchestration
Kubeflow Pipelines
Feature
Serving Vertica
Vertica Engineering
Network Insight Anomaly
Tables Detection Tables
Kubernetes
1 2 4 6 Serving code runs in 7
New data arrives to On scheduled Extracts Predictions results
Network Insight on basis AD checks if features from k8s as a deployment. are stored back in
periodic basis. new data is new KPIs It is called to serve Vertica
available for The resulting predictions.
predictions features are It can scale out.
sent to
predictor
Technology Stack
Explicit request to not have Vertica in the center of the Alternatives
product/solution • Exploration & Analysis → VM / DS Laptop
• DS team needs freedom to choose the libraries to use • Orchestration → Standalone kubeflow pipelines
• AD shoulld run without Vertica • Model Registry → In house / Mlflow
Exploration Model
Vertica Cold Storage Registry
code
Network
Insight
Kubernetes
Historical Data is All historical data is Python code to explore Store model metadata of
extracted from files accumulated in a cold and analyze data. runs during
by Network Insight. storage in a file experimentation if
format for DS analysis A python-based needed.
This needs to be and exploration. environment supporting
done in turns for any DS library needs to Share models between
each data range due be provided for this DS team members.
to license purpose.
limitations. Model Versioning.
Model Training – Automated/ On Demand
Python code to
extract features
from KPIs
Feature • In house
Engineering • Git Repo
Files: • Kubeflow Code
• Parquet/orc pipelines
• Python script Model metadata:
• MLflow Model
Cold Model • Mlmd
Vertica Storage Training Registry • In house
Network
Insight
Orchestration
Orchestration
Vertica
Network
Kubernetes
Insight
On scheduled Runs in k8s as a deployment. Specific model version is
New data arrives to basis AD obtained from the model
Network Insight on checks if new It is called to serve predictions registry for serving.
periodic basis. data is by the orchestrator.
available for
predictions Data is read from Vertica,
formatted as a request, sent to
predictor, and results are stored
back in Vertica
Data Exploration & Analysis – On Demand
Historical data is
moved out from Cold storage Notebooks:
Vertica to the cold S3 compatible FS Python code for data MLMD:
storage for Data available for analysis and Metadata registry
exploration purposes exploration exploration Experiments, Runs
Models
Model Versioning
Training - Automated
Historical data is
moved out from Cold storage Pipelines:
Vertica to the cold S3 compatible FS Training pipeline MLMD:
storage for training Data available for Metadata registry
purposes training Experiments, Runs
Models
Model Versioning
Serving – Automated
Predictor is deployed
from Model Registry
Predictor MLMD:
New data is sent to S3 compatible FS Metadata registry
predictors for Data available for Experiments, Runs
inference training Models
Model Versioning
Architecture Plan
• Iterative / In stages
• Each stage built on top of the previous one
First Stage – The minimalist Approach
• Have Vertica and a bunch of SQL and python scripts with an orchestrator
• Use Kubeflow Pipelines as orchestrator
• Data Engineers write pipelines using python DSL
• Advantages:
• Provides a ML flavor for orchestrating out of the box (experiments, runs, scheduling,
etc)
• It is simpler to use with python SDK / DSL than pure yaml (as it would be with Argo
workflows)
• Out of the box components and any function can be turn into a pod (no need for
dockerfile, yaml, etc)
• We can build on this approach later by adding other Kubeflow components
Demo
• Vertica
• Kubeflow Pipelines
• Verticapy
• Elyra
Architecture Confluence Pages
• Base Page
• Architecture Questions
• Architecture Plan
• Vertica Based Architecture
Next Step
• We need a concrete use case to start implementing
• Input data (ideally in the form NI will provide it)
• Data for training and testing a model, or an already trained model
• Data for prediction
• Feature Engineering logic/scripts
• Model trained or to be trained
• Expected output