Professional Documents
Culture Documents
Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
DeepLearning.AI makes these slides available for educational purposes. You may not use or
distribute these slides for commercial purposes. You may make copies of these slides and
use or distribute them for educational purposes as long as you cite DeepLearning.AI as the
source of the slides.
On Cloud
On Cloud
On Cloud
What’s in
this image?
Dog 0.99
Model Servers
What’s in
this image?
Dog 0.99
Model Servers
What’s in
this image?
Dog 0.99
Model Servers
What’s in
this image?
Dog 0.99
Model Servers
KF Serving
TensorFlow Serving
Feature
TF Models Non TF Models Word Embeddings Vocabularies
Transformations
Feature
TF Models Non TF Models Word Embeddings Vocabularies
Transformations
Feature
TF Models Non TF Models Word Embeddings Vocabularies
Transformations
Feature
TF Models Non TF Models Word Embeddings Vocabularies
Transformations
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
Model
HTTP gRPC
Repository
Inference Inference
● Single GPU for multiple models from same Request Response
Pre model
Framework
or different frameworks scheduler
queues
Backends
inference performance.
Supports model ensembles.
GPU GPU GPU GPU CPU
Designed for Scalability
Kubernetes.
● Provides high abstraction KFServing
interfaces for common ML
Knative
frameworks like TensorFlow,
PyTorch, scikit-learn etc. Istio
Kubernetes
Server
+ Increased Power
+ Upgrading
+ More RAM
+ Faster Storage
+ Scale up as needed
Bin/ Library
Operating System
Hardware
Virtual Machine (VM) Architecture
Hardware Hardware
VM Management
Hardware Hardware
Hypervisor and Scaling
Hardware Hardware
VM Limitations
Hardware Hardware
Building Containers
Hardware Hardware
Containers Advantages
Hardware Hardware
Containers Advantages
Operating System
Hardware
Docker: Container Runtime
Service Management
production environments
Scheduling
● Manages scaling of containers
Resource Management
● Ensures reliability of containers
Container Container Container
Runtime Runtime Runtime ● Distributes resources between
Machine & OS Machine & OS Machine & OS containers.
Service Management
production environments
Scheduling
● Manages scaling of containers
Resource Management
● Ensures reliability of containers
Container Container Container
Runtime Runtime Runtime ● Distributes resources between
Machine & OS Machine & OS Machine & OS containers.
Latency
Metrics to
Optimize
Cost Throughput
Optimising ML Inference
Observation
Latency
Metrics to
Optimize
Prediction
Cost Throughput
Optimising ML Inference
Observation
Latency
Metrics to
Optimize
Prediction
Cost Throughput
Optimising ML Inference
Observation
Latency
Metrics to
Optimize
Prediction
Cost Throughput
Inference Optimization
Inference Optimization: Infrastructure
Infrastructur
e
Inference Optimization: Model Architecture
Model Architecture
Inference Optimization: Model Compilation
Model Compilation
Additional Optimizations
Observation
Prediction
Data Store
Additional Optimizations
Observation
Prediction
Data Store
Additional Optimizations
Observation
Prediction
Data Store
Additional Optimizations
Observation
Prediction
Data Store
Additional Optimizations
Observation
Prediction
Data Store
Additional Optimizations
Observation
Prediction
Popular
Products Data Store
NoSQL Databases Caching and Feature Lookup
In memory cache,
Single digit milliseconds Sub milliseconds read Scaleable, handles Scaleable, can handle
read latency, in memory latency dynamically changing slowly changing data,
cache available data, Milliseconds Milliseconds read
read latency latency
Observation
Prediction
Data Preprocessing and Inference
Observation
Prediction
Data Preprocessing and Inference
Observation
Prediction
Preprocessing Operations Needed Before Inference
● Change data format for the model ● Apply same feature selection
● One-hot encoding done during training on incoming
● Vectorization data and features fetched from
cache
Preprocessing Operations Needed Before Inference
● Change data format for the model ● Apply same feature selection
● One-hot encoding done during training on incoming
● Vectorization data and features fetched from
cache
Preprocessing Operations Needed Before Inference
● Change data format for the model ● Apply same feature selection
● One-hot encoding done during training on incoming
● Vectorization data and features fetched from
cache
Preprocessing Operations Needed Before Inference
● Change data format for the model ● Apply same feature selection
● One-hot encoding done during training on incoming
● Vectorization data and features fetched from
cache
Preprocessing Operations Needed Before Inference
● Change data format for the model ● Apply same feature selection
● One-hot encoding done during training on incoming
● Vectorization data and features fetched from
cache
Preprocessing Operations Needed Before Inference
● Change data format for the model ● Apply same feature selection
● One-hot encoding done during training on incoming
● Vectorization data and features fetched from
cache
Processing After Obtaining Predictions
Observation
Prediction
Processing After Obtaining Predictions
Observation
Prediction
Batch Inference
y1 y2 y3 ... yN
Reduced cost of ML
system
X
Complex machine Caching not ● Longer data retrieval
learning models for required ● Not a problem as predictions
improved accuracy are not in real time
Limitations of Batch Inference
Recommendations from
same age bracket
Throughput
○ Transforming data
● From there data can be consumed for training or making predictions using ML models,
Distributed Processing
Load
Worker
Cluster
Manager Worker
Extract
Data Sources
Worker Target
Batch
Prediction Job
Transformation
on Prepared
Engine
Data
Load
Data Sources Transform Target
● Data Warehouse
● CSV Files
+ Eg: BigQuery
● XML
● Data Mart
● JSON Cloud DataFlow
● Data Lake
● APIs
Eg: Cloud Storage
● Data Lake (like
cloud storage)
ETL Pipeline Components Stream Processing
Consumed for
Transformation ML pipeline
Engine training/
prediction
Load
Streaming Data Sources Transform Target
+ another pipeline