Deeplearning - Ai Deeplearning - Ai

Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or
distribute these slides for commercial purposes. You may make copies of these slides and
use or distribute them for educational purposes as long as you cite DeepLearning.AI as the
source of the slides.
For the rest of the details of the license, see

https://creativecommons.org/licenses/by-sa/2.0/legalcode
Model Serving Architecture
Model Serving: Patterns

and Infrastructure
ML Infrastructure
On Prem ● Train and deploy on your own hardware
infrastructure
● Manually procure hardware GPUs, CPUs etc
● Profitable for large companies running ML projects
for longer time
On Cloud
● Train and deploy on cloud choosing from several
service providers
○ Amazon Web Services, Google Cloud Platform,
Microsoft Azure, etc
ML Infrastructure
infrastructure
for longer time
On Cloud
service providers
ML Infrastructure
infrastructure
for longer time
On Cloud
service providers
Model Serving
On Prem
● Can use open source, pre-built servers
○ TF-Serving, KF-Serving, NVidia and more...
On Cloud
● Create VMs and use open source pre-built servers

● Use the provided ML workflow
Model Serving
On Prem
On Cloud

Model Serving
On Prem
On Cloud

Model Servers
● Simplify the task of deploying machine learning models at scale.

● Can handle scaling, performance, some model lifecycle management etc.,
What’s in
this image?
Model File Model Server REST / gRPC
Dog 0.99
Model Servers

What’s in
this image?
Dog 0.99
Model Servers

What’s in
this image?
Dog 0.99
Model Servers

What’s in
this image?
Dog 0.99
Model Servers
TensorFlow Serving TorchServe
KF Serving
TensorFlow Serving
Supports many servables
Feature
TF Models Non TF Models Word Embeddings Vocabularies
Transformations
Out of the box integration with TensorFlow Models
Batch and Real-time Multi-Model Serving Exposes gRPC and REST

Inference endpoints
TensorFlow Serving
Feature
Transformations

Inference endpoints
TensorFlow Serving
Feature
Transformations

Inference endpoints
TensorFlow Serving
Feature
Transformations

Inference endpoints
TensorFlow Serving Architecture
CORE
DynamicManager ASPIRED
CLIENT ServableHandle Loader TF
VERSIONS Servable
VersionPolicy
Legend: Source FILE

SYSTEM
Blue: Caller-specific code
Green: TF Serving core classes / APIs
Yellow: Pluggable implementations
CORE
VERSIONS Servable
VersionPolicy
Legend: Source FILE

SYSTEM
CORE
VERSIONS Servable
VersionPolicy
Legend: Source FILE

SYSTEM
CORE
VERSIONS Servable
VersionPolicy
Legend: Source FILE

SYSTEM
CORE
VERSIONS Servable
VersionPolicy
Legend: Source FILE

SYSTEM
CORE
VERSIONS Servable
VersionPolicy
Legend: Source FILE

SYSTEM
CORE
VERSIONS Servable
VersionPolicy
Legend: Source FILE

SYSTEM
CORE
VERSIONS Servable
VersionPolicy
Legend: Source FILE

SYSTEM
CORE
VERSIONS Servable
VersionPolicy
Legend: Source FILE

SYSTEM
CORE
VERSIONS Servable
VersionPolicy
Legend: Source FILE

SYSTEM
CORE
VERSIONS Servable
VersionPolicy
Legend: Source FILE

SYSTEM
CORE
VERSIONS Servable
VersionPolicy
Legend: Source FILE

SYSTEM
NVIDIA Triton Inference Server
● Simplifies deployment of AI models at scale in production.
● Open source inference serving software
● Deploy trained models from any framework:
○ TensorFlow, TensorRT, PyTorch, ONNX Runtime, or a custom
framework
● Models can be stored on:
○ Local storage, AWS S3, GCP, Any CPU-GPU Architecture (cloud,
data centre or edge)
HTTP REST or gRPC endpoints are supported.
Client Application
Architecture Python/C++ Client Library
Model
HTTP gRPC
Repository
Client app Model Management

can directly C API
Triton Inference Server Architecture supports: link to C API
Inference Inference
● Single GPU for multiple models from same Request Response
Pre model
Framework
or different frameworks scheduler
queues
Backends
● Multi-GPU for same model

○ Can run instances of model on Scheduler
multiple GPUs for increased Custom
Status health report export HTTP
inference performance.
Supports model ensembles.
GPU GPU GPU GPU CPU
Designed for Scalability
Data Center | Cloud | Edge

AI Inference Cluster
CPU | GPU
Triton Inference Server
Front End Client Load balancer
Applications (optional)
Triton Inference Server
Front End Client Triton Inference Server

Applications
Triton Inference Server AI Model
Repository
Can integrate with KubeFlow pipelines for end to end AI workflow

Torch Serve
● Model serving framework for PyTorch models.
● Initiative from AWS and Facebook
TorchServe
Default handlers for Image

Batch and Real-time Supports REST Endpoints
Classification, Object Detection,
Inference Image Segmentation, Text
Classification
Multi-Model Serving Monitor Detail Logs and
Customized Metrics A/B Testing
TorchServe Architecture
KFServing
KF Serving
● Enables serverless inferencing on pre-process predict post-process explain
Kubernetes.
● Provides high abstraction KFServing
interfaces for common ML
Knative
frameworks like TensorFlow,
PyTorch, scikit-learn etc. Istio
Kubernetes
Compute Cluster, CPU GPU, TPU

Scaling Infrastructure
Model Serving: Patterns and

Infrastructure
Why is Scaling Important?
Server
+ Increased Power
+ Upgrading
+ More RAM
+ Faster Storage
+ Adding or upgrading GPU/TPU

+ More CPUs/GPUs instead of bigger ones
+ Scale up as needed
+ Scale back down to minimums

Why Horizontal Over Vertical Scaling
• Shrink or grow no of nodes based on load,

Benefit of elasticity
throughput, latency requirements.
• No need for taking existing servers offline

Application never goes offline for scaling
• Add more nodes any time at increased cost

No limit on hardware capacity









Typical System Architecture
App App App
Bin/ Library
Operating System
Hardware
Virtual Machine (VM) Architecture
App App App App App App
Bin/ Library Bin/ Library
Operating System Operating System
App App App

Virtual Machine Virtual Machine
Bin/ Library Hypervisor
Hardware Hardware
VM Management
App App App

Hardware Hardware
Hypervisor and Scaling
App App App

Hardware Hardware
VM Limitations
App App App

Hardware Hardware
Building Containers
App App App

Operating System Operating System Bin/Library Bin/Library Bin/Library

Container Container Container
Hypervisor Container Runtime
Hardware Hardware
Containers Advantages
App App App

Operating System Operating System Bin/Library Bin/Library Bin/Library

Hypervisor Container Runtime
Hardware Hardware
Containers Advantages
App App App ● Less OS requirements - more apps!

Bin/Library Bin/Library Bin/Library ● Abstraction
● Easy deployment based on container
Container Runtime runtime
Operating System
Hardware
Docker: Container Runtime
App App App ● Open source container technology
Bin/Library Bin/Library Bin/Library ● Most popular container runtime

● Started as container technology for Linux
Docker ● Available for Windows applications as well.
● Can be used in data centres, personal
Operating System machines or public cloud.
● Docker partners with major cloud services
Hardware for containerization.
Enter Container Orchestration
Apps & Services

● Manages life cycle of containers in
Orchestration
Service Management
production environments
Scheduling
● Manages scaling of containers
Resource Management
● Ensures reliability of containers
Runtime Runtime Runtime ● Distributes resources between
Machine & OS Machine & OS Machine & OS containers.
Machine Infrastructure ● Monitors health of containers

Enter Container Orchestration
Apps & Services

● Manages life cycle of containers in
Orchestration
Service Management
production environments
Scheduling
● Manages scaling of containers
Resource Management
● Ensures reliability of containers
Runtime Runtime Runtime ● Distributes resources between
Machine & OS Machine & OS Machine & OS containers.
Machine Infrastructure ● Monitors health of containers

Popular Container Orchestration Tools
Kubernetes Docker Swarm

Kubernetes
ML Workflows on Kubernetes - KubeFlow
● Dedicated to making deployments of machine

learning (ML) workflows on Kubernetes simple,
portable and scalable.
● Anywhere you are running Kubernetes, you should be
able to run Kubeflow.
● Can be run on premise or on Kubernetes engine on cloud
offerings AWS, GCP, Azure etc.,
Online Inference

and Infrastructure
Online Inference
Observation ● Process of generating machine

learning predictions in real time
upon request.
REST API Model
● Predictions are generated on a single
observation of data at runtime.
Prediction
● Can be generated at any time of the
day on demand
Online Inference

upon request.
REST API Model
Prediction
day on demand
Online Inference

upon request.
REST API Model
Prediction
day on demand
Online Inference

upon request.
REST API Model
Prediction
day on demand
Optimising ML Inference
Latency
Metrics to
Optimize
Cost Throughput
Observation
Latency
REST API Model
Metrics to
Optimize
Prediction
Cost Throughput
Observation
Latency
REST API Model
Metrics to
Optimize
Prediction
Cost Throughput
Observation
Latency
REST API Model
Metrics to
Optimize
Prediction
Cost Throughput
Inference Optimization
Inference Optimization: Infrastructure
Infrastructur
e
Inference Optimization: Model Architecture
Model Architecture
Inference Optimization: Model Compilation
Model Compilation
Additional Optimizations
Observation
REST API Model
Prediction
Data Store
Observation
REST API Model
Prediction
Data Store
Observation
REST API Model
Prediction
Data Store
Observation
REST API Model
Prediction
Data Store
Observation
REST API Model
Prediction
Data Store
Observation
REST API Model
Prediction
Popular
Products Data Store
NoSQL Databases Caching and Feature Lookup
In memory cache,
Single digit milliseconds Sub milliseconds read Scaleable, handles Scaleable, can handle
read latency, in memory latency dynamically changing slowly changing data,
cache available data, Milliseconds Milliseconds read
read latency latency
These resources are expensive

Carefully choose caching requirements based on your needs.
Data Preprocessing

and Infrastructure
Data Preprocessing and Inference
Observation
REST API Model
Prediction
Observation
REST API Model
Prediction
Observation
REST API Model
Prediction
Preprocessing Operations Needed Before Inference
Feature Tuning Feature Construction

Data Cleansing
● Normalization ● Combine inputs
● Correcting invalid ● Clipping outliers ● Feature crossing
values in incoming data
● Imputing Missing Values ● Polynomial expansion
Representation Transformation Feature Selection
● Change data format for the model ● Apply same feature selection
● One-hot encoding done during training on incoming
● Vectorization data and features fetched from
cache

Data Cleansing
cache

Data Cleansing
cache

Data Cleansing
cache

Data Cleansing
cache

Data Cleansing
cache
Processing After Obtaining Predictions
Observation
REST API Model
Prediction
Processing After Obtaining Predictions
Observation
REST API Model
Prediction
Batch Inference
Model Serving: Patterns and

Infrastructure
Batch Inference
X1 X2 X3 ... XN Generating predictions on batch of a

observations
Batch jobs are often generated on

some recurring schedule
y1 y2 y3 ... yN
Predictions are stored and made

available to developers or end
users
Advantages of Batch Inference
Reduced cost of ML
system
X
Complex machine Caching not ● Longer data retrieval
learning models for required ● Not a problem as predictions
improved accuracy are not in real time
Limitations of Batch Inference
Recommendations from
same age bracket
Recommendations from same

Long Update Latency Predictions based on geolocation
older data
Important Metrics to Optimize
Most important metric to optimize while performing batch predictions:
Throughput
● Prediction service should be able to handle large volumes of inferences at a time.
● Predictions need not be available immediately.
● Latency can be compromised.

How to Increase Throughput?
● Use hardware accelerators like GPU’s, TPU’s.
● Increase number of servers/workers
○ Load several instances of model on multiple workers to increase throughput

Use Case - Product Recommendations
● E-commerce sites: new recommendations

on a recurring schedule
ser ● Cache these for easy retrieval
yu
se db
ha Enables use of more predictors to train
Purc Similar Items ●
more complex models.
○ Helps personalization to a greater
degree, but with delayed data
Rec
omm
end
Use Case - Sentiment Analysis
● User sentiment based on customer reviews
● No need for realtime prediction for this

problem
● CNN, RNN, or LSTM all work for this
problem
My experience The product is ok Your support
so far has been I guess team
● These models are more complex, but more
fantastic is pathetic accurate for the problem
POSITIVE NEUTRAL NEGATIVE ● More cost effective to use them with batch
prediction
Use Case - Demand Forecasting
● Estimate the demand for products for

inventory and ordering optimization
● Predict future based on historical data (time
series)
● Many models available as this is a batch
predictions problem
Batch Inference
Using ML Models with

Distributed Batch and
Stream Processing Systems
Data Processing - Batch and Streaming
● Data can be of different types based on the source.

● Batch Data
○ Batch processing can be done on data available in huge volumes in data lakes,
from csv files, log files etc.,
● Streaming Data
○ Real-time streaming data, like data from sensors.
ETL on Data
● Before data is used for making batch predictions:

○ It has to be extracted from multiple sources like log files, streaming
sources, APIs, apps etc.,
○ Transformed
○ Loaded into a database for prediction
This is done using ETL Pipelines

ETL Pipelines
Transformation Batch Prediction Job on

Engine Prepared Data
Data Sources Extract Transform Load Target
● Set of processes for
○ extracting data from data sources
○ Transforming data
○ Loading into an output destination like data warehouse
● From there data can be consumed for training or making predictions using ML models,
Distributed Processing
Load
Worker
Cluster
Manager Worker
Extract
Data Sources
Worker Target
● ETL can be performed huge volumes of data in distributed manner.

● Data is split into chunks and parallely processed by multiple workers.
● The results of the ETL workflow are stored in a database.
● Results in lower latency and higher throughput of data processing.
ETL Pipeline components Batch Processing
Batch
Prediction Job
Transformation
on Prepared
Engine
Data
Load
Data Sources Transform Target
● Data Warehouse
● CSV Files
+ Eg: BigQuery
● XML
● Data Mart
● JSON Cloud DataFlow
● Data Lake
● APIs
Eg: Cloud Storage
● Data Lake (like
cloud storage)
ETL Pipeline Components Stream Processing
Consumed for
Transformation ML pipeline
Engine training/
prediction
Load
Streaming Data Sources Transform Target
● Streaming data for
+ another pipeline
Pub Sub ● Data Warehouse

Cloud DataFlow
Eg: BigQuery
Streaming ● Data Mart
● Data Lake
Eg: Cloud Storage

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright:

Available Formats

Copyright Notice

These slides are distributed under the Creative Commons License.

For the rest of the details of the license, see

Model Serving: Patterns

● Create VMs and use open source pre-built servers

● Create VMs and use open source pre-built servers

● Create VMs and use open source pre-built servers

● Simplify the task of deploying machine learning models at scale.

Model File Model Server REST / gRPC

● Simplify the task of deploying machine learning models at scale.

Model File Model Server REST / gRPC

● Simplify the task of deploying machine learning models at scale.

Model File Model Server REST / gRPC

● Simplify the task of deploying machine learning models at scale.

Model File Model Server REST / gRPC

TensorFlow Serving TorchServe

Supports many servables

Out of the box integration with TensorFlow Models

Batch and Real-time Multi-Model Serving Exposes gRPC and REST

Supports many servables

Out of the box integration with TensorFlow Models

Batch and Real-time Multi-Model Serving Exposes gRPC and REST

Supports many servables

Out of the box integration with TensorFlow Models

Batch and Real-time Multi-Model Serving Exposes gRPC and REST

Supports many servables

Out of the box integration with TensorFlow Models

Batch and Real-time Multi-Model Serving Exposes gRPC and REST

Legend: Source FILE

Legend: Source FILE

Legend: Source FILE

Legend: Source FILE

Legend: Source FILE

Legend: Source FILE

Legend: Source FILE

Legend: Source FILE

Legend: Source FILE

Legend: Source FILE

Legend: Source FILE

Legend: Source FILE

Architecture Python/C++ Client Library

Client app Model Management

● Multi-GPU for same model

multiple GPUs for increased Custom

Status health report export HTTP

Data Center | Cloud | Edge

Front End Client Triton Inference Server

Can integrate with KubeFlow pipelines for end to end AI workﬂow

Default handlers for Image

Compute Cluster, CPU GPU, TPU

Model Serving: Patterns and

+ Adding or upgrading GPU/TPU

+ Scale back down to minimums

• Shrink or grow no of nodes based on load,

• No need for taking existing servers ofﬂine

• Add more nodes any time at increased cost

• Shrink or grow no of nodes based on load,

• No need for taking existing servers ofﬂine

• Add more nodes any time at increased cost

• Shrink or grow no of nodes based on load,

• No need for taking existing servers ofﬂine

• Add more nodes any time at increased cost

• Shrink or grow no of nodes based on load,