Koordinator在云原生AI、大数据场景的设计思考

Koordinator在云原生AI、大数据
场景的设计思考
AI、大数据场景对调度系统的挑战
曾凡松
云原生容器服务
项目简介
Why koordinator was created？
企业生产活动依赖多样化的算力和工作负载
异构资源类型多样化工作负载多维度业务需求

Heterogenous resources Diverse workloads Multi-dimensional requirements
+
• 交付效率
*
GPU
FPGA • 负载优先级
• 稳定性/QoS
• 资源利用率
• 成本优化
• 安全隔离
• 算力异构 • Stateless
• 拓扑依赖 • Stateful
• 资源池化 • Jobs
多样化资源多样化约束动态决策

Evolution of Alibaba’s scheduling system
淘宝 2016 ~ 2019 2018 ~ 2021 2021 ~ now
交易
电商交易
大数据
搜索 sigma
天猫
交易大数据统一调度 Koordinator
大数据 fuxi kubernetes kubernetes
搜索
闲鱼搜索
交易
hippo
大数据
搜索
独自发展平台化云原生标准化

Koordinator Overview
A QoS-based scheduling system brings optimal layout and status to workloads
多样化工作负载
工作负载统一编排
QoS 感知调度/重调度差异化SLO策略任务调度标准化高效率低接入成本
Pod Pod Pod Pod Pod Pod Pod Pod

Pod Pod Pod Pod Pod Pod Pod
混部系统完全兼阿里生产系统大规对工作负载倾

容标准模应用，CPU 水位入少，可配置
Kubernetes 超过 50% 注入混部策略
Kubernetes
裸金属实例虚拟机实例
项目地址：koordinator.sh
Koordinator Architecture
A QoS-based scheduling system brings optimal layout and status to workloads
Goals:
Flexible Scheduling Policies
1.
Improved Resource
2.
Utilization
Enhanced Workload
3.
Performance
Easy Integration
4.
Koordinator 在 AI、大数据场景的设计思考
What is resource scheduling？
用户1 用户2 用户3
有限资源 <--> 多样需求之间的匹配问题，具

体的：
1. 优先满足谁的需求
scheduler 2. 资源已分配，但利用率低，如何提升效率
3. 如何避免任务死锁
4. 资源具备拓扑属性，如何优化编排结构
5. 如何支持上层的应用生态
Node Node Node 6. 如何保证应用间的安全性

7. …
Why based on Schedule Framework?
做增量，推进上游社区的技术演进，惠及更多的受众
Pros
• 标准化，无缝兼容社区插件生态
• 灵活性，丰富的扩展点，可以实现
多样的调度诉求
• 学习曲线，丰富的案例和文档
Cons
• 性能，框架约束下的性能上限
• 扩展点完备性，队列管理，兼容性
处理
Resource Model
资源模型是构筑调度系统的基础
● Priority Class Priority Suitable Workloads

○ resource are divided into 4 levels: Product, Mid, Batch
and Free. prod Latency sensitive service.
○ requested by not used resources for lower priority

pod
mid Streaming job or near-line jobs.
● Runtime QoS batch Offline batch jobs. (Big data, OLAP query)
○ lower priority pods will be throttled or killed if there is

resource contention free Offline testing jobs.
Elastic Quota Scheduling
Tree-type resource quota management
Root | Quota=100
Requirements:
…
● Fair sharing of available capacity between multiple
Quota-1
Min=20, Max=40
Quota-2
Min=40, Max=80
Quota-3
Min=30, Max=50
Quota-n
Min=x, Max=y
tenants
● Ability to set policies: who can use what and up to what
… limit
Quota-1.1 Quota-1.2 Quota-2.1 Quota-2.2 Quota-3.1 Quota-n.1
Koordinator Elastic Quota

Min=10, Max=20
Min=10, Max=50 Min=10, Max=60 Min=30, Max=80 Min=10, Max=40 Min=a, Max=b
承诺资源
Scheduling:
弹性资源
未使用的资
源
任意使用的
资源
● Compatible with ElasticQuota CRD

多级Quota调度，支持Quota Tree 多层级调度能力，支持资源快速识
别和抢占，从而短时间确保高优先级任务资源。使得用户能够精细管理 ● Support to configure Quota by tree structure
多种任务形态下的资源使用。
● Fair sharing and support to configure shared weight
Coscheduling/Gang Scheduling
Scheduling Pods in group
Requirements
• All pods of the job must be coordinated to start together
• AI Jobs such as TFJob support multiple roles
• Each role needs to customize the minimum available replicas
• Allocating resources together with the minimum available of pods
for a job
Koordinator Coscheduling
• Koordinator enhanced scheduler-plugins/coscheduling
• Compatible with PodGroup CRDs
• Scheduling of multiple PodGroups, and divides PodGroups by role
• Integrate with KubeDL, seamlessly switch to Koordinator Scheduler
Topology Aware
早期大数据任务拓扑感知调度的由来
File ～1PB
BLOCK-1 BLOCK-2 BLOCK-3 BLOCK-4 BLOCK-5
Job
核心交换机
mapper-1 mapper-2 mapper-3
reducer
ToR 交换机 ToR 交换机 ToR 交换机
分布式任务，每一个 mapper 处理
BLOCK-1 BLOCK-1 BLOCK-5 BLOCK-1 BLOCK-2 BLOCK-3
不同的数据分片：
BLOCK-4 BLOCK-3 BLOCK-2 BLOCK-4 BLOCK-2 BLOCK-5 BLOCK-3 BLOCK-4 BLOCK-5 • 计算与数据尽量在一个机器上
• 尽量在靠近的网络拓扑内
Node Node Node Node Node

Topology Aware
储算分离架构下的大数据计算 AIGC 推动的大模型训练
region
DFS
AZ AZ
Node Node Node Node Node
alluxio/jindofs
Topology Aware
Topology arrangement of pods will significantly affect training efficiency
4worker为例：w0->w1->w2->w3->w0 （Ring AllReduce）

性能差距：512卡 30%～40%
PSW-1
PSW-1
ASW-1 ASW-2
ASW-1 ASW-2
GPU机器-A GPU机器-B GPU机器-C GPU机器-D GPU机器-A GPU机器-B GPU机器-C GPU机器-D
W0 W2 W1 W3 W0 W2 W1 W3
随机调度拓补顺序感知调度
Resource Sharing
Koordinator Fine-grained Device Scheduling
• koord-scheduler is responsible for allocating Device
• koordlet is responsible for reporting device information
• The koordlet built-in plugin is responsible for applying the device
• koord-runtime-proxy acts as a CRI-Proxy to transfer CRI
requests to koordlet, triggering the built-in plug-in to work
• Describe device information through CRD
Resource Sharing
Future: NRI Mode Resource Management
Runtime Proxy
NRI Mode
AI workload ecosystem
Job Queueing – kube-queue/kueue
• Job level priority

• Fairness across multiple-tenant queues
• Extensible queuing strategies
• Quota based admission
• Priority
• Submission timestamp
• Production/BestEffort
•…
Bigdata workload ecosystem
面向开源生态统一优先级/QoS策略
• 基于开源版本Hadoop，非侵入式改造 • 对接Koordinator的Batch优先级资源
• 保持大数据作业提交入口为YARN不变 • 接受Koordlet的QoS策略管理
Ø CPU QoS、Memory QoS、LLC/内存带宽
Ø 单机驱逐
节点资源共享支持对接阿里云产品能力
• 超卖资源即可被K8s Pod使用，也可被 • ACK容器与EMR任务混部
YARN task使用 • 基于Koordinator混部QoS能力
• YARN task和Batch Pod在同一节点内共存
预计将于11月在Koordinator 1.4正式发布
https://github.com/koordinator-sh/koordinator/discussions/1297
Koordinator 社区动态
Koordinator Community
成员来自各行各业、积极参与贡献方案
Containerd
Containerd NRI插件集成
Node 带外应用QoS管理，支持K8s
NRI Server
Pod Pod 与非容器化应用混部
from Intel
from 360
NRI Plugins QoS插件非侵入式，无缝接入Containerd架构生态
灵活的可定制/可插拔的扩展插件带外应用进程为非容器化应用预留资源 (CPUSet、内存)
Koordlet 纳管应用进程QoS
root Koordlet
树形Elastic Quota特性增强单机指标存储持久化
from 小米
Collector from 趣丸科技
parent parent
面向多租户的树形结构Quota分组本地TSDB持久化存储，指标保留时间更长
local TSDB 支持更稳定的资源超卖策略(Mid Priority)
租户间资源公平性借用和抢占
child child
CNCF Sandox提案流程申请中
https://github.com/cncf/sandbox/issues/51
Koordinator Community
社区双周会（中文）
• 周二 19:30 GMT+8 (北京时间)
• 钉钉会议链接: https://meeting.dingtalk.com/j/cgTTojEI8Zy
Thanks

Koordinator在云原生AI、大数据场景的设计思考

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Koordinator在云原生AI、大数据场景的设计思考

Uploaded by

Copyright:

Available Formats

Koordinator在云原生AI、大数据

异构资源类型多样化工作负载多维度业务需求

多样化资源多样化约束动态决策

大数据 fuxi kubernetes kubernetes

独自发展平台化云原生标准化

QoS 感知调度/重调度差异化SLO策略任务调度标准化高效率低接入成本

Pod Pod Pod Pod Pod Pod Pod Pod

混部系统完全兼阿里生产系统大规对工作负载倾

用户1 用户2 用户3

有限资源 <--> 多样需求之间的匹配问题，具

Node Node Node 6. 如何保证应用间的安全性

● Priority Class Priority Suitable Workloads

○ requested by not used resources for lower priority

○ lower priority pods will be throttled or killed if there is

Koordinator Elastic Quota

● Compatible with ElasticQuota CRD

mapper-1 mapper-2 mapper-3

mapper-1 mapper-2 mapper-3

储算分离架构下的大数据计算 AIGC 推动的大模型训练

Node Node Node Node Node

4worker为例：w0->w1->w2->w3->w0 （Ring AllReduce）

GPU机器-A GPU机器-B GPU机器-C GPU机器-D GPU机器-A GPU机器-B GPU机器-C GPU机器-D

• Job level priority

You might also like

Koordinator在云原生AI、大数据场景的设计思考

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Koordinator在云原生AI、大数据场景的设计思考

Uploaded by

Copyright:

Available Formats

Koordinator在云原生AI、大数据

异构资源类型 多样化工作负载 多维度业务需求

多样化资源 多样化约束 动态决策

大数据 fuxi kubernetes kubernetes

独自发展 平台化 云原生 标准化

QoS 感知调度/重调度 差异化SLO策略 任务调度 标准化 高效率 低接入成本

Pod Pod Pod Pod Pod Pod Pod Pod

混部系统完全兼 阿里生产系统大规 对工作负载倾

用户1 用户2 用户3

有限资源 <--> 多样需求 之间的匹配问题，具

Node Node Node 6. 如何保证应用间的安全性

● Priority Class Priority Suitable Workloads

○ requested by not used resources for lower priority

○ lower priority pods will be throttled or killed if there is

Koordinator Elastic Quota

● Compatible with ElasticQuota CRD

mapper-1 mapper-2 mapper-3

mapper-1 mapper-2 mapper-3

储算分离架构下的大数据计算 AIGC 推动的大模型训练

Node Node Node Node Node

4worker为例：w0->w1->w2->w3->w0 （Ring AllReduce）

GPU机器-A GPU机器-B GPU机器-C GPU机器-D GPU机器-A GPU机器-B GPU机器-C GPU机器-D

• Job level priority

You might also like

异构资源类型多样化工作负载多维度业务需求

多样化资源多样化约束动态决策

独自发展平台化云原生标准化

QoS 感知调度/重调度差异化SLO策略任务调度标准化高效率低接入成本

混部系统完全兼阿里生产系统大规对工作负载倾

有限资源 <--> 多样需求之间的匹配问题，具