Professional Documents
Culture Documents
场景的设计思考
AI、大数据场景对调度系统的挑战
曾凡松
云原生容器服务
项目简介
Why koordinator was created?
企业生产活动依赖多样化的算力和工作负载
+
• 交付效率
*
GPU
FPGA • 负载优先级
• 稳定性/QoS
• 资源利用率
• 成本优化
• 安全隔离
• 算力异构 • Stateless
• 拓扑依赖 • Stateful
• 资源池化 • Jobs
交易
电商交易
大数据
搜索 sigma
天猫
交易 大数据 统一调度 Koordinator
搜索
闲鱼 搜索
交易
hippo
大数据
搜索
多样化工作负载
工作负载统一编排
裸金属实例 虚拟机实例
项目地址:koordinator.sh
Koordinator Architecture
A QoS-based scheduling system brings optimal layout and status to workloads
Goals:
Flexible Scheduling Policies
1.
Improved Resource
2.
Utilization
Enhanced Workload
3.
Performance
Easy Integration
4.
Koordinator 在 AI、大数据场景的设计思考
What is resource scheduling?
Pros
• 标准化,无缝兼容社区插件生态
• 灵活性,丰富的扩展点,可以实现
多样的调度诉求
• 学习曲线,丰富的案例和文档
Cons
• 性能,框架约束下的性能上限
• 扩展点完备性,队列管理,兼容性
处理
Resource Model
资源模型是构筑调度系统的基础
● Runtime QoS batch Offline batch jobs. (Big data, OLAP query)
Root | Quota=100
Requirements:
…
● Fair sharing of available capacity between multiple
Quota-1
Min=20, Max=40
Quota-2
Min=40, Max=80
Quota-3
Min=30, Max=50
Quota-n
Min=x, Max=y
tenants
● Ability to set policies: who can use what and up to what
… limit
Quota-1.1 Quota-1.2 Quota-2.1 Quota-2.2 Quota-3.1 Quota-n.1
承诺资源
Scheduling:
弹性资源
未使用的资
源
任意使用的
资源
Requirements
• All pods of the job must be coordinated to start together
• AI Jobs such as TFJob support multiple roles
• Each role needs to customize the minimum available replicas
• Allocating resources together with the minimum available of pods
for a job
Koordinator Coscheduling
• Koordinator enhanced scheduler-plugins/coscheduling
• Compatible with PodGroup CRDs
• Scheduling of multiple PodGroups, and divides PodGroups by role
• Integrate with KubeDL, seamlessly switch to Koordinator Scheduler
Topology Aware
早期大数据任务拓扑感知调度的由来
File ~1PB
BLOCK-1 BLOCK-2 BLOCK-3 BLOCK-4 BLOCK-5
Job
核心交换机
reducer
ToR 交换机 ToR 交换机 ToR 交换机
分布式任务,每一个 mapper 处理
BLOCK-1 BLOCK-1 BLOCK-5 BLOCK-1 BLOCK-2 BLOCK-3
不同的数据分片:
BLOCK-4 BLOCK-3 BLOCK-2 BLOCK-4 BLOCK-2 BLOCK-5 BLOCK-3 BLOCK-4 BLOCK-5 • 计算与数据尽量在一个机器上
• 尽量在靠近的网络拓扑内
Node Node Node Node Node
region
DFS
AZ AZ
alluxio/jindofs
mapper-1 mapper-2 mapper-3
Topology Aware
Topology arrangement of pods will significantly affect training efficiency
PSW-1
PSW-1
ASW-1 ASW-2
ASW-1 ASW-2
W0 W2 W1 W3 W0 W2 W1 W3
随机调度 拓补顺序感知调度
Resource Sharing
Koordinator Fine-grained Device Scheduling
• koord-scheduler is responsible for allocating Device
• koordlet is responsible for reporting device information
• The koordlet built-in plugin is responsible for applying the device
• koord-runtime-proxy acts as a CRI-Proxy to transfer CRI
requests to koordlet, triggering the built-in plug-in to work
• Describe device information through CRD
Resource Sharing
Future: NRI Mode Resource Management
Runtime Proxy
NRI Mode
AI workload ecosystem
Job Queueing – kube-queue/kueue
面向开源生态 统一优先级/QoS策略
• 基于开源版本Hadoop,非侵入式改造 • 对接Koordinator的Batch优先级资源
• 保持大数据作业提交入口为YARN不变 • 接受Koordlet的QoS策略管理
Ø CPU QoS、Memory QoS、LLC/内存带宽
Ø 单机驱逐
节点资源共享 支持对接阿里云产品能力
• 超卖资源即可被K8s Pod使用,也可被 • ACK容器与EMR任务混部
YARN task使用 • 基于Koordinator混部QoS能力
• YARN task和Batch Pod在同一节点内共存
预计将于11月在Koordinator 1.4正式发布
https://github.com/koordinator-sh/koordinator/discussions/1297
Koordinator 社区动态
Koordinator Community
成员来自各行各业、积极参与贡献方案
Containerd
Containerd NRI插件集成
Node 带外应用QoS管理,支持K8s
NRI Server
Pod Pod 与非容器化应用混部
from Intel
from 360
NRI Plugins QoS插件非侵入式,无缝接入Containerd架构生态
灵活的可定制/可插拔的扩展插件 带外应用进程 为非容器化应用预留资源 (CPUSet、内存)
Koordlet 纳管应用进程QoS
root Koordlet
树形Elastic Quota特性增强 单机指标存储持久化
from 小米
Collector from 趣丸科技
parent parent
面向多租户的树形结构Quota分组 本地TSDB持久化存储,指标保留时间更长
local TSDB 支持更稳定的资源超卖策略(Mid Priority)
租户间资源公平性借用和抢占
child child
CNCF Sandox提案流程申请中
https://github.com/cncf/sandbox/issues/51
Koordinator Community
社区双周会(中文)
• 周二 19:30 GMT+8 (北京时间)
• 钉钉会议链接: https://meeting.dingtalk.com/j/cgTTojEI8Zy
Thanks