You are on page 1of 23

Koordinator在云原生AI、大数据

场景的设计思考
AI、大数据场景对调度系统的挑战

曾凡松

云原生容器服务
项目简介
Why koordinator was created?
企业生产活动依赖多样化的算力和工作负载

异构资源类型 多样化工作负载 多维度业务需求


Heterogenous resources Diverse workloads Multi-dimensional requirements

+
• 交付效率

*
GPU
FPGA • 负载优先级
• 稳定性/QoS
• 资源利用率
• 成本优化
• 安全隔离

• 算力异构 • Stateless
• 拓扑依赖 • Stateful
• 资源池化 • Jobs

多样化资源 多样化约束 动态决策


Evolution of Alibaba’s scheduling system
淘宝 2016 ~ 2019 2018 ~ 2021 2021 ~ now

交易
电商交易
大数据

搜索 sigma

天猫
交易 大数据 统一调度 Koordinator

大数据 fuxi kubernetes kubernetes

搜索

闲鱼 搜索
交易
hippo
大数据

搜索

独自发展 平台化 云原生 标准化


Koordinator Overview
A QoS-based scheduling system brings optimal layout and status to workloads

多样化工作负载

工作负载统一编排

QoS 感知调度/重调度 差异化SLO策略 任务调度 标准化 高效率 低接入成本

Pod Pod Pod Pod Pod Pod Pod Pod


Pod Pod Pod Pod Pod Pod Pod

混部系统完全兼 阿里生产系统大规 对工作负载倾


容标准 模应用,CPU 水位 入少,可配置
Kubernetes 超过 50% 注入混部策略
Kubernetes

裸金属实例 虚拟机实例
项目地址:koordinator.sh
Koordinator Architecture
A QoS-based scheduling system brings optimal layout and status to workloads

Goals:
Flexible Scheduling Policies
1.

Improved Resource
2.

Utilization
Enhanced Workload
3.

Performance
Easy Integration
4.
Koordinator 在 AI、大数据场景的设计思考
What is resource scheduling?

用户1 用户2 用户3

有限资源 <--> 多样需求 之间的匹配问题,具


体的:
1. 优先满足谁的需求
scheduler 2. 资源已分配,但利用率低,如何提升效率
3. 如何避免任务死锁
4. 资源具备拓扑属性,如何优化编排结构
5. 如何支持上层的应用生态

Node Node Node 6. 如何保证应用间的安全性


7. …
Why based on Schedule Framework?
做增量,推进上游社区的技术演进,惠及更多的受众

Pros
• 标准化,无缝兼容社区插件生态
• 灵活性,丰富的扩展点,可以实现
多样的调度诉求
• 学习曲线,丰富的案例和文档

Cons
• 性能,框架约束下的性能上限
• 扩展点完备性,队列管理,兼容性
处理
Resource Model
资源模型是构筑调度系统的基础

● Priority Class Priority Suitable Workloads


○ resource are divided into 4 levels: Product, Mid, Batch
and Free. prod Latency sensitive service.

○ requested by not used resources for lower priority


pod
mid Streaming job or near-line jobs.

● Runtime QoS batch Offline batch jobs. (Big data, OLAP query)

○ lower priority pods will be throttled or killed if there is


resource contention free Offline testing jobs.
Elastic Quota Scheduling
Tree-type resource quota management

Root | Quota=100
Requirements:

● Fair sharing of available capacity between multiple
Quota-1
Min=20, Max=40
Quota-2
Min=40, Max=80
Quota-3
Min=30, Max=50
Quota-n
Min=x, Max=y
tenants
● Ability to set policies: who can use what and up to what
… limit
Quota-1.1 Quota-1.2 Quota-2.1 Quota-2.2 Quota-3.1 Quota-n.1

Koordinator Elastic Quota


Min=10, Max=20
Min=10, Max=50 Min=10, Max=60 Min=30, Max=80 Min=10, Max=40 Min=a, Max=b

承诺资源

Scheduling:
弹性资源
未使用的资

任意使用的
资源

● Compatible with ElasticQuota CRD


多级Quota调度,支持Quota Tree 多层级调度能力,支持资源快速识
别和抢占,从而短时间确保高优先级任务资源。使得用户能够精细管理 ● Support to configure Quota by tree structure
多种任务形态下的资源使用。
● Fair sharing and support to configure shared weight
Coscheduling/Gang Scheduling
Scheduling Pods in group

Requirements
• All pods of the job must be coordinated to start together
• AI Jobs such as TFJob support multiple roles
• Each role needs to customize the minimum available replicas
• Allocating resources together with the minimum available of pods
for a job

Koordinator Coscheduling
• Koordinator enhanced scheduler-plugins/coscheduling
• Compatible with PodGroup CRDs
• Scheduling of multiple PodGroups, and divides PodGroups by role
• Integrate with KubeDL, seamlessly switch to Koordinator Scheduler
Topology Aware
早期大数据任务拓扑感知调度的由来

File ~1PB
BLOCK-1 BLOCK-2 BLOCK-3 BLOCK-4 BLOCK-5

Job

核心交换机

mapper-1 mapper-2 mapper-3

reducer
ToR 交换机 ToR 交换机 ToR 交换机

分布式任务,每一个 mapper 处理
BLOCK-1 BLOCK-1 BLOCK-5 BLOCK-1 BLOCK-2 BLOCK-3
不同的数据分片:
BLOCK-4 BLOCK-3 BLOCK-2 BLOCK-4 BLOCK-2 BLOCK-5 BLOCK-3 BLOCK-4 BLOCK-5 • 计算与数据尽量在一个机器上
• 尽量在靠近的网络拓扑内
Node Node Node Node Node

mapper-1 mapper-2 mapper-3


Topology Aware

储算分离架构下的大数据计算 AIGC 推动的大模型训练

region

DFS
AZ AZ

Node Node Node Node Node

alluxio/jindofs
mapper-1 mapper-2 mapper-3
Topology Aware
Topology arrangement of pods will significantly affect training efficiency

4worker为例:w0->w1->w2->w3->w0 (Ring AllReduce)


性能差距:512卡 30%~40%

PSW-1
PSW-1

ASW-1 ASW-2
ASW-1 ASW-2

GPU机器-A GPU机器-B GPU机器-C GPU机器-D GPU机器-A GPU机器-B GPU机器-C GPU机器-D

W0 W2 W1 W3 W0 W2 W1 W3

随机调度 拓补顺序感知调度
Resource Sharing
Koordinator Fine-grained Device Scheduling
• koord-scheduler is responsible for allocating Device
• koordlet is responsible for reporting device information
• The koordlet built-in plugin is responsible for applying the device
• koord-runtime-proxy acts as a CRI-Proxy to transfer CRI
requests to koordlet, triggering the built-in plug-in to work
• Describe device information through CRD
Resource Sharing
Future: NRI Mode Resource Management

Runtime Proxy

NRI Mode
AI workload ecosystem
Job Queueing – kube-queue/kueue

• Job level priority


• Fairness across multiple-tenant queues
• Extensible queuing strategies
• Quota based admission
• Priority
• Submission timestamp
• Production/BestEffort
•…
Bigdata workload ecosystem

面向开源生态 统一优先级/QoS策略
• 基于开源版本Hadoop,非侵入式改造 • 对接Koordinator的Batch优先级资源
• 保持大数据作业提交入口为YARN不变 • 接受Koordlet的QoS策略管理
Ø CPU QoS、Memory QoS、LLC/内存带宽
Ø 单机驱逐

节点资源共享 支持对接阿里云产品能力
• 超卖资源即可被K8s Pod使用,也可被 • ACK容器与EMR任务混部
YARN task使用 • 基于Koordinator混部QoS能力
• YARN task和Batch Pod在同一节点内共存

预计将于11月在Koordinator 1.4正式发布
https://github.com/koordinator-sh/koordinator/discussions/1297
Koordinator 社区动态
Koordinator Community

成员来自各行各业、积极参与贡献方案

Containerd
Containerd NRI插件集成
Node 带外应用QoS管理,支持K8s
NRI Server
Pod Pod 与非容器化应用混部
from Intel
from 360
NRI Plugins QoS插件非侵入式,无缝接入Containerd架构生态
灵活的可定制/可插拔的扩展插件 带外应用进程 为非容器化应用预留资源 (CPUSet、内存)
Koordlet 纳管应用进程QoS

root Koordlet
树形Elastic Quota特性增强 单机指标存储持久化
from 小米
Collector from 趣丸科技
parent parent
面向多租户的树形结构Quota分组 本地TSDB持久化存储,指标保留时间更长
local TSDB 支持更稳定的资源超卖策略(Mid Priority)
租户间资源公平性借用和抢占
child child

CNCF Sandox提案流程申请中
https://github.com/cncf/sandbox/issues/51
Koordinator Community
社区双周会(中文)
• 周二 19:30 GMT+8 (北京时间)
• 钉钉会议链接: https://meeting.dingtalk.com/j/cgTTojEI8Zy
Thanks

You might also like