Open Cluster Management: Scheduling AI workload in multi-clusters
- Mentors
- Qing Hao, Jian Qiu
- Organization
- CNCF
- Technologies
- go, kubernetes, Scheduling, Kueue
- Topics
- cloud, kubernetes, Multi-clusters
Enhance Open Cluster Management (OCM) to efficiently schedule AI workloads across multiple Kubernetes clusters by optimizing the use of GPU/TPU resources.
Key components:
1. GPU/TPU Resource Evaluation Addon: Extend OCM's placement strategy to include GPU/TPU resource availability. This project introduce an AddonPlacementScore that assesses GPU/TPU resources in cluster sets, which informs scheduling decisions to ensure AI workloads are distributed based on specific GPU/TPU resource requirements.
2. OCM Kueue Admission Check Controller: Deliver a proposal for the external Kueue Admission Check controller integrating OCM Placement results with MultiKueue. The controller reads OCM Placement decisions and generates corresponding MultiKueueConfig and MultiKueueCluster resources, streamlining the setup of the MultiKueue environment and enabling users to select clusters based on custom criteria.
Deliverables: GPU/TPU Resource Evaluation Addon, OCM Kueue Admission Check Controller, Comprehensive Documentation and User Guides.