flyteorg · 0yukali0 · Oct 24, 2024 · Oct 27, 2024 · Oct 28, 2024 · Oct 28, 2024
diff --git a/rfc/system/5575-supporting-yunikorn-and-kueue.md b/rfc/system/5575-supporting-yunikorn-and-kueue.md
@@ -0,0 +1,131 @@
+# [Newbie] Supporting Yunikorn and Kueue
+
+**Authors:**
+
+- @yuteng
+
+## 1 Executive Summary
+
+Providing kubernetes (k8s) resource management, gang scheduling and preemption for flyte applications by third-party software, including Apache Yunikorn and Kueue.
+
+## 2 Motivation
+
+Flyte support multi-tenancy and various k8s plugins.
+
+Kueue and Yunikorn support gang scheduling and preemption.
+Gang scheduling guarantees the availability of certain K8s crd services, such as Spark, Ray, with sufficient resource and preemption make sure high priority task execute immediately.
+
+Flyte doesn't provide resource management for multi-tenancy, which hierarchical resource queues of Yunikorn can solve.
+
+## 3 Proposed Implementation
+
+```yaml
+queueconfig:
+  scheduler: yunikorn
+  jobs:
+    - type: "ray"
+      gangscheduling: "placeholderTimeoutInSeconds=60 gangSchedulingStyle=hard"
+    - type: "spark"
+      gangscheduling: "placeholderTimeoutInSeconds=30 gangSchedulingStyle=hard"
+```
+
+Mentioned configuration indicates what queues exist for an org.
+Hierarchical queues will be structured as follows.
+root.org1.ray、root.org1.spark and root.org1.default".
+
+ResourceFlavor allocates resource based on labels which indicates that category-based resource allocation by organization label is available.
+Thus, a clusterQueue including multiple resources represents the total acessaible resource for an organization.  
+| clusterQueue | localQueue |
+| --- | --- |
+| Org | ray、spark、default |
+A tenant can submit organization-specific tasks to queues such as org.ray, org.spark and org.default to track which job types are submittable. 
+
+
+A SchedulerConfigManager maintains config from mentioned yaml.
+It patches labels or annotations on k8s resources after they pass rules specified in the configuration.
+
+```go
+type YunikornScheduablePlugin interface {
+	MutateResourceForYunikorn(ctx context.Context, object client.Object, taskTmpl *core.TaskTemplate) (client.Object, error)
+  GetLabels(id core.Identifier) map[string]string
+}
+
+type KueueScheduablePlugin interface {
+	MutateResourceForKueue(ctx context.Context, object client.Object, taskTmpl *core.TaskTemplate) (client.Object, error)
+  GetLabels(id core.Identifier) map[string]string
+}
+
+func (h *YunikornScheduablePlugin) MutateResourceForYunikorn(ctx context.Context, object client.Object, taskTmpl *core.TaskTemplate) (client.Object, error) error {
+  rayJob := object.(*rayv1.RayJob)
+  // TODO
+}
+
+func (h *YunikornScheduablePlugin) GetLabels(id core.Identifier) map[string]string {
+  // 1.UserInfo
+  // 2.QueueName
+  // 3.ApplicationID
+}
+
+func PatchPodSpec(target *v1.PodSpec, labels map[string]string) error {
+  // Get Metaobject from target
+  // Add label is the specific label doesn't exist
+}
+```
+
+
+Creat a scheduler plugin according to the queueconfig.scheduler.
+Its basic responsibility validate whether submitted application is accepted. 
+When a Yunikorn scheduler plugin created, it will create applicationID and queue name.
+in the other hand, a Kueue scheduler plugin constructs labels including localQueueName, preemption.
+
+```go
+func (e *PluginManager) launchResource(ctx context.Context, tCtx pluginsCore.TaskExecutionContext) (pluginsCore.Transition, error) {
+  o, err := e.plugin.BuildResource(ctx, k8sTaskCtx)
+	if err != nil {
+		return pluginsCore.UnknownTransition, err
+	}
+  if o, err = e.SchedulerPlugin.MutateResourceForKueue(o); err == nil {
+      return pluginsCore.UnknownTransition, err
+    }
+  } else {
+     return pluginsCore.UnknownTransition, err
+  }
+}
+```
+When batchscheduler in flyte is yunikorn, some examples are like following.
+For example, this appoarch submits a Ray job owned by user1 in org1 to "root.org1.ray".
+A spark application in ns1 submitted by user4 in org1 is in "root.org1.ns1".
+In the other hand, results of these examples are "org1-ray" and "org1-ns1" when adopting Kueue.
+
+## 4 Metrics & Dashboards
+
+1. The Yunikorn scheduler add applications to a specific queue based on their user info, queue name for any application type.
+2. Yunikorn and Kueue provide gang scheduling through annotations For Ray and spark.
+3. Preemption behavior aligns with user-defined configuration in yunikorn.
+
+## 5 Drawbacks
+
+This appoarch doesn't offer a way to maintain consistency between the accuate resource quotas of groups and the configuration in scheduler.
+
+## 6 Alternatives
+
+## 7 Potential Impact and Dependencies
+
+Flyte support Spark, Ray and Kubeflow CRDs including Pytorch and TFjobs.
+The Spark and Ray operators have supported Yunikorn gang scheduling since task group calculation were implemented in these operators.
+Taskgroup calculation implementation in pods aspect in flyte or kubeflow is required for supporting kubeflow CRDs.
+In the other hand, Kueue currently doesn't support Spark CRD.
+| Operator | Yunikorn | Kueue |
+| --- | --- | --- |
+| Spark | v | x |
+| Ray | v | v |
+| Kubeflow | x | v |
+
+## 8 Unresolved questions
+
+
+## 9 Conclusion
+
+Yunikorn and Kueue support gang scheduling allowing all necassary pods to run sumultaneously when required resource are available.
+Yunikorn provides preemption calculating the priority of applications based on thier priority class and priority score of the queue where they are submitted, in order to trigger high-prioirty or emergency application immediately. 
+Yunikorn's hierarchical queue includes grarateed resources settings and ACLs.