Gang-scheduled Multi-node Training with SkyPilot + Kueue¶

This tutorial wires up SkyPilot and Kueue on a Verda Kubernetes Instant Cluster so you can launch multi-node training jobs with a single sky launch and have them gang-scheduled — all pods of a job admit together, or none do, and excess jobs wait in a FIFO queue instead of half-starting and hogging GPUs.

The worked example is a 2-node, 16-GPU TorchTitan Llama 3 debug_model run.

Prerequisites¶

A Verda Kubernetes Instant Cluster with at least two GPU nodes. The example assumes B300; adjust the accelerator type for other SKUs.
Local kubectl configured for the cluster (kubectl get nodes lists your workers).
Python 3.10+ on your workstation.
cluster-admin rights (Kueue installs cluster-scoped CRDs and webhooks).

Step 1 — Install SkyPilot¶

python3 -m venv ~/sky-env
source ~/sky-env/bin/activate
pip install 'skypilot[kubernetes]'

sky check k8s            # should report "Kubernetes: enabled"
sky gpus list --infra k8s   # should list your GPU type (e.g. B300)

If either check fails, SkyPilot cannot reach the cluster — re-check your kubeconfig (kubectl config current-context).

Step 2 — Install Kueue¶

KUEUE_VERSION=v0.17.1   # check https://github.com/kubernetes-sigs/kueue/releases for latest

kubectl apply --server-side -f \
  "https://github.com/kubernetes-sigs/kueue/releases/download/${KUEUE_VERSION}/manifests.yaml"

Verify:

kubectl -n kueue-system get pods
# kueue-controller-manager-xxxxx   1/1   Running

kubectl api-resources | grep kueue
# clusterqueues, localqueues, resourceflavors, workloads, ...

By default Kueue manages the default namespace. To use a different one, see Advanced tuning below.

Step 3 — Define the quota¶

You need three Kueue objects:

ResourceFlavor — labels a class of hardware (or "any node").
ClusterQueue — a pool of quota Kueue hands out.
LocalQueue — the namespace-scoped handle your jobs reference.

Fetch the manifest from the examples repo:

curl -O https://raw.githubusercontent.com/datacrunch-research/instant-cluster-examples/main/kueue-skypilot/kueue-queue.yaml

Edit kueue-queue.yaml:

ClusterQueue.spec.resourceGroups[0].flavors[0].resources[*].nominalQuota — set each value to your cluster's total capacity (N_nodes × per_node). Defaults are sized for a 2 × B300 cluster (480 CPU, 4000 Gi memory, 16 GPU, 2 RDMA).
coveredResources — drop rdma/rdma_shared_device_a if your cluster does not expose it.

Apply and verify:

kubectl apply -f kueue-queue.yaml
kubectl get clusterqueue my-cluster-queue \
  -o jsonpath='{.status.conditions[0]}' | jq
# expect: "type":"Active","status":"True"

Step 4 — Align your kubeconfig namespace¶

SkyPilot submits pods into your kubeconfig's current-context namespace, and Kueue only admits pods in namespaces it manages. Both must match the namespace your LocalQueue lives in.

kubectl config set-context --current --namespace=default

Warning

This is the step people miss. If kubectl get workload is empty after a sky launch, kubeconfig namespace ≠ LocalQueue namespace is almost always the cause.

Step 5 — Launch the TorchTitan job¶

curl -O https://raw.githubusercontent.com/datacrunch-research/instant-cluster-examples/main/kueue-skypilot/torchtitan-sky.yaml

Edit torchtitan-sky.yaml:

num_nodes — number of pods (default 2).
resources.accelerators — B300:8, H100:8, A100-80GB:8, etc. Must match sky gpus list --infra k8s.
config.kubernetes.kueue.local_queue_name — must match your LocalQueue (default my-local-queue).

Drop the rdma/rdma_shared_device_a request under config.kubernetes.pod_config if your cluster does not expose RDMA.

Launch:

sky launch -c tt-tutorial -y torchtitan-sky.yaml

This provisions the pods, runs setup (clones + installs torchtitan, ~1-2 min first time), and runs torchrun across 16 ranks. Stream logs from another shell with sky logs tt-tutorial.

Verify Kueue admitted it¶

kubectl get workload
# NAME                  QUEUE            RESERVED IN        ADMITTED  AGE
# tt-tutorial-xxxxxx   my-local-queue   my-cluster-queue   True      30s

If ADMITTED stays False for more than a few seconds, see Troubleshooting.

What success looks like¶

Per-rank training logs:

step:  1  loss:  8.1898  grad_norm:  0.2366  tps: 763      tflops: 0.05   mfu: 0.02%
step:  2  loss:  8.1361  grad_norm:  0.2334  tps: 405,648  tflops: 29.17  mfu: 9.35%
...
step: 36  loss:  5.6757  grad_norm:  0.1597  tps: 564,898  tflops: 40.62  mfu: 13.02%

Followed by ✓ Job finished (status: SUCCEEDED).

Checks:

All 16 ranks show the same loss at each step → NCCL collectives are correct.
Loss decreases monotonically from ~8.2 to ~5.7 → the model is training.
TFLOPS settles in the 30–45 range per rank → GPUs are doing work.
MFU around 10–15% is expected — debug_model is a toy; real Llama 3 8B/70B runs reach 40–55%.

(Optional) Demo queueing behavior¶

The payoff of Kueue is most visible when jobs compete. With the first job still running:

sky launch -c tt-tutorial-2 -y torchtitan-sky.yaml

The second job's pods stay gated:

kubectl get workload
# tt-tutorial-xxxxxx      my-local-queue   True      2m
# tt-tutorial-2-yyyyyy    my-local-queue   False     10s    ← queued

When the first job's pods are torn down and quota releases, the second admits automatically.

Warning

sky launch -c <name> creates a persistent cluster — pods stay running after the training command exits, so quota stays reserved. To see the handoff, sky down tt-tutorial once the first run prints SUCCEEDED, or use sky jobs launch (auto-terminates pods on completion).

Tear down¶

sky down tt-tutorial tt-tutorial-2     # release the jobs
kubectl delete -f kueue-queue.yaml     # remove queue defs
# Optional: uninstall Kueue
# kubectl delete -f https://github.com/kubernetes-sigs/kueue/releases/download/${KUEUE_VERSION}/manifests.yaml

Scaling up¶

Bigger model — swap debug_model.toml for llama3_8b.toml or llama3_70b.toml under torchtitan/models/llama3/train_configs/. Set HuggingFace tokens as SkyPilot envs.
More nodes — bump num_nodes and the ClusterQueue nominalQuota to match.
More steps — bump --training.steps=50 in the run block.
Checkpointing — add --checkpoint.enable_checkpoint=true and mount shared storage via pod_config.

Troubleshooting¶

`kubectl get workload` is empty, but my pods are running¶

Kueue does not see the pods — almost always a namespace mismatch. The three namespaces must agree:

kubectl config view --minify -o jsonpath='{.contexts[0].context.namespace}'   # kubeconfig
kubectl get localqueue -A                                                      # LocalQueue
kubectl -n kueue-system get cm kueue-manager-config -o yaml | grep -A6 managedJobsNamespaceSelector  # Kueue managed namespaces

Quickest fix: kubectl config set-context --current --namespace=<ns-where-localqueue-lives>.

Workload exists but stays `Admitted: False`¶

The ClusterQueue cannot satisfy the request:

kubectl get workload <name> -o jsonpath='{.status.conditions}' | jq

Common causes: pods request a resource the ClusterQueue's coveredResources does not include (add it), or the request exceeds nominalQuota (bump quota or shrink job).

Workload `Admitted=True` but pods stuck `Pending` with `Insufficient nvidia.com/gpu`¶

Kueue's quota math says GPUs are free but the kube-scheduler refuses — something outside Kueue's view is holding them (Slurm/Slinky workers, raw Deployments). Find the holder:

kubectl describe node <gpu-worker> | sed -n '/Non-terminated Pods/,/Allocated resources/p'

Fix by freeing those GPUs (e.g. kubectl -n slurm patch nodeset slurm-worker-slinky --type=merge -p '{"spec":{"replicas":0}}'), shrinking the ClusterQueue nominalQuota to match what is actually free, or bringing the other workload under Kueue.

Advanced tuning¶

Optional hardening that pays off on shared or production clusters. Skip on a first run.

Manage a non-default namespace¶

Kueue ships with managedJobsNamespaceSelector set to default. To watch a different namespace, edit the configmap and restart the controller:

kubectl -n kueue-system edit configmap kueue-manager-config
# add your namespace under managedJobsNamespaceSelector.matchExpressions[0].values

kubectl -n kueue-system rollout restart deployment kueue-controller-manager

Soften webhook failure policy¶

Kueue's admission webhooks default to failurePolicy: Fail — if the controller goes unhealthy, pod creations across the whole cluster briefly hang. On shared clusters, download the manifest, flip to Ignore, and re-apply:

curl -sL "https://github.com/kubernetes-sigs/kueue/releases/download/${KUEUE_VERSION}/manifests.yaml" -o kueue-install.yaml
sed -i 's/failurePolicy: Fail/failurePolicy: Ignore/g' kueue-install.yaml
kubectl apply --server-side -f kueue-install.yaml

Trade-off: if Kueue is unhealthy, workloads that should be gated will be admitted as plain pods. Fine for low-stakes clusters; not acceptable in regulated multi-tenant environments.

Enable `waitForPodsReady` (runtime gang gating)¶

Pod-group annotations gate admission against quota, but Kueue does not enforce that admitted pods reach Ready. If half of an admitted gang fails to start (bad image pull, RDMA init flake, node taint), the workload sits deadlocked while holding the GPU quota.

Patch the Kueue manager configmap to add a waitForPodsReady block under controllerManager:

waitForPodsReady:
  timeout: 15m              # generous: image pull + RDMA init on first run
  recoveryTimeout: 5m
  blockAdmission: true      # sequential admission, prevents same-time deadlock
  requeuingStrategy:
    timestamp: Creation
    backoffLimitCount: 5
    backoffBaseSeconds: 60
    backoffMaxSeconds: 1800

blockAdmission: true serializes admission cluster-wide — slower under load but prevents two half-admitted gangs deadlocking on each other's quota. For parallel admission without deadlock risk, look at Topology Aware Scheduling instead.

What's next¶

Per-team queues — one LocalQueue per namespace, all pointing at the same or different ClusterQueues.
Cohorts and borrowing — teams borrow each other's idle quota. See Kueue cohorts.
Preemption — WorkloadPriorityClass lets high-priority jobs evict low-priority ones.
Managed jobs — sky jobs launch runs the job under a controller that handles restarts and preemption recovery.

This tutorial intentionally covers only the happy path. For multi-tenancy, fair sharing, and provisioning integration, see kueue.sigs.k8s.io.