Gang-scheduled Multi-node Training with SkyPilot + Kueue¶
This tutorial wires up SkyPilot and Kueue on a Verda Kubernetes Instant Cluster so you can launch multi-node training jobs with a single sky launch and have them gang-scheduled — all pods of a job admit together, or none do, and excess jobs wait in a FIFO queue instead of half-starting and hogging GPUs.
The worked example is a 2-node, 16-GPU TorchTitan Llama 3 debug_model run.
Prerequisites¶
- A Verda Kubernetes Instant Cluster with at least two GPU nodes. The example assumes B300; adjust the accelerator type for other SKUs.
- Local
kubectlconfigured for the cluster (kubectl get nodeslists your workers). - Python 3.10+ on your workstation.
cluster-adminrights (Kueue installs cluster-scoped CRDs and webhooks).
Step 1 — Install SkyPilot¶
python3 -m venv ~/sky-env
source ~/sky-env/bin/activate
pip install 'skypilot[kubernetes]'
sky check k8s # should report "Kubernetes: enabled"
sky gpus list --infra k8s # should list your GPU type (e.g. B300)
If either check fails, SkyPilot cannot reach the cluster — re-check your kubeconfig (kubectl config current-context).
Step 2 — Install Kueue¶
KUEUE_VERSION=v0.17.1 # check https://github.com/kubernetes-sigs/kueue/releases for latest
kubectl apply --server-side -f \
"https://github.com/kubernetes-sigs/kueue/releases/download/${KUEUE_VERSION}/manifests.yaml"
Verify:
kubectl -n kueue-system get pods
# kueue-controller-manager-xxxxx 1/1 Running
kubectl api-resources | grep kueue
# clusterqueues, localqueues, resourceflavors, workloads, ...
By default Kueue manages the default namespace. To use a different one, see Advanced tuning below.
Step 3 — Define the quota¶
You need three Kueue objects:
- ResourceFlavor — labels a class of hardware (or "any node").
- ClusterQueue — a pool of quota Kueue hands out.
- LocalQueue — the namespace-scoped handle your jobs reference.
Fetch the manifest from the examples repo:
curl -O https://raw.githubusercontent.com/datacrunch-research/instant-cluster-examples/main/kueue-skypilot/kueue-queue.yaml
Edit kueue-queue.yaml:
ClusterQueue.spec.resourceGroups[0].flavors[0].resources[*].nominalQuota— set each value to your cluster's total capacity (N_nodes × per_node). Defaults are sized for a 2 × B300 cluster (480 CPU, 4000 Gi memory, 16 GPU, 2 RDMA).coveredResources— droprdma/rdma_shared_device_aif your cluster does not expose it.
Apply and verify:
kubectl apply -f kueue-queue.yaml
kubectl get clusterqueue my-cluster-queue \
-o jsonpath='{.status.conditions[0]}' | jq
# expect: "type":"Active","status":"True"
Step 4 — Align your kubeconfig namespace¶
SkyPilot submits pods into your kubeconfig's current-context namespace, and Kueue only admits pods in namespaces it manages. Both must match the namespace your LocalQueue lives in.
Warning
This is the step people miss. If kubectl get workload is empty after a sky launch, kubeconfig namespace ≠ LocalQueue namespace is almost always the cause.
Step 5 — Launch the TorchTitan job¶
curl -O https://raw.githubusercontent.com/datacrunch-research/instant-cluster-examples/main/kueue-skypilot/torchtitan-sky.yaml
Edit torchtitan-sky.yaml:
num_nodes— number of pods (default2).resources.accelerators—B300:8,H100:8,A100-80GB:8, etc. Must matchsky gpus list --infra k8s.config.kubernetes.kueue.local_queue_name— must match your LocalQueue (defaultmy-local-queue).
Drop the rdma/rdma_shared_device_a request under config.kubernetes.pod_config if your cluster does not expose RDMA.
Launch:
This provisions the pods, runs setup (clones + installs torchtitan, ~1-2 min first time), and runs torchrun across 16 ranks. Stream logs from another shell with sky logs tt-tutorial.
Verify Kueue admitted it¶
kubectl get workload
# NAME QUEUE RESERVED IN ADMITTED AGE
# tt-tutorial-xxxxxx my-local-queue my-cluster-queue True 30s
If ADMITTED stays False for more than a few seconds, see Troubleshooting.
What success looks like¶
Per-rank training logs:
step: 1 loss: 8.1898 grad_norm: 0.2366 tps: 763 tflops: 0.05 mfu: 0.02%
step: 2 loss: 8.1361 grad_norm: 0.2334 tps: 405,648 tflops: 29.17 mfu: 9.35%
...
step: 36 loss: 5.6757 grad_norm: 0.1597 tps: 564,898 tflops: 40.62 mfu: 13.02%
Followed by ✓ Job finished (status: SUCCEEDED).
Checks:
- All 16 ranks show the same loss at each step → NCCL collectives are correct.
- Loss decreases monotonically from ~8.2 to ~5.7 → the model is training.
- TFLOPS settles in the 30–45 range per rank → GPUs are doing work.
- MFU around 10–15% is expected —
debug_modelis a toy; real Llama 3 8B/70B runs reach 40–55%.
(Optional) Demo queueing behavior¶
The payoff of Kueue is most visible when jobs compete. With the first job still running:
The second job's pods stay gated:
kubectl get workload
# tt-tutorial-xxxxxx my-local-queue True 2m
# tt-tutorial-2-yyyyyy my-local-queue False 10s ← queued
When the first job's pods are torn down and quota releases, the second admits automatically.
Warning
sky launch -c <name> creates a persistent cluster — pods stay running after the training command exits, so quota stays reserved. To see the handoff, sky down tt-tutorial once the first run prints SUCCEEDED, or use sky jobs launch (auto-terminates pods on completion).
Tear down¶
sky down tt-tutorial tt-tutorial-2 # release the jobs
kubectl delete -f kueue-queue.yaml # remove queue defs
# Optional: uninstall Kueue
# kubectl delete -f https://github.com/kubernetes-sigs/kueue/releases/download/${KUEUE_VERSION}/manifests.yaml
Scaling up¶
- Bigger model — swap
debug_model.tomlforllama3_8b.tomlorllama3_70b.tomlundertorchtitan/models/llama3/train_configs/. Set HuggingFace tokens as SkyPilotenvs. - More nodes — bump
num_nodesand the ClusterQueuenominalQuotato match. - More steps — bump
--training.steps=50in therunblock. - Checkpointing — add
--checkpoint.enable_checkpoint=trueand mount shared storage viapod_config.
Troubleshooting¶
kubectl get workload is empty, but my pods are running¶
Kueue does not see the pods — almost always a namespace mismatch. The three namespaces must agree:
kubectl config view --minify -o jsonpath='{.contexts[0].context.namespace}' # kubeconfig
kubectl get localqueue -A # LocalQueue
kubectl -n kueue-system get cm kueue-manager-config -o yaml | grep -A6 managedJobsNamespaceSelector # Kueue managed namespaces
Quickest fix: kubectl config set-context --current --namespace=<ns-where-localqueue-lives>.
Workload exists but stays Admitted: False¶
The ClusterQueue cannot satisfy the request:
Common causes: pods request a resource the ClusterQueue's coveredResources does not include (add it), or the request exceeds nominalQuota (bump quota or shrink job).
Workload Admitted=True but pods stuck Pending with Insufficient nvidia.com/gpu¶
Kueue's quota math says GPUs are free but the kube-scheduler refuses — something outside Kueue's view is holding them (Slurm/Slinky workers, raw Deployments). Find the holder:
Fix by freeing those GPUs (e.g. kubectl -n slurm patch nodeset slurm-worker-slinky --type=merge -p '{"spec":{"replicas":0}}'), shrinking the ClusterQueue nominalQuota to match what is actually free, or bringing the other workload under Kueue.
Advanced tuning¶
Optional hardening that pays off on shared or production clusters. Skip on a first run.
Manage a non-default namespace¶
Kueue ships with managedJobsNamespaceSelector set to default. To watch a different namespace, edit the configmap and restart the controller:
kubectl -n kueue-system edit configmap kueue-manager-config
# add your namespace under managedJobsNamespaceSelector.matchExpressions[0].values
kubectl -n kueue-system rollout restart deployment kueue-controller-manager
Soften webhook failure policy¶
Kueue's admission webhooks default to failurePolicy: Fail — if the controller goes unhealthy, pod creations across the whole cluster briefly hang. On shared clusters, download the manifest, flip to Ignore, and re-apply:
curl -sL "https://github.com/kubernetes-sigs/kueue/releases/download/${KUEUE_VERSION}/manifests.yaml" -o kueue-install.yaml
sed -i 's/failurePolicy: Fail/failurePolicy: Ignore/g' kueue-install.yaml
kubectl apply --server-side -f kueue-install.yaml
Trade-off: if Kueue is unhealthy, workloads that should be gated will be admitted as plain pods. Fine for low-stakes clusters; not acceptable in regulated multi-tenant environments.
Enable waitForPodsReady (runtime gang gating)¶
Pod-group annotations gate admission against quota, but Kueue does not enforce that admitted pods reach Ready. If half of an admitted gang fails to start (bad image pull, RDMA init flake, node taint), the workload sits deadlocked while holding the GPU quota.
Patch the Kueue manager configmap to add a waitForPodsReady block under controllerManager:
waitForPodsReady:
timeout: 15m # generous: image pull + RDMA init on first run
recoveryTimeout: 5m
blockAdmission: true # sequential admission, prevents same-time deadlock
requeuingStrategy:
timestamp: Creation
backoffLimitCount: 5
backoffBaseSeconds: 60
backoffMaxSeconds: 1800
blockAdmission: true serializes admission cluster-wide — slower under load but prevents two half-admitted gangs deadlocking on each other's quota. For parallel admission without deadlock risk, look at Topology Aware Scheduling instead.
What's next¶
- Per-team queues — one LocalQueue per namespace, all pointing at the same or different ClusterQueues.
- Cohorts and borrowing — teams borrow each other's idle quota. See Kueue cohorts.
- Preemption —
WorkloadPriorityClasslets high-priority jobs evict low-priority ones. - Managed jobs —
sky jobs launchruns the job under a controller that handles restarts and preemption recovery.
This tutorial intentionally covers only the happy path. For multi-tenancy, fair sharing, and provisioning integration, see kueue.sigs.k8s.io.