Skip to content

Gang-scheduled Multi-node Training with SkyPilot + Kueue

In this tutorial, we will run multi-node training jobs on a Verda Kubernetes instant cluster using SkyPilot as the submission layer, and add gang scheduling and queueing via Kueue.

You will end up with:

  • A SkyPilot setup that can launch distributed jobs against your cluster with a single sky launch.
  • Multi-node training jobs that are admitted as one atomic unit (all pods start together, or none do).
  • A FIFO queue that holds excess jobs when the cluster is full, instead of letting them sit half-scheduled and hog GPUs.
  • Hard quota enforcement across tenants, namespaces, or teams.
  • A runnable TorchTitan example that trains Llama 3 debug_model for 50 steps across 16 GPUs, gang-scheduled end-to-end.

Info

This tutorial is not a replacement for Kueue's upstream docs. For advanced topics (cohorts, preemption, fair sharing, multi-tenancy), see kueue.sigs.k8s.io.

Prerequisites

For this tutorial you need:

  1. A Verda Kubernetes instant cluster with at least two GPU nodes. The worked example assumes B300 nodes; adjust the accelerator type for other SKUs.
  2. Local kubectl configured to talk to the cluster. Running kubectl get nodes should list your GPU worker nodes in Ready state.
  3. Python 3.10+ on your workstation.
  4. cluster-admin rights on your cluster (Kueue installs cluster-scoped CRDs and webhooks).

Why bother with Kueue?

Out of the box, Kubernetes schedules pods one at a time. For a distributed training job (N pods that need to start together), that is broken:

  • If only M < N pods fit, Kubernetes happily starts those M and leaves the rest pending. You now hold M GPUs doing nothing, waiting for capacity that may never come.
  • Two tenants submitting simultaneously can deadlock — each holds half their job's GPUs, neither can make progress.

Kueue sits in front of pod creation and fixes both problems:

  • Gang scheduling — a job's pods admit all-or-nothing. No partial starts.
  • Queueing — if the cluster is full, new jobs wait in line instead of taking GPUs that cannot satisfy the whole request.
  • Quotas — enforce per-namespace or per-team limits.

Install SkyPilot

Install into an isolated virtual environment so it does not interfere with any system Python:

python3 -m venv ~/sky-env
source ~/sky-env/bin/activate
pip install 'skypilot[kubernetes]'
sky --version

Expect something like skypilot, version 0.12.0 or newer.

Verify SkyPilot sees your cluster

sky check k8s

Expected output includes:

Kubernetes: enabled [compute]
    Allowed contexts:
    └── <your-context-name>: enabled.

Confirm SkyPilot can see your GPUs:

sky gpus list --infra k8s

You should see your GPU type (e.g. B300) listed with per-node utilization info.

Warning

If this step fails, SkyPilot cannot reach the cluster. Re-check your kubeconfig and that your context is selected with kubectl config current-context.

Install Kueue

Pick a release

KUEUE_VERSION=v0.17.1   # check https://github.com/kubernetes-sigs/kueue/releases for latest
curl -sL "https://github.com/kubernetes-sigs/kueue/releases/download/${KUEUE_VERSION}/manifests.yaml" \
  -o kueue-install.yaml

Kueue installs admission webhooks cluster-wide. By default they fail closed (failurePolicy: Fail), which means if Kueue's controller ever goes unhealthy, pod creations across your cluster briefly hang until the webhook times out.

On a shared cluster where other tenants should not be exposed to Kueue's health, flip the webhooks to fail open:

sed -i 's/failurePolicy: Fail/failurePolicy: Ignore/g' kueue-install.yaml

Warning

Trade-off: if Kueue is unhealthy, workloads that should be gated by Kueue will instead be admitted as plain pods. Fine for low-stakes clusters; not acceptable in regulated multi-tenant environments.

Apply

kubectl apply --server-side -f kueue-install.yaml

Verify

kubectl -n kueue-system get pods
# kueue-controller-manager-xxxxx   1/1   Running

kubectl api-resources | grep kueue
# Should list clusterqueues, localqueues, resourceflavors, workloads, ...

Configure the namespace scope

By default, Kueue only manages jobs in namespaces it has been told to watch. The default managedJobsNamespaceSelector is set to the default namespace. Pick the namespace where your training jobs will live and make sure it is selected.

Inspect the current selector:

kubectl -n kueue-system get configmap kueue-manager-config -o yaml \
  | grep -A8 managedJobsNamespaceSelector

You will see something like:

managedJobsNamespaceSelector:
  matchExpressions:
  - key: kubernetes.io/metadata.name
    operator: In
    values:
    - default

If you want Kueue to manage jobs in a different namespace (e.g. team-a), either:

  • Add the namespace to the selector values list (preferred — keeps Kueue narrowly scoped), or
  • Label the namespace to match the selector. For example if the selector is changed to matchLabels: kueue.x-k8s.io/managed: "true", you would kubectl label ns team-a kueue.x-k8s.io/managed=true.

After editing the ConfigMap, restart Kueue so it picks up the change:

kubectl -n kueue-system rollout restart deployment kueue-controller-manager

Warning

Common gotcha: jobs submitted to a non-managed namespace are silently ignored by Kueue. They run as plain pods with no gating and no Workload is created. If your kubectl get workload is empty when you expect rows, check this first.

For the rest of this tutorial we will use the default namespace.

Define the quota

You need three Kueue objects:

  1. ResourceFlavor — labels a class of hardware. Can be as simple as "generic node" or as specific as "A100 40GB on us-east-1b."
  2. ClusterQueue — a pool of quota Kueue hands out. Cluster-scoped.
  3. LocalQueue — a namespace-scoped pointer to a ClusterQueue. This is what your jobs reference by name.

Find out what resources your nodes expose

Before you can write quotas, you need to know what schedulable resources exist on your Verda nodes — especially if you use RDMA or other custom hardware.

kubectl get nodes -o json | jq -r '
  .items[]
  | select(.metadata.labels["nvidia.com/gpu.count"])
  | .metadata.name as $name
  | .status.allocatable
  | to_entries[]
  | "\($name)  \(.key)=\(.value)"
' | sort -u

You should see cpu, memory, nvidia.com/gpu, and — depending on your cluster SKU — something like rdma/rdma_shared_device_a or nvidia.com/hostdev. Write down the resource names you will need to request; Kueue must cover all of them, or it will not admit your workload.

Decide quota sizing

Two common patterns:

  • Full-cluster quota — let any one tenant consume the entire cluster. Good for solo projects and smoke tests.
  • Fractional quotas — give each team a slice, leaving headroom for others. Good for shared clusters.

For the tutorial we will size to the full cluster. Adjust for your own SKU:

cluster has N nodes, each with:
  C vCPUs, M GB RAM, G GPUs, R RDMA devices

ClusterQueue nominalQuota:
  cpu:                      N * C
  memory:                   N * M Gi
  nvidia.com/gpu:           N * G
  rdma/rdma_shared_device_a: N * R   (if your cluster has this)

Write the manifest

Save as kueue-queue.yaml. Adjust nominalQuota values to match your cluster. Remove or add coveredResources lines to match what your pods actually request.

apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: default-flavor
# Empty spec = matches any node. For heterogeneous clusters, add
# nodeLabels/taints/tolerations here to bind the flavor to specific nodes.
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: my-cluster-queue
spec:
  namespaceSelector: {}          # allow any namespace's LocalQueue to reference this
  queueingStrategy: BestEffortFIFO
  resourceGroups:
  - coveredResources:
    - cpu
    - memory
    - nvidia.com/gpu
    - rdma/rdma_shared_device_a  # omit if your cluster doesn't use this
    flavors:
    - name: default-flavor
      resources:
      - name: cpu
        nominalQuota: "480"       # adjust
      - name: memory
        nominalQuota: "4000Gi"    # adjust
      - name: nvidia.com/gpu
        nominalQuota: "16"        # adjust
      - name: rdma/rdma_shared_device_a
        nominalQuota: "2"         # adjust
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  name: my-local-queue
  namespace: default              # must match a Kueue-managed namespace
spec:
  clusterQueue: my-cluster-queue

Apply it:

kubectl apply -f kueue-queue.yaml
kubectl get clusterqueue,localqueue,resourceflavor

Confirm the ClusterQueue reports Active: True:

kubectl get clusterqueue my-cluster-queue \
  -o jsonpath='{.status.conditions[0]}' | jq

Point SkyPilot at the LocalQueue

Align your kubeconfig namespace

Warning

This is the step people miss. SkyPilot submits pods into your kubeconfig's current-context namespace. For Kueue to admit them, that namespace must match both:

  • the namespace your LocalQueue lives in, and
  • a namespace selected by Kueue's managedJobsNamespaceSelector.

Set it:

kubectl config set-context --current --namespace=default
kubectl config view --minify -o jsonpath='{.contexts[0].context.namespace}'
# → default

SkyPilot has no namespace: config key — kubeconfig is the only control.

The Kueue stanza

In any SkyPilot task YAML, add config.kubernetes.kueue to route the job through your LocalQueue:

config:
  kubernetes:
    kueue:
      local_queue_name: my-local-queue   # from the previous section

When you sky launch, SkyPilot automatically adds these labels and annotations to each pod:

  • kueue.x-k8s.io/queue-name: my-local-queue (label)
  • kueue.x-k8s.io/pod-group-name: <unique-job-id> (label)
  • kueue.x-k8s.io/pod-group-total-count: "<N>" (annotation)

The last one is what enables gang scheduling — it tells Kueue "this pod is part of a group of N, do not admit any of us until all N can fit."

The next section puts this together with a real multi-node training job.

Worked example: Multi-node TorchTitan training

This runs a 2-node TorchTitan job on 16 GPUs (2 × 8 B300), submitted through SkyPilot and gated by Kueue. Loss should converge from ~8.2 down to ~5.7 over 50 steps.

Info

The debug_model config is a tiny toy model meant to validate the cluster and plumbing end-to-end — it is not a real training run. See Scaling up for pointers to real Llama 3 8B / 70B runs.

Write the task YAML

Save the following as torchtitan-sky.yaml:

name: torchtitan-tutorial

num_nodes: 2

resources:
  infra: k8s
  accelerators: B300:8           # adjust for your GPU SKU
  image_id: docker:nvcr.io/nvidia/pytorch:25.08-py3
  cpus: 32+
  memory: 192+

# Route the job through Kueue's LocalQueue and inject RDMA + IPC_LOCK.
# SkyPilot's short form covers CPU/memory/GPU but not custom cluster
# resources, so we pass RDMA as a raw pod spec fragment.
config:
  kubernetes:
    kueue:
      local_queue_name: my-local-queue
    pod_config:
      spec:
        containers:
          - resources:
              limits:
                rdma/rdma_shared_device_a: "1"
              requests:
                rdma/rdma_shared_device_a: "1"
            securityContext:
              capabilities:
                add: ["IPC_LOCK"]

envs:
  NCCL_DEBUG: INFO
  NCCL_SOCKET_IFNAME: eth0
  NCCL_IB_TIMEOUT: "22"
  NCCL_IB_RETRY_CNT: "7"
  NCCL_NET_GDR_LEVEL: "0"

setup: |
  set -ex
  cd ~
  if [ ! -d torchtitan ]; then
    git clone https://github.com/pytorch/torchtitan.git
  fi
  cd torchtitan
  # Pin to a commit whose torch API usage matches NGC 25.08's torch build.
  # Newer torchtitan commits import APIs (HuggingFaceStorageWriter,
  # DefaultStager) that are still private in this torch build.
  git checkout b0902b29
  # Install torchtitan's deps, minus torch* (NGC image already provides it).
  grep -E -v '^(torch|torchvision|torchaudio)([[:space:]]|$|[<>=])' \
    requirements.txt > /tmp/req-notorch.txt
  pip install --no-cache-dir -r /tmp/req-notorch.txt
  pip install --no-cache-dir tiktoken blobfile pyyaml
  pip install --no-cache-dir --no-deps -e .
  python -c "import torchtitan, torch; print('torch', torch.__version__, 'cuda', torch.version.cuda)"

run: |
  set -ex
  cd ~/torchtitan
  MASTER_ADDR=$(echo $SKYPILOT_NODE_IPS | awk '{print $1}')
  MASTER_PORT=29500
  echo "rank=$SKYPILOT_NODE_RANK nnodes=$SKYPILOT_NUM_NODES gpus/node=$SKYPILOT_NUM_GPUS_PER_NODE master=$MASTER_ADDR"

  CONFIG=./torchtitan/models/llama3/train_configs/debug_model.toml

  torchrun \
    --nproc-per-node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --nnodes=$SKYPILOT_NUM_NODES \
    --node-rank=$SKYPILOT_NODE_RANK \
    --rdzv-id=100 \
    --rdzv-backend=c10d \
    --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT \
    -m torchtitan.train \
    --job.config-file "$CONFIG" \
    --training.steps=50

What the YAML does

  • num_nodes: 2 + accelerators: B300:8 — request 2 pods, each with 8 B300 GPUs. SkyPilot translates this into the right nvidia.com/gpu request and nvidia.com/gpu.product nodeSelector automatically.
  • image_id — NGC PyTorch container as the base. Ships with a Blackwell-tuned torch build, so no CUDA install needed. The 25.08-py3 tag in particular has the torch==2.8 nightly that pairs with our pinned torchtitan commit.
  • config.kubernetes.kueue.local_queue_name — routes the job through the LocalQueue you defined earlier. SkyPilot tags all pods so Kueue admits them as an atomic group.
  • config.kubernetes.pod_config — SkyPilot's escape hatch for Kubernetes fields it does not model natively. Here we add the RDMA resource request and the IPC_LOCK Linux capability that RDMA requires for memory pinning.
  • envs — NCCL tuning for the RDMA fabric.
  • setup — runs once per node on first launch. Clones torchtitan, pins to a known-good commit, installs its deps (avoiding the torch that NGC already provides), and installs torchtitan itself in editable mode.
  • run — the actual training command. SkyPilot populates $SKYPILOT_NODE_IPS, $SKYPILOT_NODE_RANK, etc. on each pod so torchrun can wire up the distributed rendezvous.

Launch the job

sky launch -c tt-tutorial -y torchtitan-sky.yaml

This does four things in sequence:

  1. Provisions 2 pods on the cluster (takes ~30–60 s; mostly image pull).
  2. Syncs files and env.
  3. Runs setup on each pod (clones + installs torchtitan — ~1–2 min first time).
  4. Runs runtorchrun fires up 16 ranks and starts training.

Watch the live output in your terminal, or stream logs from a separate shell:

sky logs tt-tutorial

Verify Kueue admitted it

Kueue creates a Workload object for each gated job. From another shell:

kubectl get workload
# NAME                  QUEUE            RESERVED IN        ADMITTED  AGE
# tt-tutorial-xxxxxx   my-local-queue   my-cluster-queue   True      30s

Detailed conditions:

kubectl get workload <workload-name> -o jsonpath='{.status.conditions[*].type}={.status.conditions[*].status}{"\n"}'
# QuotaReserved=True  Admitted=True  PodsReady=True

Check ClusterQueue usage:

kubectl get clusterqueue my-cluster-queue -o jsonpath='{.status.flavorsUsage}' | jq

You should see non-zero total values for each resource — proof that Kueue is actually reserving capacity against this workload (as opposed to the workload running as a plain ungated pod, which would leave usage at zero).

What success looks like

After setup finishes, you should see per-rank training step logs like:

step:  1  loss:  8.1898  grad_norm:  0.2366  tps: 763      tflops: 0.05   mfu: 0.02%
step:  2  loss:  8.1361  grad_norm:  0.2334  tps: 405,648  tflops: 29.17  mfu: 9.35%
...
step: 36  loss:  5.6757  grad_norm:  0.1597  tps: 564,898  tflops: 40.62  mfu: 13.02%

And at the end:

✓ Job finished (status: SUCCEEDED).

Things to check:

  • All 16 ranks show the same loss at each step — NCCL collectives are correct.
  • Loss decreases monotonically from ~8.2 to ~5.7 — the model is actually training.
  • TFLOPS settles in the 30–45 range per rank — GPUs are doing work, not just sitting.
  • MFU around 10–15% is expected and low — debug_model is a toy (tens of thousands of parameters). Real Llama 3 8B/70B runs will show MFU in the 40–55% range.

Demo queueing behavior (optional)

The payoff of Kueue is most visible when jobs compete. With the first job still running and the cluster fully reserved, submit a second identical job:

sky launch -c tt-tutorial-2 -y torchtitan-sky.yaml

The second job's pods will not start — they will have a schedulingGates[].name: kueue.x-k8s.io/admission gate on them:

kubectl get workload
# NAME                     QUEUE            ADMITTED  AGE
# tt-tutorial-xxxxxx      my-local-queue   True      2m
# tt-tutorial-2-yyyyyy    my-local-queue   False     10s    ← queued

When the first job finishes and releases its quota, Kueue will automatically admit the second one and its pods start. This is the deadlock-free behavior you cannot get without a gang scheduler.

Tear down

The cluster stays up after the job finishes so you can re-use it:

sky logs tt-tutorial                               # re-stream logs
sky exec tt-tutorial ...                           # run more commands
sky launch -c tt-tutorial torchtitan-sky.yaml      # re-run (setup is cached)

When you are done:

sky down tt-tutorial tt-tutorial-2     # release the jobs
kubectl delete -f kueue-queue.yaml     # remove the queue defs
kubectl delete -f kueue-install.yaml   # uninstall Kueue entirely (optional)

Scaling up

To go from the smoke test to a real run:

  • Bigger model — swap debug_model.toml for llama3_8b.toml or llama3_70b.toml under torchtitan/models/llama3/train_configs/. You will also need Hugging Face tokens for the tokenizer and dataset — set them as SkyPilot envs.
  • More nodes — increase num_nodes up to your cluster's GPU capacity. Bump the ClusterQueue nominalQuota to match. The rest of the YAML does not change.
  • More steps — bump --training.steps=50 in the run section.
  • Checkpointing — add --checkpoint.enable_checkpoint=true and mount shared storage (e.g. cephfs-pvc) via pod_config so ranks write to one volume.

Troubleshooting

The failure modes you will hit, in rough order of frequency:

kubectl get workload is empty, but my pods are running

Kueue does not see the pods at all. Almost always a namespace mismatch:

  1. Check kubeconfig: kubectl config view --minify -o jsonpath='{.contexts[0].context.namespace}'
  2. Check LocalQueue's namespace: kubectl get localqueue -A
  3. Check Kueue's managed namespaces: kubectl -n kueue-system get cm kueue-manager-config -o yaml | grep -A6 managedJobsNamespaceSelector

All three must agree. The quickest fix is usually kubectl config set-context --current --namespace=<ns-where-localqueue-lives>.

Workload exists but stays Admitted: False forever

The ClusterQueue cannot satisfy the request. Inspect the Workload's conditions:

kubectl get workload <name> -o jsonpath='{.status.conditions}' | jq

Common reasons:

  • Your pods request a resource (e.g. rdma/rdma_shared_device_a) that the ClusterQueue's coveredResources does not include. Fix: add it.
  • The requested quantity exceeds nominalQuota. Fix: bump the quota or submit a smaller job.
  • Another Workload is holding the quota. Either wait or expand the quota.

Pod spec looks wrong (missing RDMA, missing IPC_LOCK, etc.)

Your pod_config block did not merge. Re-check:

  • YAML indentation is correct.
  • You edited the task YAML (what sky launch reads) and not a stale copy.
  • Restart the sky API server if you suspect caching: sky api stop && sky api start (or just pkill -f "sky.server").

ImportError: cannot import name 'HuggingFaceStorageWriter'

TorchTitan main uses a torch API that a given NGC image does not expose yet. The pin git checkout b0902b29 in setup avoids this; if you adopt a newer torchtitan commit, you may need a newer NGC image (25.09-py3, 25.10-py3, etc.) to match.

Pods stuck Pending

Run kubectl describe pod <pod-name> — usually means insufficient GPUs or wrong nodeSelector. Verify sky gpus list --infra k8s shows free GPUs matching your accelerators: request. If Kueue admitted the Workload but pods still won't schedule, the node-level resources don't match what the ClusterQueue promised.

NCCL hangs or bandwidth looks low

Confirm the RDMA resource request actually landed:

kubectl get pod <pod> -o json | jq '.spec.containers[0].resources'

Output should show rdma/rdma_shared_device_a. If missing, the config.kubernetes.pod_config block did not merge — check indentation.

What's next

  • Multiple queues per team — create a LocalQueue per namespace/team, each pointing at the same or different ClusterQueues.
  • Cohorts and borrowing — lets teams borrow each other's idle quota. See Kueue cohorts docs.
  • Preemption — interrupt low-priority jobs to make room for high-priority ones. Requires WorkloadPriorityClass.
  • Provisioning integration — on autoscaling clusters, Kueue can trigger scale-up (GKE, Karpenter, Nebius, etc.) via ProvisioningRequest.
  • Managed jobs — use sky jobs launch instead of sky launch to run the job under a controller that handles restarts, preemption recovery, etc.
  • Dynamo inference — once you have a checkpoint, deploy it with Dynamo. See the Dynamo inference tutorial.

Reference: how the pieces fit together

YOU                                           CLUSTER
───                                           ───────
~/.kube/config                                 kueue-system/
  └─ context.namespace ←─────────────────┐      ├─ kueue-controller-manager
                                         │      └─ configmap/kueue-manager-config
~/.sky/config.yaml         ─────────┐    │            └─ managedJobsNamespaceSelector
my-task.yaml                        │    │                     │
  config.kubernetes.kueue.     ──┐  │    │                     ▼
    local_queue_name: X          │  │    │       <ns>/localqueue/X ←── must exist
                                 │  │    │              │
                                 │  │    │              ▼
                                 │  │    │       clusterqueue/Y
                                 │  │    │              │
                                 │  │    │              ▼
                                 │  │    │       resourceflavor/Z
                                 │  │    │              │
                                 ▼  ▼    ▼              ▼
           sky launch ── SkyPilot CLI ── creates Pod in <ns> ──▶ Kueue webhook
                                                               creates Workload,
                                                               gates pod until quota fits

The invariant: kubeconfig.namespace == LocalQueue.namespace, and task.local_queue_name == LocalQueue.name, and the ClusterQueue it points at must cover every resource your pods request. Break any link and Kueue silently ignores the job.