Skip to content

Kubernetes

Choosing the Kubernetes job orchestrator provisions the Instant Cluster with Kubernetes and Slinky (Slurm running on top of Kubernetes), ready to run multi-node GPU workloads over InfiniBand — no additional setup required.

k8s-instant-cluster-os

Info

The Kubernetes orchestrator is still being developed to reach feature parity with the native Slurm one.


What's Included

The following components are pre-installed via Helm and ready to use:

Component Purpose Details
MPI Operator Distributed multi-node job orchestration Provides the MPIJob custom resource for running distributed workloads across nodes
NVIDIA Device Plugin GPU scheduling Exposes GPUs as schedulable resources (nvidia.com/gpu) so Kubernetes can assign them to pods
NVIDIA Network Operator InfiniBand / RDMA networking Configures high-speed InfiniBand networking for GPU-to-GPU communication across nodes
Cilium Pod networking (CNI) Handles standard Ethernet-based pod-to-pod and pod-to-service communication
Slinky Bundled Slurm on Kubernetes Runs a Slurm cluster in the slurm namespace so you can also submit srun / sbatch jobs — see Slurm
Local Disk StorageClass Node-local NVMe storage Each worker node's /mnt/local_disk is available as a Kubernetes StorageClass for scratch data, model caches, and checkpoints

Slinky Slurm reserves every GPU by default

Each Slinky worker pod requests nvidia.com/gpu: 8, and there is one Slinky worker pod per GPU node — so on a fresh cluster every GPU on the cluster is allocated to Slurm and there are no GPUs left for other Kubernetes pods. An MPIJob (or any pod) that requests nvidia.com/gpu will stay Pending with Insufficient nvidia.com/gpu until you free the GPUs.

To run GPU workloads directly through Kubernetes (e.g. an MPIJob), scale the Slinky worker NodeSet down first:

kubectl scale -n slurm nodeset slurm-worker-slinky --replicas=0

Bring Slurm back when you're done:

kubectl scale -n slurm nodeset slurm-worker-slinky --replicas=<num-gpu-nodes>

The slurm-controller, slurm-accounting, slurm-login-slinky and slurm-restapi pods do not hold GPUs and can be left running.


Key Concepts

MPIJob (MPI Operator)

An MPIJob is a Kubernetes custom resource provided by the MPI Operator. It is the primary way to run distributed multi-node workloads on the cluster.

When you submit an MPIJob, the operator creates:

  • A launcher pod that coordinates the job (similar to mpirun)
  • One or more worker pods that perform the actual computation

The operator handles SSH key distribution and network setup between pods automatically. You define your container image, GPU resource requests, and the command to run — the operator takes care of the rest.

MPI Operator API versions: v1 vs v2beta1

The MPI Operator supports two API versions. Your cluster uses v2beta1, which is the recommended version.

kubeflow.org/v1 kubeflow.org/v2beta1
Worker connectivity kubectl exec (requires API server access) SSH (direct pod-to-pod)
Image requirement No sshd needed Must include sshd
Launcher networking Goes through Kubernetes API server Direct SSH to workers — lower latency
Hostfile Managed via ConfigMap Written to /etc/mpi/hostfile
Status Stable but older Actively developed, recommended for new clusters

Info

All examples in this documentation use apiVersion: kubeflow.org/v2beta1. If you see v1 examples from external sources, the main difference to be aware of is the sshd requirement — v2beta1 workers must have an SSH server in the container image.

Use MPIJobs for:

  • NCCL communication tests (e.g. all_reduce_perf)
  • Distributed PyTorch training with torchrun
  • Any workload that needs to run across multiple nodes with GPU-to-GPU communication

Here is a complete example that runs an NCCL all_reduce_perf benchmark across 2 nodes with 8 GPUs each:

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  generateName: nccl-test-2n-
spec:
  slotsPerWorker: 8
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
            - name: launcher
              image: vccr.io/nccl-tests/nccl-tests:cuda13.1.1-nccl2.29.3-1-v2.17.9
              env:
                - name: OMPI_ALLOW_RUN_AS_ROOT
                  value: "1"
                - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                  value: "1"
              command: ["/bin/bash", "-c"]
              args:
                - |
                  echo "=== NCCL 16-GPU Test (2 nodes) ==="

                  # Wait for MPI hostfile
                  echo "Waiting for MPI hostfile..."
                  while [ ! -f /etc/mpi/hostfile ] || [ ! -s /etc/mpi/hostfile ]; do
                    sleep 2
                  done
                  echo "Hostfile:"
                  cat /etc/mpi/hostfile

                  # Wait for workers to be reachable via SSH
                  echo "Waiting for workers..."
                  for worker in $(awk '{print $1}' /etc/mpi/hostfile); do
                    retries=0
                    until ssh -o ConnectTimeout=2 "$worker" hostname >/dev/null 2>&1; do
                      retries=$((retries + 1))
                      if [ "$retries" -ge 60 ]; then
                        echo "TIMEOUT: $worker not reachable after 5 minutes"
                        exit 1
                      fi
                      echo "  Waiting for $worker... (attempt $retries)"
                      sleep 5
                    done
                    echo "  $worker ready"
                  done
                  echo "All workers ready"

                  echo ""
                  echo "=========================================="
                  echo "Running: all_reduce_perf"
                  echo "=========================================="
                  mpirun \
                    -np 16 \
                    -bind-to none \
                    -x NCCL_IB_PKEY=1 \
                    /opt/nccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1

                  echo ""
                  echo "=== NCCL test completed ==="
              resources:
                requests:
                  cpu: 2
                  memory: 256Mi
    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            app: nccl-test
        spec:
          containers:
            - name: worker
              image: vccr.io/nccl-tests/nccl-tests:cuda13.1.1-nccl2.29.3-1-v2.17.9
              securityContext:
                capabilities:
                  add:
                    - IPC_LOCK
              resources:
                requests:
                  cpu: 32
                  memory: 128Gi
                  nvidia.com/gpu: 8
                  rdma/rdma_shared_device_a: 1
                limits:
                  nvidia.com/gpu: 8
                  rdma/rdma_shared_device_a: 1
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchLabels:
                      app: nccl-test
                  topologyKey: kubernetes.io/hostname
          volumes:
            - name: dshm
              emptyDir:
                medium: Memory
                sizeLimit: 64Gi

Key details in this manifest:

  • generateName instead of name — each kubectl create generates a unique job name
  • rdma/rdma_shared_device_a — requests RDMA device access for InfiniBand communication
  • IPC_LOCK capability — required for RDMA memory registration
  • /dev/shm — large shared memory volume for NCCL inter-process communication
  • podAntiAffinity — ensures workers are scheduled on different physical nodes
  • sshd requirement — the MPI Operator uses SSH to launch processes on workers, so your container image must include an SSH server

Warning

Your container image must include /usr/sbin/sshd. Standard NGC images (e.g. nvcr.io/nvidia/pytorch:...) do not ship with an SSH server and will fail with StartError when used in an MPIJob. Either build a custom image with sshd installed, or use PyTorchJob instead (see below).

InfiniBand and NCCL Configuration

The cluster uses InfiniBand for high-speed GPU-to-GPU communication across nodes via NCCL (NVIDIA Collective Communications Library). For NCCL to work correctly over InfiniBand, you must set the following environment variable in your job containers:

Environment Variable Value Purpose
NCCL_IB_PKEY 1 Required. Tells NCCL which InfiniBand Partition Key to use. Without this, cross-node GPU communication will fail.
NCCL_IB_HCA ^mlx5_0 Recommended. Excludes the management InfiniBand port so NCCL only uses the data ports.
NCCL_DEBUG INFO Optional. Enables verbose NCCL logging, useful for troubleshooting communication issues.

Storage

Each worker node has fast NVMe storage mounted at /mnt/local_disk, which is exposed as a Kubernetes StorageClass. This is ideal for:

  • Model weight caches — avoid re-downloading large models on every job
  • Training checkpoints — fast local writes during training
  • Scratch data — temporary files during computation

For data that needs to be shared across nodes (datasets, final model outputs), use a Persistent Volume Claim backed by the Shared Filesystem (SFS).


Getting Started

Accessing the Cluster

SSH into the jumphost and use kubectl. Admin credentials are pre-configured at:

/root/.kube/config
/home/ubuntu/.kube/config

k9s is also available out of the box. Verify access with:

kubectl get nodes

You should see your worker nodes in Ready status.

Running Your First Job: NCCL all_reduce Test

A pre-configured example job is available on the jumphost. This runs an NCCL all_reduce_perf benchmark across 2 nodes — it's the standard way to verify that your cluster's InfiniBand networking is healthy and performing as expected. It sets the crucial NCCL_IB_PKEY=1 environment variable so the nodes know which InfiniBand P_Key to use.

Submit the job:

kubectl create -f /home/ubuntu/verda_k8s_all_reduce_perf_2_nodes.yml
mpijob.kubeflow.org/nccl-test-2n-wcq4s created

Check pod status:

kubectl get pods
NAME                                READY   STATUS      RESTARTS   AGE
nccl-test-2n-8cnbc-launcher-przrf   0/1     Completed   4          30m

Downloading the container image on all workers may take a few minutes on first run.

View the results:

kubectl logs -f nccl-test-2n-8cnbc-launcher-przrf | tail -10
  4294967296    1073741824     float     sum      -1  9230.25  465.31  872.46       0  9221.35  465.76  873.31       0
  8589934592    2147483648     float     sum      -1  18376.5  467.44  876.45       0  18337.6  468.43  878.31       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 827.441
#
# Collective test concluded: all_reduce_perf
#

=== NCCL test completed ===

Understanding the output:

The key metric is Avg bus bandwidth. This measures how fast GPUs can collectively communicate across the InfiniBand fabric. For a healthy cluster with 400 Gb/s InfiniBand links, you should expect bus bandwidth in the range of 800+ GB/s for large message sizes. If this number is significantly lower, it may indicate a network issue (bad cable, misconfigured IB port, or wrong P_Key).

Bundled Slurm (Slinky)

The Kubernetes orchestrator also runs a Slinky Slurm cluster in the slurm namespace. The Slurm controller, accounting, REST API, login pod and worker pods all run as Kubernetes workloads.

$ kubectl get pods -n slurm
NAME                                  READY   STATUS    RESTARTS   AGE
mariadb-0                             1/1     Running   ...        ...
slurm-accounting-0                    1/1     Running   ...        ...
slurm-controller-0                    3/3     Running   ...        ...
slurm-login-slinky-xxxxxxxxxx-xxxxx   1/1     Running   ...        ...
slurm-restapi-xxxxxxxxxx-xxxxx        1/1     Running   ...        ...
slurm-worker-slinky-0                 2/2     Running   ...        ...
slurm-worker-slinky-1                 2/2     Running   ...        ...

sinfo, sbatch, srun etc. are available from inside the slurm-login-slinky-* pod. Exec into it to submit jobs (full examples on the Slurm page):

$ kubectl get pods -n slurm | grep login-slinky
slurm-login-slinky-xxxxxxxxxx-xxxxx   1/1     Running   2 (3h37m ago)   8h

$ kubectl exec -it -n slurm slurm-login-slinky-xxxxxxxxxx-xxxxx -- bash

root@slurm-login-slinky-xxxxxxxxxx-xxxxx:/tmp# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2   idle slinky-[0-1]

root@slurm-login-slinky-xxxxxxxxxx-xxxxx:/tmp# srun -N 2 nvidia-smi -L
GPU 0: NVIDIA B200 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
GPU 1: ..
...

Info

On Kubernetes clusters sinfo and sbatch are available from the Slurm login pod, not from the login host directly. Making them available on the login host too is in progress. Usage is identical to the native Slurm cluster, see the Slurm page.

Warning

While the Slinky NodeSet is at its default size, it owns all the cluster's GPUs (8 per Slinky worker pod, one pod per GPU node). Kubernetes pods that request nvidia.com/gpu will stay Pending. Scale the NodeSet down as described in the warning at the top of this page before running an MPIJob.

If dpkg -l inside a Slurm worker pod does not list any nvidia-* packages, that is expected. See Good to know: NVIDIA userspace inside Slinky pods.

Monitoring Jobs

Use standard kubectl commands to monitor your workloads:

# List all pods and their status
kubectl get pods

# Follow logs from a specific pod
kubectl logs -f <pod-name>

# Describe a pod for detailed status and events
kubectl describe pod <pod-name>

# List all MPIJobs
kubectl get mpijobs

For cluster-level monitoring, Grafana dashboards are pre-configured with GPU, node, and SLURM metrics. See Monitoring for details on accessing the dashboards.


Container Registry

When running jobs on Kubernetes, your container images need to be pulled from a registry. It is recommended to use authenticated access when pulling images.

Verda provides a managed container registry — see Container Registries for setup instructions.

For quick experimentation, public images from NVIDIA NGC (nvcr.io/nvidia/...) and Docker Hub can also be used.


Optional: Installing the Kubeflow Training Operator

The cluster ships with the MPI Operator (for MPIJob). If you want a higher-level abstraction for distributed training — such as PyTorchJob, which automatically injects environment variables like MASTER_ADDR, WORLD_SIZE, and RANK — you can install the Kubeflow Training Operator yourself:

kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1"

Verify the installation:

kubectl get crd | grep kubeflow

You should see pytorchjobs.kubeflow.org alongside the existing mpijobs.kubeflow.org.

When to use MPIJob vs PyTorchJob

MPIJob PyTorchJob
Pre-installed Yes No (install Training Operator first)
Communication MPI (mpirun launches processes via SSH) PyTorch Distributed (torchrun / elastic)
Image requirement Must include sshd No sshd needed — standard NGC images work
Env vars Manual (NCCL_IB_PKEY, etc.) Auto-injected (MASTER_ADDR, RANK, WORLD_SIZE)
Best for NCCL tests, MPI-native workloads PyTorch training scripts using torch.distributed

Tutorials