Validation¶

Every Instant Cluster is validated automatically before it is handed over. The validation tests we run depend on the image type / orchestrator you deploy:

Native Slurm — Slurm runs directly on the nodes.
Kubernetes (k8s) — Kubernetes only, no Slurm.
Slinky — Slurm running inside Kubernetes (the Slinky Slurm operator).

The validation runs in two phases:

Phase 1 runs while the cluster is in validating status. It must pass before the cluster transitions to running.
Phase 2 runs after the cluster is running. Its tests are currently informational and do not affect cluster status.

Checks common to every cluster¶

These run regardless of the orchestrator.

Early checks (per node)¶

When each node comes up we verify:

Kernel versions
InfiniBand card port status, configuration and firmware versions
ECC configuration consistency across all GPUs within each node

Node health (NHC + gpud, Slurm-based images)¶

On native Slurm and Slinky images, a node only becomes idle in Slurm after the node health check (NHC) passes (see /etc/nhc/nhc.conf), which verifies:

/ partition has less than 90% disk usage
dcgmi diag -r 1 -n gpu:8 passes
All NVLinks and InfiniBand ports are up
There are 8 InfiniBand ports at NDR or faster, all sharing the same P_Key
All gpud checks report Healthy

On Kubernetes-only images the equivalent gate is the node reporting Ready in kubectl get nodes. gpud still runs on every compute node.

Phase 1 readiness gates (all orchestrators)¶

Before the orchestrator-specific Phase 1 tests run, the login node waits for and verifies:

The expected number of nodes report ready (slurm_nodes_idle for Slurm, count(up{job="node_exporter"}) for k8s) in Prometheus
Prometheus has the expected number of healthy scrape targets
Grafana is responding
The object storage endpoint is reachable
For Kubernetes images: kubectl get nodes shows all nodes Ready
kanidm (cluster auth) reports online on the login node

Native Slurm¶

Phase 1¶

In addition to the common readiness gates above:

A Slurm job runs nccl-tests all_reduce_perf across all nodes. Phase 1 fails if the job fails or reports too little bus bandwidth (minimum 350 GB/s).

Phase 2 (informational)¶

The following Slurm jobs run as the ubuntu user. Their results are recorded as Prometheus metrics:

NCCL all_reduce_perf (2-node) — must reach a minimum bus bandwidth (380 GB/s on H200, 680 GB/s otherwise)
ucx_perftest — RDMA bandwidth between nodes (minimum 50000 MB/s)
iperf node-to-node Ethernet bandwidth (minimum 50 Gbps)
iperf all-nodes-to-node-1 Ethernet bandwidth (minimum 50 Gbps total)
srun responsiveness — srun hostname and srun --gpus 8 nvidia-smi each complete within 30s
slurmrestd ping — the Slurm REST API answers

Details of these jobs are in /home/ubuntu/slurm-*.out and /home/ubuntu/verda_validation/, or via journalctl -u verda-validation-phase-2.service.

Kubernetes (k8s)¶

A Kubernetes-only image has no Slurm (no slurmctld, no worker pods, no login pod), so all Slurm-specific tests are skipped.

Phase 1¶

Only the common readiness gates apply — most importantly that kubectl get nodes shows all nodes Ready. Once those pass, the cluster transitions to running.

Phase 2 (informational)¶

Prometheus is responding
The Kubernetes RDMA networking is configured: NicClusterPolicy / rdma_shared_device_a, the local-disk and local-path StorageClasses, and the MPI Operator
Validation metrics are written to Prometheus

Slinky¶

A Slinky image runs Slurm inside Kubernetes. Both the Kubernetes node checks and a set of Slurm-via-k8s smoke checks run.

Phase 1¶

In addition to the common readiness gates:

The Kubernetes RDMA policy for the Slinky workers is in place
slurmctld is up
The s9s Slurm client is configured
The Slurm worker pods register and become ready
scontrol reconfigure succeeds from inside the login pod
An srun smoke check is responsive inside the login pod

Phase 2 (informational)¶

Run from inside the Slurm login pod / across the worker pods:

Prometheus is responding and the Kubernetes RDMA networking is configured
slurmctld is up and the worker pods are ready
Multi-node smoke — srun across nodes runs hostname and nvidia-smi -L
Shared-jail smoke — entering the shared jail, plus sbatch with a nested srun
DNS / egress on every worker — each worker has intact jail binaries, a working resolv.conf, DNS resolution and egress (regression guard for the shared-rootfs bind-detach race)
Shared /home venv on every worker — a Python venv on shared /home is usable from every worker
Nested multi-node srun — sbatch launching a nested multi-node srun inside the jail
UCX 2-node RDMA wireup smoke — non-fatal; a transient InfiniBand hiccup warns rather than fails validation
Validation metrics are written to Prometheus

Phase 2 details for any image can be found in /home/ubuntu/verda_validation/, /home/ubuntu/slurm-*.out, or by running journalctl -u verda-validation-phase-2.service on the login node.