Skip to content

Validation

When each instant cluster is deployed, we perform a series of early checks:

  • Kernel versions
  • Infiniband card port status, configuration and firmware versions
  • ECC configuration consistency across all GPUs within each node

While the cluster is in validating (phase 1) status - before it transitions to running - we also verify that all worker nodes report as idle in SLURM or Ready in Kubernetes

Nodes become idle in SLURM after the health check passes (see /etc/nhc/nhc.conf for details), which verifies:

  • / partition has less than 90% disk usage
  • dcgmi diag -r 1 -n gpu:8 passes
  • All NVLinks and InfiniBand ports are up
  • There are 8 InfiniBand ports at NDR or faster, all sharing the same P_Key
  • All gpud checks report Healthy

At the very end of phase 1 we will also fail if:

  • a SLURM job (that runs nccl-tests all_reduce_perf across all the nodes) fails, or fails with too little bandwidth.

In phase 2 we also run additional SLURM jobs. These are currently informational and do not affect cluster status.

  • iperf from every node to node-1 to measure Ethernet bandwidth
  • That an srun nvidia-smi job completes in a timely manner

Details about these extra SLURM jobs can be found in /home/ubuntu/slurm-*.out or by running journalctl -u verda-validation-phase-2.service