Validation¶
When each instant cluster is deployed, we perform a series of early checks:
- Kernel versions
- Infiniband card port status, configuration and firmware versions
- ECC configuration consistency across all GPUs within each node
While the cluster is in validating (phase 1) status - before it transitions to running - we also verify that all worker nodes report as idle in SLURM or Ready in Kubernetes
Nodes become idle in SLURM after the health check passes (see /etc/nhc/nhc.conf for details), which verifies:
/partition has less than 90% disk usagedcgmi diag -r 1 -n gpu:8passes- All NVLinks and InfiniBand ports are up
- There are 8 InfiniBand ports at NDR or faster, all sharing the same P_Key
- All gpud checks report Healthy
At the very end of phase 1 we will also fail if:
- a SLURM job (that runs nccl-tests all_reduce_perf across all the nodes) fails, or fails with too little bandwidth.
In phase 2 we also run additional SLURM jobs. These are currently informational and do not affect cluster status.
- iperf from every node to node-1 to measure Ethernet bandwidth
- That an srun
nvidia-smijob completes in a timely manner
Details about these extra SLURM jobs can be found in /home/ubuntu/slurm-*.out or by running journalctl -u verda-validation-phase-2.service