# Validation

When each instant cluster is deployed, we perform a series of early checks:

* Kernel versions
* Infiniband card port status, configuration and firmware versions
* ECC configuration consistency across all GPUs within each node

While the cluster is in <mark style="color:$warning;">**validating**</mark>**&#x20;(phase 1)** status - before it transitions to <mark style="color:$success;">**running**</mark> - we also verify that all worker nodes report as idle in SLURM or Ready in Kubernetes

Nodes become idle in SLURM after the health check passes (see `/etc/nhc/nhc.conf` for details), which verifies:

* `/` partition has less than 90% disk usage
* `dcgmi diag -r 1 -n gpu:8` passes
* All NVLinks and InfiniBand ports are up
* There are 8 InfiniBand ports at NDR or faster, all sharing the same P\_Key
* All [gpud](https://github.com/leptonai/gpud) checks report Healthy

In **phase 2** we also run additional SLURM jobs. These are currently *informational* and do not affect cluster status.

* nccl-tests all\_reduce\_perf across all the nodes
* iperf from every node to node-1 to measure Ethernet bandwidth&#x20;
* That an srun `nvidia-smi` job completes in a timely manner

Details about these extra SLURM jobs can be found in `/home/ubuntu/slurm-*.out` or by running `journalctl -u datacrunch-validation-phase-2.service`
