Good to know

Cluster node naming convention

Cluster node names will be based on the Hostname you specify when creating the cluster:

  • Jump host: hostname-jumphost

  • Worker nodes: hostname-1, hostname-2 , etc.

Validation

The job scheduler regularly run passive health checks on every worker node. The actual checks can be found in /etc/slurm/nhc.conf

As part of cluster creation, we run some active checks.

The first validation phase checks that nodes:

If the first phase does not pass the cluster will not move on from status validating

The results of the second validation phase does not (yet) affect cluster status, and the results can be found by running systemctl status *-validation-phase-2

The second phase for example runs:

  • all_reduce_perf from nccl-tests

  • iperf to check Ethernet performance

With SLURMarrow-up-right, the jobs are stored in /home/ubuntu and can be customized and re-submitted.

To re-submit our example SLURM jobs one approach is to:

Storage

There is a shared network filesystem mounted at /home on every node on the cluster.

Each worker node has a local NVMe drive mounted on /mnt/local_disk for extra fast I/O.

Infiniband partitioning

Worker nodes are interconnected using a partitioned 400 Gb/s Infiniband fabric with M_KEY. For this reason commands like ibhosts will not work, while distributed workloads like MPI work correctly.

To use Infiniband and NCCL from inside a Docker container make sure to set environment variable NCCL_IB_PKEY=1.

For example:

Other

Worker nodes are using the jump host as a default gateway, NAT firewall and Slurm controller.

CUDA, OpenMPI, doca-ofed and nvidia-drivers are installed on each server

Pytorch installer setup script is available in /home/pytorch.setup.sh.

Last updated

Was this helpful?