Good to know
Cluster node naming convention
Cluster node names will be based on the Hostname you specify when creating the cluster:
Jump host:
hostname-jumphostWorker nodes:
hostname-1,hostname-2, etc.
Validation
The job scheduler regularly run passive health checks on every worker node. The actual checks can be found in /etc/slurm/nhc.conf
As part of cluster creation, we run some active checks.
The first validation phase checks that nodes:
are ready / idle in the job scheduler
are UP from monitoring point of view
If the first phase does not pass the cluster will not move on from status validating
The results of the second validation phase does not (yet) affect cluster status, and the results can be found by running systemctl status *-validation-phase-2
The second phase for example runs:
all_reduce_perffrom nccl-testsiperfto check Ethernet performance
With SLURM, the jobs are stored in /home/ubuntu and can be customized and re-submitted.
To re-submit our example SLURM jobs one approach is to:
Storage
There is a shared network filesystem mounted at /home on every node on the cluster.
Each worker node has a local NVMe drive mounted on /mnt/local_disk for extra fast I/O.
Infiniband partitioning
Worker nodes are interconnected using a partitioned 400 Gb/s Infiniband fabric with M_KEY. For this reason commands like ibhosts will not work, while distributed workloads like MPI work correctly.
To use Infiniband and NCCL from inside a Docker container make sure to set environment variable NCCL_IB_PKEY=1.
For example:
Other
Worker nodes are using the jump host as a default gateway, NAT firewall and Slurm controller.
CUDA, OpenMPI, doca-ofed and nvidia-drivers are installed on each server
Pytorch installer setup script is available in /home/pytorch.setup.sh.
Last updated
Was this helpful?