Good to know

Cluster node naming convention

Cluster node names will be based on the Hostname you specify when creating the cluster:

Jump host: hostname-jumphost
Worker nodes: hostname-1, hostname-2 , etc.

Validation

The job scheduler regularly run passive health checks on every worker node. The actual checks can be found in /etc/slurm/nhc.conf

As part of cluster creation, we run some active checks.

The first validation phase checks that nodes:

are ready / idle in the job scheduler
are UP from monitoring point of view

If the first phase does not pass the cluster will not move on from status validating

The results of the second validation phase does not (yet) affect cluster status, and the results can be found by running systemctl status *-validation-phase-2

The second phase for example runs:

all_reduce_perf from nccl-tests
iperf to check Ethernet performance

With SLURM, the jobs are stored in /home/ubuntu and can be customized and re-submitted.

To re-submit our example SLURM jobs one approach is to:

rm /etc/datacrunch_validation_phase_2
systemctl start datacrunch-validation-phase-2
journalctl -xefu datacrunch-validation-phase-2

Storage

There is a shared network filesystem mounted at /home on every node on the cluster.

Each worker node has a local NVMe drive mounted on /mnt/local_disk for extra fast I/O.

Infiniband partitioning

Worker nodes are interconnected using a partitioned 400 Gb/s Infiniband fabric with M_KEY. For this reason commands like ibhosts will not work, while distributed workloads like MPI work correctly.

To use Infiniband and NCCL from inside a Docker container make sure to set environment variable NCCL_IB_PKEY=1.

For example:

docker run -e NCCL_IB_PKEY=1

Other

Worker nodes are using the jump host as a default gateway, NAT firewall and Slurm controller.

CUDA, OpenMPI, doca-ofed and nvidia-drivers are installed on each server

Pytorch installer setup script is available in /home/pytorch.setup.sh.

PreviousMonitoring NextLocal Users

Last updated 5 days ago

Was this helpful?

hashtagCluster node naming convention

hashtagValidation

hashtagStorage

hashtagInfiniband partitioning

hashtagOther

Cluster node naming convention

Validation

Storage

Infiniband partitioning

Other