Good to know¶
Cluster node naming convention¶
Cluster node names will be based on the Hostname you specify when creating the cluster:
- Login / jump host:
hostname-login(still labeled as jumphost in the Console and API) - Service node:
hostname-service, also reachable asauth.cluster.verda.internal - Worker nodes:
hostname-1,hostname-2, etc.
Storage¶
There is a shared network filesystem mounted at /home on every node on the cluster.
Each worker node has a local NVMe drive mounted on /mnt/local_disk for extra fast I/O.
Infiniband partitioning¶
Worker nodes are interconnected using a partitioned 400 Gb/s Infiniband fabric with M_KEY. For this reason commands like ibhosts will not work, while distributed workloads like MPI work correctly.
To use Infiniband and NCCL from inside a Docker container make sure to set environment variable NCCL_IB_PKEY=1.
For example:
NVIDIA userspace inside Slinky pods¶
On Kubernetes clusters, the bundled Slinky Slurm runs as pods. Inside the default Slurm worker pod the NVIDIA driver userspace (e.g. nvidia-smi) is injected by the NVIDIA Container Toolkit — the binaries and libraries are bind-mounted from the host into the pod. They are not installed via apt / dpkg, so dpkg -l will not list them:
How it works:
srunlaunches the command on a worker pod (slinky-0,slinky-1, ...).- On the worker pod,
/usr/bin/nvidia-smiand the matchinglibnvidia-*.so.<driver-version>libraries appear as individual bind-mounts from the host filesystem. You can see them inmount | grep nvidia. - Because those files are bind-mounted at container start rather than installed from a
.deb, the pod'sdpkgdatabase has nonvidia-*package entries — hence the emptydpkg -l | grep -i nvidioutput. which nvidia-smistill resolves to/usr/bin/nvidia-smibecause the bind-mount makes the binary present at that path.
The login pod has no GPUs attached and therefore no injected NVIDIA binaries — nvidia-smi is only available when running under srun on a worker.
Other¶
Worker nodes use the login (jump) host as their default gateway and NAT firewall.
The SLURM controller, the Kubernetes control plane and the monitoring/observability stack run on separate nodes from the login host. The Grafana UI is reachable at https://<login-ip>:443.
CUDA, OpenMPI, doca-ofed and nvidia-drivers are installed on each server
Pytorch installer setup script is available in /home/pytorch.setup.sh.