Skip to content

Good to know

Cluster node naming convention

Cluster node names will be based on the Hostname you specify when creating the cluster:

  • Login / jump host: hostname-login (still labeled as jumphost in the Console and API)
  • Service node: hostname-service, also reachable as auth.cluster.verda.internal
  • Worker nodes: hostname-1, hostname-2 , etc.

Storage

There is a shared network filesystem mounted at /home on every node on the cluster.

Each worker node has a local NVMe drive mounted on /mnt/local_disk for extra fast I/O.

Infiniband partitioning

Worker nodes are interconnected using a partitioned 400 Gb/s Infiniband fabric with M_KEY. For this reason commands like ibhosts will not work, while distributed workloads like MPI work correctly.

To use Infiniband and NCCL from inside a Docker container make sure to set environment variable NCCL_IB_PKEY=1.

For example:

docker run -e NCCL_IB_PKEY=1

NVIDIA userspace inside Slinky pods

On Kubernetes clusters, the bundled Slinky Slurm runs as pods. Inside the default Slurm worker pod the NVIDIA driver userspace (e.g. nvidia-smi) is injected by the NVIDIA Container Toolkit — the binaries and libraries are bind-mounted from the host into the pod. They are not installed via apt / dpkg, so dpkg -l will not list them:

root@slurm-login-slinky-xxxxxxxxxx-xxxxx:/tmp# srun which nvidia-smi
/usr/bin/nvidia-smi

How it works:

  • srun launches the command on a worker pod (slinky-0, slinky-1, ...).
  • On the worker pod, /usr/bin/nvidia-smi and the matching libnvidia-*.so.<driver-version> libraries appear as individual bind-mounts from the host filesystem. You can see them in mount | grep nvidia.
  • Because those files are bind-mounted at container start rather than installed from a .deb, the pod's dpkg database has no nvidia-* package entries — hence the empty dpkg -l | grep -i nvidi output.
  • which nvidia-smi still resolves to /usr/bin/nvidia-smi because the bind-mount makes the binary present at that path.

The login pod has no GPUs attached and therefore no injected NVIDIA binaries — nvidia-smi is only available when running under srun on a worker.

Other

Worker nodes use the login (jump) host as their default gateway and NAT firewall.

The SLURM controller, the Kubernetes control plane and the monitoring/observability stack run on separate nodes from the login host. The Grafana UI is reachable at https://<login-ip>:443.

CUDA, OpenMPI, doca-ofed and nvidia-drivers are installed on each server

Pytorch installer setup script is available in /home/pytorch.setup.sh.