Good to know¶

Cluster architecture¶

A standard cluster has three node roles: a single login node (jumphost / bastion), a single service node, and N worker nodes (up to 16, each with 8 GPUs). The login node is the only node reachable from the internet — it is the SSH entry point and the NAT gateway for everything behind it. Worker nodes are interconnected by a high-speed InfiniBand fabric and all nodes share a /home filesystem.

graph TD
    Internet([Internet / User])
    Internet -->|"SSH :22 · HTTPS :443"| Login

    subgraph cluster [Cluster private network]
        Login["Login node<br/>jumphost / bastion<br/>NAT gateway · Nginx → Grafana"]
        Service["Service node<br/>auth.cluster.verda.internal<br/>Slurm controller · k8s control plane · monitoring"]
        W1["Worker 1<br/>8 GPUs · local NVMe"]
        W2["Worker 2<br/>8 GPUs · local NVMe"]
        Wn["Worker N<br/>(up to 16)<br/>8 GPUs · local NVMe"]

        Login --- Service
        Login --- W1
        Login --- W2
        Login --- Wn
        W1 -. InfiniBand .- W2
        W2 -. InfiniBand .- Wn
    end

The SLURM controller, the Kubernetes control plane and the monitoring/observability stack run on the service node, not on the login host. See node naming for the exact hostnames.

The login node firewall (/usr/local/sbin/iptables-custom.sh, configured from /etc/default/verda_iptables) drops all inbound traffic on the public interface by default, except for the ports below. Traffic from the internal cluster network and ICMP (ping) are always allowed.

Port	Purpose
`22` (TCP)	SSH — bastion shell. Also the seamless-SSH redirect target if `SEAMLESS_SSH_PORT=22`.
`80` (TCP)	HTTP — Let's Encrypt ACME (HTTP-01) challenge and redirect to `443`.
`443` (TCP)	HTTPS — Grafana via Nginx.
`2222` (TCP)	SSH — allowed by the firewall, but nothing listens unless seamless SSH is enabled (opt-in).

Worker and service nodes are not reachable from the internet by default; reach them by SSHing to the login node first (or see External SSH to worker nodes).

Cluster node naming convention¶

Cluster node names will be based on the Hostname you specify when creating the cluster:

Login / jump host: hostname-login (still labeled as jumphost in the Console and API)
Service node: hostname-service, also reachable as auth.cluster.verda.internal
Worker nodes: hostname-1, hostname-2 , etc.

Storage¶

There is a shared network filesystem mounted at /home on every node on the cluster.

Each worker node has a local NVMe drive mounted on /mnt/local_disk for extra fast I/O.

Infiniband partitioning¶

Worker nodes are interconnected using a partitioned 400 Gb/s Infiniband fabric with M_KEY. For this reason commands like ibhosts will not work, while distributed workloads like MPI work correctly.

On B300 clusters the partition key is assigned at index 0, so Infiniband and NCCL work from inside a Docker container without any extra configuration.

On H200 and B200 clusters, to use Infiniband and NCCL from inside a Docker container make sure to set environment variable NCCL_IB_PKEY=1.

For example:

docker run -e NCCL_IB_PKEY=1

NVIDIA userspace inside Slinky pods¶

On Kubernetes clusters, the bundled Slinky Slurm runs as pods. Inside the default Slurm worker pod the NVIDIA driver userspace (e.g. nvidia-smi) is injected by the NVIDIA Container Toolkit — the binaries and libraries are bind-mounted from the host into the pod. They are not installed via apt / dpkg, so dpkg -l will not list them:

root@slurm-login-slinky-xxxxxxxxxx-xxxxx:/tmp# srun which nvidia-smi
/usr/bin/nvidia-smi

How it works:

srun launches the command on a worker pod (slinky-0, slinky-1, ...).
On the worker pod, /usr/bin/nvidia-smi and the matching libnvidia-*.so.<driver-version> libraries appear as individual bind-mounts from the host filesystem. You can see them in mount | grep nvidia.
Because those files are bind-mounted at container start rather than installed from a .deb, the pod's dpkg database has no nvidia-* package entries — hence the empty dpkg -l | grep -i nvidi output.
which nvidia-smi still resolves to /usr/bin/nvidia-smi because the bind-mount makes the binary present at that path.

The login pod has no GPUs attached and therefore no injected NVIDIA binaries — nvidia-smi is only available when running under srun on a worker.

Changing the Slurm configuration (Slinky)¶

On Kubernetes clusters Slurm runs as pods managed by the Slinky operator, so there is no /etc/slurm/slurm.conf on the service node to edit — the operator renders slurm.conf from the Slurm Helm release and keeps the running controller in sync. Hand-editing a slurm.conf file (on the service node or inside a pod) has no lasting effect.

Extra slurm.conf settings are appended through the controller's extraConf. There are two ways to set it:

Durable (survives Helm upgrades) — set controller.extraConf in the Slurm Helm values and upgrade the slurm release in the slurm namespace. In Verda's deploy these values are generated by cluster-bootstrap.sh; edit the controller.extraConf block there and re-run, or apply a values file directly:

# values.yaml
controller:
  extraConf: |
    AccountingStorageEnforce=associations,limits,qos

helm upgrade slurm <slurm-chart> -n slurm --reuse-values -f values.yaml

Quick (live, but reverted on the next Helm upgrade) — edit the Controller resource directly:

kubectl -n slurm edit controller slurm
# extend spec.extraConf, e.g.:
#   extraConf: |
#     AccountingStorageEnforce=associations,limits,qos

Either way the Slinky operator detects the change and reconfigures the cluster live (equivalent to scontrol reconfigure) with zero control-plane downtime — no manual pod restart is needed. A small set of Slurm parameters require a full slurmctld restart (noted in the slurm.conf reference); for those, delete the controller pod and let the operator recreate it:

kubectl -n slurm delete pod slurm-controller-0

Confirm the change took effect from the login node:

scontrol show config | grep AccountingStorageEnforce

External SSH to worker nodes (optional)¶

By default the login node only NATs outbound traffic from workers — workers are not reachable from the internet. The usual path is to SSH to the login node and then to a worker by name (e.g. ssh hostname-1).

If you want to reach worker SSH directly from outside the cluster, enable DNAT on the login node:

On the login node, edit /etc/default/verda_iptables and set WORKER_SSH_DNAT=1.
Apply: systemctl restart iptables-custom

The login node will then forward <login-ip>:1000N to hostname-N:22. For example, <login-ip>:10001 → hostname-1, <login-ip>:10002 → hostname-2, and so on.

Warning

Enabling DNAT exposes worker SSH to the public internet. Make sure each worker's sshd only accepts key-based authentication.

Seamless SSH for kanidm users (optional)¶

Note

Slinky (Kubernetes) clusters only. The redirect targets the Slurm login pod, which does not exist on native-Slurm clusters.

By default the login node is a plain bastion: sshd answers on :22 only, and cluster users reach Slurm by first SSHing to the login node. You can optionally enable seamless SSH, which opens TCP :2222 on the login node and redirects it straight to the Slurm login pod, so a kanidm user lands directly in a Slurm-ready shell:

ssh <kanidm-user>@<login-ip> -p 2222

To enable it on a running cluster:

On the login node, edit /etc/default/verda_seamless_ssh and set SEAMLESS_SSH_ENABLED=1.
Apply: systemctl restart slinky-login-dnat.service

The :22 admin bastion shell (ubuntu@<login-ip>) keeps working as before. Members of the kanidm cluster_users group are then restricted to the seamless port — they can no longer open a plain bastion shell on :22.

SEAMLESS_SSH_PORT (default 2222) selects which port is the redirect; the other stays the bastion shell port. Set it to 22 to flip the roles, so seamless uses the cleaner :22 and the admin bastion shell moves to :2222. Only 22 or 2222 are valid.

Other¶

Worker nodes use the login (jump) host as their default gateway and NAT firewall.

The SLURM controller, the Kubernetes control plane and the monitoring/observability stack run on separate nodes from the login host. The Grafana UI is reachable at https://<login-ip>:443.

CUDA, OpenMPI, doca-ofed and nvidia-drivers are installed on each server

Pytorch installer setup script is available in /home/pytorch.setup.sh.