# Kubernetes

By choosing the kubernetes Job orchestrator the Instant Cluster will *also* have k8s installed.

<figure><img src="https://2529223994-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYO2RW7i8v8Hs8UzxSAdO%2Fuploads%2Fj2QJnAplsbNvWJannd2q%2Fimage.png?alt=media&#x26;token=fbe94c3b-23fa-42ab-8df4-236b1ba123c2" alt=""><figcaption></figcaption></figure>

### Features

The cluster is configured to provide an out-of-the box ability to run multi-node Infiniband jobs.

* Each worker node's /mnt/local\_disk is available as a StorageClass
* mpi-operator is deployed
* cilium, nvidia-device-plugin, nvidia-network-operator are also installed with helm

> At this point in time both kubernetes and SLURM job orchestrators are available when one chooses Kubernetes. There is no coordination between the orchestrators. One can disable SLURM with `systemctl disable --now slurmctld` on the jumphost.

### Using kubectl

Admin credentials can be found in `/root/.kube/config` and `/home/ubuntu/.kube/config`&#x20;

#### Submitting a job

`/home/ubuntu/verda_k8s_all_reduce_perf_2_nodes.yml` is available as an example. It runs an nccl-tests all\_reduce\_perf and it sets the crucial `NCCL_PKEY=1` environment variable. It is needed so that the nodes know which Infiniband P\_KEY to use.

```
$ kubectl create -f /home/ubuntu/verda_k8s_all_reduce_perf_2_nodes.yml 
mpijob.kubeflow.org/nccl-test-2n-wcq4s created
```

#### Job Details

```
$ kubectl get pods 
NAME                                READY   STATUS      RESTARTS   AGE
nccl-test-2n-8cnbc-launcher-przrf   0/1     Completed   4          30m
$
$ kubectl logs -f nccl-test-2n-8cnbc-launcher-przrf | tail -10
  4294967296    1073741824     float     sum      -1  9230.25  465.31  872.46       0  9221.35  465.76  873.31       0
  8589934592    2147483648     float     sum      -1  18376.5  467.44  876.45       0  18337.6  468.43  878.31       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 827.441 
#
# Collective test concluded: all_reduce_perf
#


=== NCCL test completed ===
```

> Downloading the container image on all workers might take a little while

### Container Registry

> It is a good idea to use authentication when pulling images.

{% embed url="<https://docs.verda.com/containers/container-registries>" %}
