Skip to content

Deploying NVIDIA Dynamo on a Kubernetes Instant Cluster

NVIDIA Dynamo is an open-source, distributed inference-serving framework built to deploy LLMs and other generative models in multi-node environments at data-center scale. It supports multiple inference backends — SGLang, NVIDIA TensorRT-LLM, and vLLM — and disaggregates the prefill and decode stages of inference across pods so each can be scaled independently.

This tutorial walks through deploying Dynamo on a Verda Kubernetes Instant Cluster (B200 / B300 class hardware) end-to-end: prerequisites, install, model deployment, and verification. It also documents the known issues and workarounds we hit during validation so you can skip past them.

What you'll deploy

Layer Component Role
Kubernetes operator NVIDIA GPU Operator Manages driver, container toolkit, MIG, DCGM exporter
Kubernetes operator NVIDIA Network Operator RDMA over InfiniBand (already installed on Verda Instant Clusters)
Dynamo control plane dynamo-crds chart Defines the Dynamo CRDs (DynamoGraphDeployment, DynamoGraphDeploymentRequest, etc.)
Dynamo control plane dynamo-platform chart Operator + NATS messaging + planner job runner
Dynamo workload DynamoGraphDeploymentRequest (DGDR) Auto-profiles your hardware and generates an optimal DynamoGraphDeployment
Dynamo workload DynamoGraphDeployment (DGD) The actual inference graph: frontend, prefill workers, decode workers, router

End users hit an OpenAI-compatible HTTP endpoint exposed by the frontend service; the operator handles everything below it.

Prerequisites

Before you start, confirm:

  • A Verda Kubernetes Instant Cluster with at least one GPU node, and kubectl access from the jumphost. See Deploying an Instant Cluster if you don't have one yet.
  • helm v3.12+ on the jumphost (helm version).
  • An NVIDIA NGC API key — generate at ngc.nvidia.com → top-right user menu → Setup → Generate API Key. The key needs NGC Catalog (Container Registry) access.
  • A HuggingFace token with read access to whichever model you plan to serve — generate at huggingface.co → Settings → Access Tokens.
  • Free GPU capacity at the Kubernetes layer. Confirm with the command below — Dynamo's inference workers will go Pending if no GPUs are allocatable. If your cluster has other workloads holding GPUs (e.g. Slurm via Slinky), scale them down before deploying Dynamo.
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}gpu={.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
  • A default StorageClass that supports ReadWriteMany (RWX) access mode. Dynamo's disaggregated architecture spreads prefill and decode workers across multiple nodes, and every worker mounts the same model-weights PVC so the weights are downloaded once and shared. This matters for two reasons:

    1. Multi-node correctness. With ReadWriteOnce (RWO) only one node can mount the volume at a time, so a multi-node DGD cannot bind its workers to a single shared PVC at all. The operator either fails to deploy or silently degrades to a single-node graph that wastes the rest of the cluster.
    2. Cold-start time and disk usage. Even on a single node, RWX lets you reuse one cached copy of the weights across re-deploys (prefill ↔ decode ratio changes, autoscaler events, planner regenerations). Without it, every worker re-downloads from HuggingFace — minutes to hours for large models, plus N× the disk.

    Verify your default StorageClass advertises RWX before installing Dynamo:

    kubectl get sc
    kubectl get sc <name> -o jsonpath='{.metadata.name}: {.metadata.annotations.storageclass\.kubernetes\.io/is-default-class}{"\n"}'
    # Confirm AccessModes the provisioner supports:
    kubectl describe sc <name> | grep -iE 'provisioner|allowVolumeExpansion|VolumeBindingMode'
    

    Verda Instant Clusters ship with an RWX-capable default StorageClass out of the box (this is the same setup used for Verda's managed inference clusters), so on a stock cluster this just works. If you have swapped it for an RWO-only provisioner (e.g. local-path, raw block), switch back or add an RWX-capable class — common choices are CephFS, NFS, Longhorn RWX, or any CSI driver that exposes ReadWriteMany. To make the new class the default after installing:

    kubectl patch sc <new-rwx-class> -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
    kubectl patch sc <old-rwo-class> -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
    

Set these as environment variables on the jumphost — every command in this tutorial assumes they're set:

export NAMESPACE=dynamo-system
export NGC_API_KEY='nvapi-...your-key-here...'
export HF_TOKEN='hf_...your-token-here...'

Step 1 — Pre-deployment check

The Dynamo repo ships a script that validates cluster readiness:

git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo/deploy/pre-deployment
bash pre-deployment-check.sh

A healthy cluster produces:

========================================
  Dynamo Pre-Deployment Check Script
========================================

--- Checking kubectl connectivity ---
✅ kubectl is available and cluster is accessible

--- Checking for default StorageClass ---
✅ Default StorageClass found

--- Checking cluster GPU resources ---
✅ Found 2 GPU node(s) in the cluster

--- Checking GPU operator ---
✅ GPU operator is running (1/1 pods)

Summary: 4 passed, 0 failed
🎉 All pre-deployment checks passed!

Two checks commonly fail on a fresh Instant Cluster:

"No GPU nodes found"

The script looks for the label nvidia.com/gpu.present=true. Verda Instant Clusters use the standard nvidia.com/gpu.product label set by the Device Plugin, which the Dynamo script doesn't currently recognize. Fix by labelling each GPU node:

for n in $(kubectl get nodes -l nvidia.com/gpu.product -o name); do
  kubectl label "$n" nvidia.com/gpu.present=true --overwrite
done

"GPU operator not found"

The Verda Instant Cluster image ships only the standalone NVIDIA Device Plugin and Network Operator — not the full GPU Operator. Dynamo expects GPU Operator's ClusterPolicy CRD to be present. Install it next.

Step 2 — Install NVIDIA GPU Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --force-update
helm repo update nvidia

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace \
  --wait --timeout=600s

Verify:

kubectl get pods -n gpu-operator
kubectl get clusterpolicies.nvidia.com

You should see the operator pod Running and a ClusterPolicy named cluster-policy with state: ready.

Coexistence with the standalone Device Plugin

GPU Operator's default install detects the existing Device Plugin / NFD / GFD components shipped by the Instant Cluster image and coexists gracefully — the device plugin DaemonSet is owned by whichever chart installed it first, and GPU Operator's components fill in the gaps (DCGM Exporter, MIG Manager, validator). Confirm allocatable GPU counts haven't changed after install:

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}gpu={.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'

If you see 0 or doubled counts (e.g. 16 on a worker that should have 8), the two device plugins conflicted — open a Verda support ticket.

Step 3 — Install Dynamo Platform

Dynamo's CRDs and platform are published as HTTPS Helm charts (not OCI) at helm.ngc.nvidia.com/nvidia/ai-dynamo. Install in two steps — CRDs first, then platform:

helm repo add nvidia-ai-dynamo https://helm.ngc.nvidia.com/nvidia/ai-dynamo
helm repo update nvidia-ai-dynamo

# See what's published
helm search repo nvidia-ai-dynamo

# Install CRDs (use the latest dynamo-crds version)
helm install dynamo-crds nvidia-ai-dynamo/dynamo-crds \
  --version 0.9.1 \
  --namespace "$NAMESPACE" --create-namespace \
  --wait

# Install platform (use the latest dynamo-platform version)
helm install dynamo-platform nvidia-ai-dynamo/dynamo-platform \
  --version 1.1.0 \
  --namespace "$NAMESPACE" \
  --wait --timeout=600s

Pin to versions you've actually verified exist

The dynamo-crds, dynamo-platform, and dynamo-graph charts version independently — at time of writing the latest are 0.9.1, 1.1.0, and 0.8.1 respectively. Always run helm search repo nvidia-ai-dynamo --versions before installing rather than relying on hardcoded examples in tutorials (including this one).

Verify:

kubectl get pods -n "$NAMESPACE"
kubectl get crd | grep dynamo

You should see two pods running (dynamo-platform-dynamo-operator-controller-manager and dynamo-platform-nats-0) and seven CRDs including dynamographdeployments.nvidia.com and dynamographdeploymentrequests.nvidia.com.

Step 4 — Create credentials secrets

Dynamo needs two distinct credentials at two different layers:

Secret Used by Purpose
nvcr-imagepullsecret kubelet (before container start) Pull Dynamo container images from nvcr.io
hf-token-secret inference container (at runtime) Download model weights from huggingface.co

Create both:

# NGC image pull secret
kubectl create secret docker-registry nvcr-imagepullsecret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="$NGC_API_KEY" \
  -n "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -

# HuggingFace token
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="$HF_TOKEN" \
  -n "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -

Don't conflate these two secrets

HF_TOKEN and NGC_API_KEY are not interchangeable. NGC docs sometimes only mention HF_TOKEN because most NGC images are anonymous-pullable, but Dynamo's Job templates reference nvcr-imagepullsecret regardless. Creating both eliminates a class of confusing pull-error and warning messages.

Step 5 — Deploy a model

The simplest path is a DynamoGraphDeploymentRequest (DGDR) — Dynamo's auto-profiling resource. The operator runs a profiler job, determines optimal sharding/parallelism for your hardware, and auto-creates a DynamoGraphDeployment (DGD) that spawns the actual inference pods.

Fetch the quickstart manifest:

curl -O https://raw.githubusercontent.com/datacrunch-research/instant-cluster-examples/main/dynamo/qwen3-quickstart.yaml

Open qwen3-quickstart.yaml and edit the hardware block to match your cluster:

  • spec.model — any HuggingFace model readable by $HF_TOKEN (default: Qwen/Qwen3-0.6B).
  • spec.hardware.totalGpus — allocatable GPU count from kubectl get nodes. Default 8 assumes a single 8-GPU node.
  • spec.hardware.numGpusPerNode — GPUs per node, usually 8.
  • spec.hardware.vramMb — per-GPU VRAM in MiB. 275039 is correct for B300 SXM6.
  • spec.hardware.gpuSku — keep b200_sxm for both B200 and B300 (see Known issues below).

Apply:

kubectl apply -f qwen3-quickstart.yaml
kubectl get pods -n "$NAMESPACE" -w

What happens next:

  1. The operator creates a profile-qwen3-quickstart-... Job. Its profiler container runs hardware sweeps to find the best deployment shape.
  2. On success, the profiler emits a config to a ConfigMap, and the operator generates a DynamoGraphDeployment.
  3. The DGD spawns inference pods — typically a frontend, one or more prefill workers, decode workers, and a router. They pull their runtime image (e.g. tensorrtllm-runtime:1.1.0) and start.
  4. The frontend exposes an OpenAI-compatible HTTP endpoint via a Service.

Profiling takes 5–15 min for small models, 30+ min for larger ones. You can kubectl logs -f <profiler-pod> -c profiler to watch progress.

Expected: startup probe warnings during first deploy

Worker pods will emit Startup probe failed events on port 9090 (/live) for several minutes after they start. This is normal — the runtime needs to pull a large image (TRT-LLM runtime is ~19 GB), load model weights, compile inference engines, and allocate KV cache before it's healthy. Total cold-start can be 10+ minutes on first deploy. The startup probe is configured with a long failureThreshold to tolerate this; the pod will become Ready once /live returns 200. Subsequent restarts on the same node are much faster because the image is cached.

Step 6 — Verify inference

Find the frontend service and port-forward:

kubectl get svc -n "$NAMESPACE"

FRONTEND_SVC=$(kubectl get svc -n "$NAMESPACE" -o name | grep -iE "frontend|router|api" | head -1)
kubectl port-forward "$FRONTEND_SVC" 8000:8000 -n "$NAMESPACE" &
PF_PID=$!

# Wait for the forward to be ready
for i in $(seq 1 20); do
  curl -s -m 1 http://localhost:8000/health >/dev/null 2>&1 && break
  sleep 1
done

# OpenAI-compatible chat completion
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "messages": [{"role": "user", "content": "What is NVIDIA Dynamo?"}],
    "max_tokens": 200
  }' | jq

kill $PF_PID

Example response (reasoning model — content and reasoning_content are returned as separate fields):

{
  "id": "chatcmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": null,
        "role": "assistant",
        "reasoning_content": "We need to answer: \"What is NVIDIA Dynamo?\" I need to recall or infer what NVIDIA Dynamo is. ... [model's chain-of-thought continues here]"
      },
      "finish_reason": "length",
      "logprobs": null
    }
  ],
  "created": 1778660379,
  "model": "deepseek-ai/DeepSeek-V4-Pro",
  "service_tier": null,
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 200,
    "total_tokens": 210
  }
}

A successful response includes a choices[0].message.content string with the generated answer (non-reasoning models) or a reasoning_content trace followed by the final content once the reasoning phase completes (reasoning models — give them more max_tokens if you want to see content populated). If you get an empty body, the port-forward likely raced ahead of the service being ready — wait longer or extend the polling loop.

Cleanup

To remove Dynamo and restore the cluster to its original state:

# 1. Delete the running deployment (DGD owns the pods, not DGDR)
kubectl get dgd -n dynamo-system
kubectl delete dgd qwen3-quickstart-dgd -n dynamo-system

# 2. Delete the request and any leftover output ConfigMaps
kubectl delete dgdr qwen3-quickstart -n dynamo-system 2>/dev/null
kubectl delete cm -l dgdr.nvidia.com/name -n dynamo-system

# Or nuke everything in one shot
kubectl delete dgd,dgdr,dynamocomponentdeployment --all -n dynamo-system

# 3. Uninstall the platform
helm uninstall dynamo-platform -n dynamo-system
helm uninstall dynamo-crds     -n dynamo-system
kubectl delete ns dynamo-system

# Optional: uninstall GPU Operator if you only added it for Dynamo
# helm uninstall gpu-operator -n gpu-operator
# kubectl delete ns gpu-operator

DGDR vs DGD lifecycle

The DynamoGraphDeploymentRequest (DGDR) is a one-shot object: the operator profiles hardware, generates a plan, creates a DynamoGraphDeployment (DGD), and is then done. The DGD is not owned by the DGDR — it has an independent lifecycle. Deleting the DGDR does not take down the running inference pods. You must delete the DGD to stop the workload. The DGD's name is the DGDR name with a -dgd suffix (e.g. DGDR qwen3-quickstart → DGD qwen3-quickstart-dgd).

Known issues

B300 not in Dynamo's gpuSku enum

The profiler will reject B300 hardware with a Pydantic enum error because the operator auto-fills gpuSku from the nvidia.com/gpu.product node label (NVIDIA B300 SXM6 AC), which isn't in the allowed list.

Workaround: explicitly set spec.hardware.gpuSku: b200_sxm in your DGDR (as shown in Step 5). Keep vramMb at the actual B300 value (275039). B200 and B300 share the same chip family and NVLink5 bandwidth, so plans generated under the B200 profile run correctly. Track upstream at github.com/ai-dynamo/dynamo.

Upstream helm install examples use the wrong scheme

NVIDIA's docs sometimes show oci://helm.ngc.nvidia.com/... for Dynamo charts. That scheme returns "not found" — Dynamo's NGC repo is HTTPS-based. Use helm repo add (Step 3) or fetch the .tgz directly:

helm install dynamo-platform \
  https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-1.1.0.tgz \
  -n "$NAMESPACE"

Listing NGC image tags requires a Bearer token

docker login nvcr.io works, but raw curl -u '$oauthtoken':$NGC_API_KEY against tags/list returns 401 — the endpoint requires a Bearer token exchanged via /proxy_auth. Reusable helper:

ngc_tags() {
  local repo="$1"
  local tok=$(curl -s -u '$oauthtoken':"$NGC_API_KEY" \
    "https://nvcr.io/proxy_auth?scope=repository:${repo}:pull&service=nvcr.io" \
    | jq -r '.token // .access_token')
  curl -s -H "Authorization: Bearer $tok" \
    "https://nvcr.io/v2/${repo}/tags/list" \
    | jq -r '.tags[]?' | grep -vE 'sha256-' | sort -V
}

ngc_tags nvidia/ai-dynamo/dynamo-planner
ngc_tags nvidia/ai-dynamo/tensorrtllm-runtime

References