Deploying NVIDIA Dynamo on a Kubernetes Instant Cluster¶
NVIDIA Dynamo is an open-source, distributed inference-serving framework built to deploy LLMs and other generative models in multi-node environments at data-center scale. It supports multiple inference backends — SGLang, NVIDIA TensorRT-LLM, and vLLM — and disaggregates the prefill and decode stages of inference across pods so each can be scaled independently.
This tutorial walks through deploying Dynamo on a Verda Kubernetes Instant Cluster (B200 / B300 class hardware) end-to-end: prerequisites, install, model deployment, and verification. It also documents the known issues and workarounds we hit during validation so you can skip past them.
What you'll deploy¶
| Layer | Component | Role |
|---|---|---|
| Kubernetes operator | NVIDIA GPU Operator | Manages driver, container toolkit, MIG, DCGM exporter |
| Kubernetes operator | NVIDIA Network Operator | RDMA over InfiniBand (already installed on Verda Instant Clusters) |
| Dynamo control plane | dynamo-crds chart |
Defines the Dynamo CRDs (DynamoGraphDeployment, DynamoGraphDeploymentRequest, etc.) |
| Dynamo control plane | dynamo-platform chart |
Operator + NATS messaging + planner job runner |
| Dynamo workload | DynamoGraphDeploymentRequest (DGDR) |
Auto-profiles your hardware and generates an optimal DynamoGraphDeployment |
| Dynamo workload | DynamoGraphDeployment (DGD) |
The actual inference graph: frontend, prefill workers, decode workers, router |
End users hit an OpenAI-compatible HTTP endpoint exposed by the frontend service; the operator handles everything below it.
Prerequisites¶
Before you start, confirm:
- A Verda Kubernetes Instant Cluster with at least one GPU node, and
kubectlaccess from the jumphost. See Deploying an Instant Cluster if you don't have one yet. helmv3.12+ on the jumphost (helm version).- An NVIDIA NGC API key — generate at ngc.nvidia.com → top-right user menu → Setup → Generate API Key. The key needs NGC Catalog (Container Registry) access.
- A HuggingFace token with read access to whichever model you plan to serve — generate at huggingface.co → Settings → Access Tokens.
- Free GPU capacity at the Kubernetes layer. Confirm with the command below — Dynamo's inference workers will go
Pendingif no GPUs are allocatable. If your cluster has other workloads holding GPUs (e.g. Slurm via Slinky), scale them down before deploying Dynamo.
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}gpu={.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
-
A default
StorageClassthat supportsReadWriteMany(RWX) access mode. Dynamo's disaggregated architecture spreads prefill and decode workers across multiple nodes, and every worker mounts the same model-weights PVC so the weights are downloaded once and shared. This matters for two reasons:- Multi-node correctness. With
ReadWriteOnce(RWO) only one node can mount the volume at a time, so a multi-node DGD cannot bind its workers to a single shared PVC at all. The operator either fails to deploy or silently degrades to a single-node graph that wastes the rest of the cluster. - Cold-start time and disk usage. Even on a single node, RWX lets you reuse one cached copy of the weights across re-deploys (prefill ↔ decode ratio changes, autoscaler events, planner regenerations). Without it, every worker re-downloads from HuggingFace — minutes to hours for large models, plus N× the disk.
Verify your default StorageClass advertises RWX before installing Dynamo:
kubectl get sc kubectl get sc <name> -o jsonpath='{.metadata.name}: {.metadata.annotations.storageclass\.kubernetes\.io/is-default-class}{"\n"}' # Confirm AccessModes the provisioner supports: kubectl describe sc <name> | grep -iE 'provisioner|allowVolumeExpansion|VolumeBindingMode'Verda Instant Clusters ship with an RWX-capable default StorageClass out of the box (this is the same setup used for Verda's managed inference clusters), so on a stock cluster this just works. If you have swapped it for an RWO-only provisioner (e.g. local-path, raw block), switch back or add an RWX-capable class — common choices are CephFS, NFS, Longhorn RWX, or any CSI driver that exposes
ReadWriteMany. To make the new class the default after installing: - Multi-node correctness. With
Set these as environment variables on the jumphost — every command in this tutorial assumes they're set:
export NAMESPACE=dynamo-system
export NGC_API_KEY='nvapi-...your-key-here...'
export HF_TOKEN='hf_...your-token-here...'
Step 1 — Pre-deployment check¶
The Dynamo repo ships a script that validates cluster readiness:
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo/deploy/pre-deployment
bash pre-deployment-check.sh
A healthy cluster produces:
========================================
Dynamo Pre-Deployment Check Script
========================================
--- Checking kubectl connectivity ---
✅ kubectl is available and cluster is accessible
--- Checking for default StorageClass ---
✅ Default StorageClass found
--- Checking cluster GPU resources ---
✅ Found 2 GPU node(s) in the cluster
--- Checking GPU operator ---
✅ GPU operator is running (1/1 pods)
Summary: 4 passed, 0 failed
🎉 All pre-deployment checks passed!
Two checks commonly fail on a fresh Instant Cluster:
"No GPU nodes found"¶
The script looks for the label nvidia.com/gpu.present=true. Verda Instant Clusters use the standard nvidia.com/gpu.product label set by the Device Plugin, which the Dynamo script doesn't currently recognize. Fix by labelling each GPU node:
for n in $(kubectl get nodes -l nvidia.com/gpu.product -o name); do
kubectl label "$n" nvidia.com/gpu.present=true --overwrite
done
"GPU operator not found"¶
The Verda Instant Cluster image ships only the standalone NVIDIA Device Plugin and Network Operator — not the full GPU Operator. Dynamo expects GPU Operator's ClusterPolicy CRD to be present. Install it next.
Step 2 — Install NVIDIA GPU Operator¶
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --force-update
helm repo update nvidia
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace \
--wait --timeout=600s
Verify:
You should see the operator pod Running and a ClusterPolicy named cluster-policy with state: ready.
Coexistence with the standalone Device Plugin
GPU Operator's default install detects the existing Device Plugin / NFD / GFD components shipped by the Instant Cluster image and coexists gracefully — the device plugin DaemonSet is owned by whichever chart installed it first, and GPU Operator's components fill in the gaps (DCGM Exporter, MIG Manager, validator). Confirm allocatable GPU counts haven't changed after install:
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}gpu={.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
If you see 0 or doubled counts (e.g. 16 on a worker that should have 8), the two device plugins conflicted — open a Verda support ticket.
Step 3 — Install Dynamo Platform¶
Dynamo's CRDs and platform are published as HTTPS Helm charts (not OCI) at helm.ngc.nvidia.com/nvidia/ai-dynamo. Install in two steps — CRDs first, then platform:
helm repo add nvidia-ai-dynamo https://helm.ngc.nvidia.com/nvidia/ai-dynamo
helm repo update nvidia-ai-dynamo
# See what's published
helm search repo nvidia-ai-dynamo
# Install CRDs (use the latest dynamo-crds version)
helm install dynamo-crds nvidia-ai-dynamo/dynamo-crds \
--version 0.9.1 \
--namespace "$NAMESPACE" --create-namespace \
--wait
# Install platform (use the latest dynamo-platform version)
helm install dynamo-platform nvidia-ai-dynamo/dynamo-platform \
--version 1.1.0 \
--namespace "$NAMESPACE" \
--wait --timeout=600s
Pin to versions you've actually verified exist
The dynamo-crds, dynamo-platform, and dynamo-graph charts version independently — at time of writing the latest are 0.9.1, 1.1.0, and 0.8.1 respectively. Always run helm search repo nvidia-ai-dynamo --versions before installing rather than relying on hardcoded examples in tutorials (including this one).
Verify:
You should see two pods running (dynamo-platform-dynamo-operator-controller-manager and dynamo-platform-nats-0) and seven CRDs including dynamographdeployments.nvidia.com and dynamographdeploymentrequests.nvidia.com.
Step 4 — Create credentials secrets¶
Dynamo needs two distinct credentials at two different layers:
| Secret | Used by | Purpose |
|---|---|---|
nvcr-imagepullsecret |
kubelet (before container start) | Pull Dynamo container images from nvcr.io |
hf-token-secret |
inference container (at runtime) | Download model weights from huggingface.co |
Create both:
# NGC image pull secret
kubectl create secret docker-registry nvcr-imagepullsecret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password="$NGC_API_KEY" \
-n "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
# HuggingFace token
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="$HF_TOKEN" \
-n "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
Don't conflate these two secrets
HF_TOKEN and NGC_API_KEY are not interchangeable. NGC docs sometimes only mention HF_TOKEN because most NGC images are anonymous-pullable, but Dynamo's Job templates reference nvcr-imagepullsecret regardless. Creating both eliminates a class of confusing pull-error and warning messages.
Step 5 — Deploy a model¶
The simplest path is a DynamoGraphDeploymentRequest (DGDR) — Dynamo's auto-profiling resource. The operator runs a profiler job, determines optimal sharding/parallelism for your hardware, and auto-creates a DynamoGraphDeployment (DGD) that spawns the actual inference pods.
Fetch the quickstart manifest:
curl -O https://raw.githubusercontent.com/datacrunch-research/instant-cluster-examples/main/dynamo/qwen3-quickstart.yaml
Open qwen3-quickstart.yaml and edit the hardware block to match your cluster:
spec.model— any HuggingFace model readable by$HF_TOKEN(default:Qwen/Qwen3-0.6B).spec.hardware.totalGpus— allocatable GPU count fromkubectl get nodes. Default8assumes a single 8-GPU node.spec.hardware.numGpusPerNode— GPUs per node, usually8.spec.hardware.vramMb— per-GPU VRAM in MiB.275039is correct for B300 SXM6.spec.hardware.gpuSku— keepb200_sxmfor both B200 and B300 (see Known issues below).
Apply:
What happens next:
- The operator creates a
profile-qwen3-quickstart-...Job. Itsprofilercontainer runs hardware sweeps to find the best deployment shape. - On success, the profiler emits a config to a ConfigMap, and the operator generates a
DynamoGraphDeployment. - The DGD spawns inference pods — typically a frontend, one or more prefill workers, decode workers, and a router. They pull their runtime image (e.g.
tensorrtllm-runtime:1.1.0) and start. - The frontend exposes an OpenAI-compatible HTTP endpoint via a
Service.
Profiling takes 5–15 min for small models, 30+ min for larger ones. You can kubectl logs -f <profiler-pod> -c profiler to watch progress.
Expected: startup probe warnings during first deploy
Worker pods will emit Startup probe failed events on port 9090 (/live) for several minutes after they start. This is normal — the runtime needs to pull a large image (TRT-LLM runtime is ~19 GB), load model weights, compile inference engines, and allocate KV cache before it's healthy. Total cold-start can be 10+ minutes on first deploy. The startup probe is configured with a long failureThreshold to tolerate this; the pod will become Ready once /live returns 200. Subsequent restarts on the same node are much faster because the image is cached.
Step 6 — Verify inference¶
Find the frontend service and port-forward:
kubectl get svc -n "$NAMESPACE"
FRONTEND_SVC=$(kubectl get svc -n "$NAMESPACE" -o name | grep -iE "frontend|router|api" | head -1)
kubectl port-forward "$FRONTEND_SVC" 8000:8000 -n "$NAMESPACE" &
PF_PID=$!
# Wait for the forward to be ready
for i in $(seq 1 20); do
curl -s -m 1 http://localhost:8000/health >/dev/null 2>&1 && break
sleep 1
done
# OpenAI-compatible chat completion
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [{"role": "user", "content": "What is NVIDIA Dynamo?"}],
"max_tokens": 200
}' | jq
kill $PF_PID
Example response (reasoning model — content and reasoning_content are returned as separate fields):
{
"id": "chatcmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"choices": [
{
"index": 0,
"message": {
"content": null,
"role": "assistant",
"reasoning_content": "We need to answer: \"What is NVIDIA Dynamo?\" I need to recall or infer what NVIDIA Dynamo is. ... [model's chain-of-thought continues here]"
},
"finish_reason": "length",
"logprobs": null
}
],
"created": 1778660379,
"model": "deepseek-ai/DeepSeek-V4-Pro",
"service_tier": null,
"system_fingerprint": null,
"object": "chat.completion",
"usage": {
"prompt_tokens": 10,
"completion_tokens": 200,
"total_tokens": 210
}
}
A successful response includes a choices[0].message.content string with the generated answer (non-reasoning models) or a reasoning_content trace followed by the final content once the reasoning phase completes (reasoning models — give them more max_tokens if you want to see content populated). If you get an empty body, the port-forward likely raced ahead of the service being ready — wait longer or extend the polling loop.
Cleanup¶
To remove Dynamo and restore the cluster to its original state:
# 1. Delete the running deployment (DGD owns the pods, not DGDR)
kubectl get dgd -n dynamo-system
kubectl delete dgd qwen3-quickstart-dgd -n dynamo-system
# 2. Delete the request and any leftover output ConfigMaps
kubectl delete dgdr qwen3-quickstart -n dynamo-system 2>/dev/null
kubectl delete cm -l dgdr.nvidia.com/name -n dynamo-system
# Or nuke everything in one shot
kubectl delete dgd,dgdr,dynamocomponentdeployment --all -n dynamo-system
# 3. Uninstall the platform
helm uninstall dynamo-platform -n dynamo-system
helm uninstall dynamo-crds -n dynamo-system
kubectl delete ns dynamo-system
# Optional: uninstall GPU Operator if you only added it for Dynamo
# helm uninstall gpu-operator -n gpu-operator
# kubectl delete ns gpu-operator
DGDR vs DGD lifecycle
The DynamoGraphDeploymentRequest (DGDR) is a one-shot object: the operator profiles hardware, generates a plan, creates a DynamoGraphDeployment (DGD), and is then done. The DGD is not owned by the DGDR — it has an independent lifecycle. Deleting the DGDR does not take down the running inference pods. You must delete the DGD to stop the workload. The DGD's name is the DGDR name with a -dgd suffix (e.g. DGDR qwen3-quickstart → DGD qwen3-quickstart-dgd).
Known issues¶
B300 not in Dynamo's gpuSku enum¶
The profiler will reject B300 hardware with a Pydantic enum error because the operator auto-fills gpuSku from the nvidia.com/gpu.product node label (NVIDIA B300 SXM6 AC), which isn't in the allowed list.
Workaround: explicitly set spec.hardware.gpuSku: b200_sxm in your DGDR (as shown in Step 5). Keep vramMb at the actual B300 value (275039). B200 and B300 share the same chip family and NVLink5 bandwidth, so plans generated under the B200 profile run correctly. Track upstream at github.com/ai-dynamo/dynamo.
Upstream helm install examples use the wrong scheme¶
NVIDIA's docs sometimes show oci://helm.ngc.nvidia.com/... for Dynamo charts. That scheme returns "not found" — Dynamo's NGC repo is HTTPS-based. Use helm repo add (Step 3) or fetch the .tgz directly:
helm install dynamo-platform \
https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-1.1.0.tgz \
-n "$NAMESPACE"
Listing NGC image tags requires a Bearer token¶
docker login nvcr.io works, but raw curl -u '$oauthtoken':$NGC_API_KEY against tags/list returns 401 — the endpoint requires a Bearer token exchanged via /proxy_auth. Reusable helper:
ngc_tags() {
local repo="$1"
local tok=$(curl -s -u '$oauthtoken':"$NGC_API_KEY" \
"https://nvcr.io/proxy_auth?scope=repository:${repo}:pull&service=nvcr.io" \
| jq -r '.token // .access_token')
curl -s -H "Authorization: Bearer $tok" \
"https://nvcr.io/v2/${repo}/tags/list" \
| jq -r '.tags[]?' | grep -vE 'sha256-' | sort -V
}
ngc_tags nvidia/ai-dynamo/dynamo-planner
ngc_tags nvidia/ai-dynamo/tensorrtllm-runtime