Skip to content

Tutorial: deploying vllm inference on instant cluster using ray

If you need to run your inference on more than 8 GPUs, you can do so on our instant cluster using vllm with ray.

Warning

The vllm command and required steps might differ depending on the model you are trying to deploy

After you grab your instant cluster and ssh into the first node you need to install the environment:

apt install python3-venv
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env  # or restart shell

uv python install 3.12
uv venv --python 3.12
source .venv/bin/activate

##create pyproject.toml and add dependencies to it
cat << EOF >> pyproject.toml
[project]
name = "vllm-ray"
version = "1.0.0"
dependencies = [
    "ray==2.52.0",
    "vllm",
]

[tool.uv.sources]
vllm = { url = "https://github.com/vllm-project/vllm/releases/download/v0.12.0/vllm-0.12.0+cu130-cp38-abi3-manylinux_2_31_x86_64.whl" }

[[tool.uv.index]]
url = "https://download.pytorch.org/whl/cu130"
EOF

uv sync --index-strategy unsafe-best-match

\ Then you need to download the model you want to run, remember to replace YOUR_HF_TOKEN with your actual huggingface token:

export HF_TOKEN=YOUR_HF_TOKEN
hf auth login --token $HF_TOKEN
hf download deepseek-ai/deepseek-llm-7b-chat # or any other model

After model has been downloaded, we can start ray on node 1:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export GLOO_SOCKET_IFNAME=eth0
ray start --head --num-gpus=8 --port=6379

Then on our worker nodes we also start ray, remember to replace FIRST_NODE_IP with actual IP of first node:

source .venv/bin/activate
ray start --address="FIRST_NODE_IP:6379" --num-gpus=8 --block

And then on the first node we can finally start serving with vllm, change pipeline-parallel-size's value to the amount of nodes(including head node) you have available and tensor-parallel-size to number of GPUs per node:

source .venv/bin/activate # if not in venv already
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/deepseek-llm-7b-chat \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--distributed-executor-backend ray