Tutorial: deploying vllm inference on instant cluster using ray

If you need to run your inference on more than 8 GPUs, you can do so on our instant cluster using vllm with ray.

circle-exclamation

After you grab your instant cluster and ssh into the first node you need to install the environment:

apt install python3-venv
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env  # or restart shell

uv python install 3.12
uv venv --python 3.12
source .venv/bin/activate

##create pyproject.toml and add dependencies to it
cat << EOF >> pyproject.toml
[project]
name = "vllm-ray"
version = "1.0.0"
dependencies = [
    "ray==2.52.0",
    "vllm",
]

[tool.uv.sources]
vllm = { url = "https://github.com/vllm-project/vllm/releases/download/v0.12.0/vllm-0.12.0+cu130-cp38-abi3-manylinux_2_31_x86_64.whl" }

[[tool.uv.index]]
url = "https://download.pytorch.org/whl/cu130"
EOF

uv sync --index-strategy unsafe-best-match

Then you need to download the model you want to run, remember to replace YOUR_HF_TOKEN with your actual huggingface token:

After model has been downloaded, we can start ray on node 1:

Then on our worker nodes we also start ray, remember to replace FIRST_NODE_IP with actual IP of first node:

And then on the first node we can finally start serving with vllm, change pipeline-parallel-size's value to the amount of nodes(including head node) you have available and tensor-parallel-size to number of GPUs per node:

Last updated

Was this helpful?