Quick: Deploy with vLLM

In this tutorial, we will deploy a vLLM endpoint in a few easy steps. vLLMarrow-up-right has become one of the leading libraries for LLM-serving and inference, supporting many architecturesarrow-up-right and models that use them.

Model Weights

vLLM depends on the model weights being fetched from Hugging Face.

In this tutorial we are loading deepseek-ai/deepseek-llm-7b-chat model from Hugging Facearrow-up-right.

circle-check

You will also require the User Access Token in order to fetch the weights. You can obtain the Access Token in your Hugging Face accountarrow-up-right by clicking the Profile icon (top right corner) and selecting Access Tokens.

For deploying the vLLM endpoint, the READ permissions are sufficient.

circle-check

Create the Deployment

In this tutorial, we will deploy deepseek-ai/deepseek-llm-7b-chat on a General Compute (24 GB VRAM) GPU type. For larger models, you may need to choose one of the other GPU types we offer.

  1. Log in to the Verda cloud consolearrow-up-right, and go to Containers -> New deployment. Name your deployment and select the Compute Type.

  2. We will be using the official vLLM Docker containerarrow-up-right, set Container Image to docker.io/vllm/vllm-openai

  3. Toggle on the Public location for your image

  4. Select the Tag to deploy

  5. Set the Exposed HTTP port to 8000

  6. Set the Healthcheck port to 8000

  7. Set the Healthcheck path to /health

  8. Toggle Start Command on

  9. Add the following parameters to CMD: --model deepseek-ai/deepseek-llm-7b-chat --gpu-memory-utilization 0.9 --model-loader-extra-config '{"enable_multithread_load": true}'

  10. Add your Hugging Face User Access Token to the Environment Variables as HF_TOKEN

  11. Deploy container

(You can leave the Scaling options to their default values.)

That's it you should now have a running deployment!

circle-exclamation

Connect to the Endpoint

Before you can connect to the endpoint, you will need to generate an authentication token, by going to Keys -> Inference API Keys, and click Create.

The base endpoint URL for your deployment is in the Containers API section in the top left of the screen.

Test Request

Below is an example cURL command for running your test request:

circle-info

Notice the added subpath /v1/chat/completions to the base endpoint URL

Example Response

You should see a response that looks like this:

Last updated

Was this helpful?