In-Depth: Deploy with TGI
In this tutorial, we will deploy a text generation interface (TGI) endpoint hosting deepseek-ai/deepseek-llm-7b-chat large language model. TGI is one of the leading libraries for LLM-serving and inference, supporting many architectures and models that use them.
You can find more information about the model itself from the Hugging Face model hub.
Prerequisites
For this example you need a Python environment running on your local machine, a Hugging Face account to create a Hugging Face token that is used to fetch the model weights and Verda cloud account to create a deployment.
Model Weights
TGI deployment fetches the model weights from Hugging Face.
In this tutorial we are loading deepseek-ai/deepseek-llm-7b-chat model.
Some models on Hugging Face require the user to accept their usage policy, so please verify this for any model you are deploying. If you have not Agreed to the policy previously, you will see a similar dialog on the model page on Hugging Face:

You will also require the User Access Token in order to fetch the weights. You can obtain the Access Token in your Hugging Face account by clicking the Profile icon (top right corner) and selecting Access Tokens.
For deploying the TGI endpoint, the READ permissions are sufficient.

Please store the obtained token safely. You will need it for the next steps!
Create the deployment
In this example, we will deploy deepseek-ai/deepseek-llm-7b-chat on a General Compute (24 GB VRAM) GPU type. For larger models, you may need to choose one of the other GPU types we offer.
Log in to the Verda cloud console
Create a new project or use existing one, open the project
On the left you'll see a navigation menu. Go to Containers -> New deployment. Name your deployment and select the Compute Type.
We will be using the official TGI Docker image, set Container Image to
ghcr.io/huggingface/text-generation-inference:3.0.2You can select another version from the list if you prefer, or leave the version out of the url given and select the one that you wish to use. For this example we use3.0.2.Toggle on the Public location for your image. You can use the Private if you have a private registry, paired with credentials. For this example we use the public registry.
Make sure your preferred tag is selected
Set the Exposed HTTP port to
80Set the Healthcheck port to
80Set the Healthcheck path to
/healthToggle Start Command on
Add the following parameters to CMD:
--model-id deepseek-ai/deepseek-llm-7b-chatAdd your Hugging Face User Access Token to the Environment Variables as
HF_TOKEN. Note that in some examples you might seeHUGGING_FACE_HUB_TOKENenvironment variable used. TheHF_TOKENis the new name for the environment variable. The old nameHUGGING_FACE_HUB_TOKENis still supported, but going forwards we recommend using the new name.Deploy container
(You can leave the Scaling options to their default values, however if you wish to enable LLM batching, you can set the Concurrent requests per replica option to a value greater than 1. This number represents the number of concurrent requests the deployment accepts)
That's it! You have now created a deployment. You can check the logs of the deployment from the logs tab. When the deployment starts it'll download the model weights from Hugging Face and start the TGI server. This will take few minutes to complete.
For production use, we recommend authenticating/using private registries to avoid potential rate limits imposed by public container registries.
Accessing the deployment
Before you can connect to the endpoint, you will need to generate an authentication token, by going to Keys -> Inference API Keys, and click Create.

The base endpoint URL for your deployment is in the Containers API section in the top left of the screen. This will be in the form of: https://containers.datacrunch.io/<NAME-OF-OUR-DEPLOYMENT>/
Test Deployment
Once the deployment has been created and is ready to accept requests, you can test that it responds correctly by sending a List Models request to the endpoint.
TGI can be deployed as a server that implements the OpenAI API protocol. This allows TGI to be used as a replacement for applications using OpenAI API. More information about TGI in general and available endpoints can be found in the official documentation of TGI
Below is an example cURL command for running your test deployment:
This should return a response that shows deepseek-ai/deepseek-llm-7b-chat model is available for use.
Sending inference requests
As the List Models request show us deepseek-ai/deepseek-llm-7b-chat, we are ready to send an inference requests to the model.
Generate API
Generate API /generate offers a quick way to get completions for a given prompt.
Synchronous request
Below is a Python script that calls the completions endpoint /generate with a prompt and returns the completion. Save it to a file named test_request.py and run it with python test_request.py. Remember to replace <YOUR_CONTAINERS_API_URL> and <YOUR_INFERENCE_API_KEY> with the values from your deployment.
Response
This returns a synchronous response with the completion of the prompt:
Streaming request
Same example as above, but streaming out the response using Generate API stream endpoint /generate_stream. Save it to a file named test_request.py and run it with python test_request.py.
Response
This returns a streaming response with the completion of the prompt:
Chat Completions API
The chat completions API /v1/chat/completions is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. Notice that the prompt format is different from the completions API.
Synchronous request
Below is a Python script that calls the chat completions endpoint /v1/chat/completions with a prompt and returns the completion. Save it to a file named test_request.py and run it with python test_request.py. . Remember to replace <YOUR_CONTAINERS_API_URL> and <YOUR_INFERENCE_API_KEY> with the values from your deployment.
Response
This returns a synchronous response with the completion of the prompt.
Streaming request
Same example as above, but streaming out the response. Save it to a file named test_request.py and run it with python test_request.py.
Response
This returns a streaming response with the completion of the prompt.
Conclusion
This concludes our tutorial how to call the TGI endpoint with deepseek-ai/deepseek-llm-7b-chat model. You can now use the TGI endpoint to generate completions for your prompts.
Also check out also other TGI standard endpoints such as /health, /info or /metrics to monitor the health of the deployment.
Last updated
Was this helpful?