Production stack¶
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the vLLM production stack. Born out of a Berkeley-UChicago collaboration, vLLM production stack is an officially released, production-optimized codebase under the vLLM project, designed for LLM deployment with:
- Upstream vLLM compatibility – It wraps around upstream vLLM without modifying its code.
- Ease of use – Simplified deployment via Helm charts and observability through Grafana dashboards.
- High performance – Optimized for LLM workloads with features like multi-model support, model-aware and prefix-aware routing, fast vLLM bootstrapping, and KV cache offloading with LMCache, among others.
If you are new to Kubernetes, don't worry: in the vLLM production stack repo, we provide a step-by-step guide and a short video to set up everything and get started in 4 minutes!
Pre-requisite¶
Ensure that you have a running Kubernetes environment with GPU (you can follow this tutorial to install a Kubernetes environment on a bare-medal GPU machine).
Deployment using vLLM production stack¶
The standard vLLM production stack is installed using a Helm chart. You can run this bash script to install Helm on your GPU server.
To install the vLLM production stack, run the following commands on your desktop:
sudo helm repo add vllm https://vllm-project.github.io/production-stack
sudo helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml
This will instantiate a vLLM-production-stack-based deployment named vllm
that runs a small LLM (Facebook opt-125M model).
Validate Installation¶
Monitor the deployment status using:
And you will see that pods for the vllm
deployment will transit to Running
state.
NAME READY STATUS RESTARTS AGE
vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 0 2m38s
vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 0 2m38s
Note
It may take some time for the containers to download the Docker images and LLM weights.
Send a Query to the Stack¶
Forward the vllm-router-service
port to the host machine:
And then you can send out a query to the OpenAI-compatible API to check the available models:
Output
To send an actual chatting request, you can issue a curl request to the OpenAI /completion
endpoint:
curl -X POST http://localhost:30080/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "Once upon a time,",
"max_tokens": 10
}'
Output
Uninstall¶
To remove the deployment, run:
(Advanced) Configuring vLLM production stack¶
The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:
Yaml
In this YAML configuration:
modelSpec
includes:name
: A nickname that you prefer to call the model.repository
: Docker repository of vLLM.tag
: Docker image tag.modelURL
: The LLM model that you want to use.
replicaCount
: Number of replicas.requestCPU
andrequestMemory
: Specifies the CPU and memory resource requests for the pod.requestGPU
: Specifies the number of GPUs required.pvcStorage
: Allocates persistent storage for the model.
Note
If you intend to set up two pods, please refer to this YAML file.
Tip
vLLM production stack offers many more features (e.g. CPU offloading and a wide range of routing algorithms). Please check out these examples and tutorials and our repo for more details!