vLLM Deployment Guide
Step-by-step guide for deploying vLLM to the f3iai Kubernetes namespace using Helm or raw manifests, including air-gapped model loading.
Production Air-Gapped Deployment This guide covers deploying LLM inference via vLLM on Kubernetes for production and air-gapped environments. For developer workstation setup using Ollama on Apple Silicon, see the Developer Workstation Guide.
Helm Chart Deployment
The vLLM project publishes an official Helm chart. For air-gapped environments, the chart should be mirrored to your internal Harbor registry or vendored into the GitOps repository.
Add the Helm Repository
On a workstation with internet access (or from within a connected build environment):
helm repo add vllm https://vllm-project.github.io/helm-charts/
helm repo update
helm pull vllm/vllm --version 0.7.3 --destination ./charts/
Push the chart to Harbor for air-gapped use:
helm push ./charts/vllm-0.7.3.tgz oci://harbor.vitro.lan/ffp/charts
Values Configuration
Create a values.yaml targeting Qwen2.5-7B-Instruct with tool calling enabled:
replicaCount: 1
image:
repository: harbor.vitro.lan/ffp/vllm-openai
tag: "v0.7.3"
pullPolicy: IfNotPresent
containerPort: 8000
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
env:
- name: VLLM_ALLOW_LONG_MAX_MODEL_LEN
value: "1"
extraArgs:
- "--model"
- "/models/Qwen2.5-7B-Instruct"
- "--served-model-name"
- "Qwen/Qwen2.5-7B-Instruct"
- "--max-model-len"
- "32768"
- "--gpu-memory-utilization"
- "0.90"
- "--enable-auto-tool-choice"
- "--tool-call-parser"
- "hermes"
- "--chat-template"
- "/models/Qwen2.5-7B-Instruct/tool_chat_template.jinja"
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: vllm-models
volumeMounts:
- name: model-storage
mountPath: /models
readOnly: true
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
failureThreshold: 5
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3
service:
type: ClusterIP
port: 8000
Install the Chart
helm install vllm oci://harbor.vitro.lan/ffp/charts/vllm \
--namespace f3iai \
--values values.yaml \
--wait --timeout 10m
The --timeout 10m is important because model loading can take several minutes, especially for larger models loaded from PVC.
Air-Gapped Model Loading
In production air-gapped environments, models cannot be pulled from HuggingFace at runtime. There are two approaches:
Option 1: PersistentVolumeClaim (Recommended)
Create a PVC and populate it with model weights using a one-time Kubernetes Job:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-models
namespace: f3iai
spec:
accessModes:
- ReadOnlyMany
storageClassName: nfs-csi
resources:
requests:
storage: 50Gi
Populate the PVC from a build machine that has internet access, then copy the model directory into the NFS-backed volume. Alternatively, use a Job that pulls from Harbor’s OCI artifact storage:
# On a connected machine, download and tar the model
huggingface-cli download Qwen/Qwen2.5-7B-Instruct --local-dir ./Qwen2.5-7B-Instruct
tar czf qwen2.5-7b-instruct.tar.gz Qwen2.5-7B-Instruct/
# Copy to NFS or upload as OCI artifact
Option 2: Bake into Container Image
Build a custom vLLM image with the model weights embedded:
FROM harbor.vitro.lan/ffp/vllm-openai:v0.7.3
COPY ./Qwen2.5-7B-Instruct /models/Qwen2.5-7B-Instruct
This produces a large image (~15-20 GB for a 7B FP16 model) but is fully self-contained. Push to Harbor:
docker build -t harbor.vitro.lan/ffp/vllm-qwen2.5-7b:v1.0.0 .
docker push harbor.vitro.lan/ffp/vllm-qwen2.5-7b:v1.0.0
Raw Manifest Deployment (Without Helm)
If Helm is not available or not desired, deploy using standard Kubernetes manifests.
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
namespace: f3iai
labels:
app: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: harbor.vitro.lan/ffp/vllm-openai:v0.7.3
ports:
- containerPort: 8000
name: http
args:
- "--model"
- "/models/Qwen2.5-7B-Instruct"
- "--served-model-name"
- "Qwen/Qwen2.5-7B-Instruct"
- "--max-model-len"
- "32768"
- "--gpu-memory-utilization"
- "0.90"
- "--enable-auto-tool-choice"
- "--tool-call-parser"
- "hermes"
env:
- name: VLLM_ALLOW_LONG_MAX_MODEL_LEN
value: "1"
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: model-storage
mountPath: /models
readOnly: true
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
failureThreshold: 5
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: vllm-models
Service
apiVersion: v1
kind: Service
metadata:
name: vllm
namespace: f3iai
labels:
app: vllm
spec:
type: ClusterIP
ports:
- port: 8000
targetPort: 8000
protocol: TCP
name: http
selector:
app: vllm
Verification
After deployment, verify the vLLM instance is running and serving the model:
# Check pod status
kubectl get pods -n f3iai -l app=vllm
# Check health endpoint
kubectl exec -n f3iai deploy/compass-api -- \
curl -s http://vllm.f3iai.svc.cluster.local:8000/health
# Verify model is loaded
kubectl exec -n f3iai deploy/compass-api -- \
curl -s http://vllm.f3iai.svc.cluster.local:8000/v1/models | jq .
# Test a completion
kubectl exec -n f3iai deploy/compass-api -- \
curl -s http://vllm.f3iai.svc.cluster.local:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello"}],"max_tokens":50}'
The /v1/models response should list your served model name. The /health endpoint returns HTTP 200 when the model is loaded and ready to serve requests.
GPU Scheduling Notes
vLLM requires NVIDIA GPU resources. Ensure your cluster has:
- NVIDIA device plugin installed (
nvidia-device-pluginDaemonSet) - GPU nodes labeled with
nvidia.com/gpu.present=true - GPU taints applied so non-GPU workloads do not schedule on GPU nodes:
nvidia.com/gpu:NoSchedule - Node affinity or nodeSelector configured in the vLLM deployment to target GPU nodes specifically
If using multiple GPU types (e.g., A100 and T4), use additional labels like nvidia.com/gpu.product to target specific hardware.