vLLM Deployment Guide

Step-by-step guide for deploying vLLM to the f3iai Kubernetes namespace using Helm or raw manifests, including air-gapped model loading.

Production Air-Gapped Deployment This guide covers deploying LLM inference via vLLM on Kubernetes for production and air-gapped environments. For developer workstation setup using Ollama on Apple Silicon, see the Developer Workstation Guide.

Helm Chart Deployment

The vLLM project publishes an official Helm chart. For air-gapped environments, the chart should be mirrored to your internal Harbor registry or vendored into the GitOps repository.

Add the Helm Repository

On a workstation with internet access (or from within a connected build environment):

helm repo add vllm https://vllm-project.github.io/helm-charts/
helm repo update
helm pull vllm/vllm --version 0.7.3 --destination ./charts/

Push the chart to Harbor for air-gapped use:

helm push ./charts/vllm-0.7.3.tgz oci://harbor.vitro.lan/ffp/charts

Values Configuration

Create a values.yaml targeting Qwen2.5-7B-Instruct with tool calling enabled:

replicaCount: 1

image:
  repository: harbor.vitro.lan/ffp/vllm-openai
  tag: "v0.7.3"
  pullPolicy: IfNotPresent

containerPort: 8000

resources:
  requests:
    cpu: "4"
    memory: "16Gi"
    nvidia.com/gpu: "1"
  limits:
    cpu: "8"
    memory: "32Gi"
    nvidia.com/gpu: "1"

env:
  - name: VLLM_ALLOW_LONG_MAX_MODEL_LEN
    value: "1"

extraArgs:
  - "--model"
  - "/models/Qwen2.5-7B-Instruct"
  - "--served-model-name"
  - "Qwen/Qwen2.5-7B-Instruct"
  - "--max-model-len"
  - "32768"
  - "--gpu-memory-utilization"
  - "0.90"
  - "--enable-auto-tool-choice"
  - "--tool-call-parser"
  - "hermes"
  - "--chat-template"
  - "/models/Qwen2.5-7B-Instruct/tool_chat_template.jinja"

volumes:
  - name: model-storage
    persistentVolumeClaim:
      claimName: vllm-models

volumeMounts:
  - name: model-storage
    mountPath: /models
    readOnly: true

nodeSelector:
  nvidia.com/gpu.present: "true"

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 120
  periodSeconds: 30
  failureThreshold: 5

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60
  periodSeconds: 10
  failureThreshold: 3

service:
  type: ClusterIP
  port: 8000

Install the Chart

helm install vllm oci://harbor.vitro.lan/ffp/charts/vllm \
  --namespace f3iai \
  --values values.yaml \
  --wait --timeout 10m

The --timeout 10m is important because model loading can take several minutes, especially for larger models loaded from PVC.

Air-Gapped Model Loading

In production air-gapped environments, models cannot be pulled from HuggingFace at runtime. There are two approaches:

Create a PVC and populate it with model weights using a one-time Kubernetes Job:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
  namespace: f3iai
spec:
  accessModes:
    - ReadOnlyMany
  storageClassName: nfs-csi
  resources:
    requests:
      storage: 50Gi

Populate the PVC from a build machine that has internet access, then copy the model directory into the NFS-backed volume. Alternatively, use a Job that pulls from Harbor’s OCI artifact storage:

# On a connected machine, download and tar the model
huggingface-cli download Qwen/Qwen2.5-7B-Instruct --local-dir ./Qwen2.5-7B-Instruct
tar czf qwen2.5-7b-instruct.tar.gz Qwen2.5-7B-Instruct/

# Copy to NFS or upload as OCI artifact

Option 2: Bake into Container Image

Build a custom vLLM image with the model weights embedded:

FROM harbor.vitro.lan/ffp/vllm-openai:v0.7.3
COPY ./Qwen2.5-7B-Instruct /models/Qwen2.5-7B-Instruct

This produces a large image (~15-20 GB for a 7B FP16 model) but is fully self-contained. Push to Harbor:

docker build -t harbor.vitro.lan/ffp/vllm-qwen2.5-7b:v1.0.0 .
docker push harbor.vitro.lan/ffp/vllm-qwen2.5-7b:v1.0.0

Raw Manifest Deployment (Without Helm)

If Helm is not available or not desired, deploy using standard Kubernetes manifests.

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  namespace: f3iai
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: harbor.vitro.lan/ffp/vllm-openai:v0.7.3
          ports:
            - containerPort: 8000
              name: http
          args:
            - "--model"
            - "/models/Qwen2.5-7B-Instruct"
            - "--served-model-name"
            - "Qwen/Qwen2.5-7B-Instruct"
            - "--max-model-len"
            - "32768"
            - "--gpu-memory-utilization"
            - "0.90"
            - "--enable-auto-tool-choice"
            - "--tool-call-parser"
            - "hermes"
          env:
            - name: VLLM_ALLOW_LONG_MAX_MODEL_LEN
              value: "1"
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-storage
              mountPath: /models
              readOnly: true
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 30
            failureThreshold: 5
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
            failureThreshold: 3
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: vllm-models

Service

apiVersion: v1
kind: Service
metadata:
  name: vllm
  namespace: f3iai
  labels:
    app: vllm
spec:
  type: ClusterIP
  ports:
    - port: 8000
      targetPort: 8000
      protocol: TCP
      name: http
  selector:
    app: vllm

Verification

After deployment, verify the vLLM instance is running and serving the model:

# Check pod status
kubectl get pods -n f3iai -l app=vllm

# Check health endpoint
kubectl exec -n f3iai deploy/compass-api -- \
  curl -s http://vllm.f3iai.svc.cluster.local:8000/health

# Verify model is loaded
kubectl exec -n f3iai deploy/compass-api -- \
  curl -s http://vllm.f3iai.svc.cluster.local:8000/v1/models | jq .

# Test a completion
kubectl exec -n f3iai deploy/compass-api -- \
  curl -s http://vllm.f3iai.svc.cluster.local:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello"}],"max_tokens":50}'

The /v1/models response should list your served model name. The /health endpoint returns HTTP 200 when the model is loaded and ready to serve requests.

GPU Scheduling Notes

vLLM requires NVIDIA GPU resources. Ensure your cluster has:

  1. NVIDIA device plugin installed (nvidia-device-plugin DaemonSet)
  2. GPU nodes labeled with nvidia.com/gpu.present=true
  3. GPU taints applied so non-GPU workloads do not schedule on GPU nodes: nvidia.com/gpu:NoSchedule
  4. Node affinity or nodeSelector configured in the vLLM deployment to target GPU nodes specifically

If using multiple GPU types (e.g., A100 and T4), use additional labels like nvidia.com/gpu.product to target specific hardware.