Kubernetes Integration
How the F3Iai platform services connect to vLLM, including service configuration, Compass API integration, GPU scheduling, network policies, and monitoring.
Production Air-Gapped Deployment This guide covers deploying LLM inference via vLLM on Kubernetes for production and air-gapped environments. For developer workstation setup using Ollama on Apple Silicon, see the Developer Workstation Guide.
Service Definition
The vLLM Kubernetes Service provides a stable DNS name for in-cluster consumers. All F3Iai services reference vLLM through this service rather than pod IPs.
apiVersion: v1
kind: Service
metadata:
name: vllm
namespace: f3iai
labels:
app: vllm
spec:
type: ClusterIP
ports:
- port: 8000
targetPort: 8000
protocol: TCP
name: http
selector:
app: vllm
This creates the internal DNS name vllm.f3iai.svc.cluster.local resolving to the vLLM pod(s) on port 8000. ClusterIP is the correct service type – vLLM should never be exposed outside the cluster directly.
Compass API Configuration
The Compass API is the sole consumer of the vLLM endpoint. No other service in F3Iai calls vLLM directly. MCP servers receive tool invocations from the Compass API, not from vLLM.
Configure the Compass API deployment to point at the vLLM service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: compass-api
namespace: f3iai
spec:
template:
spec:
containers:
- name: compass-api
image: harbor.vitro.lan/ffp/compass-api:v1.2.0
env:
- name: LLM_ENDPOINT
value: "http://vllm.f3iai.svc.cluster.local:8000/v1"
- name: LLM_MODEL
value: "Qwen/Qwen2.5-7B-Instruct"
- name: LLM_MAX_TOKENS
value: "4096"
- name: LLM_TEMPERATURE
value: "0.1"
The LLM_ENDPOINT must include the /v1 path prefix. The Compass API appends /chat/completions, /models, etc. to this base URL. The LLM_MODEL must match the --served-model-name configured on the vLLM deployment.
Request Flow
The complete request flow through the platform:
GPU Scheduling and Node Affinity
vLLM pods must run on GPU-equipped nodes. Use node affinity to ensure correct scheduling while keeping MCP servers and other CPU-only workloads on standard nodes.
Node Labels
GPU nodes should be labeled:
kubectl label node gpu-node-01 nvidia.com/gpu.present=true
kubectl label node gpu-node-01 node-role.kubernetes.io/gpu=true
Node Affinity for vLLM
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
GPU Tolerations
If GPU nodes have taints to prevent non-GPU workloads from scheduling:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Resource Quotas
Apply a ResourceQuota to the f3iai namespace to prevent over-commitment of GPU resources:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: f3iai
spec:
hard:
requests.nvidia.com/gpu: "2"
limits.nvidia.com/gpu: "2"
A LimitRange can enforce per-pod GPU limits:
apiVersion: v1
kind: LimitRange
metadata:
name: gpu-limits
namespace: f3iai
spec:
limits:
- type: Container
max:
nvidia.com/gpu: "2"
default:
nvidia.com/gpu: "0"
Health Probes
vLLM exposes a /health endpoint that returns HTTP 200 when the model is loaded and ready. Model loading can take 1-5 minutes depending on model size and storage speed, so probe timing must account for this.
Liveness Probe
The liveness probe detects if the vLLM process has crashed or hung. Set a generous initial delay to allow model loading:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
failureThreshold: 5
timeoutSeconds: 10
With these settings, Kubernetes will not kill the pod for at least 120 + (30 * 5) = 270 seconds (4.5 minutes) after startup, giving large models time to load.
Readiness Probe
The readiness probe controls when the pod receives traffic. It should start checking sooner but has the same endpoint:
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 5
The pod will not receive traffic from the Service until the readiness probe passes. This prevents the Compass API from sending requests to a vLLM instance that is still loading its model.
Network Policy
Restrict ingress to the vLLM pod to only the Compass API. No other service should be able to reach vLLM directly:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vllm-ingress
namespace: f3iai
spec:
podSelector:
matchLabels:
app: vllm
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: compass-api
ports:
- protocol: TCP
port: 8000
This policy ensures that only pods with the label app: compass-api in the f3iai namespace can reach vLLM on port 8000. All other ingress traffic is denied.
Monitoring
vLLM exposes Prometheus metrics at the /metrics endpoint. Configure a ServiceMonitor (if using Prometheus Operator) or scrape config to collect these metrics.
ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm
namespace: f3iai
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: http
path: /metrics
interval: 30s
Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
vllm:num_requests_running |
Currently processing requests | > 50 sustained |
vllm:num_requests_waiting |
Queued requests | > 10 sustained (indicates capacity issue) |
vllm:avg_generation_throughput_toks_per_s |
Token generation speed | < 20 tok/s (performance degradation) |
vllm:gpu_cache_usage_perc |
KV cache utilization | > 95% (context window exhaustion) |
vllm:avg_prompt_throughput_toks_per_s |
Prompt processing speed | Baseline varies by model |
vllm:num_preemptions_total |
Preempted requests (OOM) | Any sustained increase |
Grafana Dashboard
Import the vLLM community Grafana dashboard (ID 21150) or create a custom dashboard tracking:
- Request rate and latency (p50, p95, p99)
- Token throughput (prompt and generation)
- GPU cache utilization over time
- Queue depth (waiting requests)
- Error rate by status code
Scaling
Horizontal Pod Autoscaler
For environments with multiple GPUs, scale vLLM horizontally based on request queue depth:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm
namespace: f3iai
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_waiting
target:
type: AverageValue
averageValue: "5"
behavior:
scaleUp:
stabilizationWindowSeconds: 120
policies:
- type: Pods
value: 1
periodSeconds: 300
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Pods
value: 1
periodSeconds: 600
Note that each vLLM replica requires its own GPU. The HPA maxReplicas must not exceed the number of available GPUs in the cluster. Scale-down is intentionally slow (10-minute stabilization) because model loading on startup is expensive.
Vertical Scaling
For single-GPU environments, vertical scaling means increasing the context window or switching to a larger model. Adjust --max-model-len and --gpu-memory-utilization to balance concurrent request capacity against context length. A lower --max-model-len (e.g., 16384 instead of 32768) frees VRAM for more concurrent KV caches, increasing throughput at the cost of maximum conversation length.