Kubernetes Integration

How the F3Iai platform services connect to vLLM, including service configuration, Compass API integration, GPU scheduling, network policies, and monitoring.

Production Air-Gapped Deployment This guide covers deploying LLM inference via vLLM on Kubernetes for production and air-gapped environments. For developer workstation setup using Ollama on Apple Silicon, see the Developer Workstation Guide.

Service Definition

The vLLM Kubernetes Service provides a stable DNS name for in-cluster consumers. All F3Iai services reference vLLM through this service rather than pod IPs.

apiVersion: v1
kind: Service
metadata:
  name: vllm
  namespace: f3iai
  labels:
    app: vllm
spec:
  type: ClusterIP
  ports:
    - port: 8000
      targetPort: 8000
      protocol: TCP
      name: http
  selector:
    app: vllm

This creates the internal DNS name vllm.f3iai.svc.cluster.local resolving to the vLLM pod(s) on port 8000. ClusterIP is the correct service type – vLLM should never be exposed outside the cluster directly.

Compass API Configuration

The Compass API is the sole consumer of the vLLM endpoint. No other service in F3Iai calls vLLM directly. MCP servers receive tool invocations from the Compass API, not from vLLM.

Configure the Compass API deployment to point at the vLLM service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: compass-api
  namespace: f3iai
spec:
  template:
    spec:
      containers:
        - name: compass-api
          image: harbor.vitro.lan/ffp/compass-api:v1.2.0
          env:
            - name: LLM_ENDPOINT
              value: "http://vllm.f3iai.svc.cluster.local:8000/v1"
            - name: LLM_MODEL
              value: "Qwen/Qwen2.5-7B-Instruct"
            - name: LLM_MAX_TOKENS
              value: "4096"
            - name: LLM_TEMPERATURE
              value: "0.1"

The LLM_ENDPOINT must include the /v1 path prefix. The Compass API appends /chat/completions, /models, etc. to this base URL. The LLM_MODEL must match the --served-model-name configured on the vLLM deployment.

Request Flow

The complete request flow through the platform:

sequenceDiagram participant User as User (Compass UI) participant API as Compass API participant vLLM as vLLM participant MCP as MCP Servers User->>API: Chat message API->>vLLM: /v1/chat/completions vLLM-->>API: tool_calls in response API->>MCP: Dispatch to ffo-mcp, kolla-mcp, +10 others MCP-->>API: Tool results API->>vLLM: Follow-up with tool results vLLM-->>API: Final answer API-->>User: Formatted response

GPU Scheduling and Node Affinity

vLLM pods must run on GPU-equipped nodes. Use node affinity to ensure correct scheduling while keeping MCP servers and other CPU-only workloads on standard nodes.

Node Labels

GPU nodes should be labeled:

kubectl label node gpu-node-01 nvidia.com/gpu.present=true
kubectl label node gpu-node-01 node-role.kubernetes.io/gpu=true

Node Affinity for vLLM

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: nvidia.com/gpu.present
              operator: In
              values:
                - "true"

GPU Tolerations

If GPU nodes have taints to prevent non-GPU workloads from scheduling:

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Resource Quotas

Apply a ResourceQuota to the f3iai namespace to prevent over-commitment of GPU resources:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: f3iai
spec:
  hard:
    requests.nvidia.com/gpu: "2"
    limits.nvidia.com/gpu: "2"

A LimitRange can enforce per-pod GPU limits:

apiVersion: v1
kind: LimitRange
metadata:
  name: gpu-limits
  namespace: f3iai
spec:
  limits:
    - type: Container
      max:
        nvidia.com/gpu: "2"
      default:
        nvidia.com/gpu: "0"

Health Probes

vLLM exposes a /health endpoint that returns HTTP 200 when the model is loaded and ready. Model loading can take 1-5 minutes depending on model size and storage speed, so probe timing must account for this.

Liveness Probe

The liveness probe detects if the vLLM process has crashed or hung. Set a generous initial delay to allow model loading:

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 120
  periodSeconds: 30
  failureThreshold: 5
  timeoutSeconds: 10

With these settings, Kubernetes will not kill the pod for at least 120 + (30 * 5) = 270 seconds (4.5 minutes) after startup, giving large models time to load.

Readiness Probe

The readiness probe controls when the pod receives traffic. It should start checking sooner but has the same endpoint:

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60
  periodSeconds: 10
  failureThreshold: 3
  timeoutSeconds: 5

The pod will not receive traffic from the Service until the readiness probe passes. This prevents the Compass API from sending requests to a vLLM instance that is still loading its model.

Network Policy

Restrict ingress to the vLLM pod to only the Compass API. No other service should be able to reach vLLM directly:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vllm-ingress
  namespace: f3iai
spec:
  podSelector:
    matchLabels:
      app: vllm
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: compass-api
      ports:
        - protocol: TCP
          port: 8000

This policy ensures that only pods with the label app: compass-api in the f3iai namespace can reach vLLM on port 8000. All other ingress traffic is denied.

Monitoring

vLLM exposes Prometheus metrics at the /metrics endpoint. Configure a ServiceMonitor (if using Prometheus Operator) or scrape config to collect these metrics.

ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm
  namespace: f3iai
spec:
  selector:
    matchLabels:
      app: vllm
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

Key Metrics

Metric	Description	Alert Threshold
`vllm:num_requests_running`	Currently processing requests	> 50 sustained
`vllm:num_requests_waiting`	Queued requests	> 10 sustained (indicates capacity issue)
`vllm:avg_generation_throughput_toks_per_s`	Token generation speed	< 20 tok/s (performance degradation)
`vllm:gpu_cache_usage_perc`	KV cache utilization	> 95% (context window exhaustion)
`vllm:avg_prompt_throughput_toks_per_s`	Prompt processing speed	Baseline varies by model
`vllm:num_preemptions_total`	Preempted requests (OOM)	Any sustained increase

Grafana Dashboard

Import the vLLM community Grafana dashboard (ID 21150) or create a custom dashboard tracking:

Request rate and latency (p50, p95, p99)
Token throughput (prompt and generation)
GPU cache utilization over time
Queue depth (waiting requests)
Error rate by status code

Scaling

Horizontal Pod Autoscaler

For environments with multiple GPUs, scale vLLM horizontally based on request queue depth:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm
  namespace: f3iai
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_waiting
        target:
          type: AverageValue
          averageValue: "5"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Pods
          value: 1
          periodSeconds: 600

Note that each vLLM replica requires its own GPU. The HPA maxReplicas must not exceed the number of available GPUs in the cluster. Scale-down is intentionally slow (10-minute stabilization) because model loading on startup is expensive.

Vertical Scaling

For single-GPU environments, vertical scaling means increasing the context window or switching to a larger model. Adjust --max-model-len and --gpu-memory-utilization to balance concurrent request capacity against context length. A lower --max-model-len (e.g., 16384 instead of 32768) frees VRAM for more concurrent KV caches, increasing throughput at the cost of maximum conversation length.