vLLM Production Inference

Overview of vLLM as the production LLM inference engine for the Federal Frontier Platform in air-gapped Kubernetes environments.

Production Air-Gapped Deployment This guide covers deploying LLM inference via vLLM on Kubernetes for production and air-gapped environments. For developer workstation setup using Ollama on Apple Silicon, see the Developer Workstation Guide.

Why vLLM for Federal and Air-Gapped Environments

vLLM is the production inference engine for the Federal Frontier Platform (F3Iai). It was selected over alternatives for several key reasons that align with federal deployment constraints:

  • Helm chart deployable – vLLM ships an official Helm chart that integrates cleanly with ArgoCD-managed GitOps workflows, requiring no external package managers or runtime downloads.
  • No registry dependency at runtime – Unlike Ollama, which pulls models from an external registry on first use, vLLM models can be baked into container images or loaded from Kubernetes PersistentVolumeClaims. This eliminates the need for any outbound network access after deployment.
  • OpenAI-compatible API – vLLM exposes /v1/chat/completions, /v1/completions, and /v1/models endpoints that are drop-in compatible with OpenAI client libraries. Any application targeting the OpenAI API works with vLLM without code changes.
  • Tensor parallelism – For larger models (32B+), vLLM supports splitting model layers across multiple GPUs on the same node or across nodes, enabling deployment of models that exceed single-GPU memory.
  • FIPS-aligned container builds – The vLLM container can be built on a FIPS-validated base image (e.g., Iron Bank UBI or hardened Debian), satisfying compliance requirements for federal systems.

Supported Model Formats

vLLM supports loading models in several formats, each with different trade-offs:

Format Description Use Case
SafeTensors Standard HuggingFace format, FP16/BF16 weights Best quality, highest memory usage
AWQ 4-bit activation-aware quantization Recommended for production – good quality at 1/4 memory
GPTQ 4-bit post-training quantization Alternative to AWQ, slightly slower inference
GGUF llama.cpp format, CPU-optimized Not recommended for GPU serving – use SafeTensors or AWQ instead

For air-gapped deployments, SafeTensors and AWQ are the preferred formats because they load directly from disk without runtime conversion steps.

Integration with F3Iai

vLLM serves as the central inference engine that powers the entire Federal Frontier Platform’s AI capabilities. The integration follows a clean separation of concerns:

Compass API  -->  vLLM (OpenAI-compatible)  -->  tool_calls  -->  MCP Servers (12)
    |                    |                                              |
LLM_ENDPOINT        /v1/chat/completions                      150+ tools across
env var              with tool definitions                     TypeDB, PostgreSQL,
                                                               OpenStack, etc.
  1. Compass API is the sole consumer of the vLLM endpoint. It is configured via the LLM_ENDPOINT environment variable to point to the vLLM Kubernetes service. No other service in the platform calls vLLM directly.
  2. Tool calling is the critical capability. The Compass API sends chat completion requests that include 150+ MCP tool definitions. The LLM returns structured tool_calls in its response, which the Compass API dispatches to the appropriate MCP server.
  3. MCP servers (12 in total) handle the actual data operations – querying TypeDB, managing Kubernetes resources, interacting with OpenStack, and more. They never communicate with vLLM; they only receive tool invocations from the Compass API.

For tool-calling workloads in F3Iai, the Qwen 2.5 and Qwen 3 series models have proven to be the most reliable:

Model Parameters VRAM (AWQ) Tool Calling Quality
Qwen/Qwen2.5-7B-Instruct 7B ~4 GB Good for simple workflows
Qwen/Qwen2.5-14B-Instruct 14B ~8 GB Strong tool calling
Qwen/Qwen2.5-32B-Instruct-AWQ 32B ~18 GB Recommended for production
Qwen/Qwen3-8B 8B ~5 GB Excellent tool calling with thinking
Qwen/Qwen3-32B 32B ~18 GB Best overall for complex multi-tool chains

All Qwen models should be served with the following flags to enable structured tool calling:

--enable-auto-tool-choice --tool-call-parser hermes

For Qwen 2.5 models specifically, --tool-call-parser qwen25 may also be used if hermes produces inconsistent results.

Key Advantage Over Ollama

The fundamental difference between vLLM and Ollama for production use is model delivery. Ollama expects to pull models from a registry at startup or on first request. In an air-gapped environment with no outbound internet access, this model must either be:

  • Pre-loaded into an Ollama instance and its blob storage persisted (fragile, version-dependent)
  • Served from a local Ollama registry (adds another service to maintain)

With vLLM, models are simply files on disk. They can be:

  • Baked into the container image at build time, producing a self-contained deployment artifact
  • Loaded from a PVC that was populated during initial provisioning or via a Kubernetes Job
  • Stored in Harbor as an OCI artifact alongside the container images

This means a vLLM deployment in an air-gapped environment is fully deterministic – the exact model weights are part of the deployment manifest, versioned and reproducible.