vLLM Production Inference
Overview of vLLM as the production LLM inference engine for the Federal Frontier Platform in air-gapped Kubernetes environments.
Production Air-Gapped Deployment This guide covers deploying LLM inference via vLLM on Kubernetes for production and air-gapped environments. For developer workstation setup using Ollama on Apple Silicon, see the Developer Workstation Guide.
Why vLLM for Federal and Air-Gapped Environments
vLLM is the production inference engine for the Federal Frontier Platform (F3Iai). It was selected over alternatives for several key reasons that align with federal deployment constraints:
- Helm chart deployable – vLLM ships an official Helm chart that integrates cleanly with ArgoCD-managed GitOps workflows, requiring no external package managers or runtime downloads.
- No registry dependency at runtime – Unlike Ollama, which pulls models from an external registry on first use, vLLM models can be baked into container images or loaded from Kubernetes PersistentVolumeClaims. This eliminates the need for any outbound network access after deployment.
- OpenAI-compatible API – vLLM exposes
/v1/chat/completions,/v1/completions, and/v1/modelsendpoints that are drop-in compatible with OpenAI client libraries. Any application targeting the OpenAI API works with vLLM without code changes. - Tensor parallelism – For larger models (32B+), vLLM supports splitting model layers across multiple GPUs on the same node or across nodes, enabling deployment of models that exceed single-GPU memory.
- FIPS-aligned container builds – The vLLM container can be built on a FIPS-validated base image (e.g., Iron Bank UBI or hardened Debian), satisfying compliance requirements for federal systems.
Supported Model Formats
vLLM supports loading models in several formats, each with different trade-offs:
| Format | Description | Use Case |
|---|---|---|
| SafeTensors | Standard HuggingFace format, FP16/BF16 weights | Best quality, highest memory usage |
| AWQ | 4-bit activation-aware quantization | Recommended for production – good quality at 1/4 memory |
| GPTQ | 4-bit post-training quantization | Alternative to AWQ, slightly slower inference |
| GGUF | llama.cpp format, CPU-optimized | Not recommended for GPU serving – use SafeTensors or AWQ instead |
For air-gapped deployments, SafeTensors and AWQ are the preferred formats because they load directly from disk without runtime conversion steps.
Integration with F3Iai
vLLM serves as the central inference engine that powers the entire Federal Frontier Platform’s AI capabilities. The integration follows a clean separation of concerns:
Compass API --> vLLM (OpenAI-compatible) --> tool_calls --> MCP Servers (12)
| | |
LLM_ENDPOINT /v1/chat/completions 150+ tools across
env var with tool definitions TypeDB, PostgreSQL,
OpenStack, etc.
- Compass API is the sole consumer of the vLLM endpoint. It is configured via the
LLM_ENDPOINTenvironment variable to point to the vLLM Kubernetes service. No other service in the platform calls vLLM directly. - Tool calling is the critical capability. The Compass API sends chat completion requests that include 150+ MCP tool definitions. The LLM returns structured
tool_callsin its response, which the Compass API dispatches to the appropriate MCP server. - MCP servers (12 in total) handle the actual data operations – querying TypeDB, managing Kubernetes resources, interacting with OpenStack, and more. They never communicate with vLLM; they only receive tool invocations from the Compass API.
Recommended Models
For tool-calling workloads in F3Iai, the Qwen 2.5 and Qwen 3 series models have proven to be the most reliable:
| Model | Parameters | VRAM (AWQ) | Tool Calling Quality |
|---|---|---|---|
| Qwen/Qwen2.5-7B-Instruct | 7B | ~4 GB | Good for simple workflows |
| Qwen/Qwen2.5-14B-Instruct | 14B | ~8 GB | Strong tool calling |
| Qwen/Qwen2.5-32B-Instruct-AWQ | 32B | ~18 GB | Recommended for production |
| Qwen/Qwen3-8B | 8B | ~5 GB | Excellent tool calling with thinking |
| Qwen/Qwen3-32B | 32B | ~18 GB | Best overall for complex multi-tool chains |
All Qwen models should be served with the following flags to enable structured tool calling:
--enable-auto-tool-choice --tool-call-parser hermes
For Qwen 2.5 models specifically, --tool-call-parser qwen25 may also be used if hermes produces inconsistent results.
Key Advantage Over Ollama
The fundamental difference between vLLM and Ollama for production use is model delivery. Ollama expects to pull models from a registry at startup or on first request. In an air-gapped environment with no outbound internet access, this model must either be:
- Pre-loaded into an Ollama instance and its blob storage persisted (fragile, version-dependent)
- Served from a local Ollama registry (adds another service to maintain)
With vLLM, models are simply files on disk. They can be:
- Baked into the container image at build time, producing a self-contained deployment artifact
- Loaded from a PVC that was populated during initial provisioning or via a Kubernetes Job
- Stored in Harbor as an OCI artifact alongside the container images
This means a vLLM deployment in an air-gapped environment is fully deterministic – the exact model weights are part of the deployment manifest, versioned and reproducible.