LLM Inference Overview

Choosing between developer workstation (Ollama on Apple Silicon) and production air-gapped (vLLM on Kubernetes) inference for the Federal Frontier Platform.

LLM Inference

The Federal Frontier Platform uses large language models (LLMs) for agentic tool calling — the AI chat interface in Compass selects and invokes MCP tools based on natural language queries. Two inference deployment patterns exist, each optimized for different environments.

Which Guide Do I Need?

Deployment Context Inference Server Hardware Guide
Operator workstation / local dev Ollama Apple Silicon Mac (M-series, 36GB+ unified memory) Developer Workstation Guide
Production air-gapped / Kubernetes vLLM GPU node or bare metal with CUDA/ROCm, or CPU fallback Production Inference Guide
CI/test pipeline Ollama or vLLM Any Either guide

Why Two Paths?

Ollama excels at zero-configuration LLM serving on Apple Silicon. Metal GPU acceleration is automatic, models download from the Ollama registry with a single command, and the native /api/chat endpoint provides reliable tool calling. It is the fastest path from zero to a working AI-powered infrastructure chat on an operator’s desk. However, Ollama is not designed for multi-tenant Kubernetes deployments, does not ship as a Helm chart, and depends on the Ollama model registry for model distribution.

vLLM is a production-grade inference server built for Kubernetes. It deploys via Helm chart, exposes an OpenAI-compatible API, supports tensor parallelism across GPU nodes, and serves models from local storage or container images — making it suitable for FIPS-aligned air-gapped environments where no external registry access is available. vLLM is the inference backend for production F3Iai deployments.

Model Selection

Both paths serve the same model families. The Federal Frontier Platform recommends Qwen 3.5 or Qwen 2.5 series models for agentic tool calling due to their reliable structured output and function-calling capabilities:

Model Parameters Quantization Min Memory Use Case
qwen3.5:35b-a3b-q4_K_M 35B (3B active MoE) Q4_K_M 24GB Dev workstation (Ollama) — production-quality tool calling
Qwen/Qwen2.5-7B-Instruct 7B FP16 16GB Production (vLLM) — lightweight, fast, good tool calling
Qwen/Qwen2.5-32B-Instruct 32B AWQ/GPTQ 24GB Production (vLLM) — higher quality, needs GPU

For detailed model selection guidance, see the Model Selection Guide.

Architecture

graph TD UI[Compass UI
Next.js Frontend] --> API[Compass API — FastAPI
chat.py → intent classification
→ template match OR LLM tool calling] API --> Dev[Ollama
Mac · :11434] API --> Prod[vLLM
K8s · :8000] Dev --> MCP[12 MCP Servers
150+ tools] Prod --> MCP style UI fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style API fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style Dev fill:#553c9a,stroke:#805ad5,color:#e2e8f0 style Prod fill:#553c9a,stroke:#805ad5,color:#e2e8f0 style MCP fill:#1a365d,stroke:#4299e1,color:#e2e8f0

The Compass API backend switches between Ollama and vLLM via the LLM_ENDPOINT environment variable. No code changes are required — both serve compatible chat completion APIs.

Next Steps