LLM Inference Overview

Choosing between developer workstation (Ollama on Apple Silicon) and production air-gapped (vLLM on Kubernetes) inference for the Federal Frontier Platform.

LLM Inference

The Federal Frontier Platform uses large language models (LLMs) for agentic tool calling — the AI chat interface in Compass selects and invokes MCP tools based on natural language queries. Two inference deployment patterns exist, each optimized for different environments.

Which Guide Do I Need?

Deployment Context	Inference Server	Hardware	Guide
Operator workstation / local dev	Ollama	Apple Silicon Mac (M-series, 36GB+ unified memory)	Developer Workstation Guide
Production air-gapped / Kubernetes	vLLM	GPU node or bare metal with CUDA/ROCm, or CPU fallback	Production Inference Guide
CI/test pipeline	Ollama or vLLM	Any	Either guide

Why Two Paths?

Ollama excels at zero-configuration LLM serving on Apple Silicon. Metal GPU acceleration is automatic, models download from the Ollama registry with a single command, and the native /api/chat endpoint provides reliable tool calling. It is the fastest path from zero to a working AI-powered infrastructure chat on an operator’s desk. However, Ollama is not designed for multi-tenant Kubernetes deployments, does not ship as a Helm chart, and depends on the Ollama model registry for model distribution.

vLLM is a production-grade inference server built for Kubernetes. It deploys via Helm chart, exposes an OpenAI-compatible API, supports tensor parallelism across GPU nodes, and serves models from local storage or container images — making it suitable for FIPS-aligned air-gapped environments where no external registry access is available. vLLM is the inference backend for production F3Iai deployments.

Model Selection

Both paths serve the same model families. The Federal Frontier Platform recommends Qwen 3.5 or Qwen 2.5 series models for agentic tool calling due to their reliable structured output and function-calling capabilities:

Model	Parameters	Quantization	Min Memory	Use Case
qwen3.5:35b-a3b-q4_K_M	35B (3B active MoE)	Q4_K_M	24GB	Dev workstation (Ollama) — production-quality tool calling
Qwen/Qwen2.5-7B-Instruct	7B	FP16	16GB	Production (vLLM) — lightweight, fast, good tool calling
Qwen/Qwen2.5-32B-Instruct	32B	AWQ/GPTQ	24GB	Production (vLLM) — higher quality, needs GPU

For detailed model selection guidance, see the Model Selection Guide.

Architecture

graph TD UI[Compass UI
Next.js Frontend] --> API[Compass API — FastAPI
chat.py → intent classification
→ template match OR LLM tool calling] API --> Dev[Ollama
Mac · :11434] API --> Prod[vLLM
K8s · :8000] Dev --> MCP[12 MCP Servers
150+ tools] Prod --> MCP style UI fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style API fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style Dev fill:#553c9a,stroke:#805ad5,color:#e2e8f0 style Prod fill:#553c9a,stroke:#805ad5,color:#e2e8f0 style MCP fill:#1a365d,stroke:#4299e1,color:#e2e8f0

The Compass API backend switches between Ollama and vLLM via the LLM_ENDPOINT environment variable. No code changes are required — both serve compatible chat completion APIs.

Next Steps

Setting up a dev workstation? Start with Ollama on Apple Silicon
Deploying to production K8s? Start with vLLM Production Inference
Choosing a model? See the Model Selection Guide