Model Configuration
Guide for selecting, quantizing, and configuring LLM models for vLLM in air-gapped Federal Frontier Platform deployments.
Production Air-Gapped Deployment This guide covers deploying LLM inference via vLLM on Kubernetes for production and air-gapped environments. For developer workstation setup using Ollama on Apple Silicon, see the Developer Workstation Guide.
Model Selection for Air-Gapped Environments
When selecting models for air-gapped deployment, apply the following criteria:
- SafeTensors format available – The model must be downloadable as SafeTensors files from HuggingFace (or a mirror). Avoid models that only ship as PyTorch
.binfiles, as SafeTensors provides memory-mapped loading and tamper detection. - No proprietary tokenizer dependencies – Some models require downloading additional tokenizer files or running custom code during loading. Prefer models where all tokenizer assets are included in the model directory.
- Proven tool-calling support – Not all instruction-tuned models handle structured tool calls reliably. The Qwen 2.5 Instruct series and Qwen 3 series have been validated extensively with the F3Iai tool set (150+ MCP tools).
- Mirror-friendly licensing – Ensure the model license permits hosting on an internal Harbor registry or NFS volume. Apache 2.0 and Qwen License models are safe choices.
Tool-Calling Configuration
Tool calling is the most critical capability for F3Iai. The LLM must reliably emit structured tool_calls in its response when presented with tool definitions in the system prompt.
Required vLLM Flags
--enable-auto-tool-choice # Enables automatic tool call detection in output
--tool-call-parser hermes # Parser for extracting tool calls from model output
The --tool-call-parser flag tells vLLM how to parse tool calls from the model’s raw text output. Available parsers:
| Parser | Compatible Models | Notes |
|---|---|---|
hermes |
Most instruction-tuned models, Qwen 2.5/3 | Recommended default – uses Hermes-style XML tags |
qwen25 |
Qwen 2.5 series specifically | Alternative if hermes has issues with Qwen 2.5 |
mistral |
Mistral/Mixtral models | Mistral-specific tool format |
llama3_json |
Llama 3.x Instruct models | JSON-based tool calling |
jamba |
AI21 Jamba models | Jamba-specific format |
Chat Template Requirements
vLLM uses Jinja2 chat templates to format conversations. For tool calling to work, the chat template must support the tools parameter. Most Qwen and Llama instruct models ship with a compatible template in their repository (tokenizer_config.json or a standalone .jinja file).
If the default template does not handle tools correctly, you can override it:
--chat-template /models/Qwen2.5-7B-Instruct/tool_chat_template.jinja
Qwen models typically include a tool_chat_template.jinja alongside their standard template. Use this file for tool-calling workloads.
Verifying Tool Calling
After deployment, test tool calling with a curl request:
curl -s http://vllm.f3iai.svc.cluster.local:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Search for all Kubernetes clusters in the ontology"}
],
"tools": [
{
"type": "function",
"function": {
"name": "ffo.search",
"description": "Search the Federal Frontier Ontology",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"entity_type": {"type": "string"}
},
"required": ["query"]
}
}
}
]
}'
A successful response will contain tool_calls in the assistant message rather than plain text.
Quantization Trade-Offs
Quantization reduces model memory requirements at the cost of some quality. For production F3Iai deployments, the trade-offs are:
AWQ (Activation-Aware Weight Quantization)
- Bits: 4-bit
- VRAM savings: ~75% compared to FP16
- Inference speed: Fast on NVIDIA GPUs (uses efficient CUDA kernels)
- Quality: Minimal degradation for instruction-following and tool calling
- Recommendation: Preferred for production. Best balance of quality, speed, and memory.
GPTQ (Post-Training Quantization)
- Bits: 4-bit (or 8-bit)
- VRAM savings: ~75% compared to FP16
- Inference speed: Slightly slower than AWQ on most hardware
- Quality: Comparable to AWQ
- Recommendation: Use as fallback if AWQ variant is unavailable for your chosen model.
FP16 / BF16 (Full Precision)
- Bits: 16-bit
- VRAM savings: None (baseline)
- Inference speed: Fastest per-token, but fewer concurrent requests due to memory
- Quality: Best possible
- Recommendation: Use only if GPU memory is abundant and quality is paramount.
GGUF
- Bits: Variable (2-bit to 8-bit)
- VRAM savings: Variable
- Inference speed: Optimized for CPU, suboptimal on GPU
- Quality: Depends on quantization level
- Recommendation: Not recommended for vLLM GPU serving. Use SafeTensors or AWQ instead. GGUF is better suited for Ollama on developer workstations.
Memory Planning
Use these estimates to plan GPU resource allocation:
| Model Size | FP16 VRAM | AWQ (4-bit) VRAM | GPUs (A100 80GB) |
|---|---|---|---|
| 7B | ~14 GB | ~4 GB | 1 |
| 14B | ~28 GB | ~8 GB | 1 |
| 32B | ~64 GB | ~18 GB | 1 |
| 70B | ~140 GB | ~36 GB | 2 (tensor parallel) |
| 72B | ~144 GB | ~38 GB | 2 (tensor parallel) |
These estimates include model weights only. Add 2-4 GB for KV cache and runtime overhead. The --gpu-memory-utilization flag (default 0.9) controls how much of available VRAM is used for the KV cache after model weights are loaded.
Serving Parameters
Key vLLM serving parameters and their impact:
--max-model-len 32768 # Maximum context window (tokens). Reduce to save VRAM.
--gpu-memory-utilization 0.90 # Fraction of GPU memory for KV cache. Lower = fewer concurrent requests.
--tensor-parallel-size 1 # Number of GPUs for model parallelism. Set to 2+ for large models.
--max-num-seqs 64 # Maximum concurrent sequences. Reduce if OOM.
--enforce-eager # Disable CUDA graphs. Use if hitting memory issues during warmup.
For F3Iai tool-calling workloads, a context window of 32768 tokens is recommended. Tool definitions for 150+ MCP tools consume approximately 8000-12000 tokens in the system prompt, leaving 20000+ tokens for conversation history and tool results.
Mirroring Models for Air-Gapped Use
To prepare a model for air-gapped deployment:
# 1. Download on a connected machine
pip install huggingface-hub
huggingface-cli download Qwen/Qwen2.5-7B-Instruct \
--local-dir ./Qwen2.5-7B-Instruct \
--local-dir-use-symlinks False
# 2. Verify all files are present
ls -la ./Qwen2.5-7B-Instruct/
# Should contain: config.json, tokenizer.json, tokenizer_config.json,
# model-*.safetensors, generation_config.json, etc.
# 3. Package for transfer
tar czf qwen2.5-7b-instruct.tar.gz Qwen2.5-7B-Instruct/
# 4. Transfer to air-gapped environment via approved media
# (USB, SFTP to jump host, etc.)
# 5. Extract to NFS volume or build into container image
tar xzf qwen2.5-7b-instruct.tar.gz -C /mnt/nfs/models/
Alternatively, push the model as an OCI artifact to Harbor:
# Using ORAS to push model files as OCI artifact
oras push harbor.vitro.lan/ffp/models/qwen2.5-7b-instruct:v1.0 \
./Qwen2.5-7B-Instruct/:application/vnd.safetensors
This approach versions model weights alongside container images in the same registry, simplifying air-gapped delivery.