Model Configuration

Guide for selecting, quantizing, and configuring LLM models for vLLM in air-gapped Federal Frontier Platform deployments.

Production Air-Gapped Deployment This guide covers deploying LLM inference via vLLM on Kubernetes for production and air-gapped environments. For developer workstation setup using Ollama on Apple Silicon, see the Developer Workstation Guide.

Model Selection for Air-Gapped Environments

When selecting models for air-gapped deployment, apply the following criteria:

SafeTensors format available – The model must be downloadable as SafeTensors files from HuggingFace (or a mirror). Avoid models that only ship as PyTorch .bin files, as SafeTensors provides memory-mapped loading and tamper detection.
No proprietary tokenizer dependencies – Some models require downloading additional tokenizer files or running custom code during loading. Prefer models where all tokenizer assets are included in the model directory.
Proven tool-calling support – Not all instruction-tuned models handle structured tool calls reliably. The Qwen 2.5 Instruct series and Qwen 3 series have been validated extensively with the F3Iai tool set (150+ MCP tools).
Mirror-friendly licensing – Ensure the model license permits hosting on an internal Harbor registry or NFS volume. Apache 2.0 and Qwen License models are safe choices.

Tool-Calling Configuration

Tool calling is the most critical capability for F3Iai. The LLM must reliably emit structured tool_calls in its response when presented with tool definitions in the system prompt.

Required vLLM Flags

--enable-auto-tool-choice    # Enables automatic tool call detection in output
--tool-call-parser hermes    # Parser for extracting tool calls from model output

The --tool-call-parser flag tells vLLM how to parse tool calls from the model’s raw text output. Available parsers:

Parser	Compatible Models	Notes
`hermes`	Most instruction-tuned models, Qwen 2.5/3	Recommended default – uses Hermes-style XML tags
`qwen25`	Qwen 2.5 series specifically	Alternative if hermes has issues with Qwen 2.5
`mistral`	Mistral/Mixtral models	Mistral-specific tool format
`llama3_json`	Llama 3.x Instruct models	JSON-based tool calling
`jamba`	AI21 Jamba models	Jamba-specific format

Chat Template Requirements

vLLM uses Jinja2 chat templates to format conversations. For tool calling to work, the chat template must support the tools parameter. Most Qwen and Llama instruct models ship with a compatible template in their repository (tokenizer_config.json or a standalone .jinja file).

If the default template does not handle tools correctly, you can override it:

--chat-template /models/Qwen2.5-7B-Instruct/tool_chat_template.jinja

Qwen models typically include a tool_chat_template.jinja alongside their standard template. Use this file for tool-calling workloads.

Verifying Tool Calling

After deployment, test tool calling with a curl request:

curl -s http://vllm.f3iai.svc.cluster.local:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "Search for all Kubernetes clusters in the ontology"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "ffo.search",
          "description": "Search the Federal Frontier Ontology",
          "parameters": {
            "type": "object",
            "properties": {
              "query": {"type": "string"},
              "entity_type": {"type": "string"}
            },
            "required": ["query"]
          }
        }
      }
    ]
  }'

A successful response will contain tool_calls in the assistant message rather than plain text.

Quantization Trade-Offs

Quantization reduces model memory requirements at the cost of some quality. For production F3Iai deployments, the trade-offs are:

AWQ (Activation-Aware Weight Quantization)

Bits: 4-bit
VRAM savings: ~75% compared to FP16
Inference speed: Fast on NVIDIA GPUs (uses efficient CUDA kernels)
Quality: Minimal degradation for instruction-following and tool calling
Recommendation: Preferred for production. Best balance of quality, speed, and memory.

GPTQ (Post-Training Quantization)

Bits: 4-bit (or 8-bit)
VRAM savings: ~75% compared to FP16
Inference speed: Slightly slower than AWQ on most hardware
Quality: Comparable to AWQ
Recommendation: Use as fallback if AWQ variant is unavailable for your chosen model.

FP16 / BF16 (Full Precision)

Bits: 16-bit
VRAM savings: None (baseline)
Inference speed: Fastest per-token, but fewer concurrent requests due to memory
Quality: Best possible
Recommendation: Use only if GPU memory is abundant and quality is paramount.

GGUF

Bits: Variable (2-bit to 8-bit)
VRAM savings: Variable
Inference speed: Optimized for CPU, suboptimal on GPU
Quality: Depends on quantization level
Recommendation: Not recommended for vLLM GPU serving. Use SafeTensors or AWQ instead. GGUF is better suited for Ollama on developer workstations.

Memory Planning

Use these estimates to plan GPU resource allocation:

Model Size	FP16 VRAM	AWQ (4-bit) VRAM	GPUs (A100 80GB)
7B	~14 GB	~4 GB	1
14B	~28 GB	~8 GB	1
32B	~64 GB	~18 GB	1
70B	~140 GB	~36 GB	2 (tensor parallel)
72B	~144 GB	~38 GB	2 (tensor parallel)

These estimates include model weights only. Add 2-4 GB for KV cache and runtime overhead. The --gpu-memory-utilization flag (default 0.9) controls how much of available VRAM is used for the KV cache after model weights are loaded.

Serving Parameters

Key vLLM serving parameters and their impact:

--max-model-len 32768          # Maximum context window (tokens). Reduce to save VRAM.
--gpu-memory-utilization 0.90  # Fraction of GPU memory for KV cache. Lower = fewer concurrent requests.
--tensor-parallel-size 1       # Number of GPUs for model parallelism. Set to 2+ for large models.
--max-num-seqs 64              # Maximum concurrent sequences. Reduce if OOM.
--enforce-eager                # Disable CUDA graphs. Use if hitting memory issues during warmup.

For F3Iai tool-calling workloads, a context window of 32768 tokens is recommended. Tool definitions for 150+ MCP tools consume approximately 8000-12000 tokens in the system prompt, leaving 20000+ tokens for conversation history and tool results.

Mirroring Models for Air-Gapped Use

To prepare a model for air-gapped deployment:

# 1. Download on a connected machine
pip install huggingface-hub
huggingface-cli download Qwen/Qwen2.5-7B-Instruct \
  --local-dir ./Qwen2.5-7B-Instruct \
  --local-dir-use-symlinks False

# 2. Verify all files are present
ls -la ./Qwen2.5-7B-Instruct/
# Should contain: config.json, tokenizer.json, tokenizer_config.json,
# model-*.safetensors, generation_config.json, etc.

# 3. Package for transfer
tar czf qwen2.5-7b-instruct.tar.gz Qwen2.5-7B-Instruct/

# 4. Transfer to air-gapped environment via approved media
# (USB, SFTP to jump host, etc.)

# 5. Extract to NFS volume or build into container image
tar xzf qwen2.5-7b-instruct.tar.gz -C /mnt/nfs/models/

Alternatively, push the model as an OCI artifact to Harbor:

# Using ORAS to push model files as OCI artifact
oras push harbor.vitro.lan/ffp/models/qwen2.5-7b-instruct:v1.0 \
  ./Qwen2.5-7B-Instruct/:application/vnd.safetensors

This approach versions model weights alongside container images in the same registry, simplifying air-gapped delivery.