Performance Tuning for Tool Calling
Optimizing Ollama inference parameters, system prompts, and resource allocation for agentic workloads with large tool payloads on the Federal Frontier Platform.
Developer Workstation Setup — This guide describes running LLM inference on an Apple Silicon Mac for local development, operator tooling, and Compass AI chat. This is not the production inference architecture. For production air-gapped deployments, see the vLLM on Kubernetes Production Inference Guide.
Overview
Tool-calling workloads are fundamentally different from general chat. The model receives a tools array that can exceed 10,000 tokens for 150+ tools, must select the correct tool from that list, generate valid JSON arguments, and then produce a coherent response from the tool results. Every parameter choice — temperature, context size, system prompt length — directly affects tool selection reliability.
This guide documents the specific tuning decisions made for the Federal Frontier Platform’s Compass AI interface, which serves qwen3.5:35b-a3b-q4_K_M with 150+ MCP tools.
Temperature
Recommended: 0.1
{
"options": {
"temperature": 0.1
}
}
Temperature controls randomness in token selection. For tool calling:
- 0.0 (greedy): Fully deterministic. Always selects the highest-probability token. Can cause repetition loops on some models.
- 0.1 (near-deterministic): Slight variation prevents repetition while keeping tool selection highly reliable. This is the FFP production setting.
- 0.3-0.5: Acceptable for general chat but introduces occasional wrong tool selection with 150+ tools.
- 0.7-1.0: Too random for tool calling. The model will sometimes select plausible-but-wrong tools or generate malformed arguments.
The reason 0.1 is better than 0.0: with greedy decoding, some models enter degenerate states where they repeat the same tool call or generate truncated JSON. The minimal randomness at 0.1 breaks these loops without meaningfully degrading tool selection accuracy.
System Prompt
Keep it to 1-2 lines. This is critical for Qwen 3.5.
Good — works reliably:
You are an infrastructure assistant. Use the provided tools to answer questions. Format results as markdown tables.
Bad — breaks tool calling:
You are an expert infrastructure assistant for the Federal Frontier Platform. You have access to a comprehensive ontology of infrastructure entities including clusters, deployments, services, images, vulnerabilities, and more. When a user asks about infrastructure, you should use the appropriate MCP tools to query the Federal Frontier Ontology (FFO) and return structured results. Always format your responses using markdown tables when presenting tabular data. If you are unsure which tool to use, prefer ffo.search for broad queries and ffo.entity.get for specific entity lookups. When traversing relationships, use ffo.traverse with appropriate depth parameters...
The long system prompt fills the model’s “instruction following” attention budget. When the system prompt is verbose and the tools array is large (150+ tools = 10K+ tokens), Qwen 3.5 begins to ignore the tool definitions and generates free-text responses instead of structured tool_calls. This failure mode is silent — the model does not error, it simply stops calling tools.
Rules for system prompts with large tool payloads:
- Maximum 2 sentences.
- State the role and output format preference only.
- Do not describe individual tools — the tools array already does that.
- Do not include examples of tool usage — the model’s training handles this.
- Do not include behavioral guidelines beyond basic role framing.
Context Window (num_ctx)
Recommended: 16384
{
"options": {
"num_ctx": 16384
}
}
The context window must be large enough to hold: system prompt + tools array + conversation history + model output.
For 150+ MCP tools, the tools array alone consumes approximately 8,000-12,000 tokens depending on how verbose the tool descriptions and parameter schemas are. With the default num_ctx of 8192, the tools array may be truncated, causing the model to lose awareness of tools near the end of the list.
| Tool Count | Approximate Tools Tokens | Recommended num_ctx |
|---|---|---|
| 10-20 | 1,000-2,000 | 8192 (default) |
| 20-50 | 2,000-5,000 | 8192-16384 |
| 50-100 | 5,000-8,000 | 16384 |
| 100-150+ | 8,000-12,000+ | 16384-32768 |
Increasing num_ctx has a direct memory cost. Each doubling of context roughly doubles the KV cache memory. On a 35B Q4 model:
- 8192 context: ~1GB KV cache
- 16384 context: ~2GB KV cache
- 32768 context: ~4GB KV cache
This is per parallel request slot.
Max Output Tokens (num_predict)
Recommended: 2048
{
"options": {
"num_predict": 2048
}
}
The default num_predict of 128 is far too low for tool-calling responses. A tool call response containing JSON arguments can easily exceed 128 tokens, causing truncated JSON that the client cannot parse. Set this to at least 2048.
For responses that include markdown tables built from tool results, 4096 may be appropriate. However, larger values increase the maximum possible response time, so balance against your timeout budget.
Parallel Request Slots (OLLAMA_NUM_PARALLEL)
Recommended: 1 for single-user, 2-4 for multi-user
export OLLAMA_NUM_PARALLEL=2
Each parallel slot maintains its own KV cache, multiplying the memory overhead:
| OLLAMA_NUM_PARALLEL | KV Cache Overhead (16K ctx) | Use Case |
|---|---|---|
| 1 | ~2GB | Single user, development |
| 2 | ~4GB | 2-3 concurrent users |
| 4 | ~8GB | Production multi-user |
| 8 | ~16GB | High concurrency (need 64GB+ total) |
With qwen3.5:35b-a3b-q4_K_M (~20GB weights) and 2 parallel slots at 16K context, total memory usage is approximately 24GB. This fits on a 36GB Mac (M4 Pro) with headroom for the operating system.
When all parallel slots are busy, additional requests are queued. The queue depth is controlled by OLLAMA_MAX_QUEUE (default 512). Queued requests will wait until a slot becomes available, which can take 60-120 seconds per request for tool-calling workloads.
Model Keep-Alive (OLLAMA_KEEP_ALIVE)
Recommended: 24h
export OLLAMA_KEEP_ALIVE=24h
The default keep-alive of 5 minutes means the model is unloaded from memory after 5 minutes of inactivity. Reloading a 20GB model takes 10-30 seconds depending on disk speed. For a production endpoint, set this to 24h or longer to keep the model permanently loaded.
Valid formats: 5m, 1h, 24h, 0 (never unload), -1 (unload immediately after each request).
Client Timeout Configuration
Ollama can take 60-120 seconds to process a request when the tool payload is large. The breakdown:
- Model loading (if not cached): 10-30 seconds
- Prompt evaluation (tokenizing 150+ tools): 10-30 seconds
- Token generation: 20-60 seconds
Configure your HTTP client accordingly. In Python with httpx:
import httpx
client = httpx.Client(timeout=httpx.Timeout(
connect=5.0,
read=120.0,
write=10.0,
pool=5.0
))
response = client.post("http://<ollama-host>:11434/api/chat", json={
"model": "qwen3.5:35b-a3b-q4_K_M",
"messages": messages,
"tools": tools,
"stream": False,
"options": {"temperature": 0.1, "num_ctx": 16384, "num_predict": 2048}
})
With requests:
import requests
response = requests.post(
"http://<ollama-host>:11434/api/chat",
json=payload,
timeout=120 # seconds
)
In Kubernetes deployment configurations, also set appropriate timeouts on any ingress or service mesh sidecar:
# Traefik IngressRoute annotation
metadata:
annotations:
traefik.ingress.kubernetes.io/router.middlewares: f3iai-timeout@kubernetescrd
Benchmarking
Tokens Per Second
Measure raw token generation speed with a simple prompt (no tools):
curl -s http://<ollama-host>:11434/api/generate \
-d '{
"model": "qwen3.5:35b-a3b-q4_K_M",
"prompt": "Write a detailed explanation of Kubernetes pod scheduling.",
"stream": false
}' | python3 -c "
import sys, json
d = json.load(sys.stdin)
eval_duration_s = d['eval_duration'] / 1e9
tokens = d['eval_count']
print(f'Generated {tokens} tokens in {eval_duration_s:.1f}s = {tokens/eval_duration_s:.1f} tok/s')
prompt_duration_s = d['prompt_eval_duration'] / 1e9
prompt_tokens = d['prompt_eval_count']
print(f'Prompt eval: {prompt_tokens} tokens in {prompt_duration_s:.1f}s = {prompt_tokens/prompt_duration_s:.1f} tok/s')
"
Expected ranges for qwen3.5:35b-a3b-q4_K_M on Apple Silicon:
| Hardware | Generation (tok/s) | Prompt Eval (tok/s) |
|---|---|---|
| M2 Ultra 192GB | 30-40 | 200-400 |
| M4 Max 128GB | 25-35 | 150-300 |
| M4 Pro 48GB | 20-30 | 100-200 |
Tool-Calling Latency
Measure end-to-end latency for a tool-calling request with the full tools payload:
time curl -s http://<ollama-host>:11434/api/chat \
-d '{
"model": "qwen3.5:35b-a3b-q4_K_M",
"messages": [
{"role": "system", "content": "You are an infrastructure assistant. Use tools to answer questions."},
{"role": "user", "content": "Show me all clusters"}
],
"tools": [... your full 150+ tools array ...],
"stream": false,
"options": {"temperature": 0.1, "num_ctx": 16384, "num_predict": 2048}
}' > /dev/null
With 150+ tools, expect 30-90 seconds for the first request (includes model loading if cold) and 10-30 seconds for subsequent requests (model already loaded).
Memory Monitoring
ollama ps
Check currently loaded models and their memory usage:
ollama ps
Example output:
NAME ID SIZE PROCESSOR UNTIL
qwen3.5:35b-a3b-q4_K_M a1b2c3d4e5 22 GB 100% GPU 24 hours from now
The SIZE column shows total memory consumption including weights and KV cache. The PROCESSOR column shows what percentage runs on the GPU (on Apple Silicon, this should always be 100% GPU for optimal performance).
Activity Monitor
On macOS, open Activity Monitor and check:
- Memory tab: Look at “Memory Used” and “Memory Pressure” gauge. Green is healthy. Yellow means the system is compressing memory. Red means swapping to disk — inference will be extremely slow.
- GPU tab: If available (macOS Ventura+), check GPU utilization during inference. It should spike to 80-100% during token generation.
Command-Line Memory Check
# Total system memory and pressure
vm_stat | head -5
# Ollama process memory
ps aux | grep ollama
# Unified memory usage (requires macOS 13+)
sudo powermetrics --samplers gpu_power -i 1000 -n 1 2>/dev/null | grep -i "gpu\|memory"
Recommended Production Configuration
Summary of all tuning parameters for the Federal Frontier Platform:
Environment variables (set in launchctl plist or shell profile):
OLLAMA_HOST=0.0.0.0:11434
OLLAMA_KEEP_ALIVE=24h
OLLAMA_NUM_PARALLEL=2
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_FLASH_ATTENTION=1
Per-request options (sent in API call):
{
"model": "qwen3.5:35b-a3b-q4_K_M",
"options": {
"temperature": 0.1,
"num_ctx": 16384,
"num_predict": 2048
},
"stream": false
}
System prompt:
You are an infrastructure assistant. Use the provided tools to answer questions. Format results as markdown tables.
Client timeout: 120 seconds.