Performance Tuning for Tool Calling

Optimizing Ollama inference parameters, system prompts, and resource allocation for agentic workloads with large tool payloads on the Federal Frontier Platform.

Developer Workstation Setup — This guide describes running LLM inference on an Apple Silicon Mac for local development, operator tooling, and Compass AI chat. This is not the production inference architecture. For production air-gapped deployments, see the vLLM on Kubernetes Production Inference Guide.

Overview

Tool-calling workloads are fundamentally different from general chat. The model receives a tools array that can exceed 10,000 tokens for 150+ tools, must select the correct tool from that list, generate valid JSON arguments, and then produce a coherent response from the tool results. Every parameter choice — temperature, context size, system prompt length — directly affects tool selection reliability.

This guide documents the specific tuning decisions made for the Federal Frontier Platform’s Compass AI interface, which serves qwen3.5:35b-a3b-q4_K_M with 150+ MCP tools.

Temperature

Recommended: 0.1

{
  "options": {
    "temperature": 0.1
  }
}

Temperature controls randomness in token selection. For tool calling:

0.0 (greedy): Fully deterministic. Always selects the highest-probability token. Can cause repetition loops on some models.
0.1 (near-deterministic): Slight variation prevents repetition while keeping tool selection highly reliable. This is the FFP production setting.
0.3-0.5: Acceptable for general chat but introduces occasional wrong tool selection with 150+ tools.
0.7-1.0: Too random for tool calling. The model will sometimes select plausible-but-wrong tools or generate malformed arguments.

The reason 0.1 is better than 0.0: with greedy decoding, some models enter degenerate states where they repeat the same tool call or generate truncated JSON. The minimal randomness at 0.1 breaks these loops without meaningfully degrading tool selection accuracy.

System Prompt

Keep it to 1-2 lines. This is critical for Qwen 3.5.

Good — works reliably:

You are an infrastructure assistant. Use the provided tools to answer questions. Format results as markdown tables.

Bad — breaks tool calling:

You are an expert infrastructure assistant for the Federal Frontier Platform. You have access to a comprehensive ontology of infrastructure entities including clusters, deployments, services, images, vulnerabilities, and more. When a user asks about infrastructure, you should use the appropriate MCP tools to query the Federal Frontier Ontology (FFO) and return structured results. Always format your responses using markdown tables when presenting tabular data. If you are unsure which tool to use, prefer ffo.search for broad queries and ffo.entity.get for specific entity lookups. When traversing relationships, use ffo.traverse with appropriate depth parameters...

The long system prompt fills the model’s “instruction following” attention budget. When the system prompt is verbose and the tools array is large (150+ tools = 10K+ tokens), Qwen 3.5 begins to ignore the tool definitions and generates free-text responses instead of structured tool_calls. This failure mode is silent — the model does not error, it simply stops calling tools.

Rules for system prompts with large tool payloads:

Maximum 2 sentences.
State the role and output format preference only.
Do not describe individual tools — the tools array already does that.
Do not include examples of tool usage — the model’s training handles this.
Do not include behavioral guidelines beyond basic role framing.

Context Window (num_ctx)

Recommended: 16384

{
  "options": {
    "num_ctx": 16384
  }
}

The context window must be large enough to hold: system prompt + tools array + conversation history + model output.

For 150+ MCP tools, the tools array alone consumes approximately 8,000-12,000 tokens depending on how verbose the tool descriptions and parameter schemas are. With the default num_ctx of 8192, the tools array may be truncated, causing the model to lose awareness of tools near the end of the list.

Tool Count	Approximate Tools Tokens	Recommended num_ctx
10-20	1,000-2,000	8192 (default)
20-50	2,000-5,000	8192-16384
50-100	5,000-8,000	16384
100-150+	8,000-12,000+	16384-32768

Increasing num_ctx has a direct memory cost. Each doubling of context roughly doubles the KV cache memory. On a 35B Q4 model:

8192 context: ~1GB KV cache
16384 context: ~2GB KV cache
32768 context: ~4GB KV cache

This is per parallel request slot.

Max Output Tokens (num_predict)

Recommended: 2048

{
  "options": {
    "num_predict": 2048
  }
}

The default num_predict of 128 is far too low for tool-calling responses. A tool call response containing JSON arguments can easily exceed 128 tokens, causing truncated JSON that the client cannot parse. Set this to at least 2048.

For responses that include markdown tables built from tool results, 4096 may be appropriate. However, larger values increase the maximum possible response time, so balance against your timeout budget.

Parallel Request Slots (OLLAMA_NUM_PARALLEL)

Recommended: 1 for single-user, 2-4 for multi-user

export OLLAMA_NUM_PARALLEL=2

Each parallel slot maintains its own KV cache, multiplying the memory overhead:

OLLAMA_NUM_PARALLEL	KV Cache Overhead (16K ctx)	Use Case
1	~2GB	Single user, development
2	~4GB	2-3 concurrent users
4	~8GB	Production multi-user
8	~16GB	High concurrency (need 64GB+ total)

With qwen3.5:35b-a3b-q4_K_M (~20GB weights) and 2 parallel slots at 16K context, total memory usage is approximately 24GB. This fits on a 36GB Mac (M4 Pro) with headroom for the operating system.

When all parallel slots are busy, additional requests are queued. The queue depth is controlled by OLLAMA_MAX_QUEUE (default 512). Queued requests will wait until a slot becomes available, which can take 60-120 seconds per request for tool-calling workloads.

Model Keep-Alive (OLLAMA_KEEP_ALIVE)

Recommended: 24h

export OLLAMA_KEEP_ALIVE=24h

The default keep-alive of 5 minutes means the model is unloaded from memory after 5 minutes of inactivity. Reloading a 20GB model takes 10-30 seconds depending on disk speed. For a production endpoint, set this to 24h or longer to keep the model permanently loaded.

Valid formats: 5m, 1h, 24h, 0 (never unload), -1 (unload immediately after each request).

Client Timeout Configuration

Ollama can take 60-120 seconds to process a request when the tool payload is large. The breakdown:

Model loading (if not cached): 10-30 seconds
Prompt evaluation (tokenizing 150+ tools): 10-30 seconds
Token generation: 20-60 seconds

Configure your HTTP client accordingly. In Python with httpx:

import httpx

client = httpx.Client(timeout=httpx.Timeout(
    connect=5.0,
    read=120.0,
    write=10.0,
    pool=5.0
))

response = client.post("http://<ollama-host>:11434/api/chat", json={
    "model": "qwen3.5:35b-a3b-q4_K_M",
    "messages": messages,
    "tools": tools,
    "stream": False,
    "options": {"temperature": 0.1, "num_ctx": 16384, "num_predict": 2048}
})

With requests:

import requests

response = requests.post(
    "http://<ollama-host>:11434/api/chat",
    json=payload,
    timeout=120  # seconds
)

In Kubernetes deployment configurations, also set appropriate timeouts on any ingress or service mesh sidecar:

# Traefik IngressRoute annotation
metadata:
  annotations:
    traefik.ingress.kubernetes.io/router.middlewares: f3iai-timeout@kubernetescrd

Benchmarking

Tokens Per Second

Measure raw token generation speed with a simple prompt (no tools):

curl -s http://<ollama-host>:11434/api/generate \
  -d '{
    "model": "qwen3.5:35b-a3b-q4_K_M",
    "prompt": "Write a detailed explanation of Kubernetes pod scheduling.",
    "stream": false
  }' | python3 -c "
import sys, json
d = json.load(sys.stdin)
eval_duration_s = d['eval_duration'] / 1e9
tokens = d['eval_count']
print(f'Generated {tokens} tokens in {eval_duration_s:.1f}s = {tokens/eval_duration_s:.1f} tok/s')
prompt_duration_s = d['prompt_eval_duration'] / 1e9
prompt_tokens = d['prompt_eval_count']
print(f'Prompt eval: {prompt_tokens} tokens in {prompt_duration_s:.1f}s = {prompt_tokens/prompt_duration_s:.1f} tok/s')
"

Expected ranges for qwen3.5:35b-a3b-q4_K_M on Apple Silicon:

Hardware	Generation (tok/s)	Prompt Eval (tok/s)
M2 Ultra 192GB	30-40	200-400
M4 Max 128GB	25-35	150-300
M4 Pro 48GB	20-30	100-200

Tool-Calling Latency

Measure end-to-end latency for a tool-calling request with the full tools payload:

time curl -s http://<ollama-host>:11434/api/chat \
  -d '{
    "model": "qwen3.5:35b-a3b-q4_K_M",
    "messages": [
      {"role": "system", "content": "You are an infrastructure assistant. Use tools to answer questions."},
      {"role": "user", "content": "Show me all clusters"}
    ],
    "tools": [... your full 150+ tools array ...],
    "stream": false,
    "options": {"temperature": 0.1, "num_ctx": 16384, "num_predict": 2048}
  }' > /dev/null

With 150+ tools, expect 30-90 seconds for the first request (includes model loading if cold) and 10-30 seconds for subsequent requests (model already loaded).

Memory Monitoring

ollama ps

Check currently loaded models and their memory usage:

ollama ps

Example output:

NAME                           ID            SIZE     PROCESSOR    UNTIL
qwen3.5:35b-a3b-q4_K_M        a1b2c3d4e5    22 GB    100% GPU     24 hours from now

The SIZE column shows total memory consumption including weights and KV cache. The PROCESSOR column shows what percentage runs on the GPU (on Apple Silicon, this should always be 100% GPU for optimal performance).

Activity Monitor

On macOS, open Activity Monitor and check:

Memory tab: Look at “Memory Used” and “Memory Pressure” gauge. Green is healthy. Yellow means the system is compressing memory. Red means swapping to disk — inference will be extremely slow.
GPU tab: If available (macOS Ventura+), check GPU utilization during inference. It should spike to 80-100% during token generation.

Command-Line Memory Check

# Total system memory and pressure
vm_stat | head -5

# Ollama process memory
ps aux | grep ollama

# Unified memory usage (requires macOS 13+)
sudo powermetrics --samplers gpu_power -i 1000 -n 1 2>/dev/null | grep -i "gpu\|memory"

Recommended Production Configuration

Summary of all tuning parameters for the Federal Frontier Platform:

Environment variables (set in launchctl plist or shell profile):

OLLAMA_HOST=0.0.0.0:11434
OLLAMA_KEEP_ALIVE=24h
OLLAMA_NUM_PARALLEL=2
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_FLASH_ATTENTION=1

Per-request options (sent in API call):

{
  "model": "qwen3.5:35b-a3b-q4_K_M",
  "options": {
    "temperature": 0.1,
    "num_ctx": 16384,
    "num_predict": 2048
  },
  "stream": false
}

System prompt:

You are an infrastructure assistant. Use the provided tools to answer questions. Format results as markdown tables.

Client timeout: 120 seconds.