Model Selection Guide

Choosing the right model and quantization level for your workload, with specific guidance for agentic tool calling on the Federal Frontier Platform.

Developer Workstation Setup — This guide describes running LLM inference on an Apple Silicon Mac for local development, operator tooling, and Compass AI chat. This is not the production inference architecture. For production air-gapped deployments, see the vLLM on Kubernetes Production Inference Guide.

Model Families

Llama 3.1 / 3.2 / 3.3 (Meta)

The most widely used open-weight model family. Llama 3.1 is available in 8B, 70B, and 405B parameter sizes. Llama 3.2 added 1B and 3B sizes for edge deployment, plus multimodal (vision) variants at 11B and 90B. Llama 3.3 is a 70B model with improved instruction following.

Best for: General-purpose chat, instruction following, reasoning, code generation. The 8B variant is an excellent development and testing model. The 70B variant is competitive with GPT-4 class models on many benchmarks.

Tool calling: Llama 3.1+ supports tool calling natively. The 70B and 405B variants handle large tool payloads well. The 8B variant struggles with more than 20-30 tools.

ollama pull llama3.1:8b
ollama pull llama3.3:70b-instruct-q4_K_M

Qwen 2.5 / 3.5 (Alibaba)

Qwen models excel at structured output and tool calling. Qwen 2.5 is available from 0.5B to 72B. Qwen 3.5 introduced Mixture-of-Experts (MoE) architectures, with the 35B-A3B variant activating only 3 billion parameters per token while having 35 billion total — dramatically reducing inference cost while maintaining quality.

This is the model family the Federal Frontier Platform uses in production. The qwen3.5:35b-a3b-q4_K_M model handles 150+ MCP tools reliably, returning properly structured tool_calls via the native Ollama API.

Best for: Tool calling, function calling, structured JSON output, multilingual tasks. Qwen 3.5 MoE variants offer the best quality-per-VRAM ratio for agentic workloads.

ollama pull qwen3.5:35b-a3b-q4_K_M    # FFP production model
ollama pull qwen2.5:32b-instruct-q4_K_M
ollama pull qwen2.5:72b-instruct-q4_K_M

Mistral / Mixtral (Mistral AI)

Mistral 7B punches well above its weight class. Mixtral 8x7B is a MoE model that activates 2 of 8 expert networks per token, giving near-34B quality at 13B inference cost. Mistral Large (123B) competes with Llama 3.1 70B.

Best for: Fast inference, code generation, short-context tasks. Mixtral is excellent when you need speed over maximum quality.

ollama pull mistral:7b
ollama pull mixtral:8x7b-instruct-v0.1-q4_K_M

DeepSeek (DeepSeek AI)

DeepSeek R1 and DeepSeek V3 are strong reasoning models. DeepSeek Coder V2 is specialized for code generation and understanding. These models use chain-of-thought reasoning internally, which improves accuracy on complex tasks but increases output token count.

Best for: Complex reasoning, mathematical problem solving, code analysis. Not ideal for tool calling due to verbose chain-of-thought output that can interfere with structured responses.

ollama pull deepseek-r1:14b
ollama pull deepseek-coder-v2:16b

Phi-3 / Phi-4 (Microsoft)

Small but highly capable models. Phi-3 Mini (3.8B) and Phi-3 Medium (14B) achieve remarkable quality for their size. Phi-4 (14B) further improves reasoning. These are trained on synthetic data and curated textbook-quality content.

Best for: Resource-constrained environments, edge deployment, situations where you need a capable model in under 10GB of memory.

ollama pull phi3:14b
ollama pull phi4:14b

Gemma 2 (Google)

Available in 2B, 9B, and 27B sizes. Gemma 2 27B is competitive with much larger models. Good instruction following and safety alignment.

Best for: General-purpose tasks where you want a Google-ecosystem model. The 27B variant is a strong alternative to Llama 3.1 8B when you have the memory.

ollama pull gemma2:27b

Command R+ (Cohere)

A 104B parameter model specifically designed for retrieval-augmented generation (RAG). It excels at grounded generation — producing responses that cite specific passages from provided context.

Best for: RAG pipelines where citation accuracy matters. Not recommended for tool calling.

ollama pull command-r-plus:104b-q4_K_M

Quantization Levels Explained

Quantization reduces model precision from 16-bit floating point to lower bit widths, reducing memory requirements and increasing inference speed at the cost of some quality. Ollama models are distributed in GGUF format with various quantization levels.

Quantization Bits/Weight Quality Speed Memory Savings Notes
Q2_K ~2.5 Poor Fastest ~75% less than F16 Significant quality degradation. Only for experimentation.
Q3_K_M ~3.5 Fair Very fast ~65% less than F16 Noticeable quality loss on complex tasks.
Q4_K_M ~4.5 Good Fast ~60% less than F16 Recommended sweet spot. FFP production choice.
Q5_K_M ~5.5 Very good Moderate ~50% less than F16 Marginal quality improvement over Q4_K_M.
Q6_K ~6.5 Excellent Moderate ~40% less than F16 Near-lossless for most tasks.
Q8_0 8 Near-perfect Slower ~25% less than F16 Best quantized option. Use when memory allows.
F16 16 Perfect Slowest Baseline Full precision. Rarely needed for inference.

The _K_M suffix indicates “K-quant Medium” — a quantization method that uses different bit widths for different layers based on their sensitivity. Attention layers get higher precision, feedforward layers get lower precision. This preserves quality better than uniform quantization.

For tool calling workloads, Q4_K_M is the recommended quantization. The quality difference between Q4_K_M and Q8_0 is negligible for structured output tasks like function calling, and Q4_K_M uses roughly half the memory.

Memory Requirements

These are approximate values for the model weights only. Actual memory usage is higher due to the KV cache, which grows with context length and parallel request slots.

Parameters Q4_K_M Q5_K_M Q6_K Q8_0 F16
1B ~0.7GB ~0.8GB ~1GB ~1.2GB ~2GB
3B ~2GB ~2.5GB ~3GB ~3.5GB ~6GB
7B ~4GB ~5GB ~6GB ~7.5GB ~14GB
13B ~8GB ~9.5GB ~11GB ~14GB ~26GB
34B (dense) ~20GB ~24GB ~28GB ~36GB ~68GB
35B MoE (A3B) ~20GB ~24GB ~28GB ~36GB ~68GB
70B ~40GB ~48GB ~56GB ~75GB ~140GB
104B ~60GB ~72GB ~84GB ~110GB ~208GB

KV cache overhead: Add approximately 1-2GB for 8K context, 2-4GB for 16K context, and 4-8GB for 32K context, per parallel request slot. With OLLAMA_NUM_PARALLEL=2 and 32K context, budget an additional 8-16GB beyond model weight size.

Memory Planning for the Federal Frontier Platform

The FFP production configuration:

  • Model: qwen3.5:35b-a3b-q4_K_M (~20GB weights)
  • Context: 16384 tokens (150+ tools produce large prompts)
  • Parallel slots: 1-2
  • Estimated total: 24-28GB unified memory

This fits comfortably on a Mac with 36GB+ unified memory (M4 Pro or higher).

Choosing a Model for Tool Calling

Tool calling is the most demanding use case for model selection because the model must:

  1. Parse a tools array containing JSON Schema definitions for every available tool.
  2. Decide which tool(s) to call based on the user’s natural language request.
  3. Generate valid JSON arguments matching the tool’s schema.
  4. Process tool results and generate a coherent final response.

With 150+ tools (as in the Federal Frontier Platform), steps 1-2 are especially challenging. The tools array alone can consume 10,000+ tokens of context.

Model Tool Calling Quality Max Tools Tested Notes
qwen3.5:35b-a3b-q4_K_M Excellent 150+ FFP production model. Best tool selection accuracy.
qwen2.5:32b-instruct-q4_K_M Very good 100+ Slightly less reliable than Qwen 3.5 with large tool counts.
llama3.1:70b-instruct-q4_K_M Very good 100+ Good tool calling, but requires 40GB+ memory.
llama3.1:8b Fair 20-30 Degrades rapidly beyond 30 tools. Fine for development.
mistral:7b Fair 10-20 Limited tool calling support.
deepseek-r1:14b Poor for tools N/A Chain-of-thought interferes with structured tool output.

Why Qwen 3.5 for Tool Calling

Qwen 3.5 returns properly structured tool_calls objects when used with Ollama’s native /api/chat endpoint. This is critical — the model does not just generate text that looks like a function call; it returns machine-parseable tool call objects with function.name and function.arguments fields that client code can dispatch programmatically.

The MoE architecture of qwen3.5:35b-a3b-q4_K_M is particularly well-suited for tool calling. Despite having 35B total parameters, only ~3B are activated per token, which means:

  • Fast token generation: Inference speed comparable to a dense 3B model.
  • Large capacity: The full 35B parameter space gives the model enough knowledge to correctly select from 150+ tools.
  • Low memory for quality: 20GB of VRAM for 35B-class quality, versus 40GB for a dense 70B model.

See the API Reference for details on the native vs OpenAI-compatible API differences that affect tool calling.