Model Selection Guide

Choosing the right model and quantization level for your workload, with specific guidance for agentic tool calling on the Federal Frontier Platform.

Developer Workstation Setup — This guide describes running LLM inference on an Apple Silicon Mac for local development, operator tooling, and Compass AI chat. This is not the production inference architecture. For production air-gapped deployments, see the vLLM on Kubernetes Production Inference Guide.

Model Families

Llama 3.1 / 3.2 / 3.3 (Meta)

The most widely used open-weight model family. Llama 3.1 is available in 8B, 70B, and 405B parameter sizes. Llama 3.2 added 1B and 3B sizes for edge deployment, plus multimodal (vision) variants at 11B and 90B. Llama 3.3 is a 70B model with improved instruction following.

Best for: General-purpose chat, instruction following, reasoning, code generation. The 8B variant is an excellent development and testing model. The 70B variant is competitive with GPT-4 class models on many benchmarks.

Tool calling: Llama 3.1+ supports tool calling natively. The 70B and 405B variants handle large tool payloads well. The 8B variant struggles with more than 20-30 tools.

ollama pull llama3.1:8b
ollama pull llama3.3:70b-instruct-q4_K_M

Qwen 2.5 / 3.5 (Alibaba)

Qwen models excel at structured output and tool calling. Qwen 2.5 is available from 0.5B to 72B. Qwen 3.5 introduced Mixture-of-Experts (MoE) architectures, with the 35B-A3B variant activating only 3 billion parameters per token while having 35 billion total — dramatically reducing inference cost while maintaining quality.

This is the model family the Federal Frontier Platform uses in production. The qwen3.5:35b-a3b-q4_K_M model handles 150+ MCP tools reliably, returning properly structured tool_calls via the native Ollama API.

Best for: Tool calling, function calling, structured JSON output, multilingual tasks. Qwen 3.5 MoE variants offer the best quality-per-VRAM ratio for agentic workloads.

ollama pull qwen3.5:35b-a3b-q4_K_M    # FFP production model
ollama pull qwen2.5:32b-instruct-q4_K_M
ollama pull qwen2.5:72b-instruct-q4_K_M

Mistral / Mixtral (Mistral AI)

Mistral 7B punches well above its weight class. Mixtral 8x7B is a MoE model that activates 2 of 8 expert networks per token, giving near-34B quality at 13B inference cost. Mistral Large (123B) competes with Llama 3.1 70B.

Best for: Fast inference, code generation, short-context tasks. Mixtral is excellent when you need speed over maximum quality.

ollama pull mistral:7b
ollama pull mixtral:8x7b-instruct-v0.1-q4_K_M

DeepSeek (DeepSeek AI)

DeepSeek R1 and DeepSeek V3 are strong reasoning models. DeepSeek Coder V2 is specialized for code generation and understanding. These models use chain-of-thought reasoning internally, which improves accuracy on complex tasks but increases output token count.

Best for: Complex reasoning, mathematical problem solving, code analysis. Not ideal for tool calling due to verbose chain-of-thought output that can interfere with structured responses.

ollama pull deepseek-r1:14b
ollama pull deepseek-coder-v2:16b

Phi-3 / Phi-4 (Microsoft)

Small but highly capable models. Phi-3 Mini (3.8B) and Phi-3 Medium (14B) achieve remarkable quality for their size. Phi-4 (14B) further improves reasoning. These are trained on synthetic data and curated textbook-quality content.

Best for: Resource-constrained environments, edge deployment, situations where you need a capable model in under 10GB of memory.

ollama pull phi3:14b
ollama pull phi4:14b

Gemma 2 (Google)

Available in 2B, 9B, and 27B sizes. Gemma 2 27B is competitive with much larger models. Good instruction following and safety alignment.

Best for: General-purpose tasks where you want a Google-ecosystem model. The 27B variant is a strong alternative to Llama 3.1 8B when you have the memory.

ollama pull gemma2:27b

Command R+ (Cohere)

A 104B parameter model specifically designed for retrieval-augmented generation (RAG). It excels at grounded generation — producing responses that cite specific passages from provided context.

Best for: RAG pipelines where citation accuracy matters. Not recommended for tool calling.

ollama pull command-r-plus:104b-q4_K_M

Quantization Levels Explained

Quantization reduces model precision from 16-bit floating point to lower bit widths, reducing memory requirements and increasing inference speed at the cost of some quality. Ollama models are distributed in GGUF format with various quantization levels.

Quantization	Bits/Weight	Quality	Speed	Memory Savings	Notes
Q2_K	~2.5	Poor	Fastest	~75% less than F16	Significant quality degradation. Only for experimentation.
Q3_K_M	~3.5	Fair	Very fast	~65% less than F16	Noticeable quality loss on complex tasks.
Q4_K_M	~4.5	Good	Fast	~60% less than F16	Recommended sweet spot. FFP production choice.
Q5_K_M	~5.5	Very good	Moderate	~50% less than F16	Marginal quality improvement over Q4_K_M.
Q6_K	~6.5	Excellent	Moderate	~40% less than F16	Near-lossless for most tasks.
Q8_0	8	Near-perfect	Slower	~25% less than F16	Best quantized option. Use when memory allows.
F16	16	Perfect	Slowest	Baseline	Full precision. Rarely needed for inference.

The _K_M suffix indicates “K-quant Medium” — a quantization method that uses different bit widths for different layers based on their sensitivity. Attention layers get higher precision, feedforward layers get lower precision. This preserves quality better than uniform quantization.

For tool calling workloads, Q4_K_M is the recommended quantization. The quality difference between Q4_K_M and Q8_0 is negligible for structured output tasks like function calling, and Q4_K_M uses roughly half the memory.

Memory Requirements

These are approximate values for the model weights only. Actual memory usage is higher due to the KV cache, which grows with context length and parallel request slots.

Parameters	Q4_K_M	Q5_K_M	Q6_K	Q8_0	F16
1B	~0.7GB	~0.8GB	~1GB	~1.2GB	~2GB
3B	~2GB	~2.5GB	~3GB	~3.5GB	~6GB
7B	~4GB	~5GB	~6GB	~7.5GB	~14GB
13B	~8GB	~9.5GB	~11GB	~14GB	~26GB
34B (dense)	~20GB	~24GB	~28GB	~36GB	~68GB
35B MoE (A3B)	~20GB	~24GB	~28GB	~36GB	~68GB
70B	~40GB	~48GB	~56GB	~75GB	~140GB
104B	~60GB	~72GB	~84GB	~110GB	~208GB

KV cache overhead: Add approximately 1-2GB for 8K context, 2-4GB for 16K context, and 4-8GB for 32K context, per parallel request slot. With OLLAMA_NUM_PARALLEL=2 and 32K context, budget an additional 8-16GB beyond model weight size.

Memory Planning for the Federal Frontier Platform

The FFP production configuration:

Model: qwen3.5:35b-a3b-q4_K_M (~20GB weights)
Context: 16384 tokens (150+ tools produce large prompts)
Parallel slots: 1-2
Estimated total: 24-28GB unified memory

This fits comfortably on a Mac with 36GB+ unified memory (M4 Pro or higher).

Choosing a Model for Tool Calling

Tool calling is the most demanding use case for model selection because the model must:

Parse a tools array containing JSON Schema definitions for every available tool.
Decide which tool(s) to call based on the user’s natural language request.
Generate valid JSON arguments matching the tool’s schema.
Process tool results and generate a coherent final response.

With 150+ tools (as in the Federal Frontier Platform), steps 1-2 are especially challenging. The tools array alone can consume 10,000+ tokens of context.

Recommended Models for Tool Calling

Model	Tool Calling Quality	Max Tools Tested	Notes
qwen3.5:35b-a3b-q4_K_M	Excellent	150+	FFP production model. Best tool selection accuracy.
qwen2.5:32b-instruct-q4_K_M	Very good	100+	Slightly less reliable than Qwen 3.5 with large tool counts.
llama3.1:70b-instruct-q4_K_M	Very good	100+	Good tool calling, but requires 40GB+ memory.
llama3.1:8b	Fair	20-30	Degrades rapidly beyond 30 tools. Fine for development.
mistral:7b	Fair	10-20	Limited tool calling support.
deepseek-r1:14b	Poor for tools	N/A	Chain-of-thought interferes with structured tool output.

Why Qwen 3.5 for Tool Calling

Qwen 3.5 returns properly structured tool_calls objects when used with Ollama’s native /api/chat endpoint. This is critical — the model does not just generate text that looks like a function call; it returns machine-parseable tool call objects with function.name and function.arguments fields that client code can dispatch programmatically.

The MoE architecture of qwen3.5:35b-a3b-q4_K_M is particularly well-suited for tool calling. Despite having 35B total parameters, only ~3B are activated per token, which means:

Fast token generation: Inference speed comparable to a dense 3B model.
Large capacity: The full 35B parameter space gives the model enough knowledge to correctly select from 150+ tools.
Low memory for quality: 20GB of VRAM for 35B-class quality, versus 40GB for a dense 70B model.

See the API Reference for details on the native vs OpenAI-compatible API differences that affect tool calling.