Ollama API Reference
Complete reference for the Ollama native API and OpenAI-compatible API, with specific guidance on tool calling behavior differences.
Developer Workstation Setup — This guide describes running LLM inference on an Apple Silicon Mac for local development, operator tooling, and Compass AI chat. This is not the production inference architecture. For production air-gapped deployments, see the vLLM on Kubernetes Production Inference Guide.
Two APIs, Different Capabilities
Ollama exposes two HTTP APIs:
| API | Base Path | Tool Calling | Streaming | Use Case |
|---|---|---|---|---|
| Native | /api/ |
Full support | Yes | Production tool calling, embeddings, model management |
| OpenAI-compatible | /v1/ |
Incomplete | Yes | Drop-in replacement for OpenAI SDK clients |
For tool calling, you MUST use the native /api/chat endpoint. The OpenAI-compatible /v1/chat/completions endpoint does not reliably return tool_calls in the response — it may drop them entirely or return them as plain text. This is a known limitation and the single most common integration mistake.
Native API Endpoints
POST /api/chat — Chat Completion
The primary endpoint for conversational inference with tool calling support.
Basic chat request:
curl -s http://localhost:11434/api/chat \
-d '{
"model": "qwen3.5:35b-a3b-q4_K_M",
"messages": [
{"role": "system", "content": "You are an infrastructure assistant."},
{"role": "user", "content": "What pods are running in the f3iai namespace?"}
],
"stream": false
}' | python3 -m json.tool
Response:
{
"model": "qwen3.5:35b-a3b-q4_K_M",
"created_at": "2026-03-21T12:00:00.000Z",
"message": {
"role": "assistant",
"content": "I'll check the pods in the f3iai namespace for you..."
},
"done": true,
"total_duration": 2500000000,
"load_duration": 50000000,
"prompt_eval_count": 42,
"prompt_eval_duration": 800000000,
"eval_count": 35,
"eval_duration": 1600000000
}
POST /api/chat — Chat with Tool Calling
This is the critical flow for the Federal Frontier Platform. The request includes a tools array defining available functions, and the model responds with tool_calls when it determines a tool should be invoked.
Step 1: Send message with tools
curl -s http://localhost:11434/api/chat \
-d '{
"model": "qwen3.5:35b-a3b-q4_K_M",
"messages": [
{
"role": "system",
"content": "You are an infrastructure assistant. Use tools to answer questions."
},
{
"role": "user",
"content": "Show me all clusters in the ontology"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "ffo_search",
"description": "Search the Federal Frontier Ontology for entities by type and/or name pattern",
"parameters": {
"type": "object",
"properties": {
"entity_type": {
"type": "string",
"description": "The ontology entity type to search for (e.g., cluster, deployment, service)"
},
"name_pattern": {
"type": "string",
"description": "Optional name pattern to filter results"
},
"limit": {
"type": "integer",
"description": "Maximum number of results to return",
"default": 25
}
},
"required": ["entity_type"]
}
}
}
],
"stream": false
}' | python3 -m json.tool
Response with tool_calls:
{
"model": "qwen3.5:35b-a3b-q4_K_M",
"created_at": "2026-03-21T12:00:01.000Z",
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"function": {
"name": "ffo_search",
"arguments": {
"entity_type": "cluster",
"limit": 25
}
}
}
]
},
"done": true
}
Note that content is empty when the model decides to call a tool. The structured tool_calls array contains the function name and parsed arguments as a JSON object (not a string).
Step 2: Execute the tool and send results back
After executing the tool, send the result back as a tool role message:
curl -s http://localhost:11434/api/chat \
-d '{
"model": "qwen3.5:35b-a3b-q4_K_M",
"messages": [
{
"role": "system",
"content": "You are an infrastructure assistant. Use tools to answer questions."
},
{
"role": "user",
"content": "Show me all clusters in the ontology"
},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"function": {
"name": "ffo_search",
"arguments": {
"entity_type": "cluster",
"limit": 25
}
}
}
]
},
{
"role": "tool",
"content": "[{\"name\": \"texas-dell-04\", \"type\": \"cluster\", \"provider\": \"bare-metal\"}, {\"name\": \"gke-prod-east\", \"type\": \"cluster\", \"provider\": \"gcp\"}]"
}
],
"stream": false
}' | python3 -m json.tool
Response — final answer:
{
"model": "qwen3.5:35b-a3b-q4_K_M",
"created_at": "2026-03-21T12:00:02.000Z",
"message": {
"role": "assistant",
"content": "Here are the clusters in the ontology:\n\n| Name | Provider |\n|---|---|\n| texas-dell-04 | bare-metal |\n| gke-prod-east | gcp |\n\nThere are 2 clusters registered."
},
"done": true
}
POST /api/generate — Text Completion
For non-conversational, single-prompt completion. Does not support tool calling.
curl -s http://localhost:11434/api/generate \
-d '{
"model": "qwen3.5:35b-a3b-q4_K_M",
"prompt": "Explain Kubernetes network policies in one paragraph.",
"stream": false
}' | python3 -m json.tool
POST /api/embeddings — Generate Embeddings
Generate vector embeddings for text. Requires an embedding model.
curl -s http://localhost:11434/api/embeddings \
-d '{
"model": "nomic-embed-text",
"prompt": "Kubernetes cluster running in the f3iai namespace"
}' | python3 -m json.tool
Response contains an embedding array of floats (768 dimensions for nomic-embed-text).
GET /api/tags — List Models
Returns all downloaded models with their sizes and modification dates.
curl -s http://localhost:11434/api/tags | python3 -m json.tool
This is the standard health check endpoint — if it returns a valid JSON response, the Ollama server is running and accessible.
POST /api/show — Model Information
Get detailed information about a specific model, including its parameters, template, and license.
curl -s http://localhost:11434/api/show \
-d '{"name": "qwen3.5:35b-a3b-q4_K_M"}' | python3 -m json.tool
POST /api/pull — Pull a Model
Download a model from the registry. Useful for scripting model deployment.
curl -s http://localhost:11434/api/pull \
-d '{"name": "qwen3.5:35b-a3b-q4_K_M", "stream": false}'
DELETE /api/delete — Delete a Model
Remove a model from local storage.
curl -s -X DELETE http://localhost:11434/api/delete \
-d '{"name": "llama3.1:8b"}'
OpenAI-Compatible API
POST /v1/chat/completions
Drop-in compatible with the OpenAI Chat Completions API. Useful for tools and libraries that expect the OpenAI API format.
curl -s http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5:35b-a3b-q4_K_M",
"messages": [
{"role": "user", "content": "Hello, what can you help me with?"}
]
}' | python3 -m json.tool
GET /v1/models
List available models in OpenAI-compatible format.
curl -s http://localhost:11434/v1/models | python3 -m json.tool
Do NOT use the /v1 endpoint for tool calling. Even if you pass a tools array to /v1/chat/completions, the response may not include the tool_calls field. Use /api/chat instead.
Python Client Examples
Basic Chat with requests
import requests
response = requests.post("http://<ollama-host>:11434/api/chat", json={
"model": "qwen3.5:35b-a3b-q4_K_M",
"messages": [
{"role": "user", "content": "List Kubernetes namespaces"}
],
"stream": False
})
data = response.json()
print(data["message"]["content"])
Tool Calling Loop with requests
import requests
import json
OLLAMA_URL = "http://<ollama-host>:11434/api/chat"
MODEL = "qwen3.5:35b-a3b-q4_K_M"
tools = [
{
"type": "function",
"function": {
"name": "ffo_search",
"description": "Search the ontology for entities",
"parameters": {
"type": "object",
"properties": {
"entity_type": {"type": "string"},
"limit": {"type": "integer", "default": 25}
},
"required": ["entity_type"]
}
}
}
]
# Tool implementation registry
def execute_tool(name: str, arguments: dict) -> str:
"""Dispatch tool calls to actual implementations."""
if name == "ffo_search":
# Call your actual MCP tool here
return json.dumps([{"name": "texas-dell-04", "type": "cluster"}])
raise ValueError(f"Unknown tool: {name}")
# Conversation loop
messages = [
{"role": "system", "content": "You are an infrastructure assistant. Use tools to answer questions."},
{"role": "user", "content": "What clusters exist?"}
]
while True:
response = requests.post(OLLAMA_URL, json={
"model": MODEL,
"messages": messages,
"tools": tools,
"stream": False,
"options": {"temperature": 0.1}
}, timeout=120)
data = response.json()
assistant_message = data["message"]
messages.append(assistant_message)
# Check if the model wants to call tools
if not assistant_message.get("tool_calls"):
# No tool calls — this is the final response
print(assistant_message["content"])
break
# Execute each tool call
for tool_call in assistant_message["tool_calls"]:
func = tool_call["function"]
result = execute_tool(func["name"], func["arguments"])
messages.append({
"role": "tool",
"content": result
})
Streaming Chat
import requests
import json
response = requests.post("http://<ollama-host>:11434/api/chat", json={
"model": "qwen3.5:35b-a3b-q4_K_M",
"messages": [
{"role": "user", "content": "Explain service meshes"}
],
"stream": True
}, stream=True)
for line in response.iter_lines():
if line:
chunk = json.loads(line)
if chunk.get("message", {}).get("content"):
print(chunk["message"]["content"], end="", flush=True)
if chunk.get("done"):
print() # Final newline
Request Options
Both /api/chat and /api/generate accept an options object for controlling inference parameters:
{
"model": "qwen3.5:35b-a3b-q4_K_M",
"messages": [...],
"options": {
"temperature": 0.1,
"num_ctx": 16384,
"num_predict": 2048,
"top_p": 0.9,
"top_k": 40,
"repeat_penalty": 1.1
}
}
| Option | Default | Description |
|---|---|---|
temperature |
0.8 | Randomness. Use 0.1 for tool calling. |
num_ctx |
8192 | Context window size in tokens. |
num_predict |
128 | Maximum tokens to generate. Increase for long responses. |
top_p |
0.9 | Nucleus sampling threshold. |
top_k |
40 | Top-K sampling. |
repeat_penalty |
1.1 | Penalty for repeating tokens. |
seed |
random | Set for reproducible output. |
stop |
none | Array of stop sequences. |
See the Performance Tuning guide for recommended values for tool-calling workloads.