Ollama API Reference

Complete reference for the Ollama native API and OpenAI-compatible API, with specific guidance on tool calling behavior differences.

Developer Workstation Setup — This guide describes running LLM inference on an Apple Silicon Mac for local development, operator tooling, and Compass AI chat. This is not the production inference architecture. For production air-gapped deployments, see the vLLM on Kubernetes Production Inference Guide.

Two APIs, Different Capabilities

Ollama exposes two HTTP APIs:

API Base Path Tool Calling Streaming Use Case
Native /api/ Full support Yes Production tool calling, embeddings, model management
OpenAI-compatible /v1/ Incomplete Yes Drop-in replacement for OpenAI SDK clients

For tool calling, you MUST use the native /api/chat endpoint. The OpenAI-compatible /v1/chat/completions endpoint does not reliably return tool_calls in the response — it may drop them entirely or return them as plain text. This is a known limitation and the single most common integration mistake.

Native API Endpoints

POST /api/chat — Chat Completion

The primary endpoint for conversational inference with tool calling support.

Basic chat request:

curl -s http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3.5:35b-a3b-q4_K_M",
    "messages": [
      {"role": "system", "content": "You are an infrastructure assistant."},
      {"role": "user", "content": "What pods are running in the f3iai namespace?"}
    ],
    "stream": false
  }' | python3 -m json.tool

Response:

{
  "model": "qwen3.5:35b-a3b-q4_K_M",
  "created_at": "2026-03-21T12:00:00.000Z",
  "message": {
    "role": "assistant",
    "content": "I'll check the pods in the f3iai namespace for you..."
  },
  "done": true,
  "total_duration": 2500000000,
  "load_duration": 50000000,
  "prompt_eval_count": 42,
  "prompt_eval_duration": 800000000,
  "eval_count": 35,
  "eval_duration": 1600000000
}

POST /api/chat — Chat with Tool Calling

This is the critical flow for the Federal Frontier Platform. The request includes a tools array defining available functions, and the model responds with tool_calls when it determines a tool should be invoked.

Step 1: Send message with tools

curl -s http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3.5:35b-a3b-q4_K_M",
    "messages": [
      {
        "role": "system",
        "content": "You are an infrastructure assistant. Use tools to answer questions."
      },
      {
        "role": "user",
        "content": "Show me all clusters in the ontology"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "ffo_search",
          "description": "Search the Federal Frontier Ontology for entities by type and/or name pattern",
          "parameters": {
            "type": "object",
            "properties": {
              "entity_type": {
                "type": "string",
                "description": "The ontology entity type to search for (e.g., cluster, deployment, service)"
              },
              "name_pattern": {
                "type": "string",
                "description": "Optional name pattern to filter results"
              },
              "limit": {
                "type": "integer",
                "description": "Maximum number of results to return",
                "default": 25
              }
            },
            "required": ["entity_type"]
          }
        }
      }
    ],
    "stream": false
  }' | python3 -m json.tool

Response with tool_calls:

{
  "model": "qwen3.5:35b-a3b-q4_K_M",
  "created_at": "2026-03-21T12:00:01.000Z",
  "message": {
    "role": "assistant",
    "content": "",
    "tool_calls": [
      {
        "function": {
          "name": "ffo_search",
          "arguments": {
            "entity_type": "cluster",
            "limit": 25
          }
        }
      }
    ]
  },
  "done": true
}

Note that content is empty when the model decides to call a tool. The structured tool_calls array contains the function name and parsed arguments as a JSON object (not a string).

Step 2: Execute the tool and send results back

After executing the tool, send the result back as a tool role message:

curl -s http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3.5:35b-a3b-q4_K_M",
    "messages": [
      {
        "role": "system",
        "content": "You are an infrastructure assistant. Use tools to answer questions."
      },
      {
        "role": "user",
        "content": "Show me all clusters in the ontology"
      },
      {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "function": {
              "name": "ffo_search",
              "arguments": {
                "entity_type": "cluster",
                "limit": 25
              }
            }
          }
        ]
      },
      {
        "role": "tool",
        "content": "[{\"name\": \"texas-dell-04\", \"type\": \"cluster\", \"provider\": \"bare-metal\"}, {\"name\": \"gke-prod-east\", \"type\": \"cluster\", \"provider\": \"gcp\"}]"
      }
    ],
    "stream": false
  }' | python3 -m json.tool

Response — final answer:

{
  "model": "qwen3.5:35b-a3b-q4_K_M",
  "created_at": "2026-03-21T12:00:02.000Z",
  "message": {
    "role": "assistant",
    "content": "Here are the clusters in the ontology:\n\n| Name | Provider |\n|---|---|\n| texas-dell-04 | bare-metal |\n| gke-prod-east | gcp |\n\nThere are 2 clusters registered."
  },
  "done": true
}

POST /api/generate — Text Completion

For non-conversational, single-prompt completion. Does not support tool calling.

curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3.5:35b-a3b-q4_K_M",
    "prompt": "Explain Kubernetes network policies in one paragraph.",
    "stream": false
  }' | python3 -m json.tool

POST /api/embeddings — Generate Embeddings

Generate vector embeddings for text. Requires an embedding model.

curl -s http://localhost:11434/api/embeddings \
  -d '{
    "model": "nomic-embed-text",
    "prompt": "Kubernetes cluster running in the f3iai namespace"
  }' | python3 -m json.tool

Response contains an embedding array of floats (768 dimensions for nomic-embed-text).

GET /api/tags — List Models

Returns all downloaded models with their sizes and modification dates.

curl -s http://localhost:11434/api/tags | python3 -m json.tool

This is the standard health check endpoint — if it returns a valid JSON response, the Ollama server is running and accessible.

POST /api/show — Model Information

Get detailed information about a specific model, including its parameters, template, and license.

curl -s http://localhost:11434/api/show \
  -d '{"name": "qwen3.5:35b-a3b-q4_K_M"}' | python3 -m json.tool

POST /api/pull — Pull a Model

Download a model from the registry. Useful for scripting model deployment.

curl -s http://localhost:11434/api/pull \
  -d '{"name": "qwen3.5:35b-a3b-q4_K_M", "stream": false}'

DELETE /api/delete — Delete a Model

Remove a model from local storage.

curl -s -X DELETE http://localhost:11434/api/delete \
  -d '{"name": "llama3.1:8b"}'

OpenAI-Compatible API

POST /v1/chat/completions

Drop-in compatible with the OpenAI Chat Completions API. Useful for tools and libraries that expect the OpenAI API format.

curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5:35b-a3b-q4_K_M",
    "messages": [
      {"role": "user", "content": "Hello, what can you help me with?"}
    ]
  }' | python3 -m json.tool

GET /v1/models

List available models in OpenAI-compatible format.

curl -s http://localhost:11434/v1/models | python3 -m json.tool

Do NOT use the /v1 endpoint for tool calling. Even if you pass a tools array to /v1/chat/completions, the response may not include the tool_calls field. Use /api/chat instead.

Python Client Examples

Basic Chat with requests

import requests

response = requests.post("http://<ollama-host>:11434/api/chat", json={
    "model": "qwen3.5:35b-a3b-q4_K_M",
    "messages": [
        {"role": "user", "content": "List Kubernetes namespaces"}
    ],
    "stream": False
})

data = response.json()
print(data["message"]["content"])

Tool Calling Loop with requests

import requests
import json

OLLAMA_URL = "http://<ollama-host>:11434/api/chat"
MODEL = "qwen3.5:35b-a3b-q4_K_M"

tools = [
    {
        "type": "function",
        "function": {
            "name": "ffo_search",
            "description": "Search the ontology for entities",
            "parameters": {
                "type": "object",
                "properties": {
                    "entity_type": {"type": "string"},
                    "limit": {"type": "integer", "default": 25}
                },
                "required": ["entity_type"]
            }
        }
    }
]

# Tool implementation registry
def execute_tool(name: str, arguments: dict) -> str:
    """Dispatch tool calls to actual implementations."""
    if name == "ffo_search":
        # Call your actual MCP tool here
        return json.dumps([{"name": "texas-dell-04", "type": "cluster"}])
    raise ValueError(f"Unknown tool: {name}")

# Conversation loop
messages = [
    {"role": "system", "content": "You are an infrastructure assistant. Use tools to answer questions."},
    {"role": "user", "content": "What clusters exist?"}
]

while True:
    response = requests.post(OLLAMA_URL, json={
        "model": MODEL,
        "messages": messages,
        "tools": tools,
        "stream": False,
        "options": {"temperature": 0.1}
    }, timeout=120)

    data = response.json()
    assistant_message = data["message"]
    messages.append(assistant_message)

    # Check if the model wants to call tools
    if not assistant_message.get("tool_calls"):
        # No tool calls — this is the final response
        print(assistant_message["content"])
        break

    # Execute each tool call
    for tool_call in assistant_message["tool_calls"]:
        func = tool_call["function"]
        result = execute_tool(func["name"], func["arguments"])
        messages.append({
            "role": "tool",
            "content": result
        })

Streaming Chat

import requests
import json

response = requests.post("http://<ollama-host>:11434/api/chat", json={
    "model": "qwen3.5:35b-a3b-q4_K_M",
    "messages": [
        {"role": "user", "content": "Explain service meshes"}
    ],
    "stream": True
}, stream=True)

for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        if chunk.get("message", {}).get("content"):
            print(chunk["message"]["content"], end="", flush=True)
        if chunk.get("done"):
            print()  # Final newline

Request Options

Both /api/chat and /api/generate accept an options object for controlling inference parameters:

{
  "model": "qwen3.5:35b-a3b-q4_K_M",
  "messages": [...],
  "options": {
    "temperature": 0.1,
    "num_ctx": 16384,
    "num_predict": 2048,
    "top_p": 0.9,
    "top_k": 40,
    "repeat_penalty": 1.1
  }
}
Option Default Description
temperature 0.8 Randomness. Use 0.1 for tool calling.
num_ctx 8192 Context window size in tokens.
num_predict 128 Maximum tokens to generate. Increase for long responses.
top_p 0.9 Nucleus sampling threshold.
top_k 40 Top-K sampling.
repeat_penalty 1.1 Penalty for repeating tokens.
seed random Set for reproducible output.
stop none Array of stop sequences.

See the Performance Tuning guide for recommended values for tool-calling workloads.