Installing Ollama

Step-by-step installation, configuration, and model management for Ollama on macOS, including environment variables and auto-start setup.

Developer Workstation Setup — This guide describes running LLM inference on an Apple Silicon Mac for local development, operator tooling, and Compass AI chat. This is not the production inference architecture. For production air-gapped deployments, see the vLLM on Kubernetes Production Inference Guide.

Installation Methods

Download the macOS application from ollama.com/download. This installs the Ollama application and the ollama CLI tool. The application runs as a menu bar item and starts the server automatically.

Homebrew

brew install ollama

This installs only the CLI binary. You will need to start the server manually with ollama serve or configure a launch agent.

Curl Install

curl -fsSL https://ollama.com/install.sh | sh

This is the standard Linux install method. On macOS, prefer the direct download or Homebrew.

Verify Installation

ollama --version
# ollama version is 0.9.x

If the command is not found after installing the macOS application, ensure /usr/local/bin is in your PATH. The application installs the CLI there.

Starting the Server

Manual Start

ollama serve

The server starts on localhost:11434 by default. This is foreground — it will block the terminal.

macOS Application

If you installed via direct download, the Ollama application runs the server automatically when launched. It appears as a llama icon in the menu bar. To start at login, enable “Launch at Login” in the menu bar dropdown.

launchctl (Headless Auto-Start)

For headless Mac servers (like the FFP inference host), create a launch agent that starts Ollama automatically on boot:

cat > ~/Library/LaunchAgents/com.ollama.server.plist << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.ollama.server</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/ollama</string>
        <string>serve</string>
    </array>
    <key>EnvironmentVariables</key>
    <dict>
        <key>OLLAMA_HOST</key>
        <string>0.0.0.0:11434</string>
        <key>OLLAMA_KEEP_ALIVE</key>
        <string>24h</string>
    </dict>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/ollama.out.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/ollama.err.log</string>
</dict>
</plist>
EOF

Load the agent:

launchctl load ~/Library/LaunchAgents/com.ollama.server.plist

Verify it is running:

curl http://localhost:11434/api/tags

To stop:

launchctl unload ~/Library/LaunchAgents/com.ollama.server.plist

Note: If you are using the Ollama macOS application, it manages its own launch agent. Do not create a second one — they will conflict on port 11434.

Pulling Models

Download models from the Ollama registry:

# General purpose
ollama pull llama3.1:8b

# Recommended for tool calling (FFP production model)
ollama pull qwen3.5:35b-a3b-q4_K_M

# Larger general purpose
ollama pull llama3.1:70b-instruct-q4_K_M

# Code generation
ollama pull deepseek-coder-v2:16b

# Embeddings
ollama pull nomic-embed-text

List downloaded models:

ollama list

Example output:

NAME                              ID            SIZE     MODIFIED
qwen3.5:35b-a3b-q4_K_M           a1b2c3d4e5    20 GB    2 days ago
llama3.1:8b                       f6g7h8i9j0    4.7 GB   5 days ago
nomic-embed-text:latest           k1l2m3n4o5    274 MB   1 week ago

Remove a model:

ollama rm llama3.1:8b

Environment Variables

Set these before starting ollama serve, or include them in the launchctl plist.

Variable Default Description
OLLAMA_HOST localhost:11434 Bind address. Set to 0.0.0.0:11434 for LAN access.
OLLAMA_MODELS ~/.ollama/models Model storage directory. Change if you need models on a different volume.
OLLAMA_NUM_PARALLEL 1 Number of concurrent request slots. Each slot consumes additional memory for KV cache.
OLLAMA_MAX_LOADED_MODELS 1 Maximum models loaded in memory simultaneously.
OLLAMA_KEEP_ALIVE 5m How long to keep a model loaded after the last request. Set to 24h for always-on serving.
OLLAMA_GPU_MEMORY_FRACTION 0.9 Fraction of GPU memory Ollama can use. On Apple Silicon, this refers to the unified memory pool.
OLLAMA_DEBUG 0 Set to 1 for verbose debug logging.
OLLAMA_FLASH_ATTENTION 1 Enable flash attention (enabled by default on supported hardware).
OLLAMA_MAX_QUEUE 512 Maximum number of queued requests before rejecting.

Setting Environment Variables for the macOS Application

If you use the Ollama macOS application (menu bar), environment variables must be set using launchctl setenv because the application does not read shell profile files:

launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
launchctl setenv OLLAMA_KEEP_ALIVE "24h"
launchctl setenv OLLAMA_NUM_PARALLEL "2"

After setting these, restart the Ollama application for changes to take effect.

Firewall Configuration

If OLLAMA_HOST is set to 0.0.0.0, macOS may prompt to allow incoming connections. To configure manually:

  1. Open System Settings > Network > Firewall.
  2. Ensure the firewall allows incoming connections for Ollama, or add an explicit rule.
  3. Verify from another machine:
curl http://<ollama-host>:11434/api/tags

If the connection is refused, check that the firewall is not blocking port 11434 and that OLLAMA_HOST is set to 0.0.0.0 (not localhost or 127.0.0.1).

Model Storage

Models are stored in ~/.ollama/models by default. Each model consists of:

  • Blob files: The actual GGUF weight files, stored content-addressed by SHA256.
  • Manifests: JSON files mapping model tags to blob digests.

A 35B Q4 model consumes approximately 20GB of disk space. Ensure your boot volume has sufficient free space, or set OLLAMA_MODELS to point to an external volume.

To check total disk usage:

du -sh ~/.ollama/models

Verifying the Installation

Run a quick inference test:

ollama run llama3.1:8b "What is 2+2?"

This pulls the model (if not already downloaded), loads it into memory, runs inference, and prints the response. If this completes successfully, your Ollama installation is working correctly.

For API-level verification:

# Check server is running
curl -s http://localhost:11434/api/tags | python3 -m json.tool

# Check a model loads and responds
curl -s http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "Hello", "stream": false}' \
  | python3 -m json.tool