Installing Ollama
Step-by-step installation, configuration, and model management for Ollama on macOS, including environment variables and auto-start setup.
Developer Workstation Setup — This guide describes running LLM inference on an Apple Silicon Mac for local development, operator tooling, and Compass AI chat. This is not the production inference architecture. For production air-gapped deployments, see the vLLM on Kubernetes Production Inference Guide.
Installation Methods
Direct Download (Recommended)
Download the macOS application from ollama.com/download. This installs the Ollama application and the ollama CLI tool. The application runs as a menu bar item and starts the server automatically.
Homebrew
brew install ollama
This installs only the CLI binary. You will need to start the server manually with ollama serve or configure a launch agent.
Curl Install
curl -fsSL https://ollama.com/install.sh | sh
This is the standard Linux install method. On macOS, prefer the direct download or Homebrew.
Verify Installation
ollama --version
# ollama version is 0.9.x
If the command is not found after installing the macOS application, ensure /usr/local/bin is in your PATH. The application installs the CLI there.
Starting the Server
Manual Start
ollama serve
The server starts on localhost:11434 by default. This is foreground — it will block the terminal.
macOS Application
If you installed via direct download, the Ollama application runs the server automatically when launched. It appears as a llama icon in the menu bar. To start at login, enable “Launch at Login” in the menu bar dropdown.
launchctl (Headless Auto-Start)
For headless Mac servers (like the FFP inference host), create a launch agent that starts Ollama automatically on boot:
cat > ~/Library/LaunchAgents/com.ollama.server.plist << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.ollama.server</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/ollama</string>
<string>serve</string>
</array>
<key>EnvironmentVariables</key>
<dict>
<key>OLLAMA_HOST</key>
<string>0.0.0.0:11434</string>
<key>OLLAMA_KEEP_ALIVE</key>
<string>24h</string>
</dict>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/tmp/ollama.out.log</string>
<key>StandardErrorPath</key>
<string>/tmp/ollama.err.log</string>
</dict>
</plist>
EOF
Load the agent:
launchctl load ~/Library/LaunchAgents/com.ollama.server.plist
Verify it is running:
curl http://localhost:11434/api/tags
To stop:
launchctl unload ~/Library/LaunchAgents/com.ollama.server.plist
Note: If you are using the Ollama macOS application, it manages its own launch agent. Do not create a second one — they will conflict on port 11434.
Pulling Models
Download models from the Ollama registry:
# General purpose
ollama pull llama3.1:8b
# Recommended for tool calling (FFP production model)
ollama pull qwen3.5:35b-a3b-q4_K_M
# Larger general purpose
ollama pull llama3.1:70b-instruct-q4_K_M
# Code generation
ollama pull deepseek-coder-v2:16b
# Embeddings
ollama pull nomic-embed-text
List downloaded models:
ollama list
Example output:
NAME ID SIZE MODIFIED
qwen3.5:35b-a3b-q4_K_M a1b2c3d4e5 20 GB 2 days ago
llama3.1:8b f6g7h8i9j0 4.7 GB 5 days ago
nomic-embed-text:latest k1l2m3n4o5 274 MB 1 week ago
Remove a model:
ollama rm llama3.1:8b
Environment Variables
Set these before starting ollama serve, or include them in the launchctl plist.
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST |
localhost:11434 |
Bind address. Set to 0.0.0.0:11434 for LAN access. |
OLLAMA_MODELS |
~/.ollama/models |
Model storage directory. Change if you need models on a different volume. |
OLLAMA_NUM_PARALLEL |
1 |
Number of concurrent request slots. Each slot consumes additional memory for KV cache. |
OLLAMA_MAX_LOADED_MODELS |
1 |
Maximum models loaded in memory simultaneously. |
OLLAMA_KEEP_ALIVE |
5m |
How long to keep a model loaded after the last request. Set to 24h for always-on serving. |
OLLAMA_GPU_MEMORY_FRACTION |
0.9 |
Fraction of GPU memory Ollama can use. On Apple Silicon, this refers to the unified memory pool. |
OLLAMA_DEBUG |
0 |
Set to 1 for verbose debug logging. |
OLLAMA_FLASH_ATTENTION |
1 |
Enable flash attention (enabled by default on supported hardware). |
OLLAMA_MAX_QUEUE |
512 |
Maximum number of queued requests before rejecting. |
Setting Environment Variables for the macOS Application
If you use the Ollama macOS application (menu bar), environment variables must be set using launchctl setenv because the application does not read shell profile files:
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
launchctl setenv OLLAMA_KEEP_ALIVE "24h"
launchctl setenv OLLAMA_NUM_PARALLEL "2"
After setting these, restart the Ollama application for changes to take effect.
Firewall Configuration
If OLLAMA_HOST is set to 0.0.0.0, macOS may prompt to allow incoming connections. To configure manually:
- Open System Settings > Network > Firewall.
- Ensure the firewall allows incoming connections for Ollama, or add an explicit rule.
- Verify from another machine:
curl http://<ollama-host>:11434/api/tags
If the connection is refused, check that the firewall is not blocking port 11434 and that OLLAMA_HOST is set to 0.0.0.0 (not localhost or 127.0.0.1).
Model Storage
Models are stored in ~/.ollama/models by default. Each model consists of:
- Blob files: The actual GGUF weight files, stored content-addressed by SHA256.
- Manifests: JSON files mapping model tags to blob digests.
A 35B Q4 model consumes approximately 20GB of disk space. Ensure your boot volume has sufficient free space, or set OLLAMA_MODELS to point to an external volume.
To check total disk usage:
du -sh ~/.ollama/models
Verifying the Installation
Run a quick inference test:
ollama run llama3.1:8b "What is 2+2?"
This pulls the model (if not already downloaded), loads it into memory, runs inference, and prints the response. If this completes successfully, your Ollama installation is working correctly.
For API-level verification:
# Check server is running
curl -s http://localhost:11434/api/tags | python3 -m json.tool
# Check a model loads and responds
curl -s http://localhost:11434/api/generate \
-d '{"model": "llama3.1:8b", "prompt": "Hello", "stream": false}' \
| python3 -m json.tool