Autonomous SRE Agent — Prevention Before Failure

InfrastructureAI acts at the warning threshold — before failure, before a page, before an incident. Autonomous investigation, remediation, and post-mortem with zero custom code.

Autonomous SRE Agent — Prevention Before Failure

InfrastructureAI acts at the warning threshold — before a system fails, before a human is paged, before there is an incident. A Grafana alert fires at 60% CPU or 80% disk. The autonomous agent investigates, identifies root cause, remediates, verifies, and writes the post-mortem — without waking anyone up. The SRE reads the post-mortem in the morning. This is prevention, not recovery.

Phase 1 — Human-Directed, Agent-Executed (Validated Today)

Phase 1 requires zero custom code. An operator provides an alert to Claude Code, which enters a ReAct loop — reasoning about what to check, calling MCP tools to gather data, synthesizing findings, and executing remediation. The MCP tool surface varies by deployment:

  • AWS EKS deployment — AWS MCP (SSM Run Command for host access), Grafana MCP, Atlassian MCP
  • FMC/VitroAI on-premise deployment — Ceph, OpenStack, Kolla, Keycloak, Grafana, ArgoCD, Gitea, Atlassian, and FFO MCP servers

The investigation and remediation logic is identical — the agent reasons the same way regardless of which MCP tools are available. Only the tool surface changes.

Note: The investigation workflow and ReAct example below use the AWS EKS deployment variant (AWS MCP with SSM Run Command). For the on-premise FMC/VitroAI deployment validated in Phase 2, the same pattern applies using the on-premise MCP fleet — see the callout after the example.

Investigation Workflow

AWS EKS deployment: The workflow below uses the AWS MCP tool surface (SSM Run Command, EC2, CloudWatch). For on-premise FMC/VitroAI deployments, the same ReAct pattern applies using the on-premise MCP fleet — Ceph, OpenStack, Grafana, Keycloak, ArgoCD, Gitea, and Atlassian MCP servers.

The operator pastes an alert into Claude Code. From that point, the agent autonomously drives the investigation:

Step MCP Server Tool Call Purpose
1 AWS MCP ssm send-commanddf -h Check current disk usage
2 AWS MCP ssm send-commanddu -sh /* Find largest directories
3 Grafana MCP query_prometheusnode_filesystem_avail_bytes[24h] Get 24-hour disk trend
4 Grafana MCP query_loki{instance=...} \|= "No space left" Check for space-related errors
5 AWS MCP ssm send-commandjournalctl --vacuum-size=1G Execute remediation
6 Atlassian MCP create-jira-issue Document incident and resolution

The agent does not follow a script. It reasons about each result before deciding the next tool call. If step 2 reveals that /var/log/journal is consuming 12GB, the agent correlates that with the Prometheus trend from step 3 and the Loki errors from step 4 before deciding that journal vacuuming is the correct remediation. If the root cause were different — say, a log-shipping pipeline filling /tmp — the agent would take a completely different remediation path.

Safety Model

Not everything executes autonomously. The safety model defines boundaries:

Action Category Autonomous Requires Approval
Read-only investigation (df, du, ps, top, PromQL queries) Yes
Log cleanup (journalctl vacuum, logrotate, /tmp cleanup) Yes
Non-production service restart Yes
Production service restart Yes
Package installation or upgrade Yes
Data deletion beyond logs Yes
Configuration file modification Yes
Security-related changes Yes

Example Session — Complete Disk Remediation Cycle

Below is the actual Claude Code ReAct loop output from a disk pressure remediation on a monitoring instance. This is not a mockup — this is the real agent output.

Alert: DiskSpaceCritical on ip-10-0-47-122 (83% used, threshold 80%)

> Thinking: Disk alert on ip-10-0-47-122. I need to check current usage,
> find what's consuming space, and look at the trend.

Tool call: aws-mcp → ssm send-command → df -h (ip-10-0-47-122)
Result: /dev/xvda1  50G   42G  8.0G  84% /

> Thinking: 84% used, 8GB free. Not emergency yet but trending up.
> Let me find where the space is going.

Tool call: aws-mcp → ssm send-command → du -sh /* (ip-10-0-47-122)
Result:
  12G   /var/log
  8.4G  /var/lib
  6.2G  /usr
  ...

> Thinking: /var/log is 12GB — that's the primary consumer. Let me check
> what's in there.

Tool call: aws-mcp → ssm send-command → du -sh /var/log/* (ip-10-0-47-122)
Result:
  9.8G  /var/log/journal
  1.1G  /var/log/syslog
  ...

> Thinking: Journal is 9.8GB. Normal retention is 1-2GB. Let me check
> the trend to see how fast it's growing.

Tool call: grafana-mcp → query_prometheus →
  node_filesystem_avail_bytes{instance="ip-10-0-47-122"}[7d]
Result: Linear decline from 18GB to 8GB over 7 days (~1.4GB/day)

> Thinking: At 1.4GB/day, this instance hits 95% in ~5 days. Journal
> vacuuming is safe and will recover ~8GB. This is a low-risk action
> on a monitoring instance — executing autonomously.

Tool call: aws-mcp → ssm send-command →
  journalctl --vacuum-size=1G (ip-10-0-47-122)
Result: Vacuuming done, freed 8.7GB

Tool call: aws-mcp → ssm send-command → df -h (ip-10-0-47-122)
Result: /dev/xvda1  50G   33G  17G  66% /

> Thinking: Down from 84% to 66%. Alert will clear. Let me document this
> and create a ticket for the underlying issue — why is journal retention
> not configured?

Tool call: atlassian-mcp → create-jira-issue
  Summary: "Configure journal retention on ip-10-0-47-122"
  Description: "Journal accumulated 9.8GB due to missing retention
  policy. Vacuumed to 1GB. Configure MaxRetentionSec=7d and
  SystemMaxUse=2G in /etc/systemd/journald.conf to prevent recurrence."

Remediation complete. Disk at 66%. Jira ticket FF-342 created for
permanent fix. No human intervention required.

On-Premise Variant (FMC/VitroAI)

The example above uses the AWS MCP tool surface (SSM Run Command for host access, CloudWatch/Prometheus for metrics). For on-premise FMC/VitroAI deployments, the same ReAct pattern applies but uses the on-premise MCP fleet:

AWS EKS Tool FMC/VitroAI Equivalent Example
AWS MCP — ssm send-command OpenStack MCP / Kolla MCP openstack server list, kolla service-logs nova-compute
Grafana MCP — query_prometheus Grafana MCP — query_prometheus Same tool, different Prometheus instance
Grafana MCP — query_loki Grafana MCP — query_loki Same tool, different Loki instance
Ceph MCP — ceph_health, ceph_osd_status Storage health, OSD state, pool usage
Keycloak MCP Identity and access state
ArgoCD MCP GitOps deployment state and sync status
FFO MCP — ffo.context.for_action Entity relationships, dependencies, prior incidents
Atlassian MCP — create-jira-issue Atlassian MCP — create-jira-issue Same tool, same Jira instance

The agent does not need to be told which deployment variant it is operating in. The MCP tool surface determines what actions are available — the agent discovers tools at startup and reasons about which ones to call based on the alert context.

Lines of Custom Code for Phase 1: Zero

Phase 1 uses Claude Code as the orchestrator and off-the-shelf MCP servers as the tool layer. There is no custom application code. The entire capability is configuration: which MCP servers to connect, what credentials to provide, and what safety boundaries to enforce.

Phase 2 — Autonomous Dispatch Loop (Validated March 29, 2026)

Phase 1 requires a human to paste the alert. Phase 2 removes that requirement. The trigger becomes autonomous — Grafana Alertmanager fires at the warning threshold, and the system dispatches a Claude Code agent in a Kubernetes Job to investigate and remediate without human initiation.

Architecture

The dispatch loop runs as the Dispatch Controller — a FastAPI application deployed on the Frontier Management Cluster (FMC). When an alert arrives, the controller evaluates risk, gathers context from the FFO knowledge graph, renders a severity-appropriate prompt, and spawns a short-lived Kubernetes Job running Claude Code with access to the full MCP tool surface.

graph TD Alert[Grafana Alertmanager
Warning Threshold] --> DC[Dispatch Controller
FastAPI] DC --> OPA[OPA Risk Evaluation
Rego Policy — Wasm] OPA -->|LOW| Auto[Autonomous Dispatch] OPA -->|MEDIUM/HIGH| Gate[Human Approval Gate] Gate -->|Approved| Auto DC --> FFO[FFO MCP
Context Injection] DC --> Registry[Postgres MCP Registry
Dynamic mcp.json] Auto --> Job[K8s Job
Claude Code Runner] Job --> MCP[MCP Server Fleet
13 servers · 153+ tools] Job --> Writeback[FFO Write-Back
Outcome + Post-Mortem] style Alert fill:#c53030,stroke:#fc8181,color:#fff style DC fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style OPA fill:#553c9a,stroke:#805ad5,color:#e2e8f0 style Auto fill:#2b6cb0,stroke:#4299e1,color:#fff style Gate fill:#d69e2e,stroke:#ecc94b,color:#1a202c style FFO fill:#2c7a7b,stroke:#38b2ac,color:#e2e8f0 style Registry fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style Job fill:#2b6cb0,stroke:#4299e1,color:#fff style MCP fill:#1a365d,stroke:#4299e1,color:#e2e8f0 style Writeback fill:#2c7a7b,stroke:#38b2ac,color:#e2e8f0

Dispatch Controller Components

The Dispatch Controller implements eight components defined in ADR-005 (Agent Harness Pattern):

Component Implementation Purpose
Trigger Layer FastAPI /dispatch endpoint Receives Alertmanager webhooks, validates payload
Risk Governance OPA Wasm (in-process Rego) Classifies risk as LOW/MEDIUM/HIGH based on resource type, severity, and blast radius
Context Injection FFO MCP ffo.context.for_action Fetches entity relationships, dependencies, classification, and prior incident history from the knowledge graph
Tool Surface Postgres mcp_servers table Generates dynamic mcp.json from the MCP server registry — 13 servers, 153+ tools
Output Parsing Job exit code + structured output Captures remediation outcome, evidence, and post-mortem
Audit Trail K8s Job logs + FFO write-back Every dispatch is recorded — prompt, tools called, outcome, duration
Escalation Routing Risk-based gating MEDIUM/HIGH risk operations pause for human approval before execution
World Model Write-Back FFO MCP ffo.write Remediation outcomes written back to the knowledge graph as institutional memory

Claude Code Runner

The agent runtime is a container image (claude-runner) based on node:20-slim with Claude Code installed. The container’s entrypoint is claude itself — the Dispatch Controller passes the rendered prompt directly as Job args. Each dispatched agent is an isolated, short-lived Kubernetes Job with:

  • Bedrock inference via AWS VPC PrivateLink (Claude Sonnet 4.6 or Opus 4.6)
  • Full MCP tool access — the dynamic mcp.json gives the agent access to all 13 registered MCP servers
  • RBAC-scoped permissions — the Job’s service account limits what Kubernetes operations the agent can perform
  • No persistent state — the agent reads context from FFO, acts via MCP tools, writes outcomes back to FFO, and terminates

Risk Classification

OPA evaluates every incoming alert against a Rego policy that considers:

  • Resource type — VitroAI infrastructure types (hypervisor, network, storage) are classified higher than application-level resources
  • Severity — Critical and warning thresholds map to different risk levels
  • Blast radius — Resources with many dependents in the FFO graph are classified higher

LOW risk operations (log cleanup, non-production restarts, cache invalidation) dispatch autonomously. MEDIUM and HIGH risk operations require human approval via the escalation gate.

Validation Results (March 29, 2026)

Phase 2 was validated end-to-end on the Frontier Management Cluster:

  • Dispatch Controller deployed and accepting webhook payloads
  • OPA risk evaluation correctly classifying alerts by resource type and severity
  • FFO context injection fetching entity relationships and prior history
  • Claude Code Jobs spawning successfully with Bedrock inference
  • Bedrock inference validated with both Claude Sonnet 4.6 and Claude Opus 4.6
  • Dynamic tool discovery — agents receiving the full 153+ tool surface from the Postgres registry

The MCP tool ecosystem from Phase 1 carries forward unchanged. The same tool calls that worked in Phase 1 are the same calls the dispatched agent executes in Phase 2. Only the trigger changes — from a human pasting an alert to Alertmanager posting a webhook.

For the full agent harness architecture, see Agent Harness Pattern (ADR-005).

Why No Runbook Library Is Required

Traditional AIOps products require customers to author and maintain a library of runbooks — step-by-step procedures that the automation follows mechanically. InfrastructureAI rejects this approach (ADR-003).

The LLM already knows standard infrastructure operational procedures. Disk pressure remediation, OOM investigation, pod crashloop diagnosis, certificate expiry handling, log accumulation cleanup — these are well-documented procedures that exist extensively in the model’s training data. Asking customers to rewrite what the model already knows as structured runbooks is wasted effort.

FFO provides the environment-specific context the LLM cannot know from training. What is running on this instance. What depends on this service. What happened last time this alert fired. What classification level governs this workload. What change window is currently active. This is the context that makes the difference between a generic remediation and a correct one — and it lives in the knowledge graph, not in a runbook.

Human-authored guidance is required only for two categories:

  • Custom application failure modes that are unique to the customer’s software
  • Organizational policy constraints that override standard operational practice

Everything else — the vast majority of infrastructure operational procedures — the LLM handles from its training, informed by the real-time context in FFO.

Runs Sovereign at Every Classification Level

InfrastructureAI inference never traverses the public internet for classified workloads. The same autonomous SRE capability operates at every classification level, with the inference backend selected to match sovereignty requirements.

IL Level Inference Path Notes
IL2-IL4 CUI Claude via AWS Bedrock VPC PrivateLink Sovereign — no public internet, no Anthropic visibility into CUI
IL5 Claude via Bedrock GovCloud FedRAMP High / DoD SRG
IL6 Air-Gapped vLLM on VitroAI US-origin models only, agency RMF
Tactical Edge Ollama on Ampere ARM64 Pre-certified TFO playbooks, disconnected operation

See Sovereign Inference for the full architecture at each classification level.