Agent Architecture — TrailbossAI, Posses, and Autonomous Remediation
How InfrastructureAI uses two-layer orchestration with TrailbossAI (LangGraph) and Posses (CrewAI) to autonomously prevent infrastructure failures through four specialized agents.
Agent Architecture — TrailbossAI, Posses, and Autonomous Remediation
InfrastructureAI uses a two-layer orchestration model. TrailbossAI (LangGraph) is the outer orchestrator — it manages the full mission lifecycle from alert ingestion through remediation and post-mortem. Within each mission, TrailbossAI dispatches a Posse (CrewAI) — a coordinated team of domain-specific agents that execute the actual investigation, analysis, and remediation work. This separation means the mission lifecycle logic never changes regardless of what the agents inside the Posse are doing, and new agent capabilities can be added to Posses without modifying the orchestration layer.
The Four Agents
Every Posse is composed of four agents. Each has a distinct role, distinct FFO read patterns, and distinct FFO write patterns.
| Agent | Role | FFO Reads | FFO Writes |
|---|---|---|---|
| Marshal | Policy enforcement — validates authorization, classification, and change windows before any action proceeds | Boundaries, classifications, clearances, policies | Validation decisions, policy exceptions |
| Scout | Discovery — gathers current infrastructure state from live systems via MCP tools | Minimal (checks what is already known to avoid redundant discovery) | Infrastructure state, findings, workload inventory |
| Sage | Analysis and planning — cross-domain reasoning, root cause identification, remediation planning | Everything — cross-domain graph traversal across all entity and relation types | Analysis results, remediation plans, risk assessments |
| Wrangler | Execution — applies changes, verifies results, writes outcomes and post-mortem | Plans from Sage, current state from Scout | Implementation status, evidence, change records, post-mortem |
The agent ordering is intentional. Marshal validates policy constraints before Scout touches any live system. Scout gathers state before Sage reasons about it. Sage produces a plan before Wrangler executes anything. This sequencing is enforced by the Posse orchestration, not by convention.
Mission Lifecycle State Machine
Every mission follows a deterministic state machine. TrailbossAI advances the mission through states based on agent outputs and policy gates.
- PENDING — Alert received, WorkItem created in FFO. Mission queued for processing.
- ASSESSING — Scout gathers current state, Sage analyzes root cause and produces a remediation plan. Marshal validates that the plan falls within policy bounds.
- AWAITING APPROVAL — For medium and high risk operations, the mission pauses at a human approval gate. Low-risk operations skip this state entirely.
- EXECUTING — Wrangler applies the remediation plan using MCP tools against live infrastructure.
- VERIFYING — Wrangler confirms the remediation succeeded by re-checking the metrics and conditions that triggered the original alert. If verification fails, the mission returns to ASSESSING for re-analysis.
- COMPLETED — Remediation verified. Post-mortem written to FFO. Mission closed.
- FAILED — Mission could not proceed due to policy rejection or unresolvable conditions. Escalation to human operator.
- ROLLBACK — Verification detected that the remediation made things worse. Wrangler reverses changes and escalates.
FFO as Shared Context
Every agent reads from FFO before acting and writes results back after. This is how the system accumulates operational knowledge without humans writing runbooks.
Agents do not cache state locally. When Scout needs to know what is running on an instance, it queries FFO for the current state. When Sage needs to understand what depends on a service, it traverses FFO relationships. When Wrangler finishes a remediation, it writes the outcome, the evidence, and the post-mortem back to FFO.
The post-mortem that Wrangler generates after each remediation IS the runbook for the next occurrence. When the same alert fires again next month, Sage will find the previous post-mortem in FFO — what was tried, what worked, what the environment looked like at the time. This accumulated knowledge is stored in the graph, not in a document someone has to find and read. The system gets better at handling each class of problem every time it encounters one.
Human-on-the-Loop
InfrastructureAI operates with a human-on-the-loop model, not human-in-the-loop. The distinction matters: agents execute autonomously within policy bounds, and humans approve at gates rather than directing every action.
Risk-based routing determines which operations require human approval:
- Low risk — Log cleanup, routine restarts of non-production services, cache invalidation, temporary scaling adjustments. These execute autonomously. The SRE reads the post-mortem in the morning.
- Medium risk — Production service restarts, configuration changes, scaling operations that affect cost. These pause at AWAITING APPROVAL and notify the on-call engineer via OutpostAI.
- High risk — Production deployments, data-destructive actions, changes to security controls, cross-boundary operations. These require explicit approval from an authorized operator via OutpostAI before proceeding.
Marshal enforces these boundaries before any Posse begins execution. The risk classification comes from FFO — the entity’s classification level, its dependency graph, the current change window status, and organizational policy constraints all factor into the risk determination. This is not a static runbook mapping alerts to risk levels. It is a dynamic assessment based on the current state of the infrastructure graph.