Monitoring and Alert-Driven Dispatch

How Grafana, Prometheus, and Alertmanager integrate with the Agent Harness Dispatch Controller to close the autonomous SRE loop — from metric to remediation with no human trigger.

Monitoring and Alert-Driven Dispatch

The monitoring stack is the Trigger Layer of the Agent Harness (ADR-005). Prometheus scrapes infrastructure metrics. Alertmanager evaluates alert rules and routes firing alerts to the Dispatch Controller as webhook payloads. The Dispatch Controller translates each alert into a governed, context-enriched Claude Code Job that investigates and remediates autonomously.

This page covers the monitoring stack deployment, how alerts flow into the dispatch loop, and how the system prevents the classes of failures that triggered the deployment.

Architecture

graph LR Kubelet[Kubelet
Volume Stats] --> Prom[Prometheus
50Gi · 15d retention] NodeExp[Node Exporter
CPU · Memory · Disk] --> Prom KSM[kube-state-metrics
Pod · Deployment · PVC] --> Prom Prom --> AM[Alertmanager
Route + Inhibit] AM -->|webhook POST| DC[Dispatch Controller
/alertmanager endpoint] DC --> OPA[OPA Risk Eval] DC --> FFO[FFO Context] DC --> Job[Claude Code Job
K8s Job] Prom --> Grafana[Grafana 12.4.2
grafana.vitro.lan] style Kubelet fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style NodeExp fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style KSM fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style Prom fill:#553c9a,stroke:#805ad5,color:#e2e8f0 style AM fill:#c53030,stroke:#fc8181,color:#fff style DC fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style OPA fill:#553c9a,stroke:#805ad5,color:#e2e8f0 style FFO fill:#2c7a7b,stroke:#38b2ac,color:#e2e8f0 style Job fill:#2b6cb0,stroke:#4299e1,color:#fff style Grafana fill:#2d3748,stroke:#4299e1,color:#e2e8f0

Stack Components

The monitoring stack runs in the monitoring namespace on the Frontier Management Cluster (FMC), deployed via the kube-prometheus-stack Helm chart.

Component	Version	Storage	Purpose
Prometheus	—	50Gi (ceph-rbd), 15-day retention	Metrics collection, PromQL evaluation, alert rule evaluation
Alertmanager	—	5Gi (ceph-rbd)	Alert routing, grouping, inhibition, webhook dispatch
Grafana	12.4.2	10Gi (ceph-rbd)	Dashboards, visualization, manual exploration
Node Exporter	—	—	Host-level metrics (CPU, memory, disk, network)
kube-state-metrics	—	—	Kubernetes object metrics (pods, deployments, PVCs, nodes)
Prometheus Operator	—	—	Manages Prometheus and Alertmanager via CRDs

Access: Grafana is exposed via Traefik IngressRoute at http://grafana.vitro.lan.

Alert Flow: Metric to Remediation

The alert flow is fully autonomous for LOW-risk operations. No human initiates the investigation.

1. Prometheus Evaluates Alert Rules

Prometheus evaluates PrometheusRule CRDs. Example — PVC usage alerts:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: pvc-usage-alerts
  namespace: monitoring
spec:
  groups:
    - name: pvc-usage
      rules:
        - alert: PVCUsageHigh
          expr: |
            (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 75
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "PVC  in  is % full"
        - alert: PVCUsageCritical
          expr: |
            (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 90
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "PVC  in  is % full"

2. Alertmanager Routes to Dispatch Controller

When an alert fires, Alertmanager groups it by namespace and alertname, then routes it to the sre-dispatch receiver:

receivers:
  - name: sre-dispatch
    webhook_configs:
      - url: http://sre-dispatch.f3iai.svc.cluster.local/alertmanager
        send_resolved: false

route:
  group_by: [namespace, alertname]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - alertname =~ "PVCUsageHigh|PVCUsageCritical"
      receiver: sre-dispatch

send_resolved: false — the Dispatch Controller only acts on firing alerts, not on resolved notifications.

3. Dispatch Controller Translates and Dispatches

The Dispatch Controller’s /alertmanager endpoint receives the Alertmanager webhook payload and translates each firing alert into an SREEvent:

Alertmanager Field	SREEvent Field	Translation
`labels.severity`	`severity`	`critical` → `high`, `warning` → `medium`, `info` → `low`
`labels.persistentvolumeclaim` / `pod` / `node` / `instance`	`resource_id`	First non-empty label wins
Alert label content	`resource_type`	Heuristic pattern matching (PVC → `k8s_pvc`, pod → `k8s_pod`, ceph → `ceph`, etc.)
`annotations.summary`	`description`	Prefixed with `[namespace]` if available
`"grafana"`	`source`	All Alertmanager alerts are sourced as Grafana
Full alert object	`raw_payload`	Preserved for audit trail

Each translated event then flows through the standard dispatch pipeline: OPA risk evaluation → FFO context injection → Jinja2 prompt rendering → K8s Job creation.

4. Claude Code Investigates and Remediates

The spawned Claude Code Job receives the full MCP tool surface and a severity-appropriate prompt. For a PVC usage alert, the agent might:

Query Prometheus via Grafana MCP for the usage trend
Identify what’s consuming space (WAL files, logs, data growth)
Remediate if LOW risk (clean WAL, expand PVC, rotate logs)
Create a Jira ticket documenting the investigation and remediation
Write the outcome back to FFO for institutional memory

Severity Mapping

Alertmanager’s severity labels do not map 1:1 to dispatch behavior. The OPA risk policy evaluates the full context — resource type, blast radius, and organizational constraints — before determining the effective severity.

Alertmanager Severity	Initial SREEvent Severity	OPA May Elevate To	Dispatch Behavior
`info`	LOW	MEDIUM or HIGH	Autonomous remediation (if OPA confirms LOW)
`warning`	MEDIUM	HIGH	Create merge request, do not execute directly
`critical`	HIGH	CRITICAL (blocked)	Investigate only, document in Jira

The Alertmanager critical label maps to SREEvent HIGH, not CRITICAL. This is intentional — CRITICAL in the dispatch system means “autonomous dispatch prohibited entirely.” Only the OPA policy can elevate to CRITICAL based on destructive keyword detection.

Built-in Dashboards

The kube-prometheus-stack deploys 28 pre-configured Grafana dashboards:

Dashboard	What It Shows
Kubernetes / Persistent Volumes	PVC usage, capacity, available bytes — the dashboard that would have caught the TypeDB disk-full incident
Kubernetes / Compute Resources / Namespace (Pods)	CPU and memory usage by pod in each namespace
Kubernetes / Compute Resources / Cluster	Cluster-wide resource utilization
Node Exporter / Nodes	Host-level CPU, memory, disk, network for texas-dell-04
Kubernetes / Kubelet	Kubelet health, volume operations, pod lifecycle
Prometheus / Overview	Prometheus scrape performance, rule evaluation duration
Alertmanager / Overview	Alert groups, notification rate, silences

Network Policies

The monitoring namespace has a default-deny-ingress policy. The following policies allow the required traffic:

Policy	From	To	Port
`allow-grafana-ingress`	Any namespace	Grafana pods	3000
`allow-prometheus-ingress`	Any namespace	Prometheus pods	9090
`allow-alertmanager-ingress`	Any namespace	Alertmanager pods	9093
`allow-prometheus-scrape-egress`	Prometheus pods	Any (scrape targets)	All
`allow-grafana-to-prometheus`	Grafana pods	Any (data sources)	All
`allow-alertmanager-to-dispatch`	Alertmanager (monitoring ns)	sre-dispatch (f3iai ns)	8080

Incident That Drove This Deployment

On April 1, 2026, the TypeDB pod in the f3iai namespace entered CrashLoopBackOff. Investigation revealed two cascading issues:

Rook-Ceph CSI driver pods were scaled to 0 — the kubelet could not mount RBD-backed PVCs. All pods depending on Ceph RBD storage were stuck in ContainerCreating.
TypeDB’s 20Gi PVC was 100% full — 15GB of WAL files (962 files that had never been compacted because TypeDB kept crashing before it could checkpoint), 4.2GB checkpoint, 4.1GB storage. When TypeDB tried to recover, it needed temporary disk space for WAL replay and had none.

Resolution

Scaled Rook-Ceph operator back to 1 replica — it reconciled the CSI DaemonSets and provisioners
Removed an empty/corrupt checkpoint directory (1775044273271069) that was causing a panic on startup
Expanded the PVC from 20Gi to 40Gi using online Ceph RBD volume expansion
TypeDB replayed 962 WAL files from the good checkpoint — zero data loss

Prevention

The monitoring stack prevents this class of failure:

PVCUsageHigh fires at 75% → the Dispatch Controller investigates and can expand the volume or clean up space before it fills
PVCUsageCritical fires at 90% → immediate escalation
Grafana dashboards provide visual PVC monitoring for manual review
Prometheus retention (15 days) provides historical trend data for capacity planning

Without this monitoring stack, the PVC filled silently over weeks until TypeDB crashed. With it, the autonomous SRE agent would have received a warning alert at 75% and taken action — days before the crash.

Validation

The end-to-end flow was validated on April 2, 2026:

Prometheus confirmed scraping kubelet_volume_stats_available_bytes for all PVCs in f3iai
Alertmanager confirmed routing PVCUsageHigh and PVCUsageCritical alerts to the sre-dispatch webhook receiver
Dispatch Controller /alertmanager endpoint confirmed translating Alertmanager webhook payloads into SREEvent objects
Test dispatch confirmed: alert webhook → OPA risk evaluation → Claude Code Job created and running

Adding New Alert Routes

To route additional alerts to the Dispatch Controller, add matchers to the Alertmanager config:

routes:
  - matchers:
      - alertname =~ "PVCUsageHigh|PVCUsageCritical|NodeDiskPressure|PodCrashLooping"
    receiver: sre-dispatch

Update via Helm:

helm upgrade kube-prom-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f monitoring-values.yaml \
  -f alertmanager-config-patch.yaml

New PrometheusRule CRDs are automatically picked up by Prometheus — no restart required.