Monitoring and Alert-Driven Dispatch

How Grafana, Prometheus, and Alertmanager integrate with the Agent Harness Dispatch Controller to close the autonomous SRE loop — from metric to remediation with no human trigger.

Monitoring and Alert-Driven Dispatch

The monitoring stack is the Trigger Layer of the Agent Harness (ADR-005). Prometheus scrapes infrastructure metrics. Alertmanager evaluates alert rules and routes firing alerts to the Dispatch Controller as webhook payloads. The Dispatch Controller translates each alert into a governed, context-enriched Claude Code Job that investigates and remediates autonomously.

This page covers the monitoring stack deployment, how alerts flow into the dispatch loop, and how the system prevents the classes of failures that triggered the deployment.

Architecture

graph LR Kubelet[Kubelet
Volume Stats] --> Prom[Prometheus
50Gi · 15d retention] NodeExp[Node Exporter
CPU · Memory · Disk] --> Prom KSM[kube-state-metrics
Pod · Deployment · PVC] --> Prom Prom --> AM[Alertmanager
Route + Inhibit] AM -->|webhook POST| DC[Dispatch Controller
/alertmanager endpoint] DC --> OPA[OPA Risk Eval] DC --> FFO[FFO Context] DC --> Job[Claude Code Job
K8s Job] Prom --> Grafana[Grafana 12.4.2
grafana.vitro.lan] style Kubelet fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style NodeExp fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style KSM fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style Prom fill:#553c9a,stroke:#805ad5,color:#e2e8f0 style AM fill:#c53030,stroke:#fc8181,color:#fff style DC fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style OPA fill:#553c9a,stroke:#805ad5,color:#e2e8f0 style FFO fill:#2c7a7b,stroke:#38b2ac,color:#e2e8f0 style Job fill:#2b6cb0,stroke:#4299e1,color:#fff style Grafana fill:#2d3748,stroke:#4299e1,color:#e2e8f0

Stack Components

The monitoring stack runs in the monitoring namespace on the Frontier Management Cluster (FMC), deployed via the kube-prometheus-stack Helm chart.

Component Version Storage Purpose
Prometheus 50Gi (ceph-rbd), 15-day retention Metrics collection, PromQL evaluation, alert rule evaluation
Alertmanager 5Gi (ceph-rbd) Alert routing, grouping, inhibition, webhook dispatch
Grafana 12.4.2 10Gi (ceph-rbd) Dashboards, visualization, manual exploration
Node Exporter Host-level metrics (CPU, memory, disk, network)
kube-state-metrics Kubernetes object metrics (pods, deployments, PVCs, nodes)
Prometheus Operator Manages Prometheus and Alertmanager via CRDs

Access: Grafana is exposed via Traefik IngressRoute at http://grafana.vitro.lan.

Alert Flow: Metric to Remediation

The alert flow is fully autonomous for LOW-risk operations. No human initiates the investigation.

1. Prometheus Evaluates Alert Rules

Prometheus evaluates PrometheusRule CRDs. Example — PVC usage alerts:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: pvc-usage-alerts
  namespace: monitoring
spec:
  groups:
    - name: pvc-usage
      rules:
        - alert: PVCUsageHigh
          expr: |
            (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 75
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "PVC  in  is % full"
        - alert: PVCUsageCritical
          expr: |
            (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 90
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "PVC  in  is % full"

2. Alertmanager Routes to Dispatch Controller

When an alert fires, Alertmanager groups it by namespace and alertname, then routes it to the sre-dispatch receiver:

receivers:
  - name: sre-dispatch
    webhook_configs:
      - url: http://sre-dispatch.f3iai.svc.cluster.local/alertmanager
        send_resolved: false

route:
  group_by: [namespace, alertname]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - alertname =~ "PVCUsageHigh|PVCUsageCritical"
      receiver: sre-dispatch

send_resolved: false — the Dispatch Controller only acts on firing alerts, not on resolved notifications.

3. Dispatch Controller Translates and Dispatches

The Dispatch Controller’s /alertmanager endpoint receives the Alertmanager webhook payload and translates each firing alert into an SREEvent:

Alertmanager Field SREEvent Field Translation
labels.severity severity criticalhigh, warningmedium, infolow
labels.persistentvolumeclaim / pod / node / instance resource_id First non-empty label wins
Alert label content resource_type Heuristic pattern matching (PVC → k8s_pvc, pod → k8s_pod, ceph → ceph, etc.)
annotations.summary description Prefixed with [namespace] if available
"grafana" source All Alertmanager alerts are sourced as Grafana
Full alert object raw_payload Preserved for audit trail

Each translated event then flows through the standard dispatch pipeline: OPA risk evaluation → FFO context injection → Jinja2 prompt rendering → K8s Job creation.

4. Claude Code Investigates and Remediates

The spawned Claude Code Job receives the full MCP tool surface and a severity-appropriate prompt. For a PVC usage alert, the agent might:

  1. Query Prometheus via Grafana MCP for the usage trend
  2. Identify what’s consuming space (WAL files, logs, data growth)
  3. Remediate if LOW risk (clean WAL, expand PVC, rotate logs)
  4. Create a Jira ticket documenting the investigation and remediation
  5. Write the outcome back to FFO for institutional memory

Severity Mapping

Alertmanager’s severity labels do not map 1:1 to dispatch behavior. The OPA risk policy evaluates the full context — resource type, blast radius, and organizational constraints — before determining the effective severity.

Alertmanager Severity Initial SREEvent Severity OPA May Elevate To Dispatch Behavior
info LOW MEDIUM or HIGH Autonomous remediation (if OPA confirms LOW)
warning MEDIUM HIGH Create merge request, do not execute directly
critical HIGH CRITICAL (blocked) Investigate only, document in Jira

The Alertmanager critical label maps to SREEvent HIGH, not CRITICAL. This is intentional — CRITICAL in the dispatch system means “autonomous dispatch prohibited entirely.” Only the OPA policy can elevate to CRITICAL based on destructive keyword detection.

Built-in Dashboards

The kube-prometheus-stack deploys 28 pre-configured Grafana dashboards:

Dashboard What It Shows
Kubernetes / Persistent Volumes PVC usage, capacity, available bytes — the dashboard that would have caught the TypeDB disk-full incident
Kubernetes / Compute Resources / Namespace (Pods) CPU and memory usage by pod in each namespace
Kubernetes / Compute Resources / Cluster Cluster-wide resource utilization
Node Exporter / Nodes Host-level CPU, memory, disk, network for texas-dell-04
Kubernetes / Kubelet Kubelet health, volume operations, pod lifecycle
Prometheus / Overview Prometheus scrape performance, rule evaluation duration
Alertmanager / Overview Alert groups, notification rate, silences

Network Policies

The monitoring namespace has a default-deny-ingress policy. The following policies allow the required traffic:

Policy From To Port
allow-grafana-ingress Any namespace Grafana pods 3000
allow-prometheus-ingress Any namespace Prometheus pods 9090
allow-alertmanager-ingress Any namespace Alertmanager pods 9093
allow-prometheus-scrape-egress Prometheus pods Any (scrape targets) All
allow-grafana-to-prometheus Grafana pods Any (data sources) All
allow-alertmanager-to-dispatch Alertmanager (monitoring ns) sre-dispatch (f3iai ns) 8080

Incident That Drove This Deployment

On April 1, 2026, the TypeDB pod in the f3iai namespace entered CrashLoopBackOff. Investigation revealed two cascading issues:

  1. Rook-Ceph CSI driver pods were scaled to 0 — the kubelet could not mount RBD-backed PVCs. All pods depending on Ceph RBD storage were stuck in ContainerCreating.

  2. TypeDB’s 20Gi PVC was 100% full — 15GB of WAL files (962 files that had never been compacted because TypeDB kept crashing before it could checkpoint), 4.2GB checkpoint, 4.1GB storage. When TypeDB tried to recover, it needed temporary disk space for WAL replay and had none.

Resolution

  1. Scaled Rook-Ceph operator back to 1 replica — it reconciled the CSI DaemonSets and provisioners
  2. Removed an empty/corrupt checkpoint directory (1775044273271069) that was causing a panic on startup
  3. Expanded the PVC from 20Gi to 40Gi using online Ceph RBD volume expansion
  4. TypeDB replayed 962 WAL files from the good checkpoint — zero data loss

Prevention

The monitoring stack prevents this class of failure:

  • PVCUsageHigh fires at 75% → the Dispatch Controller investigates and can expand the volume or clean up space before it fills
  • PVCUsageCritical fires at 90% → immediate escalation
  • Grafana dashboards provide visual PVC monitoring for manual review
  • Prometheus retention (15 days) provides historical trend data for capacity planning

Without this monitoring stack, the PVC filled silently over weeks until TypeDB crashed. With it, the autonomous SRE agent would have received a warning alert at 75% and taken action — days before the crash.

Validation

The end-to-end flow was validated on April 2, 2026:

  1. Prometheus confirmed scraping kubelet_volume_stats_available_bytes for all PVCs in f3iai
  2. Alertmanager confirmed routing PVCUsageHigh and PVCUsageCritical alerts to the sre-dispatch webhook receiver
  3. Dispatch Controller /alertmanager endpoint confirmed translating Alertmanager webhook payloads into SREEvent objects
  4. Test dispatch confirmed: alert webhook → OPA risk evaluation → Claude Code Job created and running

Adding New Alert Routes

To route additional alerts to the Dispatch Controller, add matchers to the Alertmanager config:

routes:
  - matchers:
      - alertname =~ "PVCUsageHigh|PVCUsageCritical|NodeDiskPressure|PodCrashLooping"
    receiver: sre-dispatch

Update via Helm:

helm upgrade kube-prom-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f monitoring-values.yaml \
  -f alertmanager-config-patch.yaml

New PrometheusRule CRDs are automatically picked up by Prometheus — no restart required.