Monitoring and Alert-Driven Dispatch
How Grafana, Prometheus, and Alertmanager integrate with the Agent Harness Dispatch Controller to close the autonomous SRE loop — from metric to remediation with no human trigger.
Monitoring and Alert-Driven Dispatch
The monitoring stack is the Trigger Layer of the Agent Harness (ADR-005). Prometheus scrapes infrastructure metrics. Alertmanager evaluates alert rules and routes firing alerts to the Dispatch Controller as webhook payloads. The Dispatch Controller translates each alert into a governed, context-enriched Claude Code Job that investigates and remediates autonomously.
This page covers the monitoring stack deployment, how alerts flow into the dispatch loop, and how the system prevents the classes of failures that triggered the deployment.
Architecture
Volume Stats] --> Prom[Prometheus
50Gi · 15d retention] NodeExp[Node Exporter
CPU · Memory · Disk] --> Prom KSM[kube-state-metrics
Pod · Deployment · PVC] --> Prom Prom --> AM[Alertmanager
Route + Inhibit] AM -->|webhook POST| DC[Dispatch Controller
/alertmanager endpoint] DC --> OPA[OPA Risk Eval] DC --> FFO[FFO Context] DC --> Job[Claude Code Job
K8s Job] Prom --> Grafana[Grafana 12.4.2
grafana.vitro.lan] style Kubelet fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style NodeExp fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style KSM fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style Prom fill:#553c9a,stroke:#805ad5,color:#e2e8f0 style AM fill:#c53030,stroke:#fc8181,color:#fff style DC fill:#2d3748,stroke:#4299e1,color:#e2e8f0 style OPA fill:#553c9a,stroke:#805ad5,color:#e2e8f0 style FFO fill:#2c7a7b,stroke:#38b2ac,color:#e2e8f0 style Job fill:#2b6cb0,stroke:#4299e1,color:#fff style Grafana fill:#2d3748,stroke:#4299e1,color:#e2e8f0
Stack Components
The monitoring stack runs in the monitoring namespace on the Frontier Management Cluster (FMC), deployed via the kube-prometheus-stack Helm chart.
| Component | Version | Storage | Purpose |
|---|---|---|---|
| Prometheus | — | 50Gi (ceph-rbd), 15-day retention | Metrics collection, PromQL evaluation, alert rule evaluation |
| Alertmanager | — | 5Gi (ceph-rbd) | Alert routing, grouping, inhibition, webhook dispatch |
| Grafana | 12.4.2 | 10Gi (ceph-rbd) | Dashboards, visualization, manual exploration |
| Node Exporter | — | — | Host-level metrics (CPU, memory, disk, network) |
| kube-state-metrics | — | — | Kubernetes object metrics (pods, deployments, PVCs, nodes) |
| Prometheus Operator | — | — | Manages Prometheus and Alertmanager via CRDs |
Access: Grafana is exposed via Traefik IngressRoute at http://grafana.vitro.lan.
Alert Flow: Metric to Remediation
The alert flow is fully autonomous for LOW-risk operations. No human initiates the investigation.
1. Prometheus Evaluates Alert Rules
Prometheus evaluates PrometheusRule CRDs. Example — PVC usage alerts:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pvc-usage-alerts
namespace: monitoring
spec:
groups:
- name: pvc-usage
rules:
- alert: PVCUsageHigh
expr: |
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 75
for: 10m
labels:
severity: warning
annotations:
summary: "PVC in is % full"
- alert: PVCUsageCritical
expr: |
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "PVC in is % full"
2. Alertmanager Routes to Dispatch Controller
When an alert fires, Alertmanager groups it by namespace and alertname, then routes it to the sre-dispatch receiver:
receivers:
- name: sre-dispatch
webhook_configs:
- url: http://sre-dispatch.f3iai.svc.cluster.local/alertmanager
send_resolved: false
route:
group_by: [namespace, alertname]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- alertname =~ "PVCUsageHigh|PVCUsageCritical"
receiver: sre-dispatch
send_resolved: false — the Dispatch Controller only acts on firing alerts, not on resolved notifications.
3. Dispatch Controller Translates and Dispatches
The Dispatch Controller’s /alertmanager endpoint receives the Alertmanager webhook payload and translates each firing alert into an SREEvent:
| Alertmanager Field | SREEvent Field | Translation |
|---|---|---|
labels.severity |
severity |
critical → high, warning → medium, info → low |
labels.persistentvolumeclaim / pod / node / instance |
resource_id |
First non-empty label wins |
| Alert label content | resource_type |
Heuristic pattern matching (PVC → k8s_pvc, pod → k8s_pod, ceph → ceph, etc.) |
annotations.summary |
description |
Prefixed with [namespace] if available |
"grafana" |
source |
All Alertmanager alerts are sourced as Grafana |
| Full alert object | raw_payload |
Preserved for audit trail |
Each translated event then flows through the standard dispatch pipeline: OPA risk evaluation → FFO context injection → Jinja2 prompt rendering → K8s Job creation.
4. Claude Code Investigates and Remediates
The spawned Claude Code Job receives the full MCP tool surface and a severity-appropriate prompt. For a PVC usage alert, the agent might:
- Query Prometheus via Grafana MCP for the usage trend
- Identify what’s consuming space (WAL files, logs, data growth)
- Remediate if LOW risk (clean WAL, expand PVC, rotate logs)
- Create a Jira ticket documenting the investigation and remediation
- Write the outcome back to FFO for institutional memory
Severity Mapping
Alertmanager’s severity labels do not map 1:1 to dispatch behavior. The OPA risk policy evaluates the full context — resource type, blast radius, and organizational constraints — before determining the effective severity.
| Alertmanager Severity | Initial SREEvent Severity | OPA May Elevate To | Dispatch Behavior |
|---|---|---|---|
info |
LOW | MEDIUM or HIGH | Autonomous remediation (if OPA confirms LOW) |
warning |
MEDIUM | HIGH | Create merge request, do not execute directly |
critical |
HIGH | CRITICAL (blocked) | Investigate only, document in Jira |
The Alertmanager critical label maps to SREEvent HIGH, not CRITICAL. This is intentional — CRITICAL in the dispatch system means “autonomous dispatch prohibited entirely.” Only the OPA policy can elevate to CRITICAL based on destructive keyword detection.
Built-in Dashboards
The kube-prometheus-stack deploys 28 pre-configured Grafana dashboards:
| Dashboard | What It Shows |
|---|---|
| Kubernetes / Persistent Volumes | PVC usage, capacity, available bytes — the dashboard that would have caught the TypeDB disk-full incident |
| Kubernetes / Compute Resources / Namespace (Pods) | CPU and memory usage by pod in each namespace |
| Kubernetes / Compute Resources / Cluster | Cluster-wide resource utilization |
| Node Exporter / Nodes | Host-level CPU, memory, disk, network for texas-dell-04 |
| Kubernetes / Kubelet | Kubelet health, volume operations, pod lifecycle |
| Prometheus / Overview | Prometheus scrape performance, rule evaluation duration |
| Alertmanager / Overview | Alert groups, notification rate, silences |
Network Policies
The monitoring namespace has a default-deny-ingress policy. The following policies allow the required traffic:
| Policy | From | To | Port |
|---|---|---|---|
allow-grafana-ingress |
Any namespace | Grafana pods | 3000 |
allow-prometheus-ingress |
Any namespace | Prometheus pods | 9090 |
allow-alertmanager-ingress |
Any namespace | Alertmanager pods | 9093 |
allow-prometheus-scrape-egress |
Prometheus pods | Any (scrape targets) | All |
allow-grafana-to-prometheus |
Grafana pods | Any (data sources) | All |
allow-alertmanager-to-dispatch |
Alertmanager (monitoring ns) | sre-dispatch (f3iai ns) | 8080 |
Incident That Drove This Deployment
On April 1, 2026, the TypeDB pod in the f3iai namespace entered CrashLoopBackOff. Investigation revealed two cascading issues:
-
Rook-Ceph CSI driver pods were scaled to 0 — the kubelet could not mount RBD-backed PVCs. All pods depending on Ceph RBD storage were stuck in ContainerCreating.
-
TypeDB’s 20Gi PVC was 100% full — 15GB of WAL files (962 files that had never been compacted because TypeDB kept crashing before it could checkpoint), 4.2GB checkpoint, 4.1GB storage. When TypeDB tried to recover, it needed temporary disk space for WAL replay and had none.
Resolution
- Scaled Rook-Ceph operator back to 1 replica — it reconciled the CSI DaemonSets and provisioners
- Removed an empty/corrupt checkpoint directory (
1775044273271069) that was causing a panic on startup - Expanded the PVC from 20Gi to 40Gi using online Ceph RBD volume expansion
- TypeDB replayed 962 WAL files from the good checkpoint — zero data loss
Prevention
The monitoring stack prevents this class of failure:
- PVCUsageHigh fires at 75% → the Dispatch Controller investigates and can expand the volume or clean up space before it fills
- PVCUsageCritical fires at 90% → immediate escalation
- Grafana dashboards provide visual PVC monitoring for manual review
- Prometheus retention (15 days) provides historical trend data for capacity planning
Without this monitoring stack, the PVC filled silently over weeks until TypeDB crashed. With it, the autonomous SRE agent would have received a warning alert at 75% and taken action — days before the crash.
Validation
The end-to-end flow was validated on April 2, 2026:
- Prometheus confirmed scraping
kubelet_volume_stats_available_bytesfor all PVCs inf3iai - Alertmanager confirmed routing
PVCUsageHighandPVCUsageCriticalalerts to thesre-dispatchwebhook receiver - Dispatch Controller
/alertmanagerendpoint confirmed translating Alertmanager webhook payloads intoSREEventobjects - Test dispatch confirmed: alert webhook → OPA risk evaluation → Claude Code Job created and running
Adding New Alert Routes
To route additional alerts to the Dispatch Controller, add matchers to the Alertmanager config:
routes:
- matchers:
- alertname =~ "PVCUsageHigh|PVCUsageCritical|NodeDiskPressure|PodCrashLooping"
receiver: sre-dispatch
Update via Helm:
helm upgrade kube-prom-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
-f monitoring-values.yaml \
-f alertmanager-config-patch.yaml
New PrometheusRule CRDs are automatically picked up by Prometheus — no restart required.