Cluster Bootstrap Flow — From OutpostAI Click to Working PVCs

End-to-end walkthrough of what happens when an operator provisions a new RKE2 workload cluster through OutpostAI on Vitro: CAPO manifests, Application Credentials, cloud-config Secret, and the moment the first pod successfully mounts a Cinder-backed PVC.

This page traces the full lifecycle of a CAPO workload cluster from the moment an operator clicks Create Cluster in OutpostAI to the moment a pod in the new cluster successfully mounts a Cinder-backed PVC. It exists because the path involves three Git repos, two Kubernetes clusters, OpenStack APIs, Keystone, ArgoCD, CAPO, cloud-init, RKE2, and a half-dozen Helm charts — and the only way to debug it is to understand the whole picture.

High-Level Flow

graph TD Op[Operator clicks
Create Cluster in OutpostAI] --> Wiz[Wizard collects
name, infra, addons,
customized values.yaml] Wiz --> Trail[Trailboss API
POST /api/v1/clusters/create] Trail --> CAPOFn[create_capo_cluster_gitops] CAPOFn --> Gitea1[Gitea: write CAPO manifests
+ ArgoCD Applications
+ values.yaml per addon] Gitea1 --> ArgoFMC[ArgoCD on FMC syncs] ArgoFMC --> CAPOctrl[CAPO controller in FMC
creates Cluster, OpenStackCluster,
MachineTemplates, KubeadmControlPlane] CAPOctrl --> Nova[Nova boots VMs from
rke2-node Glance image] Nova --> Cloudinit[cloud-init runs on each VM
configures hostname,
writes /etc/rancher/rke2/config.yaml] Cloudinit --> RKE2[RKE2 starts
imports airgap images
joins cluster] RKE2 --> Ready[Workload cluster
API server reachable] Ready --> Sync[ArgoCD on FMC syncs
cluster's addon Applications] Sync --> CSI[cinder-csi-controller +
cinder-csi-nodeplugin start
using pre-pulled images] CSI --> CloudCfg[CSI reads cloud-config Secret
with cluster's Application Credentials] CloudCfg --> First[First Pod creates a PVC
storageClassName: cinder-rbd] First --> CinderAPI[CSI calls Cinder API] CinderAPI --> Ceph[Cinder rbd driver
creates RBD image in Ceph] Ceph --> Attach[Cinder asks Nova to attach
volume to worker VM] Attach --> Mount[kubelet mounts /dev/vdN
into the pod] Mount --> Done[Pod is Running with
persistent storage] style Wiz fill:#2b6cb0,stroke:#4299e1,color:#fff style Gitea1 fill:#553c9a,stroke:#805ad5,color:#fff style ArgoFMC fill:#553c9a,stroke:#805ad5,color:#fff style Sync fill:#553c9a,stroke:#805ad5,color:#fff style CinderAPI fill:#2c7a7b,stroke:#38b2ac,color:#fff style Ceph fill:#2c7a7b,stroke:#38b2ac,color:#fff style Done fill:#2f855a,stroke:#48bb78,color:#fff

Stage 1 — Operator Input (OutpostAI Wizard)

The operator opens OutpostAI, clicks Create Cluster, and walks through the four-step wizard:

  1. Basic — cluster name, namespace, provider (openstack/aws), Kubernetes distribution (RKE2/K3s)
  2. Infrastructure — control plane / worker counts, OpenStack flavors, FIPS toggle
  3. Add-ons — drag-and-drop selection of CNI, CSI, ingress, monitoring, security add-ons. Defaults to Canal CNI + Cinder CSI. Each selected add-on can be customized via the Customize values.yaml gear button, which opens a Monaco YAML editor pre-filled with the templated defaults from GET /api/v1/clusters/addon-defaults/{addon_id}.
  4. Review — final confirmation, then Create Cluster

The wizard sends a POST to Trailboss:

POST /api/v1/clusters/create
{
  "cluster_name": "satellite-workload-1",
  "namespace": "workloads",
  "provider": "openstack",
  "kubernetes_type": "rke2",
  "control_plane_count": 1,
  "worker_count": 3,
  "flavor_control_plane": "m1.large",
  "flavor_worker": "m1.xlarge",
  "image_name": "rke2-node-v1.28-20260407",
  "fips_enabled": false,
  "create_korc_resources": true,
  "addons": ["canal", "cinder-csi", "metallb", "prometheus"],
  "addon_values": {
    "prometheus": "prometheus:\n  prometheusSpec:\n    retention: 30d\n  ..."
  }
}

Stage 2 — Trailboss Generates GitOps Artifacts

The wizard calls POST /api/v1/cluster-templates/{id}/render against the TrailbossAI API. The Cluster Template System (ADR-008) handles the rest:

  1. JSON Schema validation of the operator-supplied values payload — fail-fast on bad input
  2. Jinja2 render of the chosen template (e.g. capo-rke2-default) with StrictUndefined
  3. Multi-doc YAML splitting via # FILE: markers → {file_path: content} map
  4. Audit row insertion in cluster_renders recording template id, version, input values, rendered files
  5. Gitea push via the gitops_writer module — POSTs each file with PUT fallback if it exists

The same render path serves all four hyperscalers. CAPO templates render OpenStack manifests; CAPA templates render AWS EKS manifests; CAPZ renders Azure AKS; CAPOCI renders Oracle OKE. Adding a new hyperscaler is a Postgres template insert with no code changes.

The files written to Gitea (admin/federal-frontier-platform.git) for a CAPO cluster:

CAPO cluster manifests

  • Cluster (the CAPI core resource)
  • OpenStackCluster (CAPO infrastructure provider resource)
  • KubeadmControlPlane (or RKE2ControlPlane)
  • OpenStackMachineTemplate for control plane nodes (referencing the Glance image)
  • OpenStackMachineTemplate for worker nodes
  • MachineDeployment for workers
  • The cloud-config Secret referenced by OpenStackCluster.spec.identityRef, containing the Application Credentials scoped to the cluster’s project

Add-on artifacts

For each selected add-on, two files are written under clusters/<cluster-name>/addons/<addon-id>/:

  • values.yaml — either the templated defaults from addon_values.py or, if the operator customized it in the wizard, the raw YAML they typed
  • application.yaml — an ArgoCD Application pointing at the cluster, the chart in Harbor, and the values file in Gitea

A typical add-ons directory:

clusters/satellite-workload-1/
├── cluster.yaml
├── machine-templates.yaml
├── cloud-config-secret.yaml
└── addons/
    ├── canal/
    │   └── application.yaml      # Canal is a ClusterResourceSet, no values
    ├── cinder-csi/
    │   ├── application.yaml
    │   └── values.yaml
    ├── metallb/
    │   ├── application.yaml
    │   └── values.yaml
    └── prometheus/
        ├── application.yaml
        └── values.yaml           # Customized: 30d retention

Stage 3 — ArgoCD on the FMC Syncs the CAPO Manifests

The Fleet Management Cluster (FMC) runs ArgoCD with an App-of-Apps pattern that watches the Gitea clusters/ directory. When Trailboss pushes new files for satellite-workload-1, ArgoCD picks them up within seconds and syncs them into the FMC’s f3iai namespace.

This causes the CAPO controller running in the FMC to see new Cluster and OpenStackCluster resources, which trigger the actual provisioning.

Stage 4 — CAPO Provisions OpenStack Resources

The CAPO controller in the FMC translates the CAPI resources into OpenStack API calls:

  1. Network — creates a tenant network for the cluster (or uses an existing one if network_name was supplied)
  2. Router — creates a router from the tenant network to the external network
  3. Security groups — creates k8s-appropriate security groups (allows api-server port, kubelet, CNI, etc.)
  4. Application Credentials — creates an Application Credential in Keystone scoped to the cluster’s project. This is the credential the workload cluster’s Cinder CSI will use to talk to Cinder.
  5. Server groups — anti-affinity groups so control plane VMs spread across hypervisors
  6. VM creation — calls Nova to boot VMs from the rke2-node-v1.28-YYYYMMDD Glance image, attached to the tenant network with the right security groups, with cloud-init userdata containing the cluster’s join token and cloud-config for the CSI

The cloud-config Secret that CAPO creates in the workload cluster (once the API server is up) looks like this:

apiVersion: v1
kind: Secret
metadata:
  name: cloud-config
  namespace: kube-system
type: Opaque
stringData:
  cloud.conf: |
    [Global]
    auth-url = https://keystone.vitro.lan/v3
    application-credential-id = 7c4e8f1a...
    application-credential-secret = aBcD3Fg...
    region = RegionOne

    [BlockStorage]
    bs-version = v3
    ignore-volume-az = true

    [LoadBalancer]
    use-octavia = true

This Secret is referenced by both the OpenStack Cloud Controller Manager and Cinder CSI Helm charts (via secret.create: false, secret.name: cloud-config). The Application Credentials are revocable and tied to the cluster’s project — when the cluster is deleted, CAPO revokes them. No long-lived passwords. No shared service accounts.

Stage 5 — VMs Boot from the Packer Image

Each VM that Nova schedules boots from the Packer-built rke2-node-v1.28-YYYYMMDD Glance image. This image already contains:

  • RKE2 binaries
  • The RKE2 airgap image tarball (kubelet imports core k8s images on first boot)
  • All Cinder CSI / CCM / sidecar container images pre-pulled into containerd
  • Required OS packages (qemu-guest-agent, multipath-tools, open-iscsi, etc.)
  • Kernel modules and sysctl tuning for k8s networking
  • A build manifest at /etc/packer-build/image-manifest.json

cloud-init runs the user-data CAPO supplied:

  1. Sets the hostname (e.g., satellite-workload-1-md-0-abc123)
  2. Writes /etc/rancher/rke2/config.yaml with the cluster join token and node role
  3. Writes any per-cluster CNI configuration
  4. Starts rke2-server (control plane) or rke2-agent (worker)

RKE2 imports the airgap image tarball on first start, joins the cluster, and registers with the API server. Total time from Nova “booting” to “Node Ready” is typically 2-3 minutes because there are zero external image pulls.

Stage 6 — Add-ons Install via ArgoCD

Once the workload cluster’s API server is reachable, the ArgoCD Applications that Trailboss wrote to Gitea (clusters/satellite-workload-1/addons/) start syncing into the workload cluster.

The order matters:

  1. Canal CNI is applied via a CAPI ClusterResourceSet, not Helm — it must be available before any pod can schedule
  2. OpenStack Cloud Controller Manager initializes nodes (sets the OpenStack provider ID, fetches addresses, manages routes)
  3. Cinder CSI controller and node-plugin start, mount the cloud-config Secret, and register the CSI driver with kubelet
  4. MetalLB for LoadBalancer services (if selected)
  5. Prometheus + Grafana monitoring stack (if selected)
  6. Everything else

Because all the container images for steps 2-6 were pre-pulled into the Packer image, the ArgoCD sync completes in seconds rather than waiting for image pulls. In airgapped deployments this is the difference between “cluster works” and “cluster never finishes booting”.

Stage 7 — First PVC Successfully Mounts

A user (or a Helm chart) creates a PVC in the workload cluster:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-data
spec:
  storageClassName: cinder-rbd
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 10Gi

What happens:

  1. cinder-csi-controller sees the new PVC and reads cloud-config from the Secret to get Application Credentials
  2. CSI calls Cinder API: POST https://cinder.vitro.lan:8776/v3/<project_id>/volumes with the size and project scope
  3. Cinder authenticates the request against Keystone using the Application Credentials
  4. Cinder checks the project’s volume quota
  5. Cinder rbd driver creates an RBD image in the Ceph volumes pool
  6. Cinder returns the volume ID; CSI binds it to the PVC
  7. A pod schedules that mounts the PVC. kubelet asks cinder-csi-nodeplugin on the worker to publish the volume
  8. CSI node plugin asks Cinder to attach the volume to this worker’s Nova instance
  9. Cinder calls Nova to perform the attach. Nova does the QEMU virtio-blk attach handshake (this is why qemu-guest-agent matters in the image)
  10. /dev/vdb appears inside the worker VM
  11. kubelet bind-mounts /dev/vdb into the pod’s filesystem at the requested mountPath
  12. The pod is Running with persistent storage

Every layer of OpenStack is doing exactly what it’s supposed to do. Nothing was bypassed. The whole chain is auditable from the Keystone logs (auth), Cinder logs (provisioning), Ceph logs (RBD operations), and the workload cluster’s CSI logs.

Failure Modes (and Where to Look)

Symptom First place to look
Cluster never reaches Ready in OpenStack CAPO controller logs in FMC: kubectl -n capo-system logs deploy/capo-controller-manager
VMs boot but RKE2 never starts SSH to a VM, check /var/lib/rancher/rke2/agent/logs/ and journalctl -u rke2-server
RKE2 starts but no nodes Ready Cloud Controller Manager logs in workload cluster: kubectl -n kube-system logs deploy/openstack-cloud-controller-manager
Nodes Ready but PVC stuck Pending cinder-csi-controllerplugin logs in workload cluster
PVC bound but pod can’t mount cinder-csi-nodeplugin logs on the worker that scheduled the pod, plus Nova logs for the attach call
Cinder API returns 401 The cloud-config Secret has wrong/revoked Application Credentials. Check Keystone audit log
Cinder API returns 403 Project quota exceeded. Check openstack quota show <project>
Ceph returns ENOSPC Pool full or no PGs available. Check ceph status and ceph osd df

Why This Architecture Works

Three things conspire to make the whole flow boring (in a good way):

  1. The Packer image is the source of identity. Every node boots from the same hardened image with the same pre-cached dependencies. No drift, no surprises, no first-boot installs.

  2. CAPI manages the lifecycle. Scale-up, scale-down, rolling upgrade, replace-on-failure — all of this is CAPI’s job. Operators never SSH to a node.

  3. OpenStack stays in the path. Cinder CSI (not direct Ceph CSI) means Keystone authenticates every storage operation, Cinder enforces quotas, Nova handles the attach, and Ceph stays where it belongs (as the backing block layer).

When something goes wrong, the layers can be debugged independently. That’s the whole point.