Packer Image Build Process — RKE2 Node Images for CAPO

How Vitro builds reproducible, hardened, pre-cached RKE2 cloud images with Packer for use by CAPO when provisioning workload clusters.

This page describes how Federal Frontier Vitro builds the OpenStack cloud images that CAPO uses to provision RKE2 worker and control-plane nodes for workload Kubernetes clusters.

The short version: every RKE2 node in a workload cluster boots from a Packer-built Glance image with everything pre-installed and pre-cached. No first-boot installs, no external registry pulls, no SSH-and-pray. CAPI can scale, replace, and rebuild nodes without human intervention.

Why Packer?

The temptation in a Kubernetes-on-OpenStack environment is to start from a stock Ubuntu cloud image and use cloud-init at first boot to install RKE2, pull container images, and configure the node. We tried that. It fails for three reasons:

  1. Boot time. Pulling images at first boot adds 5-15 minutes per node. A 5-node cluster takes an extra 30+ minutes to come up.
  2. Drift. Cloud-init at first boot means each node potentially gets a slightly different version of every dependency, depending on what’s in registries that day. Reproducibility goes out the window.
  3. Airgap impossibility. First-boot installs require external network access. In airgapped IL5/IL6 deployments, there is no path out. Cluster boot fails outright.

The fix is the standard immutable-infrastructure pattern: bake everything into the image at build time. CAPI sees identical, drift-free, registry-independent nodes that boot in 2-3 minutes.

The rule, from CLAUDE.md:

If you find yourself SSHing into a node to install things, STOP. You are doing it wrong. Fix the Packer template instead.

Pipeline Overview

The Packer build runs on texas-dell-04 (or any host with the OpenStack Packer plugin and credentials to the VitroAI control plane). The high-level flow:

graph TD Start[packer build] --> Boot[Boot Ubuntu 22.04 cloud image
in OpenStack as a build VM] Boot --> Wait[Wait for cloud-init complete] Wait --> Pkg[01: Install OS packages
qemu-guest-agent, multipath, iSCSI,
NFS, cryptsetup, conntrack, chrony] Pkg --> RKE2[02: Install RKE2 offline
from files/ artifacts] RKE2 --> Pull[03: Pre-pull CSI/CCM/cert-manager
images into containerd image store] Pull --> Harden[04: Kernel modules, sysctl,
SSH hardening, optional FIPS] Harden --> Manifest[06: Emit image-manifest.json] Manifest --> Cleanup[05: Generalize for imaging
machine-id, SSH keys, logs, fstrim] Cleanup --> Snapshot[Snapshot VM to Glance] Snapshot --> Done[New image:
rke2-node-v1.28-YYYYMMDD] style Pull fill:#2b6cb0,stroke:#4299e1,color:#fff style Manifest fill:#553c9a,stroke:#805ad5,color:#fff

A typical build takes 10-20 minutes end to end. The dominant cost is the image pre-pull step (5-10 minutes depending on registry latency).

Build Stage Reference

Stage 1 — System Packages (01-system-packages.sh)

Installs the OS packages required for RKE2 and Cinder CSI to function:

Package Purpose
qemu-guest-agent Lets Nova perform clean attach/detach handshakes when Cinder CSI asks Nova to attach an RBD-backed volume. Without it, attach can hang or partially fail.
multipath-tools Required for any block-storage CSI that returns multiple paths (Cinder/RBD over iSCSI, NVMe-oF). Cinder expects multipathd running on the worker.
open-iscsi iSCSI initiator. Required for iSCSI-backed Cinder volumes (some Cinder backends use iSCSI even when Ceph is the block layer underneath).
nfs-common NFS client for any NFS-backed PVCs.
cryptsetup LUKS support for Longhorn or LUKS-encrypted PVCs.
conntrack, socat, ebtables, ethtool Required by kube-proxy, kubelet, and most CNIs.
chrony Time sync. Kubernetes is unforgiving about clock drift, and Ceph requires <50ms skew between nodes.

Apt fallback: VitroAI provider networks have intermittent HTTP egress issues. The script tries apt-get with retries; if all fail, it falls back to a pre-staged offline .deb directory at /tmp/packer-debs/. This is a workaround pending a longer-term fix to the VitroAI build network.

The script also disables swap permanently — kubelet refuses to start with swap on.

Stage 2 — RKE2 Install (02-install-rke2.sh)

Installs RKE2 from the offline artifacts in files/. The expected files:

files/
├── rke2.linux-amd64.tar.gz           # RKE2 binaries
├── rke2-images.linux-amd64.tar.zst   # Airgap container image tarball
├── sha256sum-amd64.txt               # Verification
└── rke2-install.sh                   # Installer script

The installer is invoked with INSTALL_RKE2_ARTIFACT_PATH=/tmp/rke2-artifacts so it never reaches out to the internet. The image tarball is placed at /var/lib/rancher/rke2/agent/images/ so kubelet imports core k8s images on first boot with zero registry access.

Stage 3 — Image Pre-pull (03-prepull-images.sh)

This is the stage that makes Cinder CSI (and everything else) work without external registry access. The RKE2 airgap tarball only contains the core Kubernetes images: kube-apiserver, kube-controller-manager, kube-scheduler, kube-proxy, kubelet, coredns, metrics-server, and the bundled CNI options. It does not contain:

  • OpenStack Cloud Controller Manager
  • Cinder CSI plugin
  • The CSI sidecars (provisioner, attacher, resizer, snapshotter, node-driver-registrar)
  • kube-vip
  • cert-manager
  • Any add-on the operator might enable at cluster create time

Stage 3 starts containerd in standalone mode (without RKE2), pulls the pinned set of images from Harbor (harbor.vitro.lan/k8s/...), and shuts containerd back down. The images persist in containerd’s content store at /var/lib/rancher/rke2/agent/containerd, which becomes part of the Glance image.

The pinned image list lives in the IMAGES array at the top of 03-prepull-images.sh. Bumping a version is a deliberate change requiring a new image build.

The script writes /etc/packer-build/prepulled-images.txt so the manifest stage can include it.

Stage 4 — Hardening (04-harden.sh)

This is not full CIS hardening — it’s the minimum needed for the cluster to function correctly under load. CIS-level hardening (login banners, audit rules, password policies) is layered on at cluster bootstrap by cloud-init when the operator requests FIPS / IL5+ mode.

What stage 4 does:

  • Loads kernel modules required by k8s networking and storage: br_netfilter, overlay, nf_conntrack, ip_vs, ip_vs_rr, ip_vs_wrr, ip_vs_sh. These are also added to /etc/modules-load.d/rke2.conf so they load on every boot.

  • Sets sysctls for Kubernetes:
    • Bridge / IP forwarding
    • nf_conntrack_max=1048576 (small clusters with many short-lived connections blow through the default 65536 quickly)
    • vm.overcommit_memory=1
    • kernel.panic=10 (panic-and-reboot fast so CAPI can replace the node)
    • inotify limits raised to 8192 instances / 524288 watches (required by node-exporter, fluentd, etc.)
  • Disables conflicting services: ufw, firewalld, apparmor (RKE2 manages its own seccomp/apparmor profiles).

  • SSH hardening: no root login, no password auth, no X11 forwarding.

  • FIPS (if FIPS_ENABLED=true): runs ua enable fips-updates if the source image is Ubuntu Pro. If FIPS is requested but the source image is not Ubuntu Pro, the build prints a warning and continues — the resulting image is not FIPS-validated.

Stage 6 — Manifest (06-manifest.sh)

Emits /etc/packer-build/image-manifest.json:

{
  "schema_version": "1.0",
  "built_by": "packer",
  "built_at": "2026-04-07T11:23:14Z",
  "os": {
    "pretty_name": "Ubuntu 22.04.4 LTS",
    "kernel": "5.15.0-101-generic"
  },
  "kubernetes": {
    "distribution": "rke2",
    "version": "v1.28.15+rke2r1"
  },
  "fips_enabled": false,
  "prepulled_images": [
    "harbor.vitro.lan/k8s/provider-os/openstack-cloud-controller-manager:v1.30.0",
    "harbor.vitro.lan/k8s/provider-os/cinder-csi-plugin:v1.30.0",
    "harbor.vitro.lan/k8s/sig-storage/csi-provisioner:v5.0.1",
    ...
  ]
}

This manifest is consumed by:

  1. Operators auditing what’s in a given Glance image (CIS / ATO evidence).
  2. OutpostAI’s cluster create wizard, which can read the manifest from the running CAPO cluster’s first node and report what versions are deployed.
  3. CI verification — comparing actual installed versions against what the build was supposed to produce.

Stage 5 — Cleanup (05-cleanup.sh)

Runs last. Generalizes the VM for imaging:

  • cloud-init clean --logs --machine-id
  • Truncates /etc/machine-id so it regenerates on first boot
  • Removes SSH host keys (regenerated by sshd on first boot)
  • Clears bash history
  • Truncates all logs in /var/log (preserves file ownership)
  • Cleans apt cache and /tmp
  • fstrim -av — fills unused rootfs blocks with zeros so Glance can sparsify the image during upload. Drops final image size by 30-50%.

Note that /etc/packer-build/ is not deleted — the image manifest is part of the deliverable.

Output

A successful build produces a Glance image:

$ openstack image list --name "rke2-node-*"
+--------------------------------------+--------------------------+--------+
| ID                                   | Name                     | Status |
+--------------------------------------+--------------------------+--------+
| a4f3...                              | rke2-node-v1.28-20260407 | active |
+--------------------------------------+--------------------------+--------+

The image is referenced from CAPO OpenStackMachineTemplate resources:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha8
kind: OpenStackMachineTemplate
metadata:
  name: my-cluster-control-plane
spec:
  template:
    spec:
      image:
        filter:
          name: rke2-node-v1.28-20260407
      flavor: m1.xlarge
      cloudName: vitroai

OutpostAI’s cluster create wizard automatically picks the most recent rke2-node-v* image when an operator provisions a new cluster, unless image_name is explicitly overridden in the request.

Versioning and Image Lifecycle

Image attribute Lifecycle
RKE2 version Bumped explicitly. New RKE2 = new Packer build = new image tag.
Pre-pulled image versions Pinned in 03-prepull-images.sh. Bumping is a deliberate change.
OS packages Pinned to whatever Ubuntu 22.04 ships at build time. Rebuilds capture security updates.
Kernel Whatever the source Ubuntu cloud image ships. Rebuild monthly to capture security updates.
Image name rke2-node-<k8s-version>-<YYYYMMDD> — date is the build date, not the RKE2 release date

Old images should be retained in Glance for at least 90 days so existing clusters can rebuild nodes against the same image they were originally provisioned with. Mass deletion of old images breaks CAPI scale-up on existing clusters.

Operator Workflow

Build a new image

ssh ubuntu@texas-dell-04
cd ~/builds/f3iai/packer
source openrc.vitroai
packer build rke2-node.pkr.hcl

Build a FIPS-validated variant

packer build -var fips_enabled=true rke2-node.pkr.hcl

Inspect a running node’s image manifest

ssh ubuntu@<workload-node>
cat /etc/packer-build/image-manifest.json | jq .

Use a specific image in OutpostAI

In the cluster create wizard, expand “Advanced” and set the Image Name field. Defaults to the most recent rke2-node-v* image if left blank.

What’s NOT in the image (and why)

Item Why it’s not pre-baked
Cluster identity (hostname, machine-id, SSH keys) Generated at first boot — pre-baking would create snowflakes
OpenStack credentials (cloud.conf) Provided per-cluster by CAPO bootstrap as a Kubernetes Secret
Per-cluster CNI configuration Applied by ClusterResourceSet at cluster bootstrap
Per-cluster add-on Helm values Applied by ArgoCD from Gitea after cluster boot
TLS certificates Generated by RKE2 / cert-manager at runtime

These are the things that legitimately differ per cluster. Everything else is identical across every node, every cluster.

Troubleshooting

apt-get fails during build

The VitroAI provider network has intermittent HTTPS egress. Pre-stage required .deb files at /tmp/packer-debs/ on the build VM (or in a custom Packer file provisioner) and the script will fall back automatically.

Image pre-pull fails for some images

The script logs failures to /etc/packer-build/prepull-failures.txt but does not abort the build. After the build completes, check this file:

cat /etc/packer-build/prepull-failures.txt

Common causes:

  • Image doesn’t exist in Harbor (wrong tag in IMAGES array)
  • Harbor TLS issue (the script uses --skip-verify but still requires Harbor reachability)
  • Disk full on build VM (Cinder CSI plugin images are 200-400MB each)

Build VM hangs at “Waiting for cloud-init”

Usually a stuck cloud-init data source. Check the build VM console in Horizon and look for cloud-init errors. Often resolved by setting a different network_id (the build network must have working DHCP and metadata access).

FIPS variant boots but FIPS is not actually enabled

The source image must be Ubuntu Pro. Check with pro status on a built image:

ssh ubuntu@<test-vm>
sudo pro status

If the source image is regular Ubuntu, switch to ubuntu-pro-22.04-fips in Glance and rebuild.