Network Policy Architecture

How the Federal Frontier Platform secures pod-to-pod traffic — default-deny ingress via Kyverno, explicit allow policies, the egress model, and how to add a new service without getting blocked.

Network policy architecture

The Federal Frontier Platform runs a zero-trust pod network: by default, nothing can connect to a pod until an explicit policy allows it. This page documents exactly how that is enforced, why connections sometimes fail with a timeout, and the precise steps to expose a new service without being blocked.

Read this first when debugging connectivity. A large fraction of “service A can’t reach service B” incidents on the platform are not bugs in A or B — they are a missing allow NetworkPolicy. The symptom is almost always a connection timeout, not “connection refused.”

The model in one paragraph

The CNI is Canal (Calico policy engine + Flannel networking). A Kyverno ClusterPolicy automatically injects a default-deny-ingress NetworkPolicy into every namespace, so all ingress is denied by default. To let traffic reach a pod you add an explicit allow-* NetworkPolicy. Egress is open by default — a pod can dial out anywhere unless it is selected by an egress policy, at which point its egress is restricted to what that policy permits.

Default-deny ingress (Kyverno)

A Kyverno ClusterPolicy named baseline-default-deny-ingress watches for Namespaces and generates a default-deny-ingress NetworkPolicy into each one:

# Generated automatically into (almost) every namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
spec:
  podSelector: {}          # all pods in the namespace
  policyTypes:
    - Ingress              # deny all ingress; egress untouched

Key properties:

Scope: all namespaces except a small infrastructure allow-list: kube-system, kube-public, kube-node-lease, kyverno, metallb-system, rook-ceph, ingress-nginx.
synchronize: true — Kyverno keeps the generated policy in sync. If you delete default-deny-ingress, Kyverno recreates it. You cannot “turn off” the deny by deleting it; you must add an allow policy alongside it.
New namespaces are locked automatically. The moment a namespace is created (e.g. vault, external-secrets), it receives default-deny-ingress. This is the single most common cause of “my freshly deployed thing can’t be reached.”

Two further Kyverno policies harden Keycloak specifically:

ClusterPolicy	Generates	Purpose
`baseline-default-deny-ingress`	`default-deny-ingress` (all ns)	Platform-wide ingress lockdown
`keycloak-netpols`	`keycloak-default-deny`, `keycloak-allow-from-ingress`	Lock Keycloak, allow only ingress controller
`keycloak-postgres-locked`	`pg-deny-all-ingress`, `pg-allow-from-keycloak-only`	Keycloak’s Postgres accepts connections only from Keycloak

How NetworkPolicy semantics actually work

NetworkPolicy is additive and per-direction, which trips people up:

A pod is “isolated” for a direction only once some policy selects it for that direction. With default-deny-ingress selecting every pod, every pod is ingress-isolated everywhere.
Allow policies are purely additive — they punch holes in the deny. There is no “deny” rule to write; you only ever add allow policies. The union of all matching allow policies is what’s permitted.
Egress is open until you write an egress policy. The platform has no default-deny-egress. But the instant a pod is selected by any egress policy (e.g. allow-compass-api-typedb-egress), that pod’s egress is restricted to only what its egress policies allow. So adding one egress allow can inadvertently block a pod’s other outbound traffic.
Denied traffic times out; it is not refused. Canal silently drops the packets, so the client waits for its connection timeout. connection refused means the pod/port is wrong; context deadline exceeded / i/o timeout almost always means a NetworkPolicy is dropping the packet.

The allow-policy convention

In the f3iai namespace (43 policies and counting) every allow policy follows the naming convention allow-<component>-<direction>:

allow-typedb-ingress              allow-compass-api-typedb-egress
allow-mcp-server-ingress          allow-compass-api-mcp-egress
allow-traefik-to-mcp-servers      allow-ollama-egress
allow-prometheus-scrape-dispatch  allow-dispatch-egress

A representative ingress policy (allow-typedb-ingress): TypeDB accepts traffic on its gRPC (1729) and HTTP (8000) ports from the Traefik namespace, from the Compass API pod, and from any pod in its own namespace:

spec:
  podSelector:
    matchLabels:
      app: typedb
  policyTypes: [Ingress]
  ingress:
    - from:
        - namespaceSelector: { matchLabels: { name: traefik } }
        - podSelector: { matchLabels: { app.kubernetes.io/name: traefik } }
      ports: [{ port: 8000 }, { port: 1729 }]
    - from:
        - podSelector: { matchLabels: { app: ffo-compass, component: api } }
      ports: [{ port: 1729 }]

Adding a new service without getting blocked

When you deploy a new workload that must receive traffic, follow this checklist:

Expect the deny. The namespace already has (or will get) default-deny-ingress. Your pod is unreachable until you add an allow.
Write an allow-<component>-ingress NetworkPolicy in the target namespace selecting your pod and the ports it serves, with from: listing the callers (by podSelector and/or namespaceSelector).
Cross-namespace callers need a namespaceSelector. Match on the automatic label kubernetes.io/metadata.name: <namespace> (every namespace has it). Pod selectors alone only match pods in the same namespace as the policy.
If the caller is selected by an egress policy, allow the new destination there too. Otherwise the caller’s own egress policy will block the new call even though the destination allows ingress.
Commit the policy to Gitea so ArgoCD manages it — never leave it as a one-off kubectl apply (it will not survive and it is invisible to the next engineer).

Worked example — ESO → Vault (2026-06-13)

When Vault and External Secrets Operator were deployed, both namespaces immediately received default-deny-ingress. ESO’s Vault login failed with context deadline exceeded — the classic netpol-timeout signature. The fix was a single allow policy in the vault namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-eso-to-vault
  namespace: vault
spec:
  podSelector: {}
  policyTypes: [Ingress]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: external-secrets
      ports:
        - protocol: TCP
          port: 8200

Troubleshooting connectivity

# 1. Confirm the destination namespace has default-deny-ingress (it almost always does)
kubectl get netpol -n <dst-ns>

# 2. List what IS allowed into the destination
kubectl get netpol -n <dst-ns> -o yaml | less

# 3. Symptom check: timeout => netpol; refused => wrong pod/port
#    "context deadline exceeded" / "i/o timeout"  -> missing allow policy
#    "connection refused"                          -> service/pod/port problem

# 4. Remember egress: is the CALLER selected by an egress policy?
kubectl get netpol -n <src-ns> -o json | \
  jq -r '.items[] | select(.spec.policyTypes[]=="Egress") | .metadata.name'

Observability of blocked traffic

Current state: Canal/Calico is running in open-source mode and Felix Prometheus metrics are not enabled, so there is no native “packets denied by NetworkPolicy” metric today. There is also no Loki/Splunk log pipeline and no blackbox prober deployed. Blocked traffic is therefore currently only visible as client-side timeouts in application logs and as failed readiness probes.

Practical ways to gain visibility, in order of effort/value:

Synthetic path probes (recommended). Deploy blackbox-exporter and define Prometheus Probe objects for critical cross-namespace paths (ESO→Vault, Compass→TypeDB, Traefik→MCP servers, etc.). A probe failure is a direct, alertable signal that a required path is broken — including when a NetworkPolicy is the cause. Integrates with the existing Prometheus + Alertmanager + SRE-dispatch stack with no new platform.
Enable Calico Felix Prometheus metrics for dataplane health (FelixConfiguration.spec.prometheusMetricsEnabled: true). Note: per-rule denied-packet counters are a Calico Enterprise feature; open-source Felix exposes dataplane health, not per-deny counts.
Hubble (Cilium) gives per-flow allow/deny visibility natively, but that is a CNI migration off Canal — a large change, not justified for this alone.
Splunk is not deployed on the platform. If it were, Felix/iptables drop logs or application timeout logs could be forwarded and alerted on. A lightweight blackbox-exporter probe is a smaller, more direct solution than standing up Splunk for this purpose.

The recommended next step is option 1: synthetic probes for the handful of must-work cross-namespace paths, surfaced on a Grafana dashboard and wired to an Alertmanager rule.

Quick reference

Question	Answer
What enforces policy?	Canal (Calico Felix), CNI for RKE2
Default ingress	Denied in every namespace (Kyverno `baseline-default-deny-ingress`)
Default egress	Open, until a pod is selected by an egress policy
Can I delete `default-deny-ingress`?	No — Kyverno regenerates it (`synchronize: true`)
Excluded namespaces	`kube-system`, `kube-public`, `kube-node-lease`, `kyverno`, `metallb-system`, `rook-ceph`, `ingress-nginx`
Symptom of a netpol block	Connection timeout (`context deadline exceeded`), not “refused”
Cross-namespace allow	Use `namespaceSelector` on `kubernetes.io/metadata.name`
Where do policies live?	Gitea (GitOps), reconciled by ArgoCD
Denied-traffic metrics today	None native; use synthetic probes (blackbox-exporter)