Network Policy Architecture
How the Federal Frontier Platform secures pod-to-pod traffic — default-deny ingress via Kyverno, explicit allow policies, the egress model, and how to add a new service without getting blocked.
Network policy architecture
The Federal Frontier Platform runs a zero-trust pod network: by default, nothing can connect to a pod until an explicit policy allows it. This page documents exactly how that is enforced, why connections sometimes fail with a timeout, and the precise steps to expose a new service without being blocked.
Read this first when debugging connectivity. A large fraction of “service A can’t reach service B” incidents on the platform are not bugs in A or B — they are a missing
allowNetworkPolicy. The symptom is almost always a connection timeout, not “connection refused.”
The model in one paragraph
The CNI is Canal (Calico policy engine + Flannel networking). A Kyverno
ClusterPolicy automatically injects a default-deny-ingress NetworkPolicy into
every namespace, so all ingress is denied by default. To let traffic reach
a pod you add an explicit allow-* NetworkPolicy. Egress is open by default
— a pod can dial out anywhere unless it is selected by an egress policy, at
which point its egress is restricted to what that policy permits.
Default-deny ingress (Kyverno)
A Kyverno ClusterPolicy named baseline-default-deny-ingress watches for
Namespaces and generates a default-deny-ingress NetworkPolicy into each one:
# Generated automatically into (almost) every namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {} # all pods in the namespace
policyTypes:
- Ingress # deny all ingress; egress untouched
Key properties:
- Scope: all namespaces except a small infrastructure allow-list:
kube-system,kube-public,kube-node-lease,kyverno,metallb-system,rook-ceph,ingress-nginx. synchronize: true— Kyverno keeps the generated policy in sync. If you deletedefault-deny-ingress, Kyverno recreates it. You cannot “turn off” the deny by deleting it; you must add anallowpolicy alongside it.- New namespaces are locked automatically. The moment a namespace is
created (e.g.
vault,external-secrets), it receivesdefault-deny-ingress. This is the single most common cause of “my freshly deployed thing can’t be reached.”
Two further Kyverno policies harden Keycloak specifically:
| ClusterPolicy | Generates | Purpose |
|---|---|---|
baseline-default-deny-ingress |
default-deny-ingress (all ns) |
Platform-wide ingress lockdown |
keycloak-netpols |
keycloak-default-deny, keycloak-allow-from-ingress |
Lock Keycloak, allow only ingress controller |
keycloak-postgres-locked |
pg-deny-all-ingress, pg-allow-from-keycloak-only |
Keycloak’s Postgres accepts connections only from Keycloak |
How NetworkPolicy semantics actually work
NetworkPolicy is additive and per-direction, which trips people up:
- A pod is “isolated” for a direction only once some policy selects it for
that direction. With
default-deny-ingressselecting every pod, every pod is ingress-isolated everywhere. - Allow policies are purely additive — they punch holes in the deny. There
is no “deny” rule to write; you only ever add
allowpolicies. The union of all matchingallowpolicies is what’s permitted. - Egress is open until you write an egress policy. The platform has no
default-deny-egress. But the instant a pod is selected by any egress policy (e.g.allow-compass-api-typedb-egress), that pod’s egress is restricted to only what its egress policies allow. So adding one egress allow can inadvertently block a pod’s other outbound traffic. - Denied traffic times out; it is not refused. Canal silently drops the
packets, so the client waits for its connection timeout.
connection refusedmeans the pod/port is wrong;context deadline exceeded/i/o timeoutalmost always means a NetworkPolicy is dropping the packet.
The allow-policy convention
In the f3iai namespace (43 policies and counting) every allow policy follows
the naming convention allow-<component>-<direction>:
allow-typedb-ingress allow-compass-api-typedb-egress
allow-mcp-server-ingress allow-compass-api-mcp-egress
allow-traefik-to-mcp-servers allow-ollama-egress
allow-prometheus-scrape-dispatch allow-dispatch-egress
A representative ingress policy (allow-typedb-ingress): TypeDB accepts traffic
on its gRPC (1729) and HTTP (8000) ports from the Traefik namespace, from the
Compass API pod, and from any pod in its own namespace:
spec:
podSelector:
matchLabels:
app: typedb
policyTypes: [Ingress]
ingress:
- from:
- namespaceSelector: { matchLabels: { name: traefik } }
- podSelector: { matchLabels: { app.kubernetes.io/name: traefik } }
ports: [{ port: 8000 }, { port: 1729 }]
- from:
- podSelector: { matchLabels: { app: ffo-compass, component: api } }
ports: [{ port: 1729 }]
Adding a new service without getting blocked
When you deploy a new workload that must receive traffic, follow this checklist:
- Expect the deny. The namespace already has (or will get)
default-deny-ingress. Your pod is unreachable until you add an allow. - Write an
allow-<component>-ingressNetworkPolicy in the target namespace selecting your pod and the ports it serves, withfrom:listing the callers (bypodSelectorand/ornamespaceSelector). - Cross-namespace callers need a
namespaceSelector. Match on the automatic labelkubernetes.io/metadata.name: <namespace>(every namespace has it). Pod selectors alone only match pods in the same namespace as the policy. - If the caller is selected by an egress policy, allow the new destination there too. Otherwise the caller’s own egress policy will block the new call even though the destination allows ingress.
- Commit the policy to Gitea so ArgoCD manages it — never leave it as a
one-off
kubectl apply(it will not survive and it is invisible to the next engineer).
Worked example — ESO → Vault (2026-06-13)
When Vault and External Secrets Operator were deployed, both namespaces
immediately received default-deny-ingress. ESO’s Vault login failed with
context deadline exceeded — the classic netpol-timeout signature. The fix was
a single allow policy in the vault namespace:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-eso-to-vault
namespace: vault
spec:
podSelector: {}
policyTypes: [Ingress]
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: external-secrets
ports:
- protocol: TCP
port: 8200
Troubleshooting connectivity
# 1. Confirm the destination namespace has default-deny-ingress (it almost always does)
kubectl get netpol -n <dst-ns>
# 2. List what IS allowed into the destination
kubectl get netpol -n <dst-ns> -o yaml | less
# 3. Symptom check: timeout => netpol; refused => wrong pod/port
# "context deadline exceeded" / "i/o timeout" -> missing allow policy
# "connection refused" -> service/pod/port problem
# 4. Remember egress: is the CALLER selected by an egress policy?
kubectl get netpol -n <src-ns> -o json | \
jq -r '.items[] | select(.spec.policyTypes[]=="Egress") | .metadata.name'
Observability of blocked traffic
Current state: Canal/Calico is running in open-source mode and Felix Prometheus metrics are not enabled, so there is no native “packets denied by NetworkPolicy” metric today. There is also no Loki/Splunk log pipeline and no blackbox prober deployed. Blocked traffic is therefore currently only visible as client-side timeouts in application logs and as failed readiness probes.
Practical ways to gain visibility, in order of effort/value:
- Synthetic path probes (recommended). Deploy
blackbox-exporterand define PrometheusProbeobjects for critical cross-namespace paths (ESO→Vault, Compass→TypeDB, Traefik→MCP servers, etc.). A probe failure is a direct, alertable signal that a required path is broken — including when a NetworkPolicy is the cause. Integrates with the existing Prometheus + Alertmanager + SRE-dispatch stack with no new platform. - Enable Calico Felix Prometheus metrics for dataplane health
(
FelixConfiguration.spec.prometheusMetricsEnabled: true). Note: per-rule denied-packet counters are a Calico Enterprise feature; open-source Felix exposes dataplane health, not per-deny counts. - Hubble (Cilium) gives per-flow allow/deny visibility natively, but that is a CNI migration off Canal — a large change, not justified for this alone.
- Splunk is not deployed on the platform. If it were, Felix/iptables
drop logs or application timeout logs could be forwarded and alerted on. A
lightweight
blackbox-exporterprobe is a smaller, more direct solution than standing up Splunk for this purpose.
The recommended next step is option 1: synthetic probes for the handful of must-work cross-namespace paths, surfaced on a Grafana dashboard and wired to an Alertmanager rule.
Quick reference
| Question | Answer |
|---|---|
| What enforces policy? | Canal (Calico Felix), CNI for RKE2 |
| Default ingress | Denied in every namespace (Kyverno baseline-default-deny-ingress) |
| Default egress | Open, until a pod is selected by an egress policy |
Can I delete default-deny-ingress? |
No — Kyverno regenerates it (synchronize: true) |
| Excluded namespaces | kube-system, kube-public, kube-node-lease, kyverno, metallb-system, rook-ceph, ingress-nginx |
| Symptom of a netpol block | Connection timeout (context deadline exceeded), not “refused” |
| Cross-namespace allow | Use namespaceSelector on kubernetes.io/metadata.name |
| Where do policies live? | Gitea (GitOps), reconciled by ArgoCD |
| Denied-traffic metrics today | None native; use synthetic probes (blackbox-exporter) |