boma/docs/superpowers/specs/2026-06-17-m5-mesh-enrollment-design.md
sjat 4142bb15f8 docs(spec): M5 mesh-enrollment design (reachability-only)
base 'mesh' concern enrols NetBird agents on ubongo + askari via a reusable scoped
setup key (vault); laptops enrolled by the operator. Reachability via the default
peer policy; the base nftables default-deny on ubongo + ACL tightening are deferred
to a follow-on. Resolves ROADMAP M5 design; next: writing-plans.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 15:44:13 +02:00

7.7 KiB

M5 — Mesh enrollment (NetBird agents) → mobile access · design

Status: Design (2026-06-17). Implements ROADMAP M5, the last milestone of Phase 1 (remote access). Builds on M4b (the netbird_coordinator is live on askari). Design resolved by ADR-016 (mesh, agent-per-host) and ADR-021 (SSH ladder); this spec is the build-shaping for that decision. Next: writing-plans.

Goal

ubongo reachable from anywhere over the self-hosted NetBird mesh — the Phase-1 mobile-access goal. Reachability only. The host-firewall lockdown and NetBird ACL-tightening are deliberately deferred (see §6).

Decisions (settled in brainstorming)

  1. Scope = reachability, not lockdown. The goal needs only: agents enrolled + the laptops on the mesh + a peer policy permitting laptop→ubongo. ubongo's SSH is already open, so reachability requires no firewall change. Applying the base nftables default-deny to ubongo is the lockout-risky part on the control node and is split into a follow-on (§6).
  2. One reusable, scoped, expiring setup key. A single reusable key in vault.netbird.setup_key, scoped to auto-assign peers to a boma-hosts group, with an expiry. base re-runs idempotently across hosts. Matches ADR-016's single vault path; blast radius is limited by scope + expiry + the fact that joining the mesh grants no access on its own (peer policy gates that). Rejected: per-host one-off ephemeral keys — more operator toil and they don't fit a single vault key for a re-runnable role.
  3. askari is enrolled as a peer (ADR-016: it runs the stack and is a peer). The agent coexists with the coordinator container on the same host. Enables later moving askari's SSH off the Hetzner-firewall WAN allow onto wt0, and gives a host-to-host mesh link verifiable from ubongo.

Architecture

1. New base concern: mesh (agent enrollment)

A new roles/base/tasks/mesh.yml, included from base/tasks/main.yml via include_tasks with apply: { tags: [mesh] } (the dynamic-include tag-propagation gotcha — see existing concerns), tagged mesh. A new mesh entry is added to the closed tag vocabulary in tests/tags.yml.

The concern:

  • installs a pinned NetBird agent from the official NetBird apt repo (repo + key added like docker_host does for Docker; exact package + version verified in the plan per ADR-014). Version-pinned (ADR-011).
  • enrolls idempotently: run netbird up --management-url {{ base__mesh_management_url }} --setup-key <key> only when netbird status reports the host is not already connected (guard on a command check, changed_when accordingly). The setup key is passed with no_log: true.
  • does NOT touch the host firewall. Enrollment is purely additive: wt0 comes up, sshd keeps listening on all interfaces exactly as today. No lockout risk in M5.

Knobs (base__mesh_*, defaults in roles/base/defaults/main.yml):

Var Default Purpose
base__mesh_enabled false Policy/opt-in gate. false ⇒ the whole concern is skipped, so applying base to a host not ready to join the mesh is a no-op. Set true per host/group (ubongo, askari) to enrol.
base__mesh_manage true Test gate for the live daemon step. true ⇒ run netbird up; Molecule sets false so the concern can be exercised without a real coordinator/key (mirrors reverse_proxy__manage / netbird_coordinator__manage).
base__mesh_management_url https://netbird.askari.wingu.me The M4b coordinator.
base__mesh_setup_key "{{ vault.netbird.setup_key }}" Reusable scoped key (vault).
base__mesh_version pinned (plan) NetBird agent version (ADR-011).

2. Vault

Add vault.netbird.setup_key: CHANGEME with a comment stating it is a reusable, scoped (boma-hosts), expiring setup key minted in the NetBird dashboard after first-boot /setup. The agent cannot mint it — the operator supplies it via make edit-vault. make check-vault lists the outstanding CHANGEME until then. base/tasks/mesh.yml wires to {{ vault.netbird.setup_key }}.

3. Enrollment scope

  • ubongobase mesh concern applied (tagged), bringing up wt0. Its other base concerns (firewall, hardening) stay unapplied — TAGS=mesh scopes the run to enrollment only, so no default-deny lands on the control node.
  • askaribase mesh concern applied; agent enrols against its own public coordinator URL and coexists with the coordinator container.
  • mamba + work laptopoperator installs the NetBird client and logs in via the dashboard (embedded Dex SSO). Not Ansible-managed; out of automation scope.

4. Reachability

M5 relies on NetBird's default peer policy for laptop→ubongo reachability. The plan verifies the pinned version's default-policy behaviour (ADR-014); if it is not allow-by-default, the plan adds one minimal policy permitting the admin group → ubongo SSH. ACL-tightening to default-deny + scoped policies (ADR-016 intent) is deferred (§6).

Testing

  • Automated (I do, needs nothing from operator): Molecule for the base mesh concern with base__mesh_enabled: true, base__mesh_manage: false, and a dummy vault.netbird.setup_key — so the install/enrol tasks are exercised but the live netbird up (which needs a real coordinator + key) is gated off. Note: this concern is install + a daemon command, so its render-only surface is thin (the "render-only tests miss the real call" gotcha) — Molecule asserts the enrol command is constructed correctly + idempotency guard works; full enrollment is proven in the live step below. Also assert base__mesh_enabled: false is a clean no-op. make lint (incl. check-tags for the new mesh tag).
  • Live (gated, after the operator handoff): apply base TAGS=mesh to ubongo + askari; verify wt0 is up and the ubongoaskari mesh link works from ubongo (both are peers I manage — e.g. netbird status shows the peer, ping the peer's mesh IP).
  • Goal verification (operator): from a laptop on the mesh, SSH ubongo over its NetBird/wt0 address. This is the mobile-access goal landing.

Operator handoff (the steps only the operator can do)

  1. Dashboard /setup (one-time) → create the admin user.
  2. Mint a reusable, scoped (boma-hosts), expiring setup key → make edit-vault to replace the CHANGEME → re-encrypt. (make check-vault confirms.)
  3. Install the NetBird client on mamba + the work laptop, log in via the dashboard.
  4. Confirm SSH to ubongo over the mesh.

Out of scope / deferred (the "mesh hardening" follow-on)

  • base nftables default-deny on ubongo (SSH only on wt0 + the base__firewall_control_addr LAN fallback, ADR-021/020). Built + dormant today; applying it to the control node is the lockout-risky step and gets its own deliberate change after the mesh path to ubongo is proven solid.
  • NetBird ACL tightening to default-deny + scoped per-group policies (ADR-016: admin peers → srv+mgmt, clients least-privilege). M5 uses the default policy.
  • askari SSH onto wt0 (retiring the Hetzner-firewall WAN SSH allow) — enabled by askari now being a peer, but a separate change.

Maps to

ADR-016 (mesh, agent-per-host, setup keys in vault), ADR-021 (SSH ladder — wt0 primary + ssh-from-control; the lockdown that uses this is deferred), ADR-020 (host firewall — default-deny deferred), ADR-002 (security baseline), ADR-011 (version-pinned agent), ADR-004 (enrollment lives in base, not a new role), ADR-014 (verify agent version/package + default-policy behaviour in the plan).