From 4142bb15f8b3be2771d343bec9c136a62ad19c64 Mon Sep 17 00:00:00 2001 From: sjat Date: Wed, 17 Jun 2026 15:44:13 +0200 Subject: [PATCH] docs(spec): M5 mesh-enrollment design (reachability-only) base 'mesh' concern enrols NetBird agents on ubongo + askari via a reusable scoped setup key (vault); laptops enrolled by the operator. Reachability via the default peer policy; the base nftables default-deny on ubongo + ACL tightening are deferred to a follow-on. Resolves ROADMAP M5 design; next: writing-plans. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../2026-06-17-m5-mesh-enrollment-design.md | 131 ++++++++++++++++++ 1 file changed, 131 insertions(+) create mode 100644 docs/superpowers/specs/2026-06-17-m5-mesh-enrollment-design.md diff --git a/docs/superpowers/specs/2026-06-17-m5-mesh-enrollment-design.md b/docs/superpowers/specs/2026-06-17-m5-mesh-enrollment-design.md new file mode 100644 index 0000000..b0b277f --- /dev/null +++ b/docs/superpowers/specs/2026-06-17-m5-mesh-enrollment-design.md @@ -0,0 +1,131 @@ +# M5 — Mesh enrollment (NetBird agents) → mobile access · design + +**Status:** Design (2026-06-17). Implements ROADMAP **M5**, the last milestone of Phase 1 +(remote access). Builds on M4b (the `netbird_coordinator` is live on `askari`). Design +resolved by **ADR-016** (mesh, agent-per-host) and **ADR-021** (SSH ladder); this spec is +the build-shaping for that decision. Next: `writing-plans`. + +## Goal + +`ubongo` reachable from anywhere over the self-hosted NetBird mesh — the Phase-1 +mobile-access goal. **Reachability only.** The host-firewall lockdown and NetBird +ACL-tightening are deliberately **deferred** (see §6). + +## Decisions (settled in brainstorming) + +1. **Scope = reachability, not lockdown.** The goal needs only: agents enrolled + the + laptops on the mesh + a peer policy permitting laptop→`ubongo`. `ubongo`'s SSH is + already open, so reachability requires **no firewall change**. Applying the `base` + nftables default-deny to `ubongo` is the lockout-risky part on the control node and is + split into a follow-on (§6). +2. **One reusable, scoped, expiring setup key.** A single reusable key in + `vault.netbird.setup_key`, scoped to auto-assign peers to a `boma-hosts` group, with an + expiry. `base` re-runs idempotently across hosts. Matches ADR-016's single vault path; + blast radius is limited by scope + expiry + the fact that joining the mesh grants no + access on its own (peer policy gates that). Rejected: per-host one-off ephemeral keys — + more operator toil and they don't fit a single vault key for a re-runnable role. +3. **`askari` is enrolled as a peer** (ADR-016: it runs the stack *and* is a peer). The + agent coexists with the coordinator container on the same host. Enables later moving + `askari`'s SSH off the Hetzner-firewall WAN allow onto `wt0`, and gives a host-to-host + mesh link verifiable from `ubongo`. + +## Architecture + +### 1. New `base` concern: `mesh` (agent enrollment) + +A new `roles/base/tasks/mesh.yml`, included from `base/tasks/main.yml` via +`include_tasks` with `apply: { tags: [mesh] }` (the dynamic-include tag-propagation +gotcha — see existing concerns), tagged `mesh`. A new `mesh` entry is added to the closed +tag vocabulary in `tests/tags.yml`. + +The concern: + +- **installs a pinned NetBird agent** from the official NetBird apt repo (repo + key added + like `docker_host` does for Docker; exact package + version **verified in the plan** per + ADR-014). Version-pinned (ADR-011). +- **enrolls idempotently:** run `netbird up --management-url {{ base__mesh_management_url }} + --setup-key ` **only when** `netbird status` reports the host is not already + connected (guard on a `command` check, `changed_when` accordingly). The setup key is + passed with `no_log: true`. +- **does NOT touch the host firewall.** Enrollment is purely additive: `wt0` comes up, + `sshd` keeps listening on all interfaces exactly as today. No lockout risk in M5. + +**Knobs (`base__mesh_*`, defaults in `roles/base/defaults/main.yml`):** + +| Var | Default | Purpose | +|---|---|---| +| `base__mesh_enabled` | `false` | **Policy/opt-in gate.** `false` ⇒ the whole concern is skipped, so applying `base` to a host not ready to join the mesh is a no-op. Set `true` per host/group (`ubongo`, `askari`) to enrol. | +| `base__mesh_manage` | `true` | **Test gate** for the live daemon step. `true` ⇒ run `netbird up`; Molecule sets `false` so the concern can be exercised without a real coordinator/key (mirrors `reverse_proxy__manage` / `netbird_coordinator__manage`). | +| `base__mesh_management_url` | `https://netbird.askari.wingu.me` | The M4b coordinator. | +| `base__mesh_setup_key` | `"{{ vault.netbird.setup_key }}"` | Reusable scoped key (vault). | +| `base__mesh_version` | pinned (plan) | NetBird agent version (ADR-011). | + +### 2. Vault + +Add `vault.netbird.setup_key: CHANGEME` with a comment stating it is a **reusable, scoped +(`boma-hosts`), expiring** setup key minted in the NetBird dashboard after first-boot +`/setup`. The agent cannot mint it — the operator supplies it via `make edit-vault`. +`make check-vault` lists the outstanding `CHANGEME` until then. `base/tasks/mesh.yml` wires +to `{{ vault.netbird.setup_key }}`. + +### 3. Enrollment scope + +- **`ubongo`** — `base` `mesh` concern applied (tagged), bringing up `wt0`. Its other + `base` concerns (`firewall`, `hardening`) stay unapplied — `TAGS=mesh` scopes the run to + enrollment only, so no default-deny lands on the control node. +- **`askari`** — `base` `mesh` concern applied; agent enrols against its own public + coordinator URL and coexists with the coordinator container. +- **`mamba` + work laptop** — **operator** installs the NetBird client and logs in via the + dashboard (embedded Dex SSO). Not Ansible-managed; out of automation scope. + +### 4. Reachability + +M5 relies on NetBird's **default peer policy** for laptop→`ubongo` reachability. The plan +**verifies the pinned version's default-policy behaviour** (ADR-014); if it is not +allow-by-default, the plan adds one minimal policy permitting the admin group → `ubongo` +SSH. ACL-tightening to default-deny + scoped policies (ADR-016 intent) is **deferred** +(§6). + +## Testing + +- **Automated (I do, needs nothing from operator):** Molecule for the `base` `mesh` + concern with `base__mesh_enabled: true`, `base__mesh_manage: false`, and a dummy + `vault.netbird.setup_key` — so the install/enrol tasks are exercised but the live + `netbird up` (which needs a real coordinator + key) is gated off. Note: this concern is + install + a daemon command, so its render-only surface is thin (the "render-only tests + miss the real call" gotcha) — Molecule asserts the enrol command is constructed + correctly + idempotency guard works; full enrollment is proven in the live step below. + Also assert `base__mesh_enabled: false` is a clean no-op. `make lint` (incl. + `check-tags` for the new `mesh` tag). +- **Live (gated, after the operator handoff):** apply `base` `TAGS=mesh` to `ubongo` + + `askari`; verify `wt0` is up and the **`ubongo`↔`askari` mesh link** works from `ubongo` + (both are peers I manage — e.g. `netbird status` shows the peer, ping the peer's mesh IP). +- **Goal verification (operator):** from a laptop on the mesh, SSH `ubongo` over its + NetBird/`wt0` address. This is the mobile-access goal landing. + +## Operator handoff (the steps only the operator can do) + +1. Dashboard `/setup` (one-time) → create the admin user. +2. Mint a **reusable, scoped (`boma-hosts`), expiring** setup key → `make edit-vault` to + replace the `CHANGEME` → re-encrypt. (`make check-vault` confirms.) +3. Install the NetBird client on `mamba` + the work laptop, log in via the dashboard. +4. Confirm SSH to `ubongo` over the mesh. + +## Out of scope / deferred (the "mesh hardening" follow-on) + +- **`base` nftables default-deny on `ubongo`** (SSH only on `wt0` + the + `base__firewall_control_addr` LAN fallback, ADR-021/020). Built + dormant today; applying + it to the control node is the lockout-risky step and gets its own deliberate change + **after** the mesh path to `ubongo` is proven solid. +- **NetBird ACL tightening** to default-deny + scoped per-group policies (ADR-016: admin + peers → `srv`+`mgmt`, clients least-privilege). M5 uses the default policy. +- **`askari` SSH onto `wt0`** (retiring the Hetzner-firewall WAN SSH allow) — enabled by + `askari` now being a peer, but a separate change. + +## Maps to + +ADR-016 (mesh, agent-per-host, setup keys in vault), ADR-021 (SSH ladder — `wt0` primary + +`ssh-from-control`; the lockdown that *uses* this is deferred), ADR-020 (host firewall — +default-deny deferred), ADR-002 (security baseline), ADR-011 (version-pinned agent), +ADR-004 (enrollment lives in `base`, not a new role), ADR-014 (verify agent +version/package + default-policy behaviour in the plan).