docs(spec): M5 mesh-enrollment design (reachability-only)

base 'mesh' concern enrols NetBird agents on ubongo + askari via a reusable scoped
setup key (vault); laptops enrolled by the operator. Reachability via the default
peer policy; the base nftables default-deny on ubongo + ACL tightening are deferred
to a follow-on. Resolves ROADMAP M5 design; next: writing-plans.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-17 15:44:13 +02:00
parent 94dd6da14c
commit 4142bb15f8

View file

@ -0,0 +1,131 @@
# M5 — Mesh enrollment (NetBird agents) → mobile access · design
**Status:** Design (2026-06-17). Implements ROADMAP **M5**, the last milestone of Phase 1
(remote access). Builds on M4b (the `netbird_coordinator` is live on `askari`). Design
resolved by **ADR-016** (mesh, agent-per-host) and **ADR-021** (SSH ladder); this spec is
the build-shaping for that decision. Next: `writing-plans`.
## Goal
`ubongo` reachable from anywhere over the self-hosted NetBird mesh — the Phase-1
mobile-access goal. **Reachability only.** The host-firewall lockdown and NetBird
ACL-tightening are deliberately **deferred** (see §6).
## Decisions (settled in brainstorming)
1. **Scope = reachability, not lockdown.** The goal needs only: agents enrolled + the
laptops on the mesh + a peer policy permitting laptop→`ubongo`. `ubongo`'s SSH is
already open, so reachability requires **no firewall change**. Applying the `base`
nftables default-deny to `ubongo` is the lockout-risky part on the control node and is
split into a follow-on (§6).
2. **One reusable, scoped, expiring setup key.** A single reusable key in
`vault.netbird.setup_key`, scoped to auto-assign peers to a `boma-hosts` group, with an
expiry. `base` re-runs idempotently across hosts. Matches ADR-016's single vault path;
blast radius is limited by scope + expiry + the fact that joining the mesh grants no
access on its own (peer policy gates that). Rejected: per-host one-off ephemeral keys —
more operator toil and they don't fit a single vault key for a re-runnable role.
3. **`askari` is enrolled as a peer** (ADR-016: it runs the stack *and* is a peer). The
agent coexists with the coordinator container on the same host. Enables later moving
`askari`'s SSH off the Hetzner-firewall WAN allow onto `wt0`, and gives a host-to-host
mesh link verifiable from `ubongo`.
## Architecture
### 1. New `base` concern: `mesh` (agent enrollment)
A new `roles/base/tasks/mesh.yml`, included from `base/tasks/main.yml` via
`include_tasks` with `apply: { tags: [mesh] }` (the dynamic-include tag-propagation
gotcha — see existing concerns), tagged `mesh`. A new `mesh` entry is added to the closed
tag vocabulary in `tests/tags.yml`.
The concern:
- **installs a pinned NetBird agent** from the official NetBird apt repo (repo + key added
like `docker_host` does for Docker; exact package + version **verified in the plan** per
ADR-014). Version-pinned (ADR-011).
- **enrolls idempotently:** run `netbird up --management-url {{ base__mesh_management_url }}
--setup-key <key>` **only when** `netbird status` reports the host is not already
connected (guard on a `command` check, `changed_when` accordingly). The setup key is
passed with `no_log: true`.
- **does NOT touch the host firewall.** Enrollment is purely additive: `wt0` comes up,
`sshd` keeps listening on all interfaces exactly as today. No lockout risk in M5.
**Knobs (`base__mesh_*`, defaults in `roles/base/defaults/main.yml`):**
| Var | Default | Purpose |
|---|---|---|
| `base__mesh_enabled` | `false` | **Policy/opt-in gate.** `false` ⇒ the whole concern is skipped, so applying `base` to a host not ready to join the mesh is a no-op. Set `true` per host/group (`ubongo`, `askari`) to enrol. |
| `base__mesh_manage` | `true` | **Test gate** for the live daemon step. `true` ⇒ run `netbird up`; Molecule sets `false` so the concern can be exercised without a real coordinator/key (mirrors `reverse_proxy__manage` / `netbird_coordinator__manage`). |
| `base__mesh_management_url` | `https://netbird.askari.wingu.me` | The M4b coordinator. |
| `base__mesh_setup_key` | `"{{ vault.netbird.setup_key }}"` | Reusable scoped key (vault). |
| `base__mesh_version` | pinned (plan) | NetBird agent version (ADR-011). |
### 2. Vault
Add `vault.netbird.setup_key: CHANGEME` with a comment stating it is a **reusable, scoped
(`boma-hosts`), expiring** setup key minted in the NetBird dashboard after first-boot
`/setup`. The agent cannot mint it — the operator supplies it via `make edit-vault`.
`make check-vault` lists the outstanding `CHANGEME` until then. `base/tasks/mesh.yml` wires
to `{{ vault.netbird.setup_key }}`.
### 3. Enrollment scope
- **`ubongo`** — `base` `mesh` concern applied (tagged), bringing up `wt0`. Its other
`base` concerns (`firewall`, `hardening`) stay unapplied — `TAGS=mesh` scopes the run to
enrollment only, so no default-deny lands on the control node.
- **`askari`** — `base` `mesh` concern applied; agent enrols against its own public
coordinator URL and coexists with the coordinator container.
- **`mamba` + work laptop** — **operator** installs the NetBird client and logs in via the
dashboard (embedded Dex SSO). Not Ansible-managed; out of automation scope.
### 4. Reachability
M5 relies on NetBird's **default peer policy** for laptop→`ubongo` reachability. The plan
**verifies the pinned version's default-policy behaviour** (ADR-014); if it is not
allow-by-default, the plan adds one minimal policy permitting the admin group → `ubongo`
SSH. ACL-tightening to default-deny + scoped policies (ADR-016 intent) is **deferred**
(§6).
## Testing
- **Automated (I do, needs nothing from operator):** Molecule for the `base` `mesh`
concern with `base__mesh_enabled: true`, `base__mesh_manage: false`, and a dummy
`vault.netbird.setup_key` — so the install/enrol tasks are exercised but the live
`netbird up` (which needs a real coordinator + key) is gated off. Note: this concern is
install + a daemon command, so its render-only surface is thin (the "render-only tests
miss the real call" gotcha) — Molecule asserts the enrol command is constructed
correctly + idempotency guard works; full enrollment is proven in the live step below.
Also assert `base__mesh_enabled: false` is a clean no-op. `make lint` (incl.
`check-tags` for the new `mesh` tag).
- **Live (gated, after the operator handoff):** apply `base` `TAGS=mesh` to `ubongo` +
`askari`; verify `wt0` is up and the **`ubongo``askari` mesh link** works from `ubongo`
(both are peers I manage — e.g. `netbird status` shows the peer, ping the peer's mesh IP).
- **Goal verification (operator):** from a laptop on the mesh, SSH `ubongo` over its
NetBird/`wt0` address. This is the mobile-access goal landing.
## Operator handoff (the steps only the operator can do)
1. Dashboard `/setup` (one-time) → create the admin user.
2. Mint a **reusable, scoped (`boma-hosts`), expiring** setup key → `make edit-vault` to
replace the `CHANGEME` → re-encrypt. (`make check-vault` confirms.)
3. Install the NetBird client on `mamba` + the work laptop, log in via the dashboard.
4. Confirm SSH to `ubongo` over the mesh.
## Out of scope / deferred (the "mesh hardening" follow-on)
- **`base` nftables default-deny on `ubongo`** (SSH only on `wt0` + the
`base__firewall_control_addr` LAN fallback, ADR-021/020). Built + dormant today; applying
it to the control node is the lockout-risky step and gets its own deliberate change
**after** the mesh path to `ubongo` is proven solid.
- **NetBird ACL tightening** to default-deny + scoped per-group policies (ADR-016: admin
peers → `srv`+`mgmt`, clients least-privilege). M5 uses the default policy.
- **`askari` SSH onto `wt0`** (retiring the Hetzner-firewall WAN SSH allow) — enabled by
`askari` now being a peer, but a separate change.
## Maps to
ADR-016 (mesh, agent-per-host, setup keys in vault), ADR-021 (SSH ladder — `wt0` primary +
`ssh-from-control`; the lockdown that *uses* this is deferred), ADR-020 (host firewall —
default-deny deferred), ADR-002 (security baseline), ADR-011 (version-pinned agent),
ADR-004 (enrollment lives in `base`, not a new role), ADR-014 (verify agent
version/package + default-policy behaviour in the plan).