boma/docs/superpowers/plans/2026-06-17-m5-mesh-enrollment.md
sjat 55776fb03c docs(plan): M5 mesh-enrollment implementation plan
8 tasks: build the base 'mesh' concern + tag + vault stub + per-host opt-in
(autonomous), operator handoff for /setup + setup key, gated live enrol of
ubongo + askari, operator laptop enrol, docs. Reachability-only; lockdown deferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 15:49:28 +02:00

234 lines
14 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# M5 — Mesh enrollment (NetBird agents) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax.
**Goal:** `ubongo` reachable from anywhere over the NetBird mesh — enrol NetBird agents on `ubongo` + `askari` via a new opt-in `base` `mesh` concern; the operator enrols the laptops.
**Architecture:** A new `base` concern (`roles/base/tasks/mesh.yml`) installs a pinned NetBird agent and runs `netbird up` with a reusable scoped setup key from vault. Gated by `base__mesh_enabled` (per-host opt-in) and `base__mesh_manage` (skips network/daemon actions for Molecule). **No firewall change** — enrollment is additive (`wt0` comes up, SSH keeps listening), so there is zero lockout risk. The host nftables default-deny + NetBird ACL tightening are a separate, deferred follow-on.
**Tech Stack:** NetBird agent (apt, pinned), Ansible (`base` role), Molecule, the M4b coordinator at `https://netbird.askari.wingu.me`.
**Spec:** `docs/superpowers/specs/2026-06-17-m5-mesh-enrollment-design.md`
**Execution context:** Tasks 14 author + commit (need nothing from the operator). **Task 5 is an operator handoff** (dashboard `/setup` + mint key). **Task 6 applies live to `ubongo` + `askari`** (gated). Task 7 is operator-only (laptops). Task 8 docs.
---
## File structure
| File | Change | Responsibility |
|---|---|---|
| `tests/tags.yml` | modify | add the `mesh` concern to the closed tag vocabulary |
| `roles/base/defaults/main.yml` | modify | `base__mesh_*` knobs |
| `roles/base/tasks/mesh.yml` | **create** | the enrollment concern (install + `netbird up`) |
| `roles/base/tasks/main.yml` | modify | include `mesh.yml` (gated, tagged) |
| `roles/base/README.md` | modify | document the `mesh` concern + knobs |
| `roles/base/molecule/default/converge.yml` | modify | enable mesh (manage off) + dummy key |
| `roles/base/molecule/default/verify.yml` | modify | assert mesh wiring / no-op |
| `inventories/production/group_vars/control/vars.yml` | modify | `base__mesh_enabled: true` (ubongo) |
| `inventories/production/group_vars/offsite_hosts/vars.yml` | **create** | `base__mesh_enabled: true` (askari) |
| `inventories/production/group_vars/all/vault.yml` | modify (vault) | `vault.netbird.setup_key: CHANGEME` |
| `STATUS.md`, `docs/ROADMAP.md`, `docs/FRICTION.md` | modify | M5 done; deferred hardening; friction note |
---
### Task 1: Verify + pin the NetBird agent; add the `mesh` tag
- [ ] **Step 1 (ADR-014 verification — record the answers):** confirm against current NetBird docs/repo (WebFetch `docs.netbird.io`, `pkgs.netbird.io`):
- the **apt repo** URL + signing-key URL + suite/component (the install-script publishes an apt source — capture the exact `deb` line and key URL);
- the **package name** (headless agent — expected `netbird`) and that **version `0.72.4`** (matching the coordinator) is installable, plus the apt **version-pin syntax**;
- the exact **`netbird status`** output string that indicates an established management connection (for the idempotency guard — e.g. `Management: Connected`);
- the **`netbird up`** flags (`--management-url`, `--setup-key`);
- whether the pinned NetBird's **default peer policy is allow-by-default** (decides §Task 6 step 4). Record all of this in the commit message / a note block.
- [ ] **Step 2:** add `mesh` to `tests/tags.yml` under `concerns:`:
```yaml
- mesh # NetBird agent enrollment (ADR-016)
```
- [ ] **Step 3:** `make lint` → expect `check-tags: OK` (an unused vocab entry is allowed; nothing references it yet). Expected: 0 failures.
- [ ] **Step 4:** commit `feat(base): add the 'mesh' concern tag (NetBird agent, ADR-016)`.
---
### Task 2: `base` `mesh` concern — defaults + tasks + include + README
**Files:** `roles/base/defaults/main.yml`, `roles/base/tasks/mesh.yml` (create), `roles/base/tasks/main.yml`, `roles/base/README.md`.
- [ ] **Step 1:** append the knobs to `roles/base/defaults/main.yml`:
```yaml
# NetBird mesh agent enrollment (ADR-016). Opt-in: default off so applying `base` to a
# host not (yet) on the mesh is a no-op for this concern. The live actions (apt install
# over the network, `netbird up` against the coordinator) are additionally gated by
# base__mesh_manage so Molecule can exercise the wiring without a coordinator.
base__mesh_enabled: false
base__mesh_manage: true
base__mesh_management_url: "https://netbird.askari.wingu.me"
base__mesh_setup_key: "{{ vault.netbird.setup_key }}" # noqa: var-naming[no-role-prefix] is NOT needed — this carries the base__ prefix
base__mesh_version: "0.72.4" # match the coordinator; confirmed installable in Task 1
```
- [ ] **Step 2:** create `roles/base/tasks/mesh.yml` (use the Task-1-verified repo URL/key/pin; the values below are the expected ones to confirm):
```yaml
---
# NetBird agent enrollment (ADR-016). Additive only — no firewall change here.
- name: Ensure /etc/apt/keyrings exists
ansible.builtin.file:
path: /etc/apt/keyrings
state: directory
mode: "0755"
tags: [mesh]
- name: Add the NetBird APT GPG key
ansible.builtin.get_url:
url: https://pkgs.netbird.io/debian/public.key # confirm in Task 1
dest: /etc/apt/keyrings/netbird.asc
mode: "0644"
when: base__mesh_manage | bool
tags: [mesh]
- name: Add the NetBird APT repository
ansible.builtin.apt_repository:
repo: >-
deb [signed-by=/etc/apt/keyrings/netbird.asc]
https://pkgs.netbird.io/debian stable main # confirm in Task 1
filename: netbird
state: present
when: base__mesh_manage | bool
tags: [mesh]
- name: Install the NetBird agent (pinned)
ansible.builtin.apt:
name: "netbird={{ base__mesh_version }}" # confirm pin syntax in Task 1
state: present
update_cache: true
when: base__mesh_manage | bool
tags: [mesh]
- name: Check current NetBird connection status
ansible.builtin.command: netbird status
register: _netbird_status
changed_when: false
failed_when: false
when: base__mesh_manage | bool
tags: [mesh]
- name: Enrol this host in the mesh
ansible.builtin.command: >-
netbird up
--management-url {{ base__mesh_management_url }}
--setup-key {{ base__mesh_setup_key }}
register: _netbird_up
changed_when: _netbird_up.rc == 0
when:
- base__mesh_manage | bool
- "'Management: Connected' not in (_netbird_status.stdout | default(''))" # confirm string in Task 1
no_log: true # setup key is on the argv
tags: [mesh]
```
- [ ] **Step 3:** in `roles/base/tasks/main.yml`, add the include (after the existing concerns), gated by `base__mesh_enabled`:
```yaml
- name: NetBird mesh enrollment
ansible.builtin.include_tasks:
file: mesh.yml
apply:
tags: [mesh]
when: base__mesh_enabled | bool
tags: [mesh]
```
- [ ] **Step 4:** document the concern in `roles/base/README.md` (purpose; the `base__mesh_*` knobs table; that it is additive/no-firewall; that the setup key comes from `vault.netbird.setup_key`; the `enabled`/`manage` gating).
- [ ] **Step 5:** `make lint` → 0 failures. Commit `feat(base): NetBird agent enrollment concern (mesh)`.
---
### Task 3: Molecule coverage
**Files:** `roles/base/molecule/default/converge.yml`, `roles/base/molecule/default/verify.yml`.
> The concern is install + a daemon command needing a live coordinator, so the hermetic Molecule surface is thin (the known "render-only misses the real call" gotcha). Molecule proves: (a) enabling mesh with `manage: false` does not break the base converge and is idempotent; (b) `base__mesh_enabled: false` (the default, already exercised by the existing firewall test) is a clean no-op. Full install+enrol is proven live in Task 6.
- [ ] **Step 1:** in `converge.yml` add to `vars:`:
```yaml
base__mesh_enabled: true
base__mesh_manage: false # skip network/daemon actions
base__mesh_setup_key: "dummy-molecule-key"
```
- [ ] **Step 2:** in `verify.yml` add a task asserting the concern is a clean no-op under `manage: false``netbird` is NOT installed and `wt0` does not exist (since all live actions are gated off):
```yaml
- name: Confirm mesh manage=false did not install/enrol
ansible.builtin.command: which netbird
register: _nb
changed_when: false
failed_when: false
- name: Assert netbird absent under manage=false
ansible.builtin.assert:
that:
- _nb.rc != 0
fail_msg: "netbird should not be installed when base__mesh_manage is false"
```
- [ ] **Step 3:** `make test ROLE=base` → converge + idempotence + verify pass (`failed=0`). The existing firewall assertions still pass (mesh vars don't affect them).
- [ ] **Step 4:** commit `test(base): molecule coverage for the mesh concern (manage-off no-op)`.
---
### Task 4: Vault stub + per-host opt-in
- [ ] **Step 1 (vault — needs `rbw` unlocked):** `make decrypt FILE=inventories/production/group_vars/all/vault.yml`; add under `vault.netbird` (alongside `auth_secret`/`datastore_key`):
```yaml
# Reusable, scoped (group "boma-hosts"), expiring NetBird setup key. Mint it in the
# dashboard (Setup Keys) AFTER the first-boot /setup admin exists. Consumed by the
# base 'mesh' concern. CHANGEME until the operator supplies it via `make edit-vault`.
setup_key: CHANGEME
```
`make encrypt FILE=...`; `make check-vault` → confirms structure + lists the `setup_key` CHANGEME.
- [ ] **Step 2:** set the opt-in. In `inventories/production/group_vars/control/vars.yml` add `base__mesh_enabled: true` (ubongo). Create `inventories/production/group_vars/offsite_hosts/vars.yml`:
```yaml
---
# askari is a NetBird peer as well as the coordinator host (ADR-016).
base__mesh_enabled: true
```
- [ ] **Step 3:** `make lint` → 0 failures. Commit `feat(base): vault setup_key stub + enable mesh on ubongo + askari`.
---
### Task 5: Operator handoff — first-boot admin + setup key (GATED, operator does this)
> Nothing here is automatable — the agent cannot create a dashboard admin or mint a key.
- [ ] **Step 1 (operator):** browse `https://netbird.askari.wingu.me`, complete the one-time `/setup` to create the admin user, log in.
- [ ] **Step 2 (operator):** create a **reusable** setup key, **scoped** to auto-assign peers to a `boma-hosts` group, with an **expiry**. Copy the key value.
- [ ] **Step 3 (operator):** `make edit-vault` → replace `vault.netbird.setup_key`'s `CHANGEME` with the real key → `:wq` (re-encrypts) → `make check-vault` shows no outstanding CHANGEME. The key never enters the chat.
- [ ] **Step 4:** no repo commit beyond the (already-encrypted) vault, which is unchanged on disk structure.
---
### Task 6: Enrol `ubongo` + `askari` (GATED, live — needs Task 5 done + `rbw` unlocked)
- [ ] **Step 1:** `make check PLAYBOOK=site LIMIT=askari TAGS=mesh` — review (askari is `ansible`-user managed; cleaner first target than the control node). Then `make deploy PLAYBOOK=site LIMIT=askari TAGS=mesh`.
- [ ] **Step 2:** verify on askari: `netbird status` shows `Management: Connected`; `ip link show wt0` exists. (Agent coexists with the coordinator container; it reaches the coordinator via the public URL.)
- [ ] **Step 3:** `make check PLAYBOOK=site LIMIT=ubongo TAGS=mesh` — review. Note: ubongo is managed as `sjat` with `become: true` (same path `dev_env` used via `playbooks/workstation.yml`); confirm `sjat` sudo works (the run will prompt/fail clearly if a become password is needed). Then `make deploy PLAYBOOK=site LIMIT=ubongo TAGS=mesh`.
- [ ] **Step 4:** verify the mesh link from ubongo: `netbird status` shows `ubongo` connected and lists `askari` as a peer; ping askari's NetBird (`100.x`) address. If the pinned NetBird is NOT allow-by-default (Task 1, Step 1), add one minimal dashboard policy permitting the admin group → `ubongo` SSH (or temporarily the default policy) so Task 7 can connect.
- [ ] **Step 5:** no repo commit (host state).
---
### Task 7: Enrol the road-warrior clients → goal lands (operator)
- [ ] **Step 1 (operator):** install the NetBird client on `mamba` + the work laptop; log in via the dashboard (Dex SSO) so they join the mesh.
- [ ] **Step 2 (operator):** from a laptop (anywhere), `ssh sjat@<ubongo-netbird-ip>` (or the mesh hostname) — connection succeeds. **← the mobile-access goal lands here.**
- [ ] **Step 3:** confirm with the operator that remote access works end-to-end.
---
### Task 8: Docs
- [ ] **Step 1:** `STATUS.md` — move "NetBird agent enrollment in `base`" to **built + applied** (ubongo + askari enrolled; reachability achieved). Note the `mesh` concern + opt-in. ubongo row: mesh-enrolled (its other base concerns still pending). askari row: NetBird peer.
- [ ] **Step 2:** `docs/ROADMAP.md`**M5 ✅ DONE**; Phase 1 (remote access) complete. Next: the **Procurement gate** (`/capacity-review` → buy cluster hardware). Record the deferred "mesh hardening" follow-on (ubongo nftables default-deny + NetBird ACL tightening + askari SSH→`wt0`).
- [ ] **Step 3:** `docs/FRICTION.md` — add a signal: a **docs-only commit still tripped the `rbw`-locked pre-commit guard** (2026-06-17), although the 2026-06-10 kaizen fix was meant to let docs-/config-only commits through without vault — the hook scoping or a blanket guard needs a look.
- [ ] **Step 4:** `make lint`; commit `docs: M5 done — Phase 1 remote access complete`.
---
## Self-Review (completed)
- **Spec coverage:** `mesh` concern (spec §1) → Tasks 13; vault stub (spec §2) → Task 4; ubongo+askari enrol (spec §3) → Tasks 4,6; laptops (spec §3) → Task 7; reachability via default policy (spec §4) → Task 6 step 4; deferred hardening (spec §6) → recorded in Task 8; operator handoff (spec) → Task 5. Testing (spec) → Task 3 (hermetic) + Task 6 (live). All covered.
- **Placeholder scan:** the "confirm in Task 1" markers are ADR-014 verification points executed in Task 1 (the repo URL/key/pin/status-string), not vague TODOs — Task 2's code carries the expected values to confirm, matching how M4a/M4b pinned versions in-plan.
- **Consistency:** `base__mesh_enabled` (opt-in) vs `base__mesh_manage` (test gate) used consistently across defaults, tasks, include, converge, and the no-op assertion; `vault.netbird.setup_key` matches between defaults, vault stub, and Task 5; `mesh` tag added (Task 1) before it is used (Task 2).
- **Risk:** the only live risk is Task 6 on the control node — mitigated because the `mesh` concern makes **no firewall change** (SSH stays open on all paths), askari is enrolled first as the lower-risk rehearsal, and the host nftables lockdown is explicitly out of scope.