diff --git a/docs/superpowers/plans/2026-06-17-m5-mesh-enrollment.md b/docs/superpowers/plans/2026-06-17-m5-mesh-enrollment.md new file mode 100644 index 0000000..ec80bb1 --- /dev/null +++ b/docs/superpowers/plans/2026-06-17-m5-mesh-enrollment.md @@ -0,0 +1,234 @@ +# M5 — Mesh enrollment (NetBird agents) Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax. + +**Goal:** `ubongo` reachable from anywhere over the NetBird mesh — enrol NetBird agents on `ubongo` + `askari` via a new opt-in `base` `mesh` concern; the operator enrols the laptops. + +**Architecture:** A new `base` concern (`roles/base/tasks/mesh.yml`) installs a pinned NetBird agent and runs `netbird up` with a reusable scoped setup key from vault. Gated by `base__mesh_enabled` (per-host opt-in) and `base__mesh_manage` (skips network/daemon actions for Molecule). **No firewall change** — enrollment is additive (`wt0` comes up, SSH keeps listening), so there is zero lockout risk. The host nftables default-deny + NetBird ACL tightening are a separate, deferred follow-on. + +**Tech Stack:** NetBird agent (apt, pinned), Ansible (`base` role), Molecule, the M4b coordinator at `https://netbird.askari.wingu.me`. + +**Spec:** `docs/superpowers/specs/2026-06-17-m5-mesh-enrollment-design.md` + +**Execution context:** Tasks 1–4 author + commit (need nothing from the operator). **Task 5 is an operator handoff** (dashboard `/setup` + mint key). **Task 6 applies live to `ubongo` + `askari`** (gated). Task 7 is operator-only (laptops). Task 8 docs. + +--- + +## File structure + +| File | Change | Responsibility | +|---|---|---| +| `tests/tags.yml` | modify | add the `mesh` concern to the closed tag vocabulary | +| `roles/base/defaults/main.yml` | modify | `base__mesh_*` knobs | +| `roles/base/tasks/mesh.yml` | **create** | the enrollment concern (install + `netbird up`) | +| `roles/base/tasks/main.yml` | modify | include `mesh.yml` (gated, tagged) | +| `roles/base/README.md` | modify | document the `mesh` concern + knobs | +| `roles/base/molecule/default/converge.yml` | modify | enable mesh (manage off) + dummy key | +| `roles/base/molecule/default/verify.yml` | modify | assert mesh wiring / no-op | +| `inventories/production/group_vars/control/vars.yml` | modify | `base__mesh_enabled: true` (ubongo) | +| `inventories/production/group_vars/offsite_hosts/vars.yml` | **create** | `base__mesh_enabled: true` (askari) | +| `inventories/production/group_vars/all/vault.yml` | modify (vault) | `vault.netbird.setup_key: CHANGEME` | +| `STATUS.md`, `docs/ROADMAP.md`, `docs/FRICTION.md` | modify | M5 done; deferred hardening; friction note | + +--- + +### Task 1: Verify + pin the NetBird agent; add the `mesh` tag + +- [ ] **Step 1 (ADR-014 verification — record the answers):** confirm against current NetBird docs/repo (WebFetch `docs.netbird.io`, `pkgs.netbird.io`): + - the **apt repo** URL + signing-key URL + suite/component (the install-script publishes an apt source — capture the exact `deb` line and key URL); + - the **package name** (headless agent — expected `netbird`) and that **version `0.72.4`** (matching the coordinator) is installable, plus the apt **version-pin syntax**; + - the exact **`netbird status`** output string that indicates an established management connection (for the idempotency guard — e.g. `Management: Connected`); + - the **`netbird up`** flags (`--management-url`, `--setup-key`); + - whether the pinned NetBird's **default peer policy is allow-by-default** (decides §Task 6 step 4). Record all of this in the commit message / a note block. +- [ ] **Step 2:** add `mesh` to `tests/tags.yml` under `concerns:`: +```yaml + - mesh # NetBird agent enrollment (ADR-016) +``` +- [ ] **Step 3:** `make lint` → expect `check-tags: OK` (an unused vocab entry is allowed; nothing references it yet). Expected: 0 failures. +- [ ] **Step 4:** commit `feat(base): add the 'mesh' concern tag (NetBird agent, ADR-016)`. + +--- + +### Task 2: `base` `mesh` concern — defaults + tasks + include + README + +**Files:** `roles/base/defaults/main.yml`, `roles/base/tasks/mesh.yml` (create), `roles/base/tasks/main.yml`, `roles/base/README.md`. + +- [ ] **Step 1:** append the knobs to `roles/base/defaults/main.yml`: +```yaml +# NetBird mesh agent enrollment (ADR-016). Opt-in: default off so applying `base` to a +# host not (yet) on the mesh is a no-op for this concern. The live actions (apt install +# over the network, `netbird up` against the coordinator) are additionally gated by +# base__mesh_manage so Molecule can exercise the wiring without a coordinator. +base__mesh_enabled: false +base__mesh_manage: true +base__mesh_management_url: "https://netbird.askari.wingu.me" +base__mesh_setup_key: "{{ vault.netbird.setup_key }}" # noqa: var-naming[no-role-prefix] is NOT needed — this carries the base__ prefix +base__mesh_version: "0.72.4" # match the coordinator; confirmed installable in Task 1 +``` +- [ ] **Step 2:** create `roles/base/tasks/mesh.yml` (use the Task-1-verified repo URL/key/pin; the values below are the expected ones to confirm): +```yaml +--- +# NetBird agent enrollment (ADR-016). Additive only — no firewall change here. +- name: Ensure /etc/apt/keyrings exists + ansible.builtin.file: + path: /etc/apt/keyrings + state: directory + mode: "0755" + tags: [mesh] + +- name: Add the NetBird APT GPG key + ansible.builtin.get_url: + url: https://pkgs.netbird.io/debian/public.key # confirm in Task 1 + dest: /etc/apt/keyrings/netbird.asc + mode: "0644" + when: base__mesh_manage | bool + tags: [mesh] + +- name: Add the NetBird APT repository + ansible.builtin.apt_repository: + repo: >- + deb [signed-by=/etc/apt/keyrings/netbird.asc] + https://pkgs.netbird.io/debian stable main # confirm in Task 1 + filename: netbird + state: present + when: base__mesh_manage | bool + tags: [mesh] + +- name: Install the NetBird agent (pinned) + ansible.builtin.apt: + name: "netbird={{ base__mesh_version }}" # confirm pin syntax in Task 1 + state: present + update_cache: true + when: base__mesh_manage | bool + tags: [mesh] + +- name: Check current NetBird connection status + ansible.builtin.command: netbird status + register: _netbird_status + changed_when: false + failed_when: false + when: base__mesh_manage | bool + tags: [mesh] + +- name: Enrol this host in the mesh + ansible.builtin.command: >- + netbird up + --management-url {{ base__mesh_management_url }} + --setup-key {{ base__mesh_setup_key }} + register: _netbird_up + changed_when: _netbird_up.rc == 0 + when: + - base__mesh_manage | bool + - "'Management: Connected' not in (_netbird_status.stdout | default(''))" # confirm string in Task 1 + no_log: true # setup key is on the argv + tags: [mesh] +``` +- [ ] **Step 3:** in `roles/base/tasks/main.yml`, add the include (after the existing concerns), gated by `base__mesh_enabled`: +```yaml +- name: NetBird mesh enrollment + ansible.builtin.include_tasks: + file: mesh.yml + apply: + tags: [mesh] + when: base__mesh_enabled | bool + tags: [mesh] +``` +- [ ] **Step 4:** document the concern in `roles/base/README.md` (purpose; the `base__mesh_*` knobs table; that it is additive/no-firewall; that the setup key comes from `vault.netbird.setup_key`; the `enabled`/`manage` gating). +- [ ] **Step 5:** `make lint` → 0 failures. Commit `feat(base): NetBird agent enrollment concern (mesh)`. + +--- + +### Task 3: Molecule coverage + +**Files:** `roles/base/molecule/default/converge.yml`, `roles/base/molecule/default/verify.yml`. + +> The concern is install + a daemon command needing a live coordinator, so the hermetic Molecule surface is thin (the known "render-only misses the real call" gotcha). Molecule proves: (a) enabling mesh with `manage: false` does not break the base converge and is idempotent; (b) `base__mesh_enabled: false` (the default, already exercised by the existing firewall test) is a clean no-op. Full install+enrol is proven live in Task 6. + +- [ ] **Step 1:** in `converge.yml` add to `vars:`: +```yaml + base__mesh_enabled: true + base__mesh_manage: false # skip network/daemon actions + base__mesh_setup_key: "dummy-molecule-key" +``` +- [ ] **Step 2:** in `verify.yml` add a task asserting the concern is a clean no-op under `manage: false` — `netbird` is NOT installed and `wt0` does not exist (since all live actions are gated off): +```yaml + - name: Confirm mesh manage=false did not install/enrol + ansible.builtin.command: which netbird + register: _nb + changed_when: false + failed_when: false + - name: Assert netbird absent under manage=false + ansible.builtin.assert: + that: + - _nb.rc != 0 + fail_msg: "netbird should not be installed when base__mesh_manage is false" +``` +- [ ] **Step 3:** `make test ROLE=base` → converge + idempotence + verify pass (`failed=0`). The existing firewall assertions still pass (mesh vars don't affect them). +- [ ] **Step 4:** commit `test(base): molecule coverage for the mesh concern (manage-off no-op)`. + +--- + +### Task 4: Vault stub + per-host opt-in + +- [ ] **Step 1 (vault — needs `rbw` unlocked):** `make decrypt FILE=inventories/production/group_vars/all/vault.yml`; add under `vault.netbird` (alongside `auth_secret`/`datastore_key`): +```yaml + # Reusable, scoped (group "boma-hosts"), expiring NetBird setup key. Mint it in the + # dashboard (Setup Keys) AFTER the first-boot /setup admin exists. Consumed by the + # base 'mesh' concern. CHANGEME until the operator supplies it via `make edit-vault`. + setup_key: CHANGEME +``` +`make encrypt FILE=...`; `make check-vault` → confirms structure + lists the `setup_key` CHANGEME. +- [ ] **Step 2:** set the opt-in. In `inventories/production/group_vars/control/vars.yml` add `base__mesh_enabled: true` (ubongo). Create `inventories/production/group_vars/offsite_hosts/vars.yml`: +```yaml +--- +# askari is a NetBird peer as well as the coordinator host (ADR-016). +base__mesh_enabled: true +``` +- [ ] **Step 3:** `make lint` → 0 failures. Commit `feat(base): vault setup_key stub + enable mesh on ubongo + askari`. + +--- + +### Task 5: Operator handoff — first-boot admin + setup key (GATED, operator does this) + +> Nothing here is automatable — the agent cannot create a dashboard admin or mint a key. + +- [ ] **Step 1 (operator):** browse `https://netbird.askari.wingu.me`, complete the one-time `/setup` to create the admin user, log in. +- [ ] **Step 2 (operator):** create a **reusable** setup key, **scoped** to auto-assign peers to a `boma-hosts` group, with an **expiry**. Copy the key value. +- [ ] **Step 3 (operator):** `make edit-vault` → replace `vault.netbird.setup_key`'s `CHANGEME` with the real key → `:wq` (re-encrypts) → `make check-vault` shows no outstanding CHANGEME. The key never enters the chat. +- [ ] **Step 4:** no repo commit beyond the (already-encrypted) vault, which is unchanged on disk structure. + +--- + +### Task 6: Enrol `ubongo` + `askari` (GATED, live — needs Task 5 done + `rbw` unlocked) + +- [ ] **Step 1:** `make check PLAYBOOK=site LIMIT=askari TAGS=mesh` — review (askari is `ansible`-user managed; cleaner first target than the control node). Then `make deploy PLAYBOOK=site LIMIT=askari TAGS=mesh`. +- [ ] **Step 2:** verify on askari: `netbird status` shows `Management: Connected`; `ip link show wt0` exists. (Agent coexists with the coordinator container; it reaches the coordinator via the public URL.) +- [ ] **Step 3:** `make check PLAYBOOK=site LIMIT=ubongo TAGS=mesh` — review. Note: ubongo is managed as `sjat` with `become: true` (same path `dev_env` used via `playbooks/workstation.yml`); confirm `sjat` sudo works (the run will prompt/fail clearly if a become password is needed). Then `make deploy PLAYBOOK=site LIMIT=ubongo TAGS=mesh`. +- [ ] **Step 4:** verify the mesh link from ubongo: `netbird status` shows `ubongo` connected and lists `askari` as a peer; ping askari's NetBird (`100.x`) address. If the pinned NetBird is NOT allow-by-default (Task 1, Step 1), add one minimal dashboard policy permitting the admin group → `ubongo` SSH (or temporarily the default policy) so Task 7 can connect. +- [ ] **Step 5:** no repo commit (host state). + +--- + +### Task 7: Enrol the road-warrior clients → goal lands (operator) + +- [ ] **Step 1 (operator):** install the NetBird client on `mamba` + the work laptop; log in via the dashboard (Dex SSO) so they join the mesh. +- [ ] **Step 2 (operator):** from a laptop (anywhere), `ssh sjat@` (or the mesh hostname) — connection succeeds. **← the mobile-access goal lands here.** +- [ ] **Step 3:** confirm with the operator that remote access works end-to-end. + +--- + +### Task 8: Docs + +- [ ] **Step 1:** `STATUS.md` — move "NetBird agent enrollment in `base`" to **built + applied** (ubongo + askari enrolled; reachability achieved). Note the `mesh` concern + opt-in. ubongo row: mesh-enrolled (its other base concerns still pending). askari row: NetBird peer. +- [ ] **Step 2:** `docs/ROADMAP.md` — **M5 ✅ DONE**; Phase 1 (remote access) complete. Next: the **Procurement gate** (`/capacity-review` → buy cluster hardware). Record the deferred "mesh hardening" follow-on (ubongo nftables default-deny + NetBird ACL tightening + askari SSH→`wt0`). +- [ ] **Step 3:** `docs/FRICTION.md` — add a signal: a **docs-only commit still tripped the `rbw`-locked pre-commit guard** (2026-06-17), although the 2026-06-10 kaizen fix was meant to let docs-/config-only commits through without vault — the hook scoping or a blanket guard needs a look. +- [ ] **Step 4:** `make lint`; commit `docs: M5 done — Phase 1 remote access complete`. + +--- + +## Self-Review (completed) + +- **Spec coverage:** `mesh` concern (spec §1) → Tasks 1–3; vault stub (spec §2) → Task 4; ubongo+askari enrol (spec §3) → Tasks 4,6; laptops (spec §3) → Task 7; reachability via default policy (spec §4) → Task 6 step 4; deferred hardening (spec §6) → recorded in Task 8; operator handoff (spec) → Task 5. Testing (spec) → Task 3 (hermetic) + Task 6 (live). All covered. +- **Placeholder scan:** the "confirm in Task 1" markers are ADR-014 verification points executed in Task 1 (the repo URL/key/pin/status-string), not vague TODOs — Task 2's code carries the expected values to confirm, matching how M4a/M4b pinned versions in-plan. +- **Consistency:** `base__mesh_enabled` (opt-in) vs `base__mesh_manage` (test gate) used consistently across defaults, tasks, include, converge, and the no-op assertion; `vault.netbird.setup_key` matches between defaults, vault stub, and Task 5; `mesh` tag added (Task 1) before it is used (Task 2). +- **Risk:** the only live risk is Task 6 on the control node — mitigated because the `mesh` concern makes **no firewall change** (SSH stays open on all paths), askari is enrolled first as the lower-risk rehearsal, and the host nftables lockdown is explicitly out of scope.