boma/docs/superpowers/plans/2026-06-17-m5-mesh-enrollment.md
sjat 55776fb03c docs(plan): M5 mesh-enrollment implementation plan
8 tasks: build the base 'mesh' concern + tag + vault stub + per-host opt-in
(autonomous), operator handoff for /setup + setup key, gated live enrol of
ubongo + askari, operator laptop enrol, docs. Reachability-only; lockdown deferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 15:49:28 +02:00

14 KiB
Raw Permalink Blame History

M5 — Mesh enrollment (NetBird agents) Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax.

Goal: ubongo reachable from anywhere over the NetBird mesh — enrol NetBird agents on ubongo + askari via a new opt-in base mesh concern; the operator enrols the laptops.

Architecture: A new base concern (roles/base/tasks/mesh.yml) installs a pinned NetBird agent and runs netbird up with a reusable scoped setup key from vault. Gated by base__mesh_enabled (per-host opt-in) and base__mesh_manage (skips network/daemon actions for Molecule). No firewall change — enrollment is additive (wt0 comes up, SSH keeps listening), so there is zero lockout risk. The host nftables default-deny + NetBird ACL tightening are a separate, deferred follow-on.

Tech Stack: NetBird agent (apt, pinned), Ansible (base role), Molecule, the M4b coordinator at https://netbird.askari.wingu.me.

Spec: docs/superpowers/specs/2026-06-17-m5-mesh-enrollment-design.md

Execution context: Tasks 14 author + commit (need nothing from the operator). Task 5 is an operator handoff (dashboard /setup + mint key). Task 6 applies live to ubongo + askari (gated). Task 7 is operator-only (laptops). Task 8 docs.


File structure

File Change Responsibility
tests/tags.yml modify add the mesh concern to the closed tag vocabulary
roles/base/defaults/main.yml modify base__mesh_* knobs
roles/base/tasks/mesh.yml create the enrollment concern (install + netbird up)
roles/base/tasks/main.yml modify include mesh.yml (gated, tagged)
roles/base/README.md modify document the mesh concern + knobs
roles/base/molecule/default/converge.yml modify enable mesh (manage off) + dummy key
roles/base/molecule/default/verify.yml modify assert mesh wiring / no-op
inventories/production/group_vars/control/vars.yml modify base__mesh_enabled: true (ubongo)
inventories/production/group_vars/offsite_hosts/vars.yml create base__mesh_enabled: true (askari)
inventories/production/group_vars/all/vault.yml modify (vault) vault.netbird.setup_key: CHANGEME
STATUS.md, docs/ROADMAP.md, docs/FRICTION.md modify M5 done; deferred hardening; friction note

Task 1: Verify + pin the NetBird agent; add the mesh tag

  • Step 1 (ADR-014 verification — record the answers): confirm against current NetBird docs/repo (WebFetch docs.netbird.io, pkgs.netbird.io):
    • the apt repo URL + signing-key URL + suite/component (the install-script publishes an apt source — capture the exact deb line and key URL);
    • the package name (headless agent — expected netbird) and that version 0.72.4 (matching the coordinator) is installable, plus the apt version-pin syntax;
    • the exact netbird status output string that indicates an established management connection (for the idempotency guard — e.g. Management: Connected);
    • the netbird up flags (--management-url, --setup-key);
    • whether the pinned NetBird's default peer policy is allow-by-default (decides §Task 6 step 4). Record all of this in the commit message / a note block.
  • Step 2: add mesh to tests/tags.yml under concerns::
  - mesh         # NetBird agent enrollment (ADR-016)
  • Step 3: make lint → expect check-tags: OK (an unused vocab entry is allowed; nothing references it yet). Expected: 0 failures.
  • Step 4: commit feat(base): add the 'mesh' concern tag (NetBird agent, ADR-016).

Task 2: base mesh concern — defaults + tasks + include + README

Files: roles/base/defaults/main.yml, roles/base/tasks/mesh.yml (create), roles/base/tasks/main.yml, roles/base/README.md.

  • Step 1: append the knobs to roles/base/defaults/main.yml:
# NetBird mesh agent enrollment (ADR-016). Opt-in: default off so applying `base` to a
# host not (yet) on the mesh is a no-op for this concern. The live actions (apt install
# over the network, `netbird up` against the coordinator) are additionally gated by
# base__mesh_manage so Molecule can exercise the wiring without a coordinator.
base__mesh_enabled: false
base__mesh_manage: true
base__mesh_management_url: "https://netbird.askari.wingu.me"
base__mesh_setup_key: "{{ vault.netbird.setup_key }}"   # noqa: var-naming[no-role-prefix] is NOT needed — this carries the base__ prefix
base__mesh_version: "0.72.4"   # match the coordinator; confirmed installable in Task 1
  • Step 2: create roles/base/tasks/mesh.yml (use the Task-1-verified repo URL/key/pin; the values below are the expected ones to confirm):
---
# NetBird agent enrollment (ADR-016). Additive only — no firewall change here.
- name: Ensure /etc/apt/keyrings exists
  ansible.builtin.file:
    path: /etc/apt/keyrings
    state: directory
    mode: "0755"
  tags: [mesh]

- name: Add the NetBird APT GPG key
  ansible.builtin.get_url:
    url: https://pkgs.netbird.io/debian/public.key          # confirm in Task 1
    dest: /etc/apt/keyrings/netbird.asc
    mode: "0644"
  when: base__mesh_manage | bool
  tags: [mesh]

- name: Add the NetBird APT repository
  ansible.builtin.apt_repository:
    repo: >-
      deb [signed-by=/etc/apt/keyrings/netbird.asc]
      https://pkgs.netbird.io/debian stable main             # confirm in Task 1
    filename: netbird
    state: present
  when: base__mesh_manage | bool
  tags: [mesh]

- name: Install the NetBird agent (pinned)
  ansible.builtin.apt:
    name: "netbird={{ base__mesh_version }}"                 # confirm pin syntax in Task 1
    state: present
    update_cache: true
  when: base__mesh_manage | bool
  tags: [mesh]

- name: Check current NetBird connection status
  ansible.builtin.command: netbird status
  register: _netbird_status
  changed_when: false
  failed_when: false
  when: base__mesh_manage | bool
  tags: [mesh]

- name: Enrol this host in the mesh
  ansible.builtin.command: >-
    netbird up
    --management-url {{ base__mesh_management_url }}
    --setup-key {{ base__mesh_setup_key }}
  register: _netbird_up
  changed_when: _netbird_up.rc == 0
  when:
    - base__mesh_manage | bool
    - "'Management: Connected' not in (_netbird_status.stdout | default(''))"   # confirm string in Task 1
  no_log: true   # setup key is on the argv
  tags: [mesh]
  • Step 3: in roles/base/tasks/main.yml, add the include (after the existing concerns), gated by base__mesh_enabled:
- name: NetBird mesh enrollment
  ansible.builtin.include_tasks:
    file: mesh.yml
    apply:
      tags: [mesh]
  when: base__mesh_enabled | bool
  tags: [mesh]
  • Step 4: document the concern in roles/base/README.md (purpose; the base__mesh_* knobs table; that it is additive/no-firewall; that the setup key comes from vault.netbird.setup_key; the enabled/manage gating).
  • Step 5: make lint → 0 failures. Commit feat(base): NetBird agent enrollment concern (mesh).

Task 3: Molecule coverage

Files: roles/base/molecule/default/converge.yml, roles/base/molecule/default/verify.yml.

The concern is install + a daemon command needing a live coordinator, so the hermetic Molecule surface is thin (the known "render-only misses the real call" gotcha). Molecule proves: (a) enabling mesh with manage: false does not break the base converge and is idempotent; (b) base__mesh_enabled: false (the default, already exercised by the existing firewall test) is a clean no-op. Full install+enrol is proven live in Task 6.

  • Step 1: in converge.yml add to vars::
    base__mesh_enabled: true
    base__mesh_manage: false          # skip network/daemon actions
    base__mesh_setup_key: "dummy-molecule-key"
  • Step 2: in verify.yml add a task asserting the concern is a clean no-op under manage: falsenetbird is NOT installed and wt0 does not exist (since all live actions are gated off):
    - name: Confirm mesh manage=false did not install/enrol
      ansible.builtin.command: which netbird
      register: _nb
      changed_when: false
      failed_when: false
    - name: Assert netbird absent under manage=false
      ansible.builtin.assert:
        that:
          - _nb.rc != 0
        fail_msg: "netbird should not be installed when base__mesh_manage is false"
  • Step 3: make test ROLE=base → converge + idempotence + verify pass (failed=0). The existing firewall assertions still pass (mesh vars don't affect them).
  • Step 4: commit test(base): molecule coverage for the mesh concern (manage-off no-op).

Task 4: Vault stub + per-host opt-in

  • Step 1 (vault — needs rbw unlocked): make decrypt FILE=inventories/production/group_vars/all/vault.yml; add under vault.netbird (alongside auth_secret/datastore_key):
    # Reusable, scoped (group "boma-hosts"), expiring NetBird setup key. Mint it in the
    # dashboard (Setup Keys) AFTER the first-boot /setup admin exists. Consumed by the
    # base 'mesh' concern. CHANGEME until the operator supplies it via `make edit-vault`.
    setup_key: CHANGEME

make encrypt FILE=...; make check-vault → confirms structure + lists the setup_key CHANGEME.

  • Step 2: set the opt-in. In inventories/production/group_vars/control/vars.yml add base__mesh_enabled: true (ubongo). Create inventories/production/group_vars/offsite_hosts/vars.yml:
---
# askari is a NetBird peer as well as the coordinator host (ADR-016).
base__mesh_enabled: true
  • Step 3: make lint → 0 failures. Commit feat(base): vault setup_key stub + enable mesh on ubongo + askari.

Task 5: Operator handoff — first-boot admin + setup key (GATED, operator does this)

Nothing here is automatable — the agent cannot create a dashboard admin or mint a key.

  • Step 1 (operator): browse https://netbird.askari.wingu.me, complete the one-time /setup to create the admin user, log in.
  • Step 2 (operator): create a reusable setup key, scoped to auto-assign peers to a boma-hosts group, with an expiry. Copy the key value.
  • Step 3 (operator): make edit-vault → replace vault.netbird.setup_key's CHANGEME with the real key → :wq (re-encrypts) → make check-vault shows no outstanding CHANGEME. The key never enters the chat.
  • Step 4: no repo commit beyond the (already-encrypted) vault, which is unchanged on disk structure.

Task 6: Enrol ubongo + askari (GATED, live — needs Task 5 done + rbw unlocked)

  • Step 1: make check PLAYBOOK=site LIMIT=askari TAGS=mesh — review (askari is ansible-user managed; cleaner first target than the control node). Then make deploy PLAYBOOK=site LIMIT=askari TAGS=mesh.
  • Step 2: verify on askari: netbird status shows Management: Connected; ip link show wt0 exists. (Agent coexists with the coordinator container; it reaches the coordinator via the public URL.)
  • Step 3: make check PLAYBOOK=site LIMIT=ubongo TAGS=mesh — review. Note: ubongo is managed as sjat with become: true (same path dev_env used via playbooks/workstation.yml); confirm sjat sudo works (the run will prompt/fail clearly if a become password is needed). Then make deploy PLAYBOOK=site LIMIT=ubongo TAGS=mesh.
  • Step 4: verify the mesh link from ubongo: netbird status shows ubongo connected and lists askari as a peer; ping askari's NetBird (100.x) address. If the pinned NetBird is NOT allow-by-default (Task 1, Step 1), add one minimal dashboard policy permitting the admin group → ubongo SSH (or temporarily the default policy) so Task 7 can connect.
  • Step 5: no repo commit (host state).

Task 7: Enrol the road-warrior clients → goal lands (operator)

  • Step 1 (operator): install the NetBird client on mamba + the work laptop; log in via the dashboard (Dex SSO) so they join the mesh.
  • Step 2 (operator): from a laptop (anywhere), ssh sjat@<ubongo-netbird-ip> (or the mesh hostname) — connection succeeds. ← the mobile-access goal lands here.
  • Step 3: confirm with the operator that remote access works end-to-end.

Task 8: Docs

  • Step 1: STATUS.md — move "NetBird agent enrollment in base" to built + applied (ubongo + askari enrolled; reachability achieved). Note the mesh concern + opt-in. ubongo row: mesh-enrolled (its other base concerns still pending). askari row: NetBird peer.
  • Step 2: docs/ROADMAP.mdM5 DONE; Phase 1 (remote access) complete. Next: the Procurement gate (/capacity-review → buy cluster hardware). Record the deferred "mesh hardening" follow-on (ubongo nftables default-deny + NetBird ACL tightening + askari SSH→wt0).
  • Step 3: docs/FRICTION.md — add a signal: a docs-only commit still tripped the rbw-locked pre-commit guard (2026-06-17), although the 2026-06-10 kaizen fix was meant to let docs-/config-only commits through without vault — the hook scoping or a blanket guard needs a look.
  • Step 4: make lint; commit docs: M5 done — Phase 1 remote access complete.

Self-Review (completed)

  • Spec coverage: mesh concern (spec §1) → Tasks 13; vault stub (spec §2) → Task 4; ubongo+askari enrol (spec §3) → Tasks 4,6; laptops (spec §3) → Task 7; reachability via default policy (spec §4) → Task 6 step 4; deferred hardening (spec §6) → recorded in Task 8; operator handoff (spec) → Task 5. Testing (spec) → Task 3 (hermetic) + Task 6 (live). All covered.
  • Placeholder scan: the "confirm in Task 1" markers are ADR-014 verification points executed in Task 1 (the repo URL/key/pin/status-string), not vague TODOs — Task 2's code carries the expected values to confirm, matching how M4a/M4b pinned versions in-plan.
  • Consistency: base__mesh_enabled (opt-in) vs base__mesh_manage (test gate) used consistently across defaults, tasks, include, converge, and the no-op assertion; vault.netbird.setup_key matches between defaults, vault stub, and Task 5; mesh tag added (Task 1) before it is used (Task 2).
  • Risk: the only live risk is Task 6 on the control node — mitigated because the mesh concern makes no firewall change (SSH stays open on all paths), askari is enrolled first as the lower-risk rehearsal, and the host nftables lockdown is explicitly out of scope.