docs(plan): M5 mesh-enrollment implementation plan
8 tasks: build the base 'mesh' concern + tag + vault stub + per-host opt-in (autonomous), operator handoff for /setup + setup key, gated live enrol of ubongo + askari, operator laptop enrol, docs. Reachability-only; lockdown deferred. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
4142bb15f8
commit
55776fb03c
1 changed files with 234 additions and 0 deletions
234
docs/superpowers/plans/2026-06-17-m5-mesh-enrollment.md
Normal file
234
docs/superpowers/plans/2026-06-17-m5-mesh-enrollment.md
Normal file
|
|
@ -0,0 +1,234 @@
|
|||
# M5 — Mesh enrollment (NetBird agents) Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax.
|
||||
|
||||
**Goal:** `ubongo` reachable from anywhere over the NetBird mesh — enrol NetBird agents on `ubongo` + `askari` via a new opt-in `base` `mesh` concern; the operator enrols the laptops.
|
||||
|
||||
**Architecture:** A new `base` concern (`roles/base/tasks/mesh.yml`) installs a pinned NetBird agent and runs `netbird up` with a reusable scoped setup key from vault. Gated by `base__mesh_enabled` (per-host opt-in) and `base__mesh_manage` (skips network/daemon actions for Molecule). **No firewall change** — enrollment is additive (`wt0` comes up, SSH keeps listening), so there is zero lockout risk. The host nftables default-deny + NetBird ACL tightening are a separate, deferred follow-on.
|
||||
|
||||
**Tech Stack:** NetBird agent (apt, pinned), Ansible (`base` role), Molecule, the M4b coordinator at `https://netbird.askari.wingu.me`.
|
||||
|
||||
**Spec:** `docs/superpowers/specs/2026-06-17-m5-mesh-enrollment-design.md`
|
||||
|
||||
**Execution context:** Tasks 1–4 author + commit (need nothing from the operator). **Task 5 is an operator handoff** (dashboard `/setup` + mint key). **Task 6 applies live to `ubongo` + `askari`** (gated). Task 7 is operator-only (laptops). Task 8 docs.
|
||||
|
||||
---
|
||||
|
||||
## File structure
|
||||
|
||||
| File | Change | Responsibility |
|
||||
|---|---|---|
|
||||
| `tests/tags.yml` | modify | add the `mesh` concern to the closed tag vocabulary |
|
||||
| `roles/base/defaults/main.yml` | modify | `base__mesh_*` knobs |
|
||||
| `roles/base/tasks/mesh.yml` | **create** | the enrollment concern (install + `netbird up`) |
|
||||
| `roles/base/tasks/main.yml` | modify | include `mesh.yml` (gated, tagged) |
|
||||
| `roles/base/README.md` | modify | document the `mesh` concern + knobs |
|
||||
| `roles/base/molecule/default/converge.yml` | modify | enable mesh (manage off) + dummy key |
|
||||
| `roles/base/molecule/default/verify.yml` | modify | assert mesh wiring / no-op |
|
||||
| `inventories/production/group_vars/control/vars.yml` | modify | `base__mesh_enabled: true` (ubongo) |
|
||||
| `inventories/production/group_vars/offsite_hosts/vars.yml` | **create** | `base__mesh_enabled: true` (askari) |
|
||||
| `inventories/production/group_vars/all/vault.yml` | modify (vault) | `vault.netbird.setup_key: CHANGEME` |
|
||||
| `STATUS.md`, `docs/ROADMAP.md`, `docs/FRICTION.md` | modify | M5 done; deferred hardening; friction note |
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Verify + pin the NetBird agent; add the `mesh` tag
|
||||
|
||||
- [ ] **Step 1 (ADR-014 verification — record the answers):** confirm against current NetBird docs/repo (WebFetch `docs.netbird.io`, `pkgs.netbird.io`):
|
||||
- the **apt repo** URL + signing-key URL + suite/component (the install-script publishes an apt source — capture the exact `deb` line and key URL);
|
||||
- the **package name** (headless agent — expected `netbird`) and that **version `0.72.4`** (matching the coordinator) is installable, plus the apt **version-pin syntax**;
|
||||
- the exact **`netbird status`** output string that indicates an established management connection (for the idempotency guard — e.g. `Management: Connected`);
|
||||
- the **`netbird up`** flags (`--management-url`, `--setup-key`);
|
||||
- whether the pinned NetBird's **default peer policy is allow-by-default** (decides §Task 6 step 4). Record all of this in the commit message / a note block.
|
||||
- [ ] **Step 2:** add `mesh` to `tests/tags.yml` under `concerns:`:
|
||||
```yaml
|
||||
- mesh # NetBird agent enrollment (ADR-016)
|
||||
```
|
||||
- [ ] **Step 3:** `make lint` → expect `check-tags: OK` (an unused vocab entry is allowed; nothing references it yet). Expected: 0 failures.
|
||||
- [ ] **Step 4:** commit `feat(base): add the 'mesh' concern tag (NetBird agent, ADR-016)`.
|
||||
|
||||
---
|
||||
|
||||
### Task 2: `base` `mesh` concern — defaults + tasks + include + README
|
||||
|
||||
**Files:** `roles/base/defaults/main.yml`, `roles/base/tasks/mesh.yml` (create), `roles/base/tasks/main.yml`, `roles/base/README.md`.
|
||||
|
||||
- [ ] **Step 1:** append the knobs to `roles/base/defaults/main.yml`:
|
||||
```yaml
|
||||
# NetBird mesh agent enrollment (ADR-016). Opt-in: default off so applying `base` to a
|
||||
# host not (yet) on the mesh is a no-op for this concern. The live actions (apt install
|
||||
# over the network, `netbird up` against the coordinator) are additionally gated by
|
||||
# base__mesh_manage so Molecule can exercise the wiring without a coordinator.
|
||||
base__mesh_enabled: false
|
||||
base__mesh_manage: true
|
||||
base__mesh_management_url: "https://netbird.askari.wingu.me"
|
||||
base__mesh_setup_key: "{{ vault.netbird.setup_key }}" # noqa: var-naming[no-role-prefix] is NOT needed — this carries the base__ prefix
|
||||
base__mesh_version: "0.72.4" # match the coordinator; confirmed installable in Task 1
|
||||
```
|
||||
- [ ] **Step 2:** create `roles/base/tasks/mesh.yml` (use the Task-1-verified repo URL/key/pin; the values below are the expected ones to confirm):
|
||||
```yaml
|
||||
---
|
||||
# NetBird agent enrollment (ADR-016). Additive only — no firewall change here.
|
||||
- name: Ensure /etc/apt/keyrings exists
|
||||
ansible.builtin.file:
|
||||
path: /etc/apt/keyrings
|
||||
state: directory
|
||||
mode: "0755"
|
||||
tags: [mesh]
|
||||
|
||||
- name: Add the NetBird APT GPG key
|
||||
ansible.builtin.get_url:
|
||||
url: https://pkgs.netbird.io/debian/public.key # confirm in Task 1
|
||||
dest: /etc/apt/keyrings/netbird.asc
|
||||
mode: "0644"
|
||||
when: base__mesh_manage | bool
|
||||
tags: [mesh]
|
||||
|
||||
- name: Add the NetBird APT repository
|
||||
ansible.builtin.apt_repository:
|
||||
repo: >-
|
||||
deb [signed-by=/etc/apt/keyrings/netbird.asc]
|
||||
https://pkgs.netbird.io/debian stable main # confirm in Task 1
|
||||
filename: netbird
|
||||
state: present
|
||||
when: base__mesh_manage | bool
|
||||
tags: [mesh]
|
||||
|
||||
- name: Install the NetBird agent (pinned)
|
||||
ansible.builtin.apt:
|
||||
name: "netbird={{ base__mesh_version }}" # confirm pin syntax in Task 1
|
||||
state: present
|
||||
update_cache: true
|
||||
when: base__mesh_manage | bool
|
||||
tags: [mesh]
|
||||
|
||||
- name: Check current NetBird connection status
|
||||
ansible.builtin.command: netbird status
|
||||
register: _netbird_status
|
||||
changed_when: false
|
||||
failed_when: false
|
||||
when: base__mesh_manage | bool
|
||||
tags: [mesh]
|
||||
|
||||
- name: Enrol this host in the mesh
|
||||
ansible.builtin.command: >-
|
||||
netbird up
|
||||
--management-url {{ base__mesh_management_url }}
|
||||
--setup-key {{ base__mesh_setup_key }}
|
||||
register: _netbird_up
|
||||
changed_when: _netbird_up.rc == 0
|
||||
when:
|
||||
- base__mesh_manage | bool
|
||||
- "'Management: Connected' not in (_netbird_status.stdout | default(''))" # confirm string in Task 1
|
||||
no_log: true # setup key is on the argv
|
||||
tags: [mesh]
|
||||
```
|
||||
- [ ] **Step 3:** in `roles/base/tasks/main.yml`, add the include (after the existing concerns), gated by `base__mesh_enabled`:
|
||||
```yaml
|
||||
- name: NetBird mesh enrollment
|
||||
ansible.builtin.include_tasks:
|
||||
file: mesh.yml
|
||||
apply:
|
||||
tags: [mesh]
|
||||
when: base__mesh_enabled | bool
|
||||
tags: [mesh]
|
||||
```
|
||||
- [ ] **Step 4:** document the concern in `roles/base/README.md` (purpose; the `base__mesh_*` knobs table; that it is additive/no-firewall; that the setup key comes from `vault.netbird.setup_key`; the `enabled`/`manage` gating).
|
||||
- [ ] **Step 5:** `make lint` → 0 failures. Commit `feat(base): NetBird agent enrollment concern (mesh)`.
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Molecule coverage
|
||||
|
||||
**Files:** `roles/base/molecule/default/converge.yml`, `roles/base/molecule/default/verify.yml`.
|
||||
|
||||
> The concern is install + a daemon command needing a live coordinator, so the hermetic Molecule surface is thin (the known "render-only misses the real call" gotcha). Molecule proves: (a) enabling mesh with `manage: false` does not break the base converge and is idempotent; (b) `base__mesh_enabled: false` (the default, already exercised by the existing firewall test) is a clean no-op. Full install+enrol is proven live in Task 6.
|
||||
|
||||
- [ ] **Step 1:** in `converge.yml` add to `vars:`:
|
||||
```yaml
|
||||
base__mesh_enabled: true
|
||||
base__mesh_manage: false # skip network/daemon actions
|
||||
base__mesh_setup_key: "dummy-molecule-key"
|
||||
```
|
||||
- [ ] **Step 2:** in `verify.yml` add a task asserting the concern is a clean no-op under `manage: false` — `netbird` is NOT installed and `wt0` does not exist (since all live actions are gated off):
|
||||
```yaml
|
||||
- name: Confirm mesh manage=false did not install/enrol
|
||||
ansible.builtin.command: which netbird
|
||||
register: _nb
|
||||
changed_when: false
|
||||
failed_when: false
|
||||
- name: Assert netbird absent under manage=false
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- _nb.rc != 0
|
||||
fail_msg: "netbird should not be installed when base__mesh_manage is false"
|
||||
```
|
||||
- [ ] **Step 3:** `make test ROLE=base` → converge + idempotence + verify pass (`failed=0`). The existing firewall assertions still pass (mesh vars don't affect them).
|
||||
- [ ] **Step 4:** commit `test(base): molecule coverage for the mesh concern (manage-off no-op)`.
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Vault stub + per-host opt-in
|
||||
|
||||
- [ ] **Step 1 (vault — needs `rbw` unlocked):** `make decrypt FILE=inventories/production/group_vars/all/vault.yml`; add under `vault.netbird` (alongside `auth_secret`/`datastore_key`):
|
||||
```yaml
|
||||
# Reusable, scoped (group "boma-hosts"), expiring NetBird setup key. Mint it in the
|
||||
# dashboard (Setup Keys) AFTER the first-boot /setup admin exists. Consumed by the
|
||||
# base 'mesh' concern. CHANGEME until the operator supplies it via `make edit-vault`.
|
||||
setup_key: CHANGEME
|
||||
```
|
||||
`make encrypt FILE=...`; `make check-vault` → confirms structure + lists the `setup_key` CHANGEME.
|
||||
- [ ] **Step 2:** set the opt-in. In `inventories/production/group_vars/control/vars.yml` add `base__mesh_enabled: true` (ubongo). Create `inventories/production/group_vars/offsite_hosts/vars.yml`:
|
||||
```yaml
|
||||
---
|
||||
# askari is a NetBird peer as well as the coordinator host (ADR-016).
|
||||
base__mesh_enabled: true
|
||||
```
|
||||
- [ ] **Step 3:** `make lint` → 0 failures. Commit `feat(base): vault setup_key stub + enable mesh on ubongo + askari`.
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Operator handoff — first-boot admin + setup key (GATED, operator does this)
|
||||
|
||||
> Nothing here is automatable — the agent cannot create a dashboard admin or mint a key.
|
||||
|
||||
- [ ] **Step 1 (operator):** browse `https://netbird.askari.wingu.me`, complete the one-time `/setup` to create the admin user, log in.
|
||||
- [ ] **Step 2 (operator):** create a **reusable** setup key, **scoped** to auto-assign peers to a `boma-hosts` group, with an **expiry**. Copy the key value.
|
||||
- [ ] **Step 3 (operator):** `make edit-vault` → replace `vault.netbird.setup_key`'s `CHANGEME` with the real key → `:wq` (re-encrypts) → `make check-vault` shows no outstanding CHANGEME. The key never enters the chat.
|
||||
- [ ] **Step 4:** no repo commit beyond the (already-encrypted) vault, which is unchanged on disk structure.
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Enrol `ubongo` + `askari` (GATED, live — needs Task 5 done + `rbw` unlocked)
|
||||
|
||||
- [ ] **Step 1:** `make check PLAYBOOK=site LIMIT=askari TAGS=mesh` — review (askari is `ansible`-user managed; cleaner first target than the control node). Then `make deploy PLAYBOOK=site LIMIT=askari TAGS=mesh`.
|
||||
- [ ] **Step 2:** verify on askari: `netbird status` shows `Management: Connected`; `ip link show wt0` exists. (Agent coexists with the coordinator container; it reaches the coordinator via the public URL.)
|
||||
- [ ] **Step 3:** `make check PLAYBOOK=site LIMIT=ubongo TAGS=mesh` — review. Note: ubongo is managed as `sjat` with `become: true` (same path `dev_env` used via `playbooks/workstation.yml`); confirm `sjat` sudo works (the run will prompt/fail clearly if a become password is needed). Then `make deploy PLAYBOOK=site LIMIT=ubongo TAGS=mesh`.
|
||||
- [ ] **Step 4:** verify the mesh link from ubongo: `netbird status` shows `ubongo` connected and lists `askari` as a peer; ping askari's NetBird (`100.x`) address. If the pinned NetBird is NOT allow-by-default (Task 1, Step 1), add one minimal dashboard policy permitting the admin group → `ubongo` SSH (or temporarily the default policy) so Task 7 can connect.
|
||||
- [ ] **Step 5:** no repo commit (host state).
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Enrol the road-warrior clients → goal lands (operator)
|
||||
|
||||
- [ ] **Step 1 (operator):** install the NetBird client on `mamba` + the work laptop; log in via the dashboard (Dex SSO) so they join the mesh.
|
||||
- [ ] **Step 2 (operator):** from a laptop (anywhere), `ssh sjat@<ubongo-netbird-ip>` (or the mesh hostname) — connection succeeds. **← the mobile-access goal lands here.**
|
||||
- [ ] **Step 3:** confirm with the operator that remote access works end-to-end.
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Docs
|
||||
|
||||
- [ ] **Step 1:** `STATUS.md` — move "NetBird agent enrollment in `base`" to **built + applied** (ubongo + askari enrolled; reachability achieved). Note the `mesh` concern + opt-in. ubongo row: mesh-enrolled (its other base concerns still pending). askari row: NetBird peer.
|
||||
- [ ] **Step 2:** `docs/ROADMAP.md` — **M5 ✅ DONE**; Phase 1 (remote access) complete. Next: the **Procurement gate** (`/capacity-review` → buy cluster hardware). Record the deferred "mesh hardening" follow-on (ubongo nftables default-deny + NetBird ACL tightening + askari SSH→`wt0`).
|
||||
- [ ] **Step 3:** `docs/FRICTION.md` — add a signal: a **docs-only commit still tripped the `rbw`-locked pre-commit guard** (2026-06-17), although the 2026-06-10 kaizen fix was meant to let docs-/config-only commits through without vault — the hook scoping or a blanket guard needs a look.
|
||||
- [ ] **Step 4:** `make lint`; commit `docs: M5 done — Phase 1 remote access complete`.
|
||||
|
||||
---
|
||||
|
||||
## Self-Review (completed)
|
||||
|
||||
- **Spec coverage:** `mesh` concern (spec §1) → Tasks 1–3; vault stub (spec §2) → Task 4; ubongo+askari enrol (spec §3) → Tasks 4,6; laptops (spec §3) → Task 7; reachability via default policy (spec §4) → Task 6 step 4; deferred hardening (spec §6) → recorded in Task 8; operator handoff (spec) → Task 5. Testing (spec) → Task 3 (hermetic) + Task 6 (live). All covered.
|
||||
- **Placeholder scan:** the "confirm in Task 1" markers are ADR-014 verification points executed in Task 1 (the repo URL/key/pin/status-string), not vague TODOs — Task 2's code carries the expected values to confirm, matching how M4a/M4b pinned versions in-plan.
|
||||
- **Consistency:** `base__mesh_enabled` (opt-in) vs `base__mesh_manage` (test gate) used consistently across defaults, tasks, include, converge, and the no-op assertion; `vault.netbird.setup_key` matches between defaults, vault stub, and Task 5; `mesh` tag added (Task 1) before it is used (Task 2).
|
||||
- **Risk:** the only live risk is Task 6 on the control node — mitigated because the `mesh` concern makes **no firewall change** (SSH stays open on all paths), askari is enrolled first as the lower-risk rehearsal, and the host nftables lockdown is explicitly out of scope.
|
||||
Loading…
Add table
Reference in a new issue