docs(adr/security): record claude NOPASSWD sudo model (ADR-015 amend + R7)

The integration-testing shakedown reversed ADR-015's "no local sudo" sub-decision: the claude AI-worker now has NOPASSWD:ALL sudo on ubongo — without it, virsh, nft, and journalctl all block during VM diagnosis. Compensating controls: password-locked account, auditd/Loki attribution, repo-managed revocable drop-in. ADR-015: dated amendment note in Status + expanded AI-worker identity section. ADR-021: new §Sudo model (amendment 2026-06-18) — claude=NOPASSWD, sjat=password required; former sjat NOPASSWD drop-in removed 2026-06-18 (least-privilege cleanup). accepted-risks.md: R7 added (claude NOPASSWD:ALL on ubongo); last-reviewed updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:39:20 +02:00 · 2026-06-18 21:39:20 +02:00 · cc772ff845
commit cc772ff845
parent 3fe6f68316
3 changed files with 66 additions and 9 deletions
--- a/docs/decisions/015-control-host.md
+++ b/docs/decisions/015-control-host.md
@ -2,7 +2,10 @@

 ## Status

-Accepted (2026-06-05)
+Accepted (2026-06-05). **Amended 2026-06-18:** the `claude` AI-worker account now has
+`NOPASSWD:ALL` sudo on `ubongo` — reversing the original "no local sudo" sub-decision.
+The amendment is recorded in §Access & security below; rationale and accepted risk are
+in ADR-021 and `docs/security/accepted-risks.md` (R7).

 ## Context

@ -88,12 +91,33 @@ Manual, on bare metal:
  only** — key-only, with password auth and root login disabled — until the NetBird mesh
  (ADR-016) is stood up.
 - **AI-worker identity:** `ubongo` runs the AI worker under a dedicated,
-  password-locked `claude` user (in the `docker` group for Molecule; **no local sudo** —
-  boma deploys reach the fleet over SSH as the `ansible` user, not via local root). It is
-  reached via `sudo -iu claude` or its own SSH key. The rationale is **attribution +
-  revocation, not containment**: auditd/Loki (ADR-018) can separate human from agent
-  actions, and the account/key can be revoked without touching the operator's access.
-  (ADR-021 left the on-`ubongo` agent identity unspecified; this records it.)
+  password-locked `claude` user (in the `docker` and `libvirt` groups; **`NOPASSWD:ALL`
+  sudo** via a repo-managed drop-in — see amendment below). It is reached via `sudo -iu
+  claude` or its own SSH key. The rationale is **attribution + revocation, not
+  containment**: auditd/Loki (ADR-018) can separate human from agent actions, and the
+  account/key can be revoked without touching the operator's access. (ADR-021 left the
+  on-`ubongo` agent identity unspecified; this records it.)
+
+  **Amendment (2026-06-18) — `claude` now has `NOPASSWD:ALL` sudo.** During the
+  integration-testing harness shakedown, the original "no local sudo" sub-decision was
+  reversed. No-sudo blocked the AI-worker from diagnosing a failed VM: `virsh`,
+  `virt-install`, `cloud-localds`, `journalctl`, `nft` — nearly all low-level
+  diagnostic commands — require root. The AI-worker must autonomously spin up,
+  inspect, and tear down test VMs without operator hand-holding; that is the harness's
+  core value proposition. Compensating controls make the risk acceptable:
+
+  1. `claude`'s password is **locked** (no interactive login, no `su claude` without the
+     operator's own credentials) — `NOPASSWD` sudo is the *only* sudo path.
+  2. `auditd` + Loki attribution (ADR-018) separates human from agent root actions.
+  3. The drop-in is **repo-managed** via `base__ai_worker_user` — revocable in one commit
+     and one deploy.
+  4. Single-operator homelab: everything in git, off-machine backups (ADR-022).
+
+  The operator (`sjat`) uses **password-required sudo** via the `sudo` group; their
+  former `NOPASSWD` drop-in was removed 2026-06-18 as redundant once `claude` had sudo
+  (least-privilege cleanup). The accepted risk is registered as R7 in
+  `docs/security/accepted-risks.md`. ADR-021 records the resulting sudo model for both
+  accounts.
 - **Disk encryption:** `ubongo`'s SSD is **not encrypted at rest** — the SanDisk X600 is
  TCG-Opal-capable but Opal is unused. This is an accepted risk recorded in
  `docs/security/accepted-risks.md` (control-node disk not encrypted at rest),
--- a/docs/decisions/021-operational-access.md
+++ b/docs/decisions/021-operational-access.md
@ -3,7 +3,9 @@
 ## Status

 Accepted (2026-06-09). Resolves TODO 7.2 (what to set up on hosts given direct access
-will be rare) and TODO 3.2 (the service admin-API access question).
+will be rare) and TODO 3.2 (the service admin-API access question). **Amended
+2026-06-18:** the on-`ubongo` sudo model for the two local accounts is now settled
+(see §Sudo model on `ubongo` below).

 **Doctrine ADR.** It pins the operational-access doctrine, the declarative `access__*`
 data model, the rendered `ACCESS.md` record, and the `/check-access` verifier. It does
@ -163,6 +165,36 @@ exists and `/check-access` is green (or a deviation is recorded in `accepted-ris
 No scaffold change — same manual-copy-plus-review pattern the sibling records
 (`SECURITY.md`/`VERIFY.md`) use.

+### Sudo model on `ubongo` (amendment 2026-06-18)
+
+The original ADR left on-`ubongo` local sudo unspecified. The integration-testing
+harness shakedown settled it:
+
+| Account | Role | Sudo |
+|---|---|---|
+| `claude` | Automated AI-worker | `NOPASSWD:ALL` via repo-managed drop-in (`base__ai_worker_user`) |
+| `sjat` | Human operator | Password-required sudo via the `sudo` group |
+
+**Rationale for `claude NOPASSWD`.** No-sudo blocked the AI-worker from diagnosing a
+failed test VM: `virsh`, `virt-install`, `cloud-localds`, `nft`, `journalctl` —
+almost every low-level diagnostic tool — require root. The harness's core value is
+autonomous spin-up → apply → reboot → assert → diagnose; that loop collapses without
+local root access.
+
+**Compensating controls (R7 in `docs/security/accepted-risks.md`):**
+- `claude`'s password is locked — `NOPASSWD` is the account's *only* sudo path; no
+  interactive login is possible.
+- `auditd` + Loki attribution (ADR-018) separates human from agent root actions in the
+  audit trail.
+- The drop-in is repo-managed and revocable in one commit + one deploy.
+- Single-operator homelab; everything in git; off-machine backups (ADR-022).
+
+**`sjat` NOPASSWD removed.** The operator's former `NOPASSWD` drop-in
+(`/etc/sudoers.d/sjat-ansible`, added as an interim measure during M5 NetBird
+enrolment) was removed 2026-06-18. It was redundant once `claude` held sudo, and its
+removal restores least-privilege for the human operator. `sjat` retains full sudo
+capability via the `sudo` group (password required).
+
 ## Consequences

 - Every host and service has at least one documented, verifiable way in — and a verifier
--- a/docs/security/accepted-risks.md
+++ b/docs/security/accepted-risks.md
@ -19,8 +19,9 @@ revisit (trigger).
 | R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
 | R5 | **No disk encryption on `ubongo`** — the control node's SSD (SanDisk X600 256 GB, TCG-Opal-capable but Opal unused) is unencrypted at rest, so it holds recovery-critical secrets in plaintext: the Ansible Vault password's `rbw` local cache and (future) Terraform state. Physical theft of the box would expose them | `ubongo` is always-on in a physically controlled location; compensating controls are a **BIOS supervisor password** and **disabled external/USB + PXE boot** (an attacker cannot trivially boot another OS to read the disk), and the offline-recoverable design means the irreducible root secret (Vaultwarden master password) is never stored on the box anyway. Full-disk encryption was weighed against the always-on/unattended-reboot requirement (LUKS+TPM auto-unlock or passphrase) and deferred for simplicity at this trust level | `ubongo` is relocated to a less-trusted physical location; the box starts holding additional high-value secrets; or a reinstall onto LUKS (TPM-sealed) is undertaken |
 | R6 | **`le-prod-wildcard` integration runs** — when `CERTS=le-prod-wildcard` is passed to `make test-integration`, the production Gandi PAT (`vault.gandi.pat`) is passed to an ephemeral local test VM via the var overlay, and transient `_acme-challenge` TXT records are written into the real `wingu.me` DNS zone to satisfy the Let's Encrypt DNS-01 challenge. A compromised or long-lived test VM could exfiltrate the PAT; the real zone is briefly (seconds) modified | Scope is **on-demand only** — `le-staging` is the default cert tier (`CERTS=internal` for incident repro); `le-prod-wildcard` is an explicit opt-in. Compensating controls: the VM is ephemeral and destroyed on success; it sits on an isolated libvirt NAT network (no LAN/mesh access); TXT records are auto-removed by Caddy immediately after validation; the PAT is not persisted inside the VM after the run. ADR-025 documents the cert-tier design and the three isolation invariants | The PAT is exfiltrated from a test VM; the `wingu.me` zone shows unexpected records; a `CERTS=le-prod-wildcard` run must be audited or the tier must be revoked |
+| R7 | **`claude` AI-worker has `NOPASSWD:ALL` sudo on `ubongo`** — the automated AI-worker account can execute any command as root on the control node without a password prompt. A compromised or misbehaving agent session could make arbitrary root-level changes to ubongo | The account is **password-locked** (no interactive `claude` login; `NOPASSWD` sudo is the account's only escalation path, so there is no "su to claude + sudo" attack). `auditd` + Loki attribution (ADR-018) logs every `sudo` invocation with the originating user. The drop-in (`/etc/sudoers.d/claude-ai-worker`) is repo-managed via `base__ai_worker_user` — revocable in one commit + one deploy. Single-operator homelab; all changes in git; off-machine backups (ADR-022). Full rationale: ADR-015 amendment (2026-06-18) + ADR-021 §Sudo model. | The AI-worker executes a destructive action that cannot be rolled back via git; the account key is compromised; the threat model shifts toward targeted remote attackers |

-_Last reviewed: 2026-06-11. The prior gaps (full CIS hardening, SELinux/AppArmor,
+_Last reviewed: 2026-06-18. The prior gaps (full CIS hardening, SELinux/AppArmor,
 IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
 Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now
 part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build