docs(spec): tagging standard design (TODO 3.7/3.11 → ADR-019)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-06 09:15:44 +02:00
parent 9bdb3017bb
commit 4ed9e9a8bf

View file

@ -0,0 +1,188 @@
# Design — Ansible tagging standard (targeted, predictable runs)
- **Date:** 2026-06-06
- **Status:** Approved design — pending implementation plan
- **Resolves:** TODO 3.7 ("Define a tagging standard that lets us target runs without
over-tagging") and TODO 3.11 ("Deliberate tagging strategy") — the same thread
- **Becomes:** ADR-019 (this design is the basis for that ADR)
---
## Problem
boma wants to run playbooks **targeted** — a single service, a single layer, or a
single cross-cutting concern — and to do so **transparently and predictably**: you
should be able to look at a `--tags` invocation and know exactly what it will and won't
touch. CLAUDE.md already mandates that every task be tag-filterable, but no *vocabulary*
or *naming convention* exists. Without one, tags proliferate ad-hoc per role and the
"predictable" property is lost — and the TODO explicitly warns against the opposite
failure mode, **over-tagging**.
The repo is effectively greenfield for this: `base` and `docker_host` are empty, and the
only tags in existence are `[base]`/`[docker]` in `site.yml` and `[bootstrap]` in
`bootstrap.yml`. So we can bake the standard into role-authoring conventions *before*
there are a dozen service roles to retrofit.
## Targeting axes (what we want to slice by)
1. **Layer / role**`--tags base`, `--tags docker`
2. **Single service**`--tags photoprism`, `--tags traefik`
3. **Concern / function**`--tags firewall`, `--tags logging`, …
Lifecycle phases (bootstrap/config/deploy) are **not** a tag axis — `bootstrap.yml` vs
`site.yml` already separate those as whole playbooks.
Key simplification: because of ADR-004 (*one service = one role*, role name = service
name), axes 1 and 2 are the **same mechanism** — a tag equal to the role name. Only the
concern axis needs a curated vocabulary.
## Approach (chosen): two-tier tagging
**Tier 1 — role/service tag (mechanical).** The tag *equals the role name*, applied
**once** at the role-import level in the playbook:
```yaml
roles:
- role: photoprism
tags: [photoprism]
```
Ansible propagates the tag to every task in the role. This covers both the layer/role
and single-service axes with one rule and **zero per-task burden**.
**Tier 2 — concern tag (curated).** A small **closed, documented list** of cross-cutting
concern tags, applied per-task/block **only where a task genuinely belongs to that
concern**. `--tags firewall` then hits firewall tasks in `base` and in every service
role.
Rejected alternatives: *concern-only/flat* (loses natural `--tags <service>` ergonomics);
*rich multi-dimensional* (role+service+concern+lifecycle+ad-hoc per task) — that is
precisely the over-tagging the TODO warns against.
## The closed concern list
Litmus test for earning a spot: a concern must (a) appear in **2+ roles**, (b) be
something you'd realistically want to run as a slice on its own, and (c) not overlap
confusingly with another.
**Baseline concerns** (mostly in `base`, some echoed in service roles):
| Tag | Covers |
|-----|--------|
| `packages` | apt package install/management |
| `users` | accounts, groups, sudo |
| `firewall` | nftables rulesets & port definitions (ADR-002) |
| `hardening` | security baseline — sshd config, fail2ban, auditd, sysctl |
| `logging` | Alloy / log-shipping config (ADR-018) |
| `monitoring` | metric exporters / health checks |
**Service concerns** (in every service role, ADR-004):
| Tag | Covers |
|-----|--------|
| `config` | render templated config/compose files to disk — **no restart** |
| `deploy` | bring services up / restart (`compose up -d`) |
| `proxy` | reverse-proxy + TLS registration (Traefik routes, Authentik) |
Nine tags total. The `config`/`deploy` split is deliberate and high-value: `--tags
config` re-renders and lets you diff configuration without bouncing services; `--tags
deploy` does the restart.
`backup` and `secrets` are **intentionally omitted** until the roles that need them
exist — they enter via the extend process, not speculative reservation.
## `always` / `never` policy
boma uses Ansible's two built-in special tags, narrowly:
- **`always`** — reserved strictly for **cheap preflight assertions** (vault unlocked,
OS is Debian 13, required vars present). Ensures even `--tags config` runs its safety
guards.
- **`never`** — reserved for **destructive/expensive opt-in tasks**, each paired with a
descriptive tag (e.g. `never, force_pull` or `never, restore`). They never run unless
explicitly named, keeping dangerous actions out of normal runs. The descriptive
partner tag is a documented `never`-paired opt-in (allowed by the linter).
## Predictability principle: tags are union-only
`--tags a,b` runs tasks tagged a **OR** b — Ansible has no native AND. Rather than fight
this, we make it an explicit principle: **boma targets one axis at a time***either* a
role/service (`--tags photoprism`) *or* a concern (`--tags firewall`), never an
intersection like "photoprism's firewall only." If that is ever genuinely needed, the
answer is "just run `--tags photoprism`" (idempotent and fast). Designing for
intersection is the over-tagging trap; we decline it on purpose.
## Reconciling the existing CLAUDE.md rule
CLAUDE.md currently says *"every task must have at least one tag."* Under the two-tier
model the role tag is applied **once at the play/import level** and **inherited** by
every task, so tasks are always reachable without hand-tagging each one. The rule is
**reworded** to:
> Import each role with its role-name tag (once, at the play level). Within a role, tag a
> task/block with a concern tag from the approved list **only where it genuinely belongs
> to that concern** — don't invent tags or tag for tagging's sake.
This directly resolves the "without over-tagging" tension.
## Terraform / Proxmox VM tags (metadata only)
Formalize the convention that already half-exists in `staging/main.tf`
(`tags = ["staging", each.value.group]`). Every TF-managed VM gets exactly three tags:
| Tag | Value | Purpose |
|-----|-------|---------|
| env | `staging` \| `production` | which environment |
| role/group | `docker_hosts`, `proxmox_hosts`, … | matches the inventory group |
| managed-by | `terraform` | distinguishes IaC VMs from hand-made ones |
Set as `tags = ["${env}", each.value.group, "managed-by=terraform"]` in the env
`main.tf` (env is constant per directory).
**Explicit non-goals** (stated so nobody wires them up later): these tags are **pure
metadata for transparency** — glanceable in the Proxmox UI. They do **not** drive
run-targeting and do **not** feed inventory. `scripts/tf_to_inventory.py` keeps building
groups from the `group` output field, which stays the single source of truth.
## Enforcement
A small **lint check wired into `make lint`**: a script collects every `tags:` value
across `roles/` and `playbooks/` and fails if any tag is not in the allowed set:
```
{role names} {9 concern tags} {always, never} {documented never-paired opt-ins}
```
The allowed concern list (and the `never`-paired opt-ins) live in **one
machine-readable file, `tests/tags.yml`**, which both the linter reads and the ADR
documents — so doc and enforcement cannot drift. This is more honest than ansible-lint's
limited built-in tags rule. A unit test (mirroring `tests/test_capacity_scan.py`) covers
the checker.
## The "propose to extend" process
To add a concern tag: (1) add it to `tests/tags.yml`; (2) add a row to the ADR-019 table
with a one-line justification showing it passes the litmus test (cross-cutting, 2+
roles, distinct). That is the whole gate — lightweight, but it leaves a paper trail.
## Deliverables
- **New `docs/decisions/019-tagging.md`** — the standard: rationale, two-tier model,
concern table, union-only principle, `always`/`never` policy, Proxmox tag convention,
extend process.
- **`tests/tags.yml`** — machine-readable allowed concern list + `never`-paired opt-ins.
- **Lint checker script** (e.g. `scripts/check-tags.py`) + **`make lint`** wiring +
**`tests/test_check_tags.py`**.
- **CLAUDE.md** — reword the tag bullet under *Ansible conventions*; add the Proxmox tag
convention under *Terraform conventions*; add ADR-019 to *Further reading*.
- **`terraform/environments/{staging,production}/main.tf`** — apply the three-tag
convention.
- **`docs/TODO.md`** — mark 3.7 and 3.11 DECIDED (ADR-019).
- **`docs/CAPABILITIES.md`** — note targeted runs as a capability, if it fits.
## Out of scope
- Intersection targeting (role ∩ concern) — declined on purpose (see principle).
- Lifecycle-phase tags — handled by separate playbooks.
- Proxmox tags feeding inventory or run-targeting — metadata only.
- `backup`/`secrets` concern tags — added later via the extend process.