boma/docs/superpowers/specs/2026-06-11-public-dns-gandi-migration-design.md

192 lines
10 KiB
Markdown
Raw Normal View History

# Design — boma's DNS home: a new domain at Gandi (DNS-as-code)
- **Date:** 2026-06-11 · **Revised:** 2026-06-12 (Option B — boma gets its own new domain;
supersedes this spec's original "migrate `baobab.band` off Cloudflare" framing)
- **Status:** Draft for review — design settled in brainstorming; pending user review,
then implementation plan
- **Roadmap milestone:** M1 (`docs/ROADMAP.md`)
- **Resolves:** TODO 4 (split-horizon FQDN — with/without `nyumbani`); review finding O12
- **Amends:** ADR-007 — boma's public zone is a **new domain at Gandi LiveDNS, managed as
code**; the three-tier naming scheme; `nyumbani` removed; mesh/LAN-only default
- **Becomes:** an ADR-007 amendment (no new ADR unless `public_dns` grows its own concerns)
---
## Problem
boma needs a DNS home. Investigating the obvious candidates ruled them out as *boma's*
home:
- **`baobab.band`** is the **live legacy homelab** (on Cloudflare): `vaultwarden`,
`nextcloud`, `matrix`/`element`, `collabora`, `ntfy`, `radio`, … in daily use, much of
it riding `*.baobab.band` / `*.nyumbani.baobab.band` wildcards. Moving its authoritative
DNS risks breaking production.
- **`ziethen.dk`** is the **family's primary email** (Fastmail). Moving a live email
domain's DNS is the highest-stakes DNS operation there is — worse, not better.
**Decision: register a NEW Swahili-themed domain at Gandi for boma.** Greenfield,
zero-risk, *born at Gandi* — so it satisfies the DNS-as-code + sovereignty goal natively
with **no migration at all**. The existing domains are decoupled: `baobab.band`'s
Cloudflare exit / V4 decommission is a **separate, later track** (handled when boma
replaces what it hosts), and `ziethen.dk` is untouched.
boma's domain is **`wingu.me`** (registered at Gandi 2026-06-14; *wingu* = Swahili for
*cloud*). The `public_dns` role keeps it as a variable (`public_dns__domain`) so it stays
swappable.
**Starting state (verified 2026-06-14):** Gandi auto-seeded the zone with **13 default
records** — apex parking `A`, `www` web-redirect, and a full Gandi mailbox set (`MX`, SPF,
three `*._domainkey` DKIM CNAMEs, `webmail`, IMAP/POP/submission `SRV`). None are boma's;
wingu.me sends no mail (email stays at `ziethen.dk`). See the setup sequence for the
one-time purge + anti-spoof baseline.
## Decisions (as settled)
1. **New domain, registered at Gandi.** No transfer, no migration, no Cloudflare/Fastmail
entanglement. (Human registers + pays — see division of labour.)
2. **Three-tier naming scheme** (re-homed to `wingu.me`) — see table. `nyumbani`
**dropped**.
3. **Mesh/LAN-only by default.** Home/cluster services have **no public record**; reached
over LAN or the NetBird mesh. Public Gandi records only for deliberate exceptions.
4. **DNS-as-code via a control-node `public_dns` role** driven by record data in
`group_vars` (same pattern as the firewall catalog). Name is provider-agnostic.
5. **Tooling: `community.general.gandi_livedns` with `personal_access_token`** (PAT).
Re-adds `community.general` to `requirements.yml` (collections-on-demand; a committed
role uses `gandi_livedns`), pinned `>=9.0.0`, with the naming comment.
6. **Cert scope: DNS + PAT only.** M1 ends at the zone + PAT in vault, which *enables*
ACME DNS-01 later. No cert issuance in M1 (reverse proxy → askari M4 / home Phase 2).
7. **Human/agent division of labour** (see table) — register + pay + PAT are human; all
record/IaC work is the agent's, from `ubongo`.
8. **Explicitly out of scope:** `baobab.band` (and its Cloudflare exit / V4 decommission)
and `ziethen.dk` — separate later tracks.
## Verified facts (ADR-014)
> verified: `community.general.gandi_livedns` requires `personal_access_token` (PAT);
> `api_key` is deprecated and **rejected** by Gandi (Bearer auth replaced Apikey) ·
> WebFetch docs.ansible.com + WebSearch (Gandi PAT announcement 2023-09; community.general
> issue #7926) · PAT param added in **community.general 9.0.0**, **13.0.1** current ·
> 2026-06-11
> - Module params: `domain`, `record`, `type`, `values` (list), `ttl`, `state`
> (`present`/`absent`). Supports **check mode + diff**.
> - Auth is per-task: `personal_access_token: "{{ vault.gandi.pat }}"`.
## Naming scheme (the convention)
| Tier | Pattern | Authoritative source | Public? |
|---|---|---|---|
| Infrastructure / hosts | `<host>.boma.wingu.me` | internal zone (`dns1`/`dns2`, Phase 2) | never |
| Home / cluster services | `<service>.wingu.me` | internal zone (split-horizon) | only deliberate exceptions |
| Off-site / VPS services | `<service>.askari.wingu.me` | Gandi LiveDNS | yes (askari has a stable public IP) |
- **`nyumbani` removed** — home is the default; only the exception (`askari`) needs naming.
- **The mesh carries "internal" to road-warriors.** NetBird pushes `dns1`/`dns2` (over
`wt0`) as resolver for the `wingu.me` match-domain → on-LAN-or-on-mesh resolves
internal; truly public resolves at Gandi (ties M1 ↔ ADR-016 / M5).
- **Wildcard TLS later.** `*.wingu.me` ACME DNS-01 (Gandi PAT) gives even unexposed
services real TLS without a public A record. Enabled by M1, issued in M4/Phase 2.
## Architecture — two deliverables
### (A) One-time setup — a short runbook (`docs/runbooks/`)
Greenfield, so this is small and low-risk (contrast the abandoned migration framing):
register the domain, create the LiveDNS zone, issue the PAT. No transfer, no live-zone
cutover.
### (B) `public_dns` — the reusable IaC role
- Runs **from the control node** (`delegate_to: localhost`, or a `dns.yml` play targeting
`control`) against the Gandi LiveDNS API — no managed *host*, only API calls.
- Reconciles records from **`group_vars` data** via `community.general.gandi_livedns`,
PAT from `vault.gandi.pat`. **Check-mode/diff first**, always.
#### Data model (sketch)
```yaml
# inventories/production/group_vars/all/public_dns.yml
public_dns__domain: "wingu.me"
public_dns__records:
# Anti-spoof baseline for a no-mail domain (replaces Gandi's seeded mail set):
- { record: "@", type: MX, values: ["0 ."], ttl: 3600 }
- { record: "@", type: TXT, values: ['"v=spf1 -all"'], ttl: 3600 }
- { record: _dmarc, type: TXT, values: ['"v=DMARC1; p=reject;"'], ttl: 3600 }
# Service records appear as public-tier needs arise; near-empty at M1.
# askari / NetBird records land in M4, e.g.:
# - { record: askari, type: A, values: ["<hetzner-ip>"], ttl: 1800 }
# mesh/LAN-only services are intentionally ABSENT — internal zone only.
# PAT referenced as {{ vault.gandi.pat }} (nested vault.<service>.<key>, CLAUDE.md).
```
#### Open design nuance — additive vs authoritative
`gandi_livedns` is **per-record** (`present`/`absent`), not whole-zone sync. Gandi seeded
`wingu.me` with 13 default records (above), so M1 needs a **one-time purge** of those to a
clean baseline (declare them `state: absent`, or a one-shot scripted delete), then manage
**additively**. Full-zone authoritative sync (GET existing → remove undeclared — the
proper end-state, and TODO 8.3's prune question) is flagged as a later enhancement.
## Setup sequence (the runbook)
Legend: **[H]** human · **[A]** agent (from `ubongo`, committed code + check-mode).
1. **[H]** Register `wingu.me` at Gandi; pay. **[H]** Issue a **LiveDNS-scoped PAT**
for it; store in vault (`vault.gandi.pat`) via rbw.
2. **[A]** Author the `public_dns` role + `public_dns__records` data (incl. the anti-spoof
baseline); add `community.general` to `requirements.yml` (≥9.0.0, with comment); commit.
3. **[A]** One-time: **purge Gandi's 13 seeded defaults** (parking `A`, `www` redirect,
Gandi mail `MX`/SPF/DKIM/`webmail`/`SRV`) down to the boma baseline.
4. **[A]** `make check` (diff vs live Gandi) → `make deploy` to load records → `dig`
verify. Re-run `make deploy` to confirm idempotence.
4. Thereafter the zone is reconciled as code; M4 adds the `askari`/NetBird records.
No registrar transfer, no nameserver flip of a live zone, no service-preservation,
no Forgejo rename — all of that belonged to the abandoned `baobab.band` framing.
## Division of labour & access (security posture)
| Task | Who | How |
|---|---|---|
| Register domain + pay | Human | Identity/billing/ToS — not automatable. |
| Issue + store the PAT | Human | LiveDNS-scoped, single-domain; into vault via rbw. |
| `public_dns` role + record data | Agent | Committed IaC; `make check` diff. |
| Create zone + load records + reconcile | Agent | `public_dns` on `ubongo`, PAT from vault, check-mode first. |
- **Minimal token scope.** Gandi PAT: **LiveDNS-only**, restricted to `wingu.me`.
- **Token in vault** (`vault.gandi.pat`) via rbw — never pasted in chat.
- **Execution on `ubongo`**, committed role + `make check``make deploy`. No agent
sandbox holds production credentials.
## Testing & verification
External-API reconciliation does not fit container Molecule cleanly (a nuance against
ADR-008). Instead: **`make check` (check-mode + diff)**, **idempotence** (second deploy =
no changes), **`dig` assertions** post-load, and optionally a small pytest over the
`public_dns__records` data shape (mirrors `test_firewall_rules.py`).
## Scope boundaries — what M1 is NOT
- **Not** a migration of `baobab.band` or `ziethen.dk` — and **not** the Cloudflare exit /
V4 decommission. Those are separate, later tracks.
- **Not** the internal split-horizon `dns` role (renders `<service>.wingu.me`
privately) — that needs the `dns` role + actual home services → **Phase 2**.
- **Not** certificate issuance or the reverse proxy — **M4 (askari) / Phase 2 (home)**.
- **Not** authoritative whole-zone pruning — additive for now.
## ADR work
Amend **ADR-007**: boma's public zone is **`wingu.me` at Gandi LiveDNS, managed as
code** (replaces "Cloudflare or equivalent"); record the **three-tier naming scheme**;
remove the `nyumbani` example; state the **mesh/LAN-only default**; note `public_dns` as
the control-node role rendering the public zone (sibling to the internal `dns` role). Note
that `baobab.band` (legacy, Cloudflare) is **not** boma's zone and is out of ADR-007's
scope going forward.
## Open items (resolve during the plan / implementation)
- ~~Pick the domain~~ **DONE:** `wingu.me` registered at Gandi; LiveDNS PAT verified
(2026-06-14) and stored in vault as `vault.gandi.pat`.
- **Pin** the `community.general` version in `requirements.yml` (≥9.0.0).
- **Play wiring:** a dedicated `dns.yml` play (control-targeted) vs folding into an
existing play — decide in the plan.