boma/docs/superpowers/specs/2026-06-11-public-dns-gandi-migration-design.md
sjat 7a47dd9dec docs(spec): M1 — public DNS migration to Gandi (DNS-as-code) design
Settles the M1 design: full registrar transfer Cloudflare -> Gandi; three-tier
naming scheme (host.boma / service.bare / service.askari), nyumbani dropped,
mesh/LAN-only default; public-DNS-as-code via a control-node `public_dns` role
driven by group_vars data, using community.general.gandi_livedns with a PAT
(api_key is deprecated/rejected by Gandi — verified per ADR-014). Stale records +
unused MX cleaned by omission. Cert scope is DNS+PAT only (issuance deferred to
M4/Phase 2). Human/agent division of labour + token-scoping recorded.

Resolves TODO 4 and review finding O12 once the ADR-007 amendment lands. Point
ROADMAP.md M1 at the spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 23:17:19 +02:00

196 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Design — Public DNS migration to Gandi (DNS-as-code)
- **Date:** 2026-06-11
- **Status:** Draft for review — design settled in brainstorming; pending user review,
then implementation plan
- **Roadmap milestone:** M1 (`docs/ROADMAP.md`)
- **Resolves:** TODO 4 (split-horizon FQDN — with/without `nyumbani`); review finding
O12 (ADR-007 FQDN convention contradicts its own example)
- **Amends:** ADR-007 — public DNS provider → **Gandi LiveDNS, managed as code**; the
three-tier naming scheme; `nyumbani` removed; mesh/LAN-only default exposure
- **Becomes:** an ADR-007 amendment (no new ADR unless the `public_dns` role grows
concerns of its own)
---
## Problem
Move `baobab.band` authoritative DNS **and registration** off Cloudflare to **Gandi**.
The driver is values/sovereignty (Gandi over Cloudflare) — it is **not** a NetBird
prerequisite, but it is sequenced first (roadmap M1) so `askari`'s records are born at
Gandi and Cloudflare is never touched again. Do it **as code**, consistent with boma's
grain: internal DNS is already Ansible-rendered and Terraform owns *no* DNS (CLAUDE.md).
While in here, settle the long-open naming question (`nyumbani`, TODO 4 / O12).
## Decisions (as settled)
1. **Full registrar transfer.** Registration *and* authoritative DNS move to Gandi —
fully exits Cloudflare. (DNS-only would strand the registration at Cloudflare and is
likely impossible anyway, since Cloudflare Registrar requires Cloudflare nameservers.)
2. **Three-tier naming scheme** (the convention — see table below). `nyumbani` is
**dropped**.
3. **Mesh/LAN-only by default.** Home/cluster services have **no public record**; they
are reached over LAN or the NetBird mesh. Public Gandi records exist only for
deliberate exceptions (today: `forgejo`, the `askari` tier).
4. **DNS-as-code via a control-node `public_dns` role** driven by structured record data
in `group_vars` — the same pattern as the firewall catalog, and exactly what ADR-007
already calls "service/alias/split-horizon records … explicit zone data in
`group_vars`." Name is provider-agnostic on purpose.
5. **Tooling: `community.general.gandi_livedns` with `personal_access_token`** (PAT).
Re-adds `community.general` to `requirements.yml` under the collections-on-demand
policy (a committed role now uses `gandi_livedns`), pinned `>=9.0.0`, with the naming
comment.
6. **Clean by omission.** Stale records and the (unused) MX are *not* deleted at
Cloudflare — the zone is abandoned. Only wanted records are carried to Gandi.
7. **Cert scope: DNS + PAT only.** M1 ends at the migrated zone + the PAT in vault, which
*enables* ACME DNS-01 later. **No certificate issuance in M1** — that lands with a
reverse proxy (askari in M4, home in Phase 2).
8. **Human/agent division of labour** (see table) — account, payment, registrar
transfer, and the go-live nameserver flip are human; all record-wrangling, the IaC,
and the post-flip cutover are the agent's, executed from `ubongo`.
## Verified facts (ADR-014)
> verified: `community.general.gandi_livedns` requires `personal_access_token` (PAT);
> `api_key` is deprecated and **rejected** by Gandi (Bearer auth replaced Apikey) ·
> WebFetch docs.ansible.com + WebSearch (Gandi PAT announcement 2023-09; community.general
> issue #7926) · PAT param added in **community.general 9.0.0**, **13.0.1** current ·
> 2026-06-11
> - Module params: `domain`, `record`, `type`, `values` (list), `ttl`, `state`
> (`present`/`absent`). Supports **check mode + diff**.
> - Auth is per-task: pass `personal_access_token: "{{ vault.gandi.pat }}"`.
> unverified (from memory — confirm during implementation): the current registrar of
> `baobab.band` (WHOIS) — determines whether the transfer is Cloudflare→Gandi or
> elsewhere→Gandi, and the exact unlock/EPP steps.
## Naming scheme (the convention)
| Tier | Pattern | Authoritative source | Public? |
|---|---|---|---|
| Infrastructure / hosts | `<host>.boma.baobab.band` | internal zone (`dns1`/`dns2`, Phase 2) | never |
| Home / cluster services | `<service>.baobab.band` | internal zone (split-horizon) | only deliberate exceptions |
| Off-site / VPS services | `<service>.askari.baobab.band` | Gandi LiveDNS | yes (askari has a stable public IP) |
- **`nyumbani` removed.** It namespaced "home," but home is the default; only the
*exception* needs naming, and `askari.baobab.band` does that, self-documenting.
- **The mesh carries "internal" to road-warriors.** NetBird pushes `dns1`/`dns2` (over
`wt0`) as the resolver for the `baobab.band` match-domain, so on-LAN-or-on-mesh →
internal answer; truly public → Gandi (ties M1 ↔ ADR-016 / M5).
- **Wildcard TLS later.** A `*.baobab.band` (and `*.askari.baobab.band`) ACME **DNS-01**
cert via the Gandi PAT gives even unexposed services real public-CA TLS — without a
public A record. Enabled by M1, issued in M4/Phase 2.
## Architecture — two deliverables (kept separate on purpose)
### (A) One-time migration — a runbook (`docs/runbooks/`)
Registrar transfers and the nameserver flip cannot be IaC'd. This is a human-gated
procedure (sequence below), executed once.
### (B) `public_dns` — the reusable IaC role
- Runs **from the control node** (`delegate_to: localhost`, or a `dns.yml` play targeting
`control`) against the Gandi LiveDNS API — there is no managed *host*, only API calls.
- Reconciles records from **`group_vars` data** via `community.general.gandi_livedns`,
PAT from `vault.gandi.pat`.
- **Check-mode/diff first**, always (boma's check-before-deploy; the module supports it).
- Carries only the public-tier records (exceptions + `askari` tier); the mesh/LAN-only
default keeps this set small.
#### Data model (sketch)
```yaml
# inventories/production/group_vars/all/public_dns.yml
public_dns__domain: baobab.band
public_dns__records:
- { record: forgejo, type: A, values: ["<home-ingress-ip>"], ttl: 1800 }
- { record: askari, type: A, values: ["<hetzner-ip>"], ttl: 1800 }
# mesh/LAN-only services are intentionally ABSENT — they live only in the internal zone.
# PAT referenced as {{ vault.gandi.pat }} (nested vault.<service>.<key>, CLAUDE.md).
```
#### Open design nuance — additive vs authoritative
`gandi_livedns` is **per-record** (`present`/`absent`); it does not whole-zone sync. To
make the repo *authoritative* (prune undeclared records — cf. TODO 8.3's prune question),
the role would need to GET existing records and remove those not declared. **M1 decision:**
start **additive** (declare what we want; remove the old via explicit `absent` entries
during cutover); flag full-zone pruning as a possible later enhancement. Avoids
accidentally deleting a record someone added out-of-band before the repo is the single
source of truth.
## Cutover sequence (the runbook)
Legend: **[H]** human · **[A]** agent (from `ubongo`, committed code + check-mode).
1. **[A]** Inventory: parse the **Cloudflare zone export** (BIND file the user downloads,
tokenless) → full record list; classify keep / rename / drop (incl. unused MX + stale).
2. **[A]** Draft `public_dns__records` (new scheme) + the `public_dns` role; PR/commit;
`make check` shows the intended Gandi state as a diff.
3. **[H]** Create/verify the Gandi account; issue a **LiveDNS-scoped PAT** for
`baobab.band`; store it in vault (`vault.gandi.pat`) via rbw. **[H]** Lower TTLs on the
*old* Cloudflare zone ~2448h ahead.
4. **[A]** Create the zone in Gandi LiveDNS and load records (`make deploy`, after a clean
`make check`). Validate with `dig @<gandi-ns>`.
5. **[H]** Initiate the **registrar transfer** to Gandi (unlock at Cloudflare, get
EPP/auth code, start at Gandi, ACK to expedite; ~5 days — DNS keeps resolving).
6. **[H, go-live]** **Flip nameservers** to Gandi LiveDNS. (Irreversible/outward-facing —
explicit human go.)
7. **[A]** Post-flip: validate resolution; **rename the Forgejo remote + CI**
(`forgejo.nyumbani.baobab.band``forgejo.baobab.band`); verify a push.
8. **[A/H]** Confirm propagation; **[H]** decommission the Cloudflare zone.
## Division of labour & access (security posture)
| Task | Who | How |
|---|---|---|
| Zone inventory | Agent | From the Cloudflare **export** (tokenless). |
| New record set + `public_dns` role + data | Agent | Committed IaC; `make check` diff. |
| Gandi account, transfer, payment | Human | Identity/billing/e-mail/ToS — not automatable. |
| Create zone + load records + reconcile | Agent | `public_dns` role on `ubongo`, PAT from vault, check-mode first. |
| Nameserver flip / go-live | Human-gated | Agent preps + validates; human flips. |
| Forgejo remote + CI cutover | Agent | After flip; verify push. |
| Delete stale Cloudflare records | Nobody | Cleaned by omission. |
- **Minimal token scope.** Gandi PAT: **LiveDNS-only**, restricted to `baobab.band`.
Cloudflare: prefer the **tokenless export**; if an API token is used, **read-only,
single-zone, throwaway** — revoke once inventory is captured.
- **Tokens live in boma's vault** (`vault.gandi.pat`) via rbw — never pasted in chat.
- **Execution on `ubongo`**, not in any agent sandbox: committed role + `make check`
`make deploy`. Irreversible/outward steps (NS flip, go-live) require explicit human
confirmation.
## Testing & verification
External-API reconciliation does not fit container Molecule cleanly (a nuance against
ADR-008 — not every role gets a converge-in-a-container scenario). Instead:
- **`make check` (check-mode + diff)** against live Gandi before any apply.
- **Idempotence:** a second `make deploy` reports no changes.
- **`dig` assertions** post-cutover: new names resolve to expected values; a Forgejo
push over `forgejo.baobab.band` succeeds.
- Optionally a small pytest over the `public_dns__records` data shape (types, no
duplicate record/type pairs), mirroring `test_firewall_rules.py`.
## Scope boundaries — what M1 is NOT
- **Not** the internal split-horizon `dns` role (renders `<service>.baobab.band`
privately) — that needs the `dns` role + actual home services → **Phase 2**.
- **Not** certificate issuance or the reverse proxy — **M4 (askari) / Phase 2 (home)**.
- **Not** authoritative whole-zone pruning — additive for now (see nuance above).
## ADR work
Amend **ADR-007**: public zone provider → **Gandi LiveDNS, managed as code** (replaces
"Cloudflare or equivalent"); record the **three-tier naming scheme**; remove the
`nyumbani` example; state the **mesh/LAN-only default**. Note `public_dns` as the
control-node role that renders the public zone (sibling to the internal `dns` role).
## Open items (resolve during the plan / implementation)
- **Cloudflare zone export** → the exact record list (execution input, not a design gap).
- **WHOIS** the current registrar → confirm transfer source + unlock/EPP steps.
- **Pin** the `community.general` version in `requirements.yml` (≥9.0.0).
- **Play wiring:** a dedicated `dns.yml` play (control-targeted) vs folding into an
existing play — decide in the plan.