boma/docs/superpowers/specs/2026-06-11-public-dns-gandi-migration-design.md
sjat 7a47dd9dec docs(spec): M1 — public DNS migration to Gandi (DNS-as-code) design
Settles the M1 design: full registrar transfer Cloudflare -> Gandi; three-tier
naming scheme (host.boma / service.bare / service.askari), nyumbani dropped,
mesh/LAN-only default; public-DNS-as-code via a control-node `public_dns` role
driven by group_vars data, using community.general.gandi_livedns with a PAT
(api_key is deprecated/rejected by Gandi — verified per ADR-014). Stale records +
unused MX cleaned by omission. Cert scope is DNS+PAT only (issuance deferred to
M4/Phase 2). Human/agent division of labour + token-scoping recorded.

Resolves TODO 4 and review finding O12 once the ADR-007 amendment lands. Point
ROADMAP.md M1 at the spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 23:17:19 +02:00

11 KiB
Raw Blame History

Design — Public DNS migration to Gandi (DNS-as-code)

  • Date: 2026-06-11
  • Status: Draft for review — design settled in brainstorming; pending user review, then implementation plan
  • Roadmap milestone: M1 (docs/ROADMAP.md)
  • Resolves: TODO 4 (split-horizon FQDN — with/without nyumbani); review finding O12 (ADR-007 FQDN convention contradicts its own example)
  • Amends: ADR-007 — public DNS provider → Gandi LiveDNS, managed as code; the three-tier naming scheme; nyumbani removed; mesh/LAN-only default exposure
  • Becomes: an ADR-007 amendment (no new ADR unless the public_dns role grows concerns of its own)

Problem

Move baobab.band authoritative DNS and registration off Cloudflare to Gandi. The driver is values/sovereignty (Gandi over Cloudflare) — it is not a NetBird prerequisite, but it is sequenced first (roadmap M1) so askari's records are born at Gandi and Cloudflare is never touched again. Do it as code, consistent with boma's grain: internal DNS is already Ansible-rendered and Terraform owns no DNS (CLAUDE.md). While in here, settle the long-open naming question (nyumbani, TODO 4 / O12).

Decisions (as settled)

  1. Full registrar transfer. Registration and authoritative DNS move to Gandi — fully exits Cloudflare. (DNS-only would strand the registration at Cloudflare and is likely impossible anyway, since Cloudflare Registrar requires Cloudflare nameservers.)
  2. Three-tier naming scheme (the convention — see table below). nyumbani is dropped.
  3. Mesh/LAN-only by default. Home/cluster services have no public record; they are reached over LAN or the NetBird mesh. Public Gandi records exist only for deliberate exceptions (today: forgejo, the askari tier).
  4. DNS-as-code via a control-node public_dns role driven by structured record data in group_vars — the same pattern as the firewall catalog, and exactly what ADR-007 already calls "service/alias/split-horizon records … explicit zone data in group_vars." Name is provider-agnostic on purpose.
  5. Tooling: community.general.gandi_livedns with personal_access_token (PAT). Re-adds community.general to requirements.yml under the collections-on-demand policy (a committed role now uses gandi_livedns), pinned >=9.0.0, with the naming comment.
  6. Clean by omission. Stale records and the (unused) MX are not deleted at Cloudflare — the zone is abandoned. Only wanted records are carried to Gandi.
  7. Cert scope: DNS + PAT only. M1 ends at the migrated zone + the PAT in vault, which enables ACME DNS-01 later. No certificate issuance in M1 — that lands with a reverse proxy (askari in M4, home in Phase 2).
  8. Human/agent division of labour (see table) — account, payment, registrar transfer, and the go-live nameserver flip are human; all record-wrangling, the IaC, and the post-flip cutover are the agent's, executed from ubongo.

Verified facts (ADR-014)

verified: community.general.gandi_livedns requires personal_access_token (PAT); api_key is deprecated and rejected by Gandi (Bearer auth replaced Apikey) · WebFetch docs.ansible.com + WebSearch (Gandi PAT announcement 2023-09; community.general issue #7926) · PAT param added in community.general 9.0.0, 13.0.1 current · 2026-06-11

  • Module params: domain, record, type, values (list), ttl, state (present/absent). Supports check mode + diff.
  • Auth is per-task: pass personal_access_token: "{{ vault.gandi.pat }}".

unverified (from memory — confirm during implementation): the current registrar of baobab.band (WHOIS) — determines whether the transfer is Cloudflare→Gandi or elsewhere→Gandi, and the exact unlock/EPP steps.

Naming scheme (the convention)

Tier Pattern Authoritative source Public?
Infrastructure / hosts <host>.boma.baobab.band internal zone (dns1/dns2, Phase 2) never
Home / cluster services <service>.baobab.band internal zone (split-horizon) only deliberate exceptions
Off-site / VPS services <service>.askari.baobab.band Gandi LiveDNS yes (askari has a stable public IP)
  • nyumbani removed. It namespaced "home," but home is the default; only the exception needs naming, and askari.baobab.band does that, self-documenting.
  • The mesh carries "internal" to road-warriors. NetBird pushes dns1/dns2 (over wt0) as the resolver for the baobab.band match-domain, so on-LAN-or-on-mesh → internal answer; truly public → Gandi (ties M1 ↔ ADR-016 / M5).
  • Wildcard TLS later. A *.baobab.band (and *.askari.baobab.band) ACME DNS-01 cert via the Gandi PAT gives even unexposed services real public-CA TLS — without a public A record. Enabled by M1, issued in M4/Phase 2.

Architecture — two deliverables (kept separate on purpose)

(A) One-time migration — a runbook (docs/runbooks/)

Registrar transfers and the nameserver flip cannot be IaC'd. This is a human-gated procedure (sequence below), executed once.

(B) public_dns — the reusable IaC role

  • Runs from the control node (delegate_to: localhost, or a dns.yml play targeting control) against the Gandi LiveDNS API — there is no managed host, only API calls.
  • Reconciles records from group_vars data via community.general.gandi_livedns, PAT from vault.gandi.pat.
  • Check-mode/diff first, always (boma's check-before-deploy; the module supports it).
  • Carries only the public-tier records (exceptions + askari tier); the mesh/LAN-only default keeps this set small.

Data model (sketch)

# inventories/production/group_vars/all/public_dns.yml
public_dns__domain: baobab.band
public_dns__records:
  - { record: forgejo, type: A, values: ["<home-ingress-ip>"], ttl: 1800 }
  - { record: askari,  type: A, values: ["<hetzner-ip>"],      ttl: 1800 }
  # mesh/LAN-only services are intentionally ABSENT — they live only in the internal zone.
# PAT referenced as {{ vault.gandi.pat }} (nested vault.<service>.<key>, CLAUDE.md).

Open design nuance — additive vs authoritative

gandi_livedns is per-record (present/absent); it does not whole-zone sync. To make the repo authoritative (prune undeclared records — cf. TODO 8.3's prune question), the role would need to GET existing records and remove those not declared. M1 decision: start additive (declare what we want; remove the old via explicit absent entries during cutover); flag full-zone pruning as a possible later enhancement. Avoids accidentally deleting a record someone added out-of-band before the repo is the single source of truth.

Cutover sequence (the runbook)

Legend: [H] human · [A] agent (from ubongo, committed code + check-mode).

  1. [A] Inventory: parse the Cloudflare zone export (BIND file the user downloads, tokenless) → full record list; classify keep / rename / drop (incl. unused MX + stale).
  2. [A] Draft public_dns__records (new scheme) + the public_dns role; PR/commit; make check shows the intended Gandi state as a diff.
  3. [H] Create/verify the Gandi account; issue a LiveDNS-scoped PAT for baobab.band; store it in vault (vault.gandi.pat) via rbw. [H] Lower TTLs on the old Cloudflare zone ~2448h ahead.
  4. [A] Create the zone in Gandi LiveDNS and load records (make deploy, after a clean make check). Validate with dig @<gandi-ns>.
  5. [H] Initiate the registrar transfer to Gandi (unlock at Cloudflare, get EPP/auth code, start at Gandi, ACK to expedite; ~5 days — DNS keeps resolving).
  6. [H, go-live] Flip nameservers to Gandi LiveDNS. (Irreversible/outward-facing — explicit human go.)
  7. [A] Post-flip: validate resolution; rename the Forgejo remote + CI (forgejo.nyumbani.baobab.bandforgejo.baobab.band); verify a push.
  8. [A/H] Confirm propagation; [H] decommission the Cloudflare zone.

Division of labour & access (security posture)

Task Who How
Zone inventory Agent From the Cloudflare export (tokenless).
New record set + public_dns role + data Agent Committed IaC; make check diff.
Gandi account, transfer, payment Human Identity/billing/e-mail/ToS — not automatable.
Create zone + load records + reconcile Agent public_dns role on ubongo, PAT from vault, check-mode first.
Nameserver flip / go-live Human-gated Agent preps + validates; human flips.
Forgejo remote + CI cutover Agent After flip; verify push.
Delete stale Cloudflare records Nobody Cleaned by omission.
  • Minimal token scope. Gandi PAT: LiveDNS-only, restricted to baobab.band. Cloudflare: prefer the tokenless export; if an API token is used, read-only, single-zone, throwaway — revoke once inventory is captured.
  • Tokens live in boma's vault (vault.gandi.pat) via rbw — never pasted in chat.
  • Execution on ubongo, not in any agent sandbox: committed role + make checkmake deploy. Irreversible/outward steps (NS flip, go-live) require explicit human confirmation.

Testing & verification

External-API reconciliation does not fit container Molecule cleanly (a nuance against ADR-008 — not every role gets a converge-in-a-container scenario). Instead:

  • make check (check-mode + diff) against live Gandi before any apply.
  • Idempotence: a second make deploy reports no changes.
  • dig assertions post-cutover: new names resolve to expected values; a Forgejo push over forgejo.baobab.band succeeds.
  • Optionally a small pytest over the public_dns__records data shape (types, no duplicate record/type pairs), mirroring test_firewall_rules.py.

Scope boundaries — what M1 is NOT

  • Not the internal split-horizon dns role (renders <service>.baobab.band privately) — that needs the dns role + actual home services → Phase 2.
  • Not certificate issuance or the reverse proxy — M4 (askari) / Phase 2 (home).
  • Not authoritative whole-zone pruning — additive for now (see nuance above).

ADR work

Amend ADR-007: public zone provider → Gandi LiveDNS, managed as code (replaces "Cloudflare or equivalent"); record the three-tier naming scheme; remove the nyumbani example; state the mesh/LAN-only default. Note public_dns as the control-node role that renders the public zone (sibling to the internal dns role).

Open items (resolve during the plan / implementation)

  • Cloudflare zone export → the exact record list (execution input, not a design gap).
  • WHOIS the current registrar → confirm transfer source + unlock/EPP steps.
  • Pin the community.general version in requirements.yml (≥9.0.0).
  • Play wiring: a dedicated dns.yml play (control-targeted) vs folding into an existing play — decide in the plan.