Settles the M1 design: full registrar transfer Cloudflare -> Gandi; three-tier naming scheme (host.boma / service.bare / service.askari), nyumbani dropped, mesh/LAN-only default; public-DNS-as-code via a control-node `public_dns` role driven by group_vars data, using community.general.gandi_livedns with a PAT (api_key is deprecated/rejected by Gandi — verified per ADR-014). Stale records + unused MX cleaned by omission. Cert scope is DNS+PAT only (issuance deferred to M4/Phase 2). Human/agent division of labour + token-scoping recorded. Resolves TODO 4 and review finding O12 once the ADR-007 amendment lands. Point ROADMAP.md M1 at the spec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
11 KiB
Design — Public DNS migration to Gandi (DNS-as-code)
- Date: 2026-06-11
- Status: Draft for review — design settled in brainstorming; pending user review, then implementation plan
- Roadmap milestone: M1 (
docs/ROADMAP.md) - Resolves: TODO 4 (split-horizon FQDN — with/without
nyumbani); review finding O12 (ADR-007 FQDN convention contradicts its own example) - Amends: ADR-007 — public DNS provider → Gandi LiveDNS, managed as code; the
three-tier naming scheme;
nyumbaniremoved; mesh/LAN-only default exposure - Becomes: an ADR-007 amendment (no new ADR unless the
public_dnsrole grows concerns of its own)
Problem
Move baobab.band authoritative DNS and registration off Cloudflare to Gandi.
The driver is values/sovereignty (Gandi over Cloudflare) — it is not a NetBird
prerequisite, but it is sequenced first (roadmap M1) so askari's records are born at
Gandi and Cloudflare is never touched again. Do it as code, consistent with boma's
grain: internal DNS is already Ansible-rendered and Terraform owns no DNS (CLAUDE.md).
While in here, settle the long-open naming question (nyumbani, TODO 4 / O12).
Decisions (as settled)
- Full registrar transfer. Registration and authoritative DNS move to Gandi — fully exits Cloudflare. (DNS-only would strand the registration at Cloudflare and is likely impossible anyway, since Cloudflare Registrar requires Cloudflare nameservers.)
- Three-tier naming scheme (the convention — see table below).
nyumbaniis dropped. - Mesh/LAN-only by default. Home/cluster services have no public record; they
are reached over LAN or the NetBird mesh. Public Gandi records exist only for
deliberate exceptions (today:
forgejo, theaskaritier). - DNS-as-code via a control-node
public_dnsrole driven by structured record data ingroup_vars— the same pattern as the firewall catalog, and exactly what ADR-007 already calls "service/alias/split-horizon records … explicit zone data ingroup_vars." Name is provider-agnostic on purpose. - Tooling:
community.general.gandi_livednswithpersonal_access_token(PAT). Re-addscommunity.generaltorequirements.ymlunder the collections-on-demand policy (a committed role now usesgandi_livedns), pinned>=9.0.0, with the naming comment. - Clean by omission. Stale records and the (unused) MX are not deleted at Cloudflare — the zone is abandoned. Only wanted records are carried to Gandi.
- Cert scope: DNS + PAT only. M1 ends at the migrated zone + the PAT in vault, which enables ACME DNS-01 later. No certificate issuance in M1 — that lands with a reverse proxy (askari in M4, home in Phase 2).
- Human/agent division of labour (see table) — account, payment, registrar
transfer, and the go-live nameserver flip are human; all record-wrangling, the IaC,
and the post-flip cutover are the agent's, executed from
ubongo.
Verified facts (ADR-014)
verified:
community.general.gandi_livednsrequirespersonal_access_token(PAT);api_keyis deprecated and rejected by Gandi (Bearer auth replaced Apikey) · WebFetch docs.ansible.com + WebSearch (Gandi PAT announcement 2023-09; community.general issue #7926) · PAT param added in community.general 9.0.0, 13.0.1 current · 2026-06-11
- Module params:
domain,record,type,values(list),ttl,state(present/absent). Supports check mode + diff.- Auth is per-task: pass
personal_access_token: "{{ vault.gandi.pat }}".
unverified (from memory — confirm during implementation): the current registrar of
baobab.band(WHOIS) — determines whether the transfer is Cloudflare→Gandi or elsewhere→Gandi, and the exact unlock/EPP steps.
Naming scheme (the convention)
| Tier | Pattern | Authoritative source | Public? |
|---|---|---|---|
| Infrastructure / hosts | <host>.boma.baobab.band |
internal zone (dns1/dns2, Phase 2) |
never |
| Home / cluster services | <service>.baobab.band |
internal zone (split-horizon) | only deliberate exceptions |
| Off-site / VPS services | <service>.askari.baobab.band |
Gandi LiveDNS | yes (askari has a stable public IP) |
nyumbaniremoved. It namespaced "home," but home is the default; only the exception needs naming, andaskari.baobab.banddoes that, self-documenting.- The mesh carries "internal" to road-warriors. NetBird pushes
dns1/dns2(overwt0) as the resolver for thebaobab.bandmatch-domain, so on-LAN-or-on-mesh → internal answer; truly public → Gandi (ties M1 ↔ ADR-016 / M5). - Wildcard TLS later. A
*.baobab.band(and*.askari.baobab.band) ACME DNS-01 cert via the Gandi PAT gives even unexposed services real public-CA TLS — without a public A record. Enabled by M1, issued in M4/Phase 2.
Architecture — two deliverables (kept separate on purpose)
(A) One-time migration — a runbook (docs/runbooks/)
Registrar transfers and the nameserver flip cannot be IaC'd. This is a human-gated procedure (sequence below), executed once.
(B) public_dns — the reusable IaC role
- Runs from the control node (
delegate_to: localhost, or adns.ymlplay targetingcontrol) against the Gandi LiveDNS API — there is no managed host, only API calls. - Reconciles records from
group_varsdata viacommunity.general.gandi_livedns, PAT fromvault.gandi.pat. - Check-mode/diff first, always (boma's check-before-deploy; the module supports it).
- Carries only the public-tier records (exceptions +
askaritier); the mesh/LAN-only default keeps this set small.
Data model (sketch)
# inventories/production/group_vars/all/public_dns.yml
public_dns__domain: baobab.band
public_dns__records:
- { record: forgejo, type: A, values: ["<home-ingress-ip>"], ttl: 1800 }
- { record: askari, type: A, values: ["<hetzner-ip>"], ttl: 1800 }
# mesh/LAN-only services are intentionally ABSENT — they live only in the internal zone.
# PAT referenced as {{ vault.gandi.pat }} (nested vault.<service>.<key>, CLAUDE.md).
Open design nuance — additive vs authoritative
gandi_livedns is per-record (present/absent); it does not whole-zone sync. To
make the repo authoritative (prune undeclared records — cf. TODO 8.3's prune question),
the role would need to GET existing records and remove those not declared. M1 decision:
start additive (declare what we want; remove the old via explicit absent entries
during cutover); flag full-zone pruning as a possible later enhancement. Avoids
accidentally deleting a record someone added out-of-band before the repo is the single
source of truth.
Cutover sequence (the runbook)
Legend: [H] human · [A] agent (from ubongo, committed code + check-mode).
- [A] Inventory: parse the Cloudflare zone export (BIND file the user downloads, tokenless) → full record list; classify keep / rename / drop (incl. unused MX + stale).
- [A] Draft
public_dns__records(new scheme) + thepublic_dnsrole; PR/commit;make checkshows the intended Gandi state as a diff. - [H] Create/verify the Gandi account; issue a LiveDNS-scoped PAT for
baobab.band; store it in vault (vault.gandi.pat) via rbw. [H] Lower TTLs on the old Cloudflare zone ~24–48h ahead. - [A] Create the zone in Gandi LiveDNS and load records (
make deploy, after a cleanmake check). Validate withdig @<gandi-ns>. - [H] Initiate the registrar transfer to Gandi (unlock at Cloudflare, get EPP/auth code, start at Gandi, ACK to expedite; ~5 days — DNS keeps resolving).
- [H, go-live] Flip nameservers to Gandi LiveDNS. (Irreversible/outward-facing — explicit human go.)
- [A] Post-flip: validate resolution; rename the Forgejo remote + CI
(
forgejo.nyumbani.baobab.band→forgejo.baobab.band); verify a push. - [A/H] Confirm propagation; [H] decommission the Cloudflare zone.
Division of labour & access (security posture)
| Task | Who | How |
|---|---|---|
| Zone inventory | Agent | From the Cloudflare export (tokenless). |
New record set + public_dns role + data |
Agent | Committed IaC; make check diff. |
| Gandi account, transfer, payment | Human | Identity/billing/e-mail/ToS — not automatable. |
| Create zone + load records + reconcile | Agent | public_dns role on ubongo, PAT from vault, check-mode first. |
| Nameserver flip / go-live | Human-gated | Agent preps + validates; human flips. |
| Forgejo remote + CI cutover | Agent | After flip; verify push. |
| Delete stale Cloudflare records | Nobody | Cleaned by omission. |
- Minimal token scope. Gandi PAT: LiveDNS-only, restricted to
baobab.band. Cloudflare: prefer the tokenless export; if an API token is used, read-only, single-zone, throwaway — revoke once inventory is captured. - Tokens live in boma's vault (
vault.gandi.pat) via rbw — never pasted in chat. - Execution on
ubongo, not in any agent sandbox: committed role +make check→make deploy. Irreversible/outward steps (NS flip, go-live) require explicit human confirmation.
Testing & verification
External-API reconciliation does not fit container Molecule cleanly (a nuance against ADR-008 — not every role gets a converge-in-a-container scenario). Instead:
make check(check-mode + diff) against live Gandi before any apply.- Idempotence: a second
make deployreports no changes. digassertions post-cutover: new names resolve to expected values; a Forgejo push overforgejo.baobab.bandsucceeds.- Optionally a small pytest over the
public_dns__recordsdata shape (types, no duplicate record/type pairs), mirroringtest_firewall_rules.py.
Scope boundaries — what M1 is NOT
- Not the internal split-horizon
dnsrole (renders<service>.baobab.bandprivately) — that needs thednsrole + actual home services → Phase 2. - Not certificate issuance or the reverse proxy — M4 (askari) / Phase 2 (home).
- Not authoritative whole-zone pruning — additive for now (see nuance above).
ADR work
Amend ADR-007: public zone provider → Gandi LiveDNS, managed as code (replaces
"Cloudflare or equivalent"); record the three-tier naming scheme; remove the
nyumbani example; state the mesh/LAN-only default. Note public_dns as the
control-node role that renders the public zone (sibling to the internal dns role).
Open items (resolve during the plan / implementation)
- Cloudflare zone export → the exact record list (execution input, not a design gap).
- WHOIS the current registrar → confirm transfer source + unlock/EPP steps.
- Pin the
community.generalversion inrequirements.yml(≥9.0.0). - Play wiring: a dedicated
dns.ymlplay (control-targeted) vs folding into an existing play — decide in the plan.