197 lines
11 KiB
Markdown
197 lines
11 KiB
Markdown
|
|
# Design — Public DNS migration to Gandi (DNS-as-code)
|
|||
|
|
|
|||
|
|
- **Date:** 2026-06-11
|
|||
|
|
- **Status:** Draft for review — design settled in brainstorming; pending user review,
|
|||
|
|
then implementation plan
|
|||
|
|
- **Roadmap milestone:** M1 (`docs/ROADMAP.md`)
|
|||
|
|
- **Resolves:** TODO 4 (split-horizon FQDN — with/without `nyumbani`); review finding
|
|||
|
|
O12 (ADR-007 FQDN convention contradicts its own example)
|
|||
|
|
- **Amends:** ADR-007 — public DNS provider → **Gandi LiveDNS, managed as code**; the
|
|||
|
|
three-tier naming scheme; `nyumbani` removed; mesh/LAN-only default exposure
|
|||
|
|
- **Becomes:** an ADR-007 amendment (no new ADR unless the `public_dns` role grows
|
|||
|
|
concerns of its own)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Problem
|
|||
|
|
|
|||
|
|
Move `baobab.band` authoritative DNS **and registration** off Cloudflare to **Gandi**.
|
|||
|
|
The driver is values/sovereignty (Gandi over Cloudflare) — it is **not** a NetBird
|
|||
|
|
prerequisite, but it is sequenced first (roadmap M1) so `askari`'s records are born at
|
|||
|
|
Gandi and Cloudflare is never touched again. Do it **as code**, consistent with boma's
|
|||
|
|
grain: internal DNS is already Ansible-rendered and Terraform owns *no* DNS (CLAUDE.md).
|
|||
|
|
While in here, settle the long-open naming question (`nyumbani`, TODO 4 / O12).
|
|||
|
|
|
|||
|
|
## Decisions (as settled)
|
|||
|
|
|
|||
|
|
1. **Full registrar transfer.** Registration *and* authoritative DNS move to Gandi —
|
|||
|
|
fully exits Cloudflare. (DNS-only would strand the registration at Cloudflare and is
|
|||
|
|
likely impossible anyway, since Cloudflare Registrar requires Cloudflare nameservers.)
|
|||
|
|
2. **Three-tier naming scheme** (the convention — see table below). `nyumbani` is
|
|||
|
|
**dropped**.
|
|||
|
|
3. **Mesh/LAN-only by default.** Home/cluster services have **no public record**; they
|
|||
|
|
are reached over LAN or the NetBird mesh. Public Gandi records exist only for
|
|||
|
|
deliberate exceptions (today: `forgejo`, the `askari` tier).
|
|||
|
|
4. **DNS-as-code via a control-node `public_dns` role** driven by structured record data
|
|||
|
|
in `group_vars` — the same pattern as the firewall catalog, and exactly what ADR-007
|
|||
|
|
already calls "service/alias/split-horizon records … explicit zone data in
|
|||
|
|
`group_vars`." Name is provider-agnostic on purpose.
|
|||
|
|
5. **Tooling: `community.general.gandi_livedns` with `personal_access_token`** (PAT).
|
|||
|
|
Re-adds `community.general` to `requirements.yml` under the collections-on-demand
|
|||
|
|
policy (a committed role now uses `gandi_livedns`), pinned `>=9.0.0`, with the naming
|
|||
|
|
comment.
|
|||
|
|
6. **Clean by omission.** Stale records and the (unused) MX are *not* deleted at
|
|||
|
|
Cloudflare — the zone is abandoned. Only wanted records are carried to Gandi.
|
|||
|
|
7. **Cert scope: DNS + PAT only.** M1 ends at the migrated zone + the PAT in vault, which
|
|||
|
|
*enables* ACME DNS-01 later. **No certificate issuance in M1** — that lands with a
|
|||
|
|
reverse proxy (askari in M4, home in Phase 2).
|
|||
|
|
8. **Human/agent division of labour** (see table) — account, payment, registrar
|
|||
|
|
transfer, and the go-live nameserver flip are human; all record-wrangling, the IaC,
|
|||
|
|
and the post-flip cutover are the agent's, executed from `ubongo`.
|
|||
|
|
|
|||
|
|
## Verified facts (ADR-014)
|
|||
|
|
|
|||
|
|
> verified: `community.general.gandi_livedns` requires `personal_access_token` (PAT);
|
|||
|
|
> `api_key` is deprecated and **rejected** by Gandi (Bearer auth replaced Apikey) ·
|
|||
|
|
> WebFetch docs.ansible.com + WebSearch (Gandi PAT announcement 2023-09; community.general
|
|||
|
|
> issue #7926) · PAT param added in **community.general 9.0.0**, **13.0.1** current ·
|
|||
|
|
> 2026-06-11
|
|||
|
|
> - Module params: `domain`, `record`, `type`, `values` (list), `ttl`, `state`
|
|||
|
|
> (`present`/`absent`). Supports **check mode + diff**.
|
|||
|
|
> - Auth is per-task: pass `personal_access_token: "{{ vault.gandi.pat }}"`.
|
|||
|
|
|
|||
|
|
> unverified (from memory — confirm during implementation): the current registrar of
|
|||
|
|
> `baobab.band` (WHOIS) — determines whether the transfer is Cloudflare→Gandi or
|
|||
|
|
> elsewhere→Gandi, and the exact unlock/EPP steps.
|
|||
|
|
|
|||
|
|
## Naming scheme (the convention)
|
|||
|
|
|
|||
|
|
| Tier | Pattern | Authoritative source | Public? |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| Infrastructure / hosts | `<host>.boma.baobab.band` | internal zone (`dns1`/`dns2`, Phase 2) | never |
|
|||
|
|
| Home / cluster services | `<service>.baobab.band` | internal zone (split-horizon) | only deliberate exceptions |
|
|||
|
|
| Off-site / VPS services | `<service>.askari.baobab.band` | Gandi LiveDNS | yes (askari has a stable public IP) |
|
|||
|
|
|
|||
|
|
- **`nyumbani` removed.** It namespaced "home," but home is the default; only the
|
|||
|
|
*exception* needs naming, and `askari.baobab.band` does that, self-documenting.
|
|||
|
|
- **The mesh carries "internal" to road-warriors.** NetBird pushes `dns1`/`dns2` (over
|
|||
|
|
`wt0`) as the resolver for the `baobab.band` match-domain, so on-LAN-or-on-mesh →
|
|||
|
|
internal answer; truly public → Gandi (ties M1 ↔ ADR-016 / M5).
|
|||
|
|
- **Wildcard TLS later.** A `*.baobab.band` (and `*.askari.baobab.band`) ACME **DNS-01**
|
|||
|
|
cert via the Gandi PAT gives even unexposed services real public-CA TLS — without a
|
|||
|
|
public A record. Enabled by M1, issued in M4/Phase 2.
|
|||
|
|
|
|||
|
|
## Architecture — two deliverables (kept separate on purpose)
|
|||
|
|
|
|||
|
|
### (A) One-time migration — a runbook (`docs/runbooks/`)
|
|||
|
|
|
|||
|
|
Registrar transfers and the nameserver flip cannot be IaC'd. This is a human-gated
|
|||
|
|
procedure (sequence below), executed once.
|
|||
|
|
|
|||
|
|
### (B) `public_dns` — the reusable IaC role
|
|||
|
|
|
|||
|
|
- Runs **from the control node** (`delegate_to: localhost`, or a `dns.yml` play targeting
|
|||
|
|
`control`) against the Gandi LiveDNS API — there is no managed *host*, only API calls.
|
|||
|
|
- Reconciles records from **`group_vars` data** via `community.general.gandi_livedns`,
|
|||
|
|
PAT from `vault.gandi.pat`.
|
|||
|
|
- **Check-mode/diff first**, always (boma's check-before-deploy; the module supports it).
|
|||
|
|
- Carries only the public-tier records (exceptions + `askari` tier); the mesh/LAN-only
|
|||
|
|
default keeps this set small.
|
|||
|
|
|
|||
|
|
#### Data model (sketch)
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# inventories/production/group_vars/all/public_dns.yml
|
|||
|
|
public_dns__domain: baobab.band
|
|||
|
|
public_dns__records:
|
|||
|
|
- { record: forgejo, type: A, values: ["<home-ingress-ip>"], ttl: 1800 }
|
|||
|
|
- { record: askari, type: A, values: ["<hetzner-ip>"], ttl: 1800 }
|
|||
|
|
# mesh/LAN-only services are intentionally ABSENT — they live only in the internal zone.
|
|||
|
|
# PAT referenced as {{ vault.gandi.pat }} (nested vault.<service>.<key>, CLAUDE.md).
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Open design nuance — additive vs authoritative
|
|||
|
|
|
|||
|
|
`gandi_livedns` is **per-record** (`present`/`absent`); it does not whole-zone sync. To
|
|||
|
|
make the repo *authoritative* (prune undeclared records — cf. TODO 8.3's prune question),
|
|||
|
|
the role would need to GET existing records and remove those not declared. **M1 decision:**
|
|||
|
|
start **additive** (declare what we want; remove the old via explicit `absent` entries
|
|||
|
|
during cutover); flag full-zone pruning as a possible later enhancement. Avoids
|
|||
|
|
accidentally deleting a record someone added out-of-band before the repo is the single
|
|||
|
|
source of truth.
|
|||
|
|
|
|||
|
|
## Cutover sequence (the runbook)
|
|||
|
|
|
|||
|
|
Legend: **[H]** human · **[A]** agent (from `ubongo`, committed code + check-mode).
|
|||
|
|
|
|||
|
|
1. **[A]** Inventory: parse the **Cloudflare zone export** (BIND file the user downloads,
|
|||
|
|
tokenless) → full record list; classify keep / rename / drop (incl. unused MX + stale).
|
|||
|
|
2. **[A]** Draft `public_dns__records` (new scheme) + the `public_dns` role; PR/commit;
|
|||
|
|
`make check` shows the intended Gandi state as a diff.
|
|||
|
|
3. **[H]** Create/verify the Gandi account; issue a **LiveDNS-scoped PAT** for
|
|||
|
|
`baobab.band`; store it in vault (`vault.gandi.pat`) via rbw. **[H]** Lower TTLs on the
|
|||
|
|
*old* Cloudflare zone ~24–48h ahead.
|
|||
|
|
4. **[A]** Create the zone in Gandi LiveDNS and load records (`make deploy`, after a clean
|
|||
|
|
`make check`). Validate with `dig @<gandi-ns>`.
|
|||
|
|
5. **[H]** Initiate the **registrar transfer** to Gandi (unlock at Cloudflare, get
|
|||
|
|
EPP/auth code, start at Gandi, ACK to expedite; ~5 days — DNS keeps resolving).
|
|||
|
|
6. **[H, go-live]** **Flip nameservers** to Gandi LiveDNS. (Irreversible/outward-facing —
|
|||
|
|
explicit human go.)
|
|||
|
|
7. **[A]** Post-flip: validate resolution; **rename the Forgejo remote + CI**
|
|||
|
|
(`forgejo.nyumbani.baobab.band` → `forgejo.baobab.band`); verify a push.
|
|||
|
|
8. **[A/H]** Confirm propagation; **[H]** decommission the Cloudflare zone.
|
|||
|
|
|
|||
|
|
## Division of labour & access (security posture)
|
|||
|
|
|
|||
|
|
| Task | Who | How |
|
|||
|
|
|---|---|---|
|
|||
|
|
| Zone inventory | Agent | From the Cloudflare **export** (tokenless). |
|
|||
|
|
| New record set + `public_dns` role + data | Agent | Committed IaC; `make check` diff. |
|
|||
|
|
| Gandi account, transfer, payment | Human | Identity/billing/e-mail/ToS — not automatable. |
|
|||
|
|
| Create zone + load records + reconcile | Agent | `public_dns` role on `ubongo`, PAT from vault, check-mode first. |
|
|||
|
|
| Nameserver flip / go-live | Human-gated | Agent preps + validates; human flips. |
|
|||
|
|
| Forgejo remote + CI cutover | Agent | After flip; verify push. |
|
|||
|
|
| Delete stale Cloudflare records | Nobody | Cleaned by omission. |
|
|||
|
|
|
|||
|
|
- **Minimal token scope.** Gandi PAT: **LiveDNS-only**, restricted to `baobab.band`.
|
|||
|
|
Cloudflare: prefer the **tokenless export**; if an API token is used, **read-only,
|
|||
|
|
single-zone, throwaway** — revoke once inventory is captured.
|
|||
|
|
- **Tokens live in boma's vault** (`vault.gandi.pat`) via rbw — never pasted in chat.
|
|||
|
|
- **Execution on `ubongo`**, not in any agent sandbox: committed role + `make check` →
|
|||
|
|
`make deploy`. Irreversible/outward steps (NS flip, go-live) require explicit human
|
|||
|
|
confirmation.
|
|||
|
|
|
|||
|
|
## Testing & verification
|
|||
|
|
|
|||
|
|
External-API reconciliation does not fit container Molecule cleanly (a nuance against
|
|||
|
|
ADR-008 — not every role gets a converge-in-a-container scenario). Instead:
|
|||
|
|
|
|||
|
|
- **`make check` (check-mode + diff)** against live Gandi before any apply.
|
|||
|
|
- **Idempotence:** a second `make deploy` reports no changes.
|
|||
|
|
- **`dig` assertions** post-cutover: new names resolve to expected values; a Forgejo
|
|||
|
|
push over `forgejo.baobab.band` succeeds.
|
|||
|
|
- Optionally a small pytest over the `public_dns__records` data shape (types, no
|
|||
|
|
duplicate record/type pairs), mirroring `test_firewall_rules.py`.
|
|||
|
|
|
|||
|
|
## Scope boundaries — what M1 is NOT
|
|||
|
|
|
|||
|
|
- **Not** the internal split-horizon `dns` role (renders `<service>.baobab.band`
|
|||
|
|
privately) — that needs the `dns` role + actual home services → **Phase 2**.
|
|||
|
|
- **Not** certificate issuance or the reverse proxy — **M4 (askari) / Phase 2 (home)**.
|
|||
|
|
- **Not** authoritative whole-zone pruning — additive for now (see nuance above).
|
|||
|
|
|
|||
|
|
## ADR work
|
|||
|
|
|
|||
|
|
Amend **ADR-007**: public zone provider → **Gandi LiveDNS, managed as code** (replaces
|
|||
|
|
"Cloudflare or equivalent"); record the **three-tier naming scheme**; remove the
|
|||
|
|
`nyumbani` example; state the **mesh/LAN-only default**. Note `public_dns` as the
|
|||
|
|
control-node role that renders the public zone (sibling to the internal `dns` role).
|
|||
|
|
|
|||
|
|
## Open items (resolve during the plan / implementation)
|
|||
|
|
|
|||
|
|
- **Cloudflare zone export** → the exact record list (execution input, not a design gap).
|
|||
|
|
- **WHOIS** the current registrar → confirm transfer source + unlock/EPP steps.
|
|||
|
|
- **Pin** the `community.general` version in `requirements.yml` (≥9.0.0).
|
|||
|
|
- **Play wiring:** a dedicated `dns.yml` play (control-targeted) vs folding into an
|
|||
|
|
existing play — decide in the plan.
|