docs(m4a): HTTP-01 for askari; ADR-024 cert-method-follows-exposure; STATUS/roadmap/friction
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
b7e919d6b3
commit
1862b7a828
4 changed files with 35 additions and 3 deletions
|
|
@ -30,7 +30,8 @@ _Last reviewed: 2026-06-14._
|
|||
| `make check` / `make deploy PLAYBOOK=<name>` | **Works.** First end-to-end run (applying `dev_env`) surfaced + fixed latent bugs: Makefile `PLAYBOOK` var collision (binary path vs playbook-name arg) meant the targets never ran; `ansible.cfg` referenced uninstalled community.general callbacks (now built-in `default` + `ansible.posix.profile_tasks`); `acl` package added so Ansible can `become_user` an unprivileged user. The make targets now function — though `site`/`base`/`docker_host` content is still incomplete (see below). |
|
||||
| `roles/public_dns/` + `playbooks/dns.yml` | **Built + applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); record data, anti-spoof baseline (SPF `-all` + DMARC reject), and the Gandi-defaults purge are defined + unit-tested (`tests/test_public_dns.py`). **Applied to wingu.me (2026-06-14):** purged Gandi's 13 seeded defaults; zone now holds only the SPF + DMARC TXT records; idempotent re-run clean. No null-MX (Gandi rejects `0 .`) — the MX is removed, so no MX + no apex A = no mail. M1 of the roadmap. |
|
||||
| `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker group, no sudo). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **`dev_env` now applied here** (zsh/tmux/nvim for `sjat` + `claude`, via `playbooks/workstation.yml`). Managed as the operator account `sjat` (`group_vars/control` sets `ansible_user: sjat`), not the `ansible` service user `group_vars/all` assumes — ubongo has no bootstrapped `ansible` user. **Pending:** NetBird mesh enrollment (so SSH is LAN-only); full `base` hardening (only the `firewall` concern exists, and it is NOT applied here — applying default-deny with no mesh would lock out inbound SSH on the physical NIC); proper `ansible`-user bootstrap (currently managed as `sjat`); OPNsense DHCP reservation for 10.20.10.151 (MAC `88:a4:c2:e0:ee:da`); Terraform state backup (now relevant — the offsite tfstate exists). |
|
||||
| `askari` — off-site Hetzner VPS (ADR-007/016, M2) | **Built + applied.** Provisioned by Terraform (`environments/offsite`, `hetznercloud/hcloud`) as **cx23 / hel1 / Debian 13.5** (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the `ansible` user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (`91.226.145.80`). Reachable from ubongo (`ansible offsite_hosts -m ping` ✓), in the `offsite_hosts` inventory (generated `offsite.yml`), published at `askari.wingu.me` → `77.42.120.136`. **SSH-hardened + fail2ban (M3 `hardening` concern applied).** **Pending:** NetBird coordinator (M4), host firewall + mesh enrollment (M5), offsite tfstate backup (ADR-022). |
|
||||
| `askari` — off-site Hetzner VPS (ADR-007/016, M2) | **Built + applied.** Provisioned by Terraform (`environments/offsite`, `hetznercloud/hcloud`) as **cx23 / hel1 / Debian 13.5** (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the `ansible` user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (`91.226.145.80`). Reachable from ubongo (`ansible offsite_hosts -m ping` ✓), in the `offsite_hosts` inventory (generated `offsite.yml`), published at `askari.wingu.me` → `77.42.120.136`. **SSH-hardened + fail2ban (M3).** **Docker + Caddy reverse proxy (M4a):** `docker_host` + `reverse_proxy` (vanilla Caddy, HTTP-01) applied; `https://test.askari.wingu.me` serves a valid Let's Encrypt cert ✓ (firewall opens 80/443/3478). **Pending:** NetBird coordinator (M4b), host firewall + mesh enrollment (M5), offsite tfstate backup (ADR-022). |
|
||||
| `roles/docker_host/` (Docker engine) + `roles/reverse_proxy/` (Caddy, ADR-024) | **Built + applied** (askari, M4a). `docker_host` installs Docker CE + compose; `reverse_proxy` is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from `reverse_proxy__routes`). DNS-01 for cluster mesh/LAN-only services is deferred to Phase 2 (caddy-dns/gandi unresolved — see FRICTION). |
|
||||
|
||||
## Scaffolded but empty — NOT implemented
|
||||
|
||||
|
|
|
|||
|
|
@ -21,6 +21,20 @@ earning its keep.
|
|||
|
||||
_(append new raw signals here; the next kaizen review consumes them)_
|
||||
|
||||
- `[gotcha]` **Hetzner IPs are 403'd by Google's Go module infra; caddy-dns/gandi DNS-01
|
||||
didn't issue** (2026-06-14, M4a): building the custom Caddy image *on askari* failed —
|
||||
`proxy.golang.org` and `golang.org` both return **403 Forbidden** to the Hetzner IP
|
||||
(worked on ubongo). Reworked the role to build on the control node + `docker save`/`load`
|
||||
to the target. *Then* the `caddy-dns/gandi` DNS-01 plugin would not create the
|
||||
`_acme-challenge` TXT despite a token verified to (a) be in Caddy's env and (b) create
|
||||
TXT records via the Gandi API directly — no plugin error, just "propagation timeout,
|
||||
last error <nil>"; resolvers/timeout tuning didn't help. **Resolution:** askari is a
|
||||
*public* host, so switched it to **HTTP-01 + vanilla Caddy** (works, drops the custom
|
||||
image entirely). DNS-01 deferred to Phase 2 (cluster's mesh/LAN-only services) — the
|
||||
plugin + the Hetzner-build-block to be solved then. → lesson: prefer HTTP-01 wherever a
|
||||
host is publicly reachable; reserve DNS-01 (and its plugin/build complexity) for hosts
|
||||
that genuinely can't do HTTP-01. Both bugs surfaced only on the live host.
|
||||
|
||||
- `[gotcha]` **A tag on `include_tasks` does NOT reach the included tasks — need
|
||||
`apply: {tags:}`** (2026-06-14): M3's `base/tasks/main.yml` tagged the ssh/fail2ban
|
||||
`include_tasks` with `hardening`, but `make deploy … TAGS=hardening` ran *nothing*
|
||||
|
|
|
|||
|
|
@ -109,8 +109,15 @@ active. Full CIS L1/L2, auditd, AppArmor, AIDE remain deferred to Phase 2 (TODO
|
|||
|
||||
### M4 · NetBird control plane on `askari` — first real service role
|
||||
|
||||
Built in two phases. **M4a (platform) — ✅ DONE:** Docker on askari + boma's standard
|
||||
**Caddy** reverse proxy (ADR-024), proven by `https://test.askari.wingu.me` serving a
|
||||
valid Let's Encrypt cert (HTTP-01 — DNS-01 deferred to Phase 2, see ADR-024/FRICTION).
|
||||
Firewall opened 80/443/3478. Spec/plan: `…2026-06-14-netbird-coordinator-m4-design.md` /
|
||||
`…2026-06-14-m4a-docker-caddy.md`. **M4b (next):** the `netbird` service role — read
|
||||
NetBird's current self-host compose then.
|
||||
|
||||
Deploy the NetBird stack (management / signal / relay / Coturn + dashboard) with the
|
||||
**embedded IdP** (ADR-016 — no Authentik dependency).
|
||||
**embedded IdP** (ADR-016 — no Authentik dependency), fronted by the now-proven Caddy.
|
||||
|
||||
- **First exercise of:** the service-role conventions (`SECURITY.md` / `VERIFY.md` /
|
||||
`ACCESS.md` / `BACKUP.md`), public **TLS / ACME**, and the **backup contract** —
|
||||
|
|
|
|||
|
|
@ -1,10 +1,20 @@
|
|||
# ADR-024 — Reverse proxy: Caddy with ACME DNS-01 (Gandi)
|
||||
# ADR-024 — Reverse proxy: Caddy (ACME — HTTP-01 public, DNS-01 private)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-06-14). Amends the soft Traefik assumption carried by the roadmap
|
||||
(Phase-2 step 5) and ADR-017 prose; those are updated to read "Caddy (ADR-024)".
|
||||
|
||||
> **Cert method follows exposure (revised 2026-06-14, M4a).** The cert *challenge*
|
||||
> depends on whether a host is publicly reachable: **public hosts** (askari) use
|
||||
> **HTTP-01** with **vanilla Caddy** — simplest, no plugin; **mesh/LAN-only cluster
|
||||
> services** (no public A-record) need **DNS-01** (the M1 Gandi capability), since they
|
||||
> can't satisfy HTTP-01. The DNS-01 path is **deferred to Phase 2**: the `caddy-dns/gandi`
|
||||
> plugin did not create the ACME TXT records on askari despite a verified-valid token
|
||||
> (and Hetzner IPs are 403'd by Google's Go module infra, blocking the on-host custom
|
||||
> build) — both to be sorted when the cluster's private services actually need DNS-01.
|
||||
> The body below describes the DNS-01 design; askari (M4a) ships on HTTP-01.
|
||||
|
||||
## Context
|
||||
|
||||
boma needs a reverse proxy to front its services with TLS. ADR-002 requires every
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue