From 5af0cf858230430710d654382c1997da432be0eb Mon Sep 17 00:00:00 2001 From: sjat Date: Thu, 30 Apr 2026 09:00:59 +0200 Subject: [PATCH] Add design decisions list for V5 Co-Authored-By: Claude Sonnet 4.6 --- design-decisions.md | 339 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 339 insertions(+) create mode 100644 design-decisions.md diff --git a/design-decisions.md b/design-decisions.md new file mode 100644 index 0000000..9f26f87 --- /dev/null +++ b/design-decisions.md @@ -0,0 +1,339 @@ +# Design Decisions — Homelab V5 + +Subjects to discuss and decide before building V5. Ordered so that each decision can be made with stable answers to everything above it. Each has a unique ID for tracking. + +Status values: `undecided` · `decided` · `deferred` + +--- + +## Foundation + +### D-01 · Goals and guiding principles +**Status:** undecided + +What V5 should optimise for. Every later trade-off will be made against this. + +- What are the top 3–5 priorities? (e.g. reliability, simplicity, maintainability, capability, cost, family usability) +- Are there things V4 got wrong that V5 must not repeat? +- What is explicitly out of scope? + +--- + +### D-02 · Hardware — what to keep, retire, replace, or add +**Status:** undecided + +Which physical machines carry forward into V5 and in what role. Decisions here determine what compute, storage, and network capacity the rest of the design has to work with. + +- fisi: keep as primary server? Upgrade? Replace? +- tembo: keep kiosk+monitoring combo? Split the roles? +- papa: keep as dedicated NAS? +- kobe: keep as dedicated backup target? Consolidate with papa? +- kuku/faru: keep as Pi-based roles? Upgrade to newer Pi hardware? +- simba: keep OPNsense on current hardware? +- Any new hardware to introduce? + +--- + +### D-03 · Virtualisation strategy +**Status:** undecided + +Whether to introduce a hypervisor layer, and if so which one. This decision shapes host OS choices, service isolation, and migration paths. + +- Stay bare-metal containers only (current approach)? +- Introduce a hypervisor (Proxmox, ESXi, bhyve)? +- If yes: which hosts get it, and which remain bare metal? +- What is the unit of deployment — VM, LXC, container, or a mix? + +--- + +### D-04 · Host OS strategy +**Status:** undecided + +What OS runs on each category of machine. Depends on D-03. + +- Debian everywhere (current)? Or specialised OS per role (TrueNAS for NAS, Proxmox for compute, etc.)? +- Minimum Debian version to target? +- Should all managed hosts run the same base OS? + +--- + +## Network + +### D-05 · IP addressing and VLAN design +**Status:** undecided + +The logical network topology. Sets the stage for firewall rules and service addressing. + +- Keep the `10.20.x.x` scheme? +- Are the current VLANs (`10.20.10`, `.1`, `.2`, `.30`) the right boundaries, or does V5 need more/fewer segments? +- Should the WireGuard tunnel subnet (`10.8.0.0/24`) change? +- DHCP reservation strategy — reserve all infrastructure IPs statically? + +--- + +### D-06 · Firewall and router +**Status:** undecided + +What handles routing, NAT, DHCP, and inter-VLAN policy. + +- Keep OPNsense on simba? +- Any changes to hardware or OPNsense version? +- Are the current inter-VLAN policies correct, or does V5 need stricter segmentation (e.g. IoT fully isolated)? + +--- + +### D-07 · WiFi +**Status:** undecided + +- Keep the two EAP610 APs (tai1/tai2) as-is? +- Add a third AP? +- Keep using standalone mode or move to Omada controller? + +--- + +## Platform Services + +### D-08 · Container orchestration +**Status:** undecided + +How containers are defined, deployed, and managed. One of the most consequential decisions — affects the IaC model, tooling, and complexity. + +- Keep Docker + Docker Compose (current)? +- Move to Podman/Quadlets? +- Introduce an orchestrator (Nomad, K3s)? +- If staying with Compose: keep the `container_base` Ansible role pattern? + +--- + +### D-09 · Internal DNS +**Status:** undecided + +How internal names resolve and how ad blocking is handled. + +- Keep Technitium on fisi? +- Risks of single-server DNS (fisi DNS outage = no internal resolution)? +- Should DNS be moved to a more reliable host, or should there be a secondary? +- Keep the `*.nyumbani.baobab.band` wildcard pattern? + +--- + +### D-10 · Reverse proxy +**Status:** undecided + +How HTTPS termination and routing work for internal and public services. + +- Keep Traefik (current, on fisi)? +- Any reason to consider Caddy? +- Certificate strategy: keep DNS-01 wildcards via Cloudflare? Keep per-VPS Traefik instances? +- Is a single Traefik instance on fisi the right topology, or should tembo have its own? + +--- + +### D-11 · Remote access and VPN +**Status:** undecided + +How family members and VPS hosts reach the homelab network from outside. + +- Keep WireGuard on kuku (Raspberry Pi hub)? +- Is kuku a single point of failure worth addressing? +- Consider Tailscale or Headscale instead of self-managed WireGuard? +- VPS integration: keep VPS hosts as WireGuard spokes? + +--- + +### D-12 · Secrets management +**Status:** undecided + +Where secrets live and how they are accessed at deploy time. + +- Keep Ansible Vault (current)? +- Move to SOPS + age? +- Introduce a secrets server (Infisical, Doppler, HashiCorp Vault)? +- Is the single vault file per inventory environment the right structure? + +--- + +### D-13 · IaC tooling +**Status:** undecided + +The tools used to define and apply infrastructure state. Depends on D-03, D-04, D-08. + +- Keep Ansible as the primary tool? +- Add Terraform/OpenTofu for VPS provisioning? +- Keep the two-inventory (`prod`/`lab`) structure? +- Role naming and structure: evolve `AnsibleBaobabV4` in place, or start fresh? + +--- + +## Storage and Data + +### D-14 · Storage architecture +**Status:** undecided + +How storage is organised across machines. Depends on D-02 and D-03. + +- Keep papa as a dedicated NAS with ZFS mirror exported via NFS? +- Is the NVMe on fisi the right place for all container state? +- Should media and container data live on the same host? +- Any need for larger storage capacity in V5? + +--- + +### D-15 · Backup strategy +**Status:** undecided + +What is protected, how, and where the backups land. Depends on D-14. + +- Keep Borg as primary? Keep papa as the backup target? +- Simplify: consolidate kobe (rsnapshot) into the Borg model? +- Off-site: keep pCloud sync via rclone? Any other off-site approach? +- Backup for network devices (simba, APs, switch) — keep pull model from papa? +- RTO/RPO expectations: what is acceptable downtime and data loss? + +--- + +## Observability + +### D-16 · Observability stack placement +**Status:** undecided + +Which host runs the monitoring stack and how resilient it needs to be. + +- Keep monitoring on tembo (same machine as kiosk)? +- Should monitoring be on a host that is not also a kiosk / display? +- What happens to observability if the monitoring host goes down? + +--- + +### D-17 · Metrics, logs, and alerting +**Status:** undecided + +The specific tools and data flows for observability. Depends on D-16. + +- Keep Prometheus + Loki + Grafana? +- Keep Grafana Alloy as the shipping agent? +- Keep the Matrix bot for alerts, or move to ntfy? +- Log retention: is 15-day Prometheus retention enough? +- Any gaps in current coverage to address in V5? + +--- + +## Public Exposure + +### D-18 · VPS strategy +**Status:** undecided + +How many VPS hosts, at which providers, and for what roles. + +- Keep three VPS (baobab.band, makerfloss, rullebiler.dk)? +- makerfloss is currently isolated (no WireGuard, no backup) — is that intentional? +- Should VPS hosts be brought fully into the homelab WireGuard mesh? +- Cost and provider consolidation: any reason to move hosts? + +--- + +### D-19 · Public services and exposure model +**Status:** undecided + +What is reachable from the internet and how traffic gets there. + +- Which services need to be publicly accessible (vs. VPN-only)? +- Keep the current model of public services pointing to fisi's public IP via Cloudflare? +- Should any services move behind a VPS relay (Cloudflare Tunnel, nginx stream proxy)? +- Port exposure policy: what can be opened directly vs. must go through a VPS? + +--- + +### D-20 · Domain and DNS provider strategy +**Status:** undecided + +How many domains, managed where. + +- Keep `baobab.band` + `makerfloss.eu` + `rullebiler.dk`? +- Keep split between Cloudflare and Gandi for DNS management? +- Any consolidation desired? + +--- + +## Services + +### D-21 · Core service catalogue +**Status:** undecided + +Which services are first-class citizens in V5 — things that must be reliable and are worth complexity to maintain. + +- Define the "core" tier: services that must survive a host rebuild before anything else is restored (e.g. Vaultwarden, Nextcloud, Forgejo, DNS, Grafana). +- Define the "nice-to-have" tier. +- Are there V4 services that should be dropped in V5? + +--- + +### D-22 · Media stack +**Status:** undecided + +- Keep the full \*arr stack (Sonarr, Radarr, Lidarr, Prowlarr, Lazylibrarian)? +- Keep Jellyfin + Audiobookshelf + Calibre Web? +- Any services to add or drop? +- Gluetun VPN for qBittorrent: keep PIA, or change provider? + +--- + +### D-23 · Communication services +**Status:** undecided + +- Keep self-hosted Matrix (conduwuit + Element Web)? +- Keep Poste.io for mail — three separate instances across three hosts is the current pattern; is that the right structure? +- Keep ntfy for push notifications? +- Any desire to consolidate or simplify the comms stack? + +--- + +### D-24 · Photo management +**Status:** undecided + +PhotoPrism is currently deployed on both fisi and tembo (partially migrated). This is unresolved technical debt. + +- Settle on a single host for PhotoPrism. +- Is PhotoPrism the right tool long-term, or is there an alternative to consider? +- Confirm GPU passthrough requirements (Intel Quick Sync for transcoding). + +--- + +### D-25 · Home automation +**Status:** undecided + +- Keep HAOS on twiga? +- Is twiga's current hardware sufficient? +- How tightly should Home Assistant integrate with the rest of the homelab in V5 (monitoring, VPN, etc.)? + +--- + +### D-26 · Kiosk +**Status:** undecided + +- Keep tembo as a dedicated kiosk display machine? +- Is GNOME the right desktop environment for a kiosk, or something lighter? +- Keep the current tab rotation + physical button handler? +- Should the kiosk and monitoring stack remain co-located on tembo? + +--- + +## Laptops and Clients + +### D-27 · Laptop management strategy +**Status:** undecided + +- Keep Debian + XFCE on all laptops, managed by Ansible? +- Any laptops to replace or add? +- mbuzi currently has no WireGuard config — is that intentional? +- Is the multi-user XFCE model on mamba working, or is it a source of friction? + +--- + +### D-28 · Client software stack +**Status:** undecided + +- Keep the current Ansible-managed flatpak + APT stack? +- Any applications to add, replace, or drop? +- pCloud: keep as the family cloud sync provider? +- PIA VPN on laptops: keep alongside WireGuard, or consolidate?