Add design decisions list for V5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
sjat 2026-04-30 09:00:59 +02:00
parent 7e74559d5b
commit 5af0cf8582

339
design-decisions.md Normal file
View file

@ -0,0 +1,339 @@
# Design Decisions — Homelab V5
Subjects to discuss and decide before building V5. Ordered so that each decision can be made with stable answers to everything above it. Each has a unique ID for tracking.
Status values: `undecided` · `decided` · `deferred`
---
## Foundation
### D-01 · Goals and guiding principles
**Status:** undecided
What V5 should optimise for. Every later trade-off will be made against this.
- What are the top 35 priorities? (e.g. reliability, simplicity, maintainability, capability, cost, family usability)
- Are there things V4 got wrong that V5 must not repeat?
- What is explicitly out of scope?
---
### D-02 · Hardware — what to keep, retire, replace, or add
**Status:** undecided
Which physical machines carry forward into V5 and in what role. Decisions here determine what compute, storage, and network capacity the rest of the design has to work with.
- fisi: keep as primary server? Upgrade? Replace?
- tembo: keep kiosk+monitoring combo? Split the roles?
- papa: keep as dedicated NAS?
- kobe: keep as dedicated backup target? Consolidate with papa?
- kuku/faru: keep as Pi-based roles? Upgrade to newer Pi hardware?
- simba: keep OPNsense on current hardware?
- Any new hardware to introduce?
---
### D-03 · Virtualisation strategy
**Status:** undecided
Whether to introduce a hypervisor layer, and if so which one. This decision shapes host OS choices, service isolation, and migration paths.
- Stay bare-metal containers only (current approach)?
- Introduce a hypervisor (Proxmox, ESXi, bhyve)?
- If yes: which hosts get it, and which remain bare metal?
- What is the unit of deployment — VM, LXC, container, or a mix?
---
### D-04 · Host OS strategy
**Status:** undecided
What OS runs on each category of machine. Depends on D-03.
- Debian everywhere (current)? Or specialised OS per role (TrueNAS for NAS, Proxmox for compute, etc.)?
- Minimum Debian version to target?
- Should all managed hosts run the same base OS?
---
## Network
### D-05 · IP addressing and VLAN design
**Status:** undecided
The logical network topology. Sets the stage for firewall rules and service addressing.
- Keep the `10.20.x.x` scheme?
- Are the current VLANs (`10.20.10`, `.1`, `.2`, `.30`) the right boundaries, or does V5 need more/fewer segments?
- Should the WireGuard tunnel subnet (`10.8.0.0/24`) change?
- DHCP reservation strategy — reserve all infrastructure IPs statically?
---
### D-06 · Firewall and router
**Status:** undecided
What handles routing, NAT, DHCP, and inter-VLAN policy.
- Keep OPNsense on simba?
- Any changes to hardware or OPNsense version?
- Are the current inter-VLAN policies correct, or does V5 need stricter segmentation (e.g. IoT fully isolated)?
---
### D-07 · WiFi
**Status:** undecided
- Keep the two EAP610 APs (tai1/tai2) as-is?
- Add a third AP?
- Keep using standalone mode or move to Omada controller?
---
## Platform Services
### D-08 · Container orchestration
**Status:** undecided
How containers are defined, deployed, and managed. One of the most consequential decisions — affects the IaC model, tooling, and complexity.
- Keep Docker + Docker Compose (current)?
- Move to Podman/Quadlets?
- Introduce an orchestrator (Nomad, K3s)?
- If staying with Compose: keep the `container_base` Ansible role pattern?
---
### D-09 · Internal DNS
**Status:** undecided
How internal names resolve and how ad blocking is handled.
- Keep Technitium on fisi?
- Risks of single-server DNS (fisi DNS outage = no internal resolution)?
- Should DNS be moved to a more reliable host, or should there be a secondary?
- Keep the `*.nyumbani.baobab.band` wildcard pattern?
---
### D-10 · Reverse proxy
**Status:** undecided
How HTTPS termination and routing work for internal and public services.
- Keep Traefik (current, on fisi)?
- Any reason to consider Caddy?
- Certificate strategy: keep DNS-01 wildcards via Cloudflare? Keep per-VPS Traefik instances?
- Is a single Traefik instance on fisi the right topology, or should tembo have its own?
---
### D-11 · Remote access and VPN
**Status:** undecided
How family members and VPS hosts reach the homelab network from outside.
- Keep WireGuard on kuku (Raspberry Pi hub)?
- Is kuku a single point of failure worth addressing?
- Consider Tailscale or Headscale instead of self-managed WireGuard?
- VPS integration: keep VPS hosts as WireGuard spokes?
---
### D-12 · Secrets management
**Status:** undecided
Where secrets live and how they are accessed at deploy time.
- Keep Ansible Vault (current)?
- Move to SOPS + age?
- Introduce a secrets server (Infisical, Doppler, HashiCorp Vault)?
- Is the single vault file per inventory environment the right structure?
---
### D-13 · IaC tooling
**Status:** undecided
The tools used to define and apply infrastructure state. Depends on D-03, D-04, D-08.
- Keep Ansible as the primary tool?
- Add Terraform/OpenTofu for VPS provisioning?
- Keep the two-inventory (`prod`/`lab`) structure?
- Role naming and structure: evolve `AnsibleBaobabV4` in place, or start fresh?
---
## Storage and Data
### D-14 · Storage architecture
**Status:** undecided
How storage is organised across machines. Depends on D-02 and D-03.
- Keep papa as a dedicated NAS with ZFS mirror exported via NFS?
- Is the NVMe on fisi the right place for all container state?
- Should media and container data live on the same host?
- Any need for larger storage capacity in V5?
---
### D-15 · Backup strategy
**Status:** undecided
What is protected, how, and where the backups land. Depends on D-14.
- Keep Borg as primary? Keep papa as the backup target?
- Simplify: consolidate kobe (rsnapshot) into the Borg model?
- Off-site: keep pCloud sync via rclone? Any other off-site approach?
- Backup for network devices (simba, APs, switch) — keep pull model from papa?
- RTO/RPO expectations: what is acceptable downtime and data loss?
---
## Observability
### D-16 · Observability stack placement
**Status:** undecided
Which host runs the monitoring stack and how resilient it needs to be.
- Keep monitoring on tembo (same machine as kiosk)?
- Should monitoring be on a host that is not also a kiosk / display?
- What happens to observability if the monitoring host goes down?
---
### D-17 · Metrics, logs, and alerting
**Status:** undecided
The specific tools and data flows for observability. Depends on D-16.
- Keep Prometheus + Loki + Grafana?
- Keep Grafana Alloy as the shipping agent?
- Keep the Matrix bot for alerts, or move to ntfy?
- Log retention: is 15-day Prometheus retention enough?
- Any gaps in current coverage to address in V5?
---
## Public Exposure
### D-18 · VPS strategy
**Status:** undecided
How many VPS hosts, at which providers, and for what roles.
- Keep three VPS (baobab.band, makerfloss, rullebiler.dk)?
- makerfloss is currently isolated (no WireGuard, no backup) — is that intentional?
- Should VPS hosts be brought fully into the homelab WireGuard mesh?
- Cost and provider consolidation: any reason to move hosts?
---
### D-19 · Public services and exposure model
**Status:** undecided
What is reachable from the internet and how traffic gets there.
- Which services need to be publicly accessible (vs. VPN-only)?
- Keep the current model of public services pointing to fisi's public IP via Cloudflare?
- Should any services move behind a VPS relay (Cloudflare Tunnel, nginx stream proxy)?
- Port exposure policy: what can be opened directly vs. must go through a VPS?
---
### D-20 · Domain and DNS provider strategy
**Status:** undecided
How many domains, managed where.
- Keep `baobab.band` + `makerfloss.eu` + `rullebiler.dk`?
- Keep split between Cloudflare and Gandi for DNS management?
- Any consolidation desired?
---
## Services
### D-21 · Core service catalogue
**Status:** undecided
Which services are first-class citizens in V5 — things that must be reliable and are worth complexity to maintain.
- Define the "core" tier: services that must survive a host rebuild before anything else is restored (e.g. Vaultwarden, Nextcloud, Forgejo, DNS, Grafana).
- Define the "nice-to-have" tier.
- Are there V4 services that should be dropped in V5?
---
### D-22 · Media stack
**Status:** undecided
- Keep the full \*arr stack (Sonarr, Radarr, Lidarr, Prowlarr, Lazylibrarian)?
- Keep Jellyfin + Audiobookshelf + Calibre Web?
- Any services to add or drop?
- Gluetun VPN for qBittorrent: keep PIA, or change provider?
---
### D-23 · Communication services
**Status:** undecided
- Keep self-hosted Matrix (conduwuit + Element Web)?
- Keep Poste.io for mail — three separate instances across three hosts is the current pattern; is that the right structure?
- Keep ntfy for push notifications?
- Any desire to consolidate or simplify the comms stack?
---
### D-24 · Photo management
**Status:** undecided
PhotoPrism is currently deployed on both fisi and tembo (partially migrated). This is unresolved technical debt.
- Settle on a single host for PhotoPrism.
- Is PhotoPrism the right tool long-term, or is there an alternative to consider?
- Confirm GPU passthrough requirements (Intel Quick Sync for transcoding).
---
### D-25 · Home automation
**Status:** undecided
- Keep HAOS on twiga?
- Is twiga's current hardware sufficient?
- How tightly should Home Assistant integrate with the rest of the homelab in V5 (monitoring, VPN, etc.)?
---
### D-26 · Kiosk
**Status:** undecided
- Keep tembo as a dedicated kiosk display machine?
- Is GNOME the right desktop environment for a kiosk, or something lighter?
- Keep the current tab rotation + physical button handler?
- Should the kiosk and monitoring stack remain co-located on tembo?
---
## Laptops and Clients
### D-27 · Laptop management strategy
**Status:** undecided
- Keep Debian + XFCE on all laptops, managed by Ansible?
- Any laptops to replace or add?
- mbuzi currently has no WireGuard config — is that intentional?
- Is the multi-user XFCE model on mamba working, or is it a source of friction?
---
### D-28 · Client software stack
**Status:** undecided
- Keep the current Ansible-managed flatpak + APT stack?
- Any applications to add, replace, or drop?
- pCloud: keep as the family cloud sync provider?
- PIA VPN on laptops: keep alongside WireGuard, or consolidate?