# Design Decisions — Homelab V5 Subjects to discuss and decide before building V5. Ordered so that each decision can be made with stable answers to everything above it. Each has a unique ID for tracking. Status values: `undecided` · `decided` · `deferred` --- ## Foundation ### D-01 · Goals and guiding principles **Status:** undecided What V5 should optimise for. Every later trade-off will be made against this. - What are the top 3–5 priorities? (e.g. reliability, simplicity, maintainability, capability, cost, family usability) - Are there things V4 got wrong that V5 must not repeat? - What is explicitly out of scope? --- ### D-02 · Hardware — what to keep, retire, replace, or add **Status:** undecided Which physical machines carry forward into V5 and in what role. Decisions here determine what compute, storage, and network capacity the rest of the design has to work with. - fisi: keep as primary server? Upgrade? Replace? - tembo: keep kiosk+monitoring combo? Split the roles? - papa: keep as dedicated NAS? - kobe: keep as dedicated backup target? Consolidate with papa? - kuku/faru: keep as Pi-based roles? Upgrade to newer Pi hardware? - simba: keep OPNsense on current hardware? - Any new hardware to introduce? --- ### D-03 · Virtualisation strategy **Status:** undecided Whether to introduce a hypervisor layer, and if so which one. This decision shapes host OS choices, service isolation, and migration paths. - Stay bare-metal containers only (current approach)? - Introduce a hypervisor (Proxmox, ESXi, bhyve)? - If yes: which hosts get it, and which remain bare metal? - What is the unit of deployment — VM, LXC, container, or a mix? --- ### D-04 · Host OS strategy **Status:** undecided What OS runs on each category of machine. Depends on D-03. - Debian everywhere (current)? Or specialised OS per role (TrueNAS for NAS, Proxmox for compute, etc.)? - Minimum Debian version to target? - Should all managed hosts run the same base OS? --- ## Network ### D-05 · IP addressing and VLAN design **Status:** undecided The logical network topology. Sets the stage for firewall rules and service addressing. - Keep the `10.20.x.x` scheme? - Are the current VLANs (`10.20.10`, `.1`, `.2`, `.30`) the right boundaries, or does V5 need more/fewer segments? - Should the WireGuard tunnel subnet (`10.8.0.0/24`) change? - DHCP reservation strategy — reserve all infrastructure IPs statically? --- ### D-06 · Firewall and router **Status:** undecided What handles routing, NAT, DHCP, and inter-VLAN policy. - Keep OPNsense on simba? - Any changes to hardware or OPNsense version? - Are the current inter-VLAN policies correct, or does V5 need stricter segmentation (e.g. IoT fully isolated)? --- ### D-07 · WiFi **Status:** undecided - Keep the two EAP610 APs (tai1/tai2) as-is? - Add a third AP? - Keep using standalone mode or move to Omada controller? --- ## Platform Services ### D-08 · Container orchestration **Status:** undecided How containers are defined, deployed, and managed. One of the most consequential decisions — affects the IaC model, tooling, and complexity. - Keep Docker + Docker Compose (current)? - Move to Podman/Quadlets? - Introduce an orchestrator (Nomad, K3s)? - If staying with Compose: keep the `container_base` Ansible role pattern? --- ### D-09 · Internal DNS **Status:** undecided How internal names resolve and how ad blocking is handled. - Keep Technitium on fisi? - Risks of single-server DNS (fisi DNS outage = no internal resolution)? - Should DNS be moved to a more reliable host, or should there be a secondary? - Keep the `*.nyumbani.baobab.band` wildcard pattern? --- ### D-10 · Reverse proxy **Status:** undecided How HTTPS termination and routing work for internal and public services. - Keep Traefik (current, on fisi)? - Any reason to consider Caddy? - Certificate strategy: keep DNS-01 wildcards via Cloudflare? Keep per-VPS Traefik instances? - Is a single Traefik instance on fisi the right topology, or should tembo have its own? --- ### D-11 · Remote access and VPN **Status:** undecided How family members and VPS hosts reach the homelab network from outside. - Keep WireGuard on kuku (Raspberry Pi hub)? - Is kuku a single point of failure worth addressing? - Consider Tailscale or Headscale instead of self-managed WireGuard? - VPS integration: keep VPS hosts as WireGuard spokes? --- ### D-12 · Secrets management **Status:** undecided Where secrets live and how they are accessed at deploy time. - Keep Ansible Vault (current)? - Move to SOPS + age? - Introduce a secrets server (Infisical, Doppler, HashiCorp Vault)? - Is the single vault file per inventory environment the right structure? --- ### D-13 · IaC tooling **Status:** undecided The tools used to define and apply infrastructure state. Depends on D-03, D-04, D-08. - Keep Ansible as the primary tool? - Add Terraform/OpenTofu for VPS provisioning? - Keep the two-inventory (`prod`/`lab`) structure? - Role naming and structure: evolve `AnsibleBaobabV4` in place, or start fresh? --- ## Storage and Data ### D-14 · Storage architecture **Status:** undecided How storage is organised across machines. Depends on D-02 and D-03. - Keep papa as a dedicated NAS with ZFS mirror exported via NFS? - Is the NVMe on fisi the right place for all container state? - Should media and container data live on the same host? - Any need for larger storage capacity in V5? --- ### D-15 · Backup strategy **Status:** undecided What is protected, how, and where the backups land. Depends on D-14. - Keep Borg as primary? Keep papa as the backup target? - Simplify: consolidate kobe (rsnapshot) into the Borg model? - Off-site: keep pCloud sync via rclone? Any other off-site approach? - Backup for network devices (simba, APs, switch) — keep pull model from papa? - RTO/RPO expectations: what is acceptable downtime and data loss? --- ## Observability ### D-16 · Observability stack placement **Status:** undecided Which host runs the monitoring stack and how resilient it needs to be. - Keep monitoring on tembo (same machine as kiosk)? - Should monitoring be on a host that is not also a kiosk / display? - What happens to observability if the monitoring host goes down? --- ### D-17 · Metrics, logs, and alerting **Status:** undecided The specific tools and data flows for observability. Depends on D-16. - Keep Prometheus + Loki + Grafana? - Keep Grafana Alloy as the shipping agent? - Keep the Matrix bot for alerts, or move to ntfy? - Log retention: is 15-day Prometheus retention enough? - Any gaps in current coverage to address in V5? --- ## Public Exposure ### D-18 · VPS strategy **Status:** undecided How many VPS hosts, at which providers, and for what roles. - Keep three VPS (baobab.band, makerfloss, rullebiler.dk)? - makerfloss is currently isolated (no WireGuard, no backup) — is that intentional? - Should VPS hosts be brought fully into the homelab WireGuard mesh? - Cost and provider consolidation: any reason to move hosts? --- ### D-19 · Public services and exposure model **Status:** undecided What is reachable from the internet and how traffic gets there. - Which services need to be publicly accessible (vs. VPN-only)? - Keep the current model of public services pointing to fisi's public IP via Cloudflare? - Should any services move behind a VPS relay (Cloudflare Tunnel, nginx stream proxy)? - Port exposure policy: what can be opened directly vs. must go through a VPS? --- ### D-20 · Domain and DNS provider strategy **Status:** undecided How many domains, managed where. - Keep `baobab.band` + `makerfloss.eu` + `rullebiler.dk`? - Keep split between Cloudflare and Gandi for DNS management? - Any consolidation desired? --- ## Services ### D-21 · Core service catalogue **Status:** undecided Which services are first-class citizens in V5 — things that must be reliable and are worth complexity to maintain. - Define the "core" tier: services that must survive a host rebuild before anything else is restored (e.g. Vaultwarden, Nextcloud, Forgejo, DNS, Grafana). - Define the "nice-to-have" tier. - Are there V4 services that should be dropped in V5? --- ### D-22 · Media stack **Status:** undecided - Keep the full \*arr stack (Sonarr, Radarr, Lidarr, Prowlarr, Lazylibrarian)? - Keep Jellyfin + Audiobookshelf + Calibre Web? - Any services to add or drop? - Gluetun VPN for qBittorrent: keep PIA, or change provider? --- ### D-23 · Communication services **Status:** undecided - Keep self-hosted Matrix (conduwuit + Element Web)? - Keep Poste.io for mail — three separate instances across three hosts is the current pattern; is that the right structure? - Keep ntfy for push notifications? - Any desire to consolidate or simplify the comms stack? --- ### D-24 · Photo management **Status:** undecided PhotoPrism is currently deployed on both fisi and tembo (partially migrated). This is unresolved technical debt. - Settle on a single host for PhotoPrism. - Is PhotoPrism the right tool long-term, or is there an alternative to consider? - Confirm GPU passthrough requirements (Intel Quick Sync for transcoding). --- ### D-25 · Home automation **Status:** undecided - Keep HAOS on twiga? - Is twiga's current hardware sufficient? - How tightly should Home Assistant integrate with the rest of the homelab in V5 (monitoring, VPN, etc.)? --- ### D-26 · Kiosk **Status:** undecided - Keep tembo as a dedicated kiosk display machine? - Is GNOME the right desktop environment for a kiosk, or something lighter? - Keep the current tab rotation + physical button handler? - Should the kiosk and monitoring stack remain co-located on tembo? --- ## Laptops and Clients ### D-27 · Laptop management strategy **Status:** undecided - Keep Debian + XFCE on all laptops, managed by Ansible? - Any laptops to replace or add? - mbuzi currently has no WireGuard config — is that intentional? - Is the multi-user XFCE model on mamba working, or is it a source of friction? --- ### D-28 · Client software stack **Status:** undecided - Keep the current Ansible-managed flatpak + APT stack? - Any applications to add, replace, or drop? - pCloud: keep as the family cloud sync provider? - PIA VPN on laptops: keep alongside WireGuard, or consolidate?