HomelabDesignV5/design-decisions.md
sjat 5af0cf8582 Add design decisions list for V5
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 09:00:59 +02:00

339 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Design Decisions — Homelab V5
Subjects to discuss and decide before building V5. Ordered so that each decision can be made with stable answers to everything above it. Each has a unique ID for tracking.
Status values: `undecided` · `decided` · `deferred`
---
## Foundation
### D-01 · Goals and guiding principles
**Status:** undecided
What V5 should optimise for. Every later trade-off will be made against this.
- What are the top 35 priorities? (e.g. reliability, simplicity, maintainability, capability, cost, family usability)
- Are there things V4 got wrong that V5 must not repeat?
- What is explicitly out of scope?
---
### D-02 · Hardware — what to keep, retire, replace, or add
**Status:** undecided
Which physical machines carry forward into V5 and in what role. Decisions here determine what compute, storage, and network capacity the rest of the design has to work with.
- fisi: keep as primary server? Upgrade? Replace?
- tembo: keep kiosk+monitoring combo? Split the roles?
- papa: keep as dedicated NAS?
- kobe: keep as dedicated backup target? Consolidate with papa?
- kuku/faru: keep as Pi-based roles? Upgrade to newer Pi hardware?
- simba: keep OPNsense on current hardware?
- Any new hardware to introduce?
---
### D-03 · Virtualisation strategy
**Status:** undecided
Whether to introduce a hypervisor layer, and if so which one. This decision shapes host OS choices, service isolation, and migration paths.
- Stay bare-metal containers only (current approach)?
- Introduce a hypervisor (Proxmox, ESXi, bhyve)?
- If yes: which hosts get it, and which remain bare metal?
- What is the unit of deployment — VM, LXC, container, or a mix?
---
### D-04 · Host OS strategy
**Status:** undecided
What OS runs on each category of machine. Depends on D-03.
- Debian everywhere (current)? Or specialised OS per role (TrueNAS for NAS, Proxmox for compute, etc.)?
- Minimum Debian version to target?
- Should all managed hosts run the same base OS?
---
## Network
### D-05 · IP addressing and VLAN design
**Status:** undecided
The logical network topology. Sets the stage for firewall rules and service addressing.
- Keep the `10.20.x.x` scheme?
- Are the current VLANs (`10.20.10`, `.1`, `.2`, `.30`) the right boundaries, or does V5 need more/fewer segments?
- Should the WireGuard tunnel subnet (`10.8.0.0/24`) change?
- DHCP reservation strategy — reserve all infrastructure IPs statically?
---
### D-06 · Firewall and router
**Status:** undecided
What handles routing, NAT, DHCP, and inter-VLAN policy.
- Keep OPNsense on simba?
- Any changes to hardware or OPNsense version?
- Are the current inter-VLAN policies correct, or does V5 need stricter segmentation (e.g. IoT fully isolated)?
---
### D-07 · WiFi
**Status:** undecided
- Keep the two EAP610 APs (tai1/tai2) as-is?
- Add a third AP?
- Keep using standalone mode or move to Omada controller?
---
## Platform Services
### D-08 · Container orchestration
**Status:** undecided
How containers are defined, deployed, and managed. One of the most consequential decisions — affects the IaC model, tooling, and complexity.
- Keep Docker + Docker Compose (current)?
- Move to Podman/Quadlets?
- Introduce an orchestrator (Nomad, K3s)?
- If staying with Compose: keep the `container_base` Ansible role pattern?
---
### D-09 · Internal DNS
**Status:** undecided
How internal names resolve and how ad blocking is handled.
- Keep Technitium on fisi?
- Risks of single-server DNS (fisi DNS outage = no internal resolution)?
- Should DNS be moved to a more reliable host, or should there be a secondary?
- Keep the `*.nyumbani.baobab.band` wildcard pattern?
---
### D-10 · Reverse proxy
**Status:** undecided
How HTTPS termination and routing work for internal and public services.
- Keep Traefik (current, on fisi)?
- Any reason to consider Caddy?
- Certificate strategy: keep DNS-01 wildcards via Cloudflare? Keep per-VPS Traefik instances?
- Is a single Traefik instance on fisi the right topology, or should tembo have its own?
---
### D-11 · Remote access and VPN
**Status:** undecided
How family members and VPS hosts reach the homelab network from outside.
- Keep WireGuard on kuku (Raspberry Pi hub)?
- Is kuku a single point of failure worth addressing?
- Consider Tailscale or Headscale instead of self-managed WireGuard?
- VPS integration: keep VPS hosts as WireGuard spokes?
---
### D-12 · Secrets management
**Status:** undecided
Where secrets live and how they are accessed at deploy time.
- Keep Ansible Vault (current)?
- Move to SOPS + age?
- Introduce a secrets server (Infisical, Doppler, HashiCorp Vault)?
- Is the single vault file per inventory environment the right structure?
---
### D-13 · IaC tooling
**Status:** undecided
The tools used to define and apply infrastructure state. Depends on D-03, D-04, D-08.
- Keep Ansible as the primary tool?
- Add Terraform/OpenTofu for VPS provisioning?
- Keep the two-inventory (`prod`/`lab`) structure?
- Role naming and structure: evolve `AnsibleBaobabV4` in place, or start fresh?
---
## Storage and Data
### D-14 · Storage architecture
**Status:** undecided
How storage is organised across machines. Depends on D-02 and D-03.
- Keep papa as a dedicated NAS with ZFS mirror exported via NFS?
- Is the NVMe on fisi the right place for all container state?
- Should media and container data live on the same host?
- Any need for larger storage capacity in V5?
---
### D-15 · Backup strategy
**Status:** undecided
What is protected, how, and where the backups land. Depends on D-14.
- Keep Borg as primary? Keep papa as the backup target?
- Simplify: consolidate kobe (rsnapshot) into the Borg model?
- Off-site: keep pCloud sync via rclone? Any other off-site approach?
- Backup for network devices (simba, APs, switch) — keep pull model from papa?
- RTO/RPO expectations: what is acceptable downtime and data loss?
---
## Observability
### D-16 · Observability stack placement
**Status:** undecided
Which host runs the monitoring stack and how resilient it needs to be.
- Keep monitoring on tembo (same machine as kiosk)?
- Should monitoring be on a host that is not also a kiosk / display?
- What happens to observability if the monitoring host goes down?
---
### D-17 · Metrics, logs, and alerting
**Status:** undecided
The specific tools and data flows for observability. Depends on D-16.
- Keep Prometheus + Loki + Grafana?
- Keep Grafana Alloy as the shipping agent?
- Keep the Matrix bot for alerts, or move to ntfy?
- Log retention: is 15-day Prometheus retention enough?
- Any gaps in current coverage to address in V5?
---
## Public Exposure
### D-18 · VPS strategy
**Status:** undecided
How many VPS hosts, at which providers, and for what roles.
- Keep three VPS (baobab.band, makerfloss, rullebiler.dk)?
- makerfloss is currently isolated (no WireGuard, no backup) — is that intentional?
- Should VPS hosts be brought fully into the homelab WireGuard mesh?
- Cost and provider consolidation: any reason to move hosts?
---
### D-19 · Public services and exposure model
**Status:** undecided
What is reachable from the internet and how traffic gets there.
- Which services need to be publicly accessible (vs. VPN-only)?
- Keep the current model of public services pointing to fisi's public IP via Cloudflare?
- Should any services move behind a VPS relay (Cloudflare Tunnel, nginx stream proxy)?
- Port exposure policy: what can be opened directly vs. must go through a VPS?
---
### D-20 · Domain and DNS provider strategy
**Status:** undecided
How many domains, managed where.
- Keep `baobab.band` + `makerfloss.eu` + `rullebiler.dk`?
- Keep split between Cloudflare and Gandi for DNS management?
- Any consolidation desired?
---
## Services
### D-21 · Core service catalogue
**Status:** undecided
Which services are first-class citizens in V5 — things that must be reliable and are worth complexity to maintain.
- Define the "core" tier: services that must survive a host rebuild before anything else is restored (e.g. Vaultwarden, Nextcloud, Forgejo, DNS, Grafana).
- Define the "nice-to-have" tier.
- Are there V4 services that should be dropped in V5?
---
### D-22 · Media stack
**Status:** undecided
- Keep the full \*arr stack (Sonarr, Radarr, Lidarr, Prowlarr, Lazylibrarian)?
- Keep Jellyfin + Audiobookshelf + Calibre Web?
- Any services to add or drop?
- Gluetun VPN for qBittorrent: keep PIA, or change provider?
---
### D-23 · Communication services
**Status:** undecided
- Keep self-hosted Matrix (conduwuit + Element Web)?
- Keep Poste.io for mail — three separate instances across three hosts is the current pattern; is that the right structure?
- Keep ntfy for push notifications?
- Any desire to consolidate or simplify the comms stack?
---
### D-24 · Photo management
**Status:** undecided
PhotoPrism is currently deployed on both fisi and tembo (partially migrated). This is unresolved technical debt.
- Settle on a single host for PhotoPrism.
- Is PhotoPrism the right tool long-term, or is there an alternative to consider?
- Confirm GPU passthrough requirements (Intel Quick Sync for transcoding).
---
### D-25 · Home automation
**Status:** undecided
- Keep HAOS on twiga?
- Is twiga's current hardware sufficient?
- How tightly should Home Assistant integrate with the rest of the homelab in V5 (monitoring, VPN, etc.)?
---
### D-26 · Kiosk
**Status:** undecided
- Keep tembo as a dedicated kiosk display machine?
- Is GNOME the right desktop environment for a kiosk, or something lighter?
- Keep the current tab rotation + physical button handler?
- Should the kiosk and monitoring stack remain co-located on tembo?
---
## Laptops and Clients
### D-27 · Laptop management strategy
**Status:** undecided
- Keep Debian + XFCE on all laptops, managed by Ansible?
- Any laptops to replace or add?
- mbuzi currently has no WireGuard config — is that intentional?
- Is the multi-user XFCE model on mamba working, or is it a source of friction?
---
### D-28 · Client software stack
**Status:** undecided
- Keep the current Ansible-managed flatpak + APT stack?
- Any applications to add, replace, or drop?
- pCloud: keep as the family cloud sync provider?
- PIA VPN on laptops: keep alongside WireGuard, or consolidate?