boma/docs/runbooks/new-host.md
sjat 175777e36a docs: reconcile 2026-06-14 review findings (O1-O7,O18,O22)
- STATUS: docker_host is built+applied, not scaffold-only (O1)
- ADR-004: backup points to ADR-022, not "out of scope"; service-role file
  table gains ACCESS.md + BACKUP.md rows (O2, O5)
- Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope
  ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22)
- ADR-016/017/018 now lead with ## Status per ADR-023 (O4)
- ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6)
- CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7)
- ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18)
- new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15)

O9 (hosts.yml header) left open: the file is generator-owned (hook-protected);
fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:06:33 +02:00

5.2 KiB

Runbook — Adding a new managed host

Prerequisites

  • Proxmox VM template exists (Debian 13 cloud-init image — see below if not). Not needed for the control node ubongo, which is bare-metal (Part E).
  • rbw is installed and unlocked (rbw unlock) so the vault password resolves from Vaultwarden
  • The host's intended hostname and IP are decided

Part A — Create the Proxmox template (one-time)

Run on a Proxmox node. Only needed once per cluster.

# Download the Debian 13 genericcloud image
wget https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2

# Create a VM (adjust ID, storage name as needed)
qm create 9000 --name debian13-template --memory 2048 --cores 2 \
  --net0 virtio,bridge=vmbr0 --serial0 socket --vga serial0

# Import the disk
qm importdisk 9000 debian-13-genericcloud-amd64.qcow2 local-lvm

# Attach disk and set boot order
qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9000-disk-0
qm set 9000 --boot c --bootdisk scsi0

# Add cloud-init drive
qm set 9000 --ide2 local-lvm:cloudinit

# Enable QEMU guest agent
qm set 9000 --agent enabled=1

# Convert to template (cannot be undone)
qm template 9000

Part B — Define the VM in Terraform

Managed hosts are created by Terraform, never by hand. Add an entry to local.vms in the environment's main.tf (terraform/environments/<env>/main.tf):

locals {
  vms = {
    <hostname> = {
      ip        = "<IP>/24"        # static; from docs/decisions/007-network.md
      group     = "docker_hosts"   # control | docker_hosts | proxmox_hosts
      cores     = 2
      memory_mb = 2048
    }
  }
}

Terraform clones the cloud-init template from Part A and sets the cloud-init values (hostname, SSH key, IP/gateway). It writes no DNS records — the dns role owns the internal zone. See ADR-009 for the full handoff and the vms output → inventory data contract.


Part C — Provision and regenerate the inventory

make tf-plan TF_ENV=production       # review — confirm only the new VM is added
make tf-apply TF_ENV=production      # create the VM (no DNS records written)
make tf-inventory TF_ENV=production  # regenerate inventories/production/hosts.yml

make tf-inventory rewrites hosts.yml from Terraform outputs — do not edit that file by hand; it carries a "do not edit manually" header and your changes would be overwritten. The source of truth is local.vms.

Wait ~60 seconds after apply for cloud-init to complete, then verify SSH access:

ssh ansible@<IP> echo ok

Add a host_vars/<hostname>/ directory if the host needs specific overrides (this is config, not inventory membership, so it is not generated):

mkdir -p inventories/production/host_vars/<hostname>
touch inventories/production/host_vars/<hostname>/vars.yml

Part D — Bootstrap and configure

# First-run bootstrap (handles Python installation, initial user setup)
make deploy PLAYBOOK=bootstrap

# Apply full standard state
make deploy PLAYBOOK=site

Verify the host reaches baseline:

make check PLAYBOOK=site
# Should report no changes

Part E — Control node (ubongo, manual exception)

The control node runs Terraform and Ansible, so it cannot be created by the Terraform it hosts (chicken-and-egg). It is ubongo, a dedicated physical machine outside the cluster — not a Proxmox guest. It is the one host provisioned manually. Rationale, hardware target, and recovery model: ADR-015.

Current state (STATUS.md): ubongo is today managed as the operator account sjat (group_vars/control sets ansible_user: sjat); it has no dedicated ansible service user yet. The dedicated-ansible-user bootstrap (step 2) is a pending item. Steps below describe the intended end state.

  1. Install Debian 13 on the physical box by hand (no template to clone).
  2. Create the ansible user and install its SSH public key. (Pending for ubongo — currently managed as sjat; see the note above.)
  3. Set up the Ansible environment on it:
    git clone <repo> ~/ansible
    cd ~/ansible
    make setup        # venv + Python deps
    make collections  # Ansible collections
    rbw login && rbw unlock   # vault password from Vaultwarden (see rotate-secrets.md)
    
  4. Join the mesh VPN — NetBird, self-hosted on askari (ADR-016) — so it is reachable over SSH from elsewhere.
  5. Add ubongo to inventories/<env>/hosts.yml under the control group.

Because ubongo is not in local.vms, this is the only case where editing hosts.yml by hand is expected. Known limitation: make tf-inventory regenerates hosts.yml from Terraform outputs and will overwrite a hand-added control entry — re-add ubongo after running it (preserving the control entry in the generator is tracked separately, not yet built).


Troubleshooting

SSH connection refused: cloud-init may still be running. Wait and retry.

Python not found: the bootstrap playbook handles this via raw module. If bootstrap fails, SSH to the host manually and run apt install -y python3.

Firewall locked out: if nftables rules are misconfigured, connect via Proxmox console (not SSH) and run nft flush ruleset to clear all rules temporarily.