boma/docs/runbooks/new-host.md
sjat f51ae1a13d docs(runbook): integration-testing runbook + pre-flight cross-links
- New docs/runbooks/integration-testing.md: when to use (firewall/
  sshd/boot/Docker changes); make test-integration commands; lower-
  level driver sub-commands; cert tier guidance; diagnostics dir;
  VM inspection (virsh console / SSH); safety invariants; resource
  constraints; adding a new profile; self-validating acceptance test.
- docs/runbooks/new-host.md: pre-flight warning before deploying
  lockout-risky changes (firewall/sshd/boot) while break-glass is open
- docs/runbooks/new-role.md: step 13 pre-flight for lockout-risky roles

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 12:59:06 +02:00

5.7 KiB

Runbook — Adding a new managed host

Prerequisites

  • Proxmox VM template exists (Debian 13 cloud-init image — see below if not). Not needed for the control node ubongo, which is bare-metal (Part E).
  • rbw is installed and unlocked (rbw unlock) so the vault password resolves from Vaultwarden
  • The host's intended hostname and IP are decided

Part A — Create the Proxmox template (one-time)

Run on a Proxmox node. Only needed once per cluster.

# Download the Debian 13 genericcloud image
wget https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2

# Create a VM (adjust ID, storage name as needed)
qm create 9000 --name debian13-template --memory 2048 --cores 2 \
  --net0 virtio,bridge=vmbr0 --serial0 socket --vga serial0

# Import the disk
qm importdisk 9000 debian-13-genericcloud-amd64.qcow2 local-lvm

# Attach disk and set boot order
qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9000-disk-0
qm set 9000 --boot c --bootdisk scsi0

# Add cloud-init drive
qm set 9000 --ide2 local-lvm:cloudinit

# Enable QEMU guest agent
qm set 9000 --agent enabled=1

# Convert to template (cannot be undone)
qm template 9000

Part B — Define the VM in Terraform

Managed hosts are created by Terraform, never by hand. Add an entry to local.vms in the environment's main.tf (terraform/environments/<env>/main.tf):

locals {
  vms = {
    <hostname> = {
      ip        = "<IP>/24"        # static; from docs/decisions/007-network.md
      group     = "docker_hosts"   # control | docker_hosts | proxmox_hosts
      cores     = 2
      memory_mb = 2048
    }
  }
}

Terraform clones the cloud-init template from Part A and sets the cloud-init values (hostname, SSH key, IP/gateway). It writes no DNS records — the dns role owns the internal zone. See ADR-009 for the full handoff and the vms output → inventory data contract.


Part C — Provision and regenerate the inventory

make tf-plan TF_ENV=production       # review — confirm only the new VM is added
make tf-apply TF_ENV=production      # create the VM (no DNS records written)
make tf-inventory TF_ENV=production  # regenerate inventories/production/hosts.yml

make tf-inventory rewrites hosts.yml from Terraform outputs — do not edit that file by hand; it carries a "do not edit manually" header and your changes would be overwritten. The source of truth is local.vms.

Wait ~60 seconds after apply for cloud-init to complete, then verify SSH access:

ssh ansible@<IP> echo ok

Add a host_vars/<hostname>/ directory if the host needs specific overrides (this is config, not inventory membership, so it is not generated):

mkdir -p inventories/production/host_vars/<hostname>
touch inventories/production/host_vars/<hostname>/vars.yml

Part D — Bootstrap and configure

# First-run bootstrap (handles Python installation, initial user setup)
make deploy PLAYBOOK=bootstrap

# Apply full standard state
make deploy PLAYBOOK=site

Verify the host reaches baseline:

make check PLAYBOOK=site
# Should report no changes

Pre-flight before lockout-risky changes (firewall / sshd / boot): before applying any change that touches nftables rules, SSH configuration, or boot ordering, run make test-integration HOST=<name> and confirm reboot-recovery on the local VM while the break-glass (Proxmox console / Hetzner console) is still open. Do not retire the break-glass until the integration test passes. See docs/runbooks/integration-testing.md and ADR-025.


Part E — Control node (ubongo, manual exception)

The control node runs Terraform and Ansible, so it cannot be created by the Terraform it hosts (chicken-and-egg). It is ubongo, a dedicated physical machine outside the cluster — not a Proxmox guest. It is the one host provisioned manually. Rationale, hardware target, and recovery model: ADR-015.

Current state (STATUS.md): ubongo is today managed as the operator account sjat (group_vars/control sets ansible_user: sjat); it has no dedicated ansible service user yet. The dedicated-ansible-user bootstrap (step 2) is a pending item. Steps below describe the intended end state.

  1. Install Debian 13 on the physical box by hand (no template to clone).
  2. Create the ansible user and install its SSH public key. (Pending for ubongo — currently managed as sjat; see the note above.)
  3. Set up the Ansible environment on it:
    git clone <repo> ~/ansible
    cd ~/ansible
    make setup        # venv + Python deps
    make collections  # Ansible collections
    rbw login && rbw unlock   # vault password from Vaultwarden (see rotate-secrets.md)
    
  4. Join the mesh VPN — NetBird, self-hosted on askari (ADR-016) — so it is reachable over SSH from elsewhere.
  5. Add ubongo to inventories/<env>/hosts.yml under the control group.

Because ubongo is not in local.vms, this is the only case where editing hosts.yml by hand is expected. Known limitation: make tf-inventory regenerates hosts.yml from Terraform outputs and will overwrite a hand-added control entry — re-add ubongo after running it (preserving the control entry in the generator is tracked separately, not yet built).


Troubleshooting

SSH connection refused: cloud-init may still be running. Wait and retry.

Python not found: the bootstrap playbook handles this via raw module. If bootstrap fails, SSH to the host manually and run apt install -y python3.

Firewall locked out: if nftables rules are misconfigured, connect via Proxmox console (not SSH) and run nft flush ruleset to clear all rules temporarily.