boma/docs/runbooks/new-host.md
sjat 45ab6ced01 Purge residual .vault_pass references (review R1-R5)
Point ADR-005, the new-host runbook, CONTRIBUTING, and AGENTS at the
rbw/Vaultwarden flow instead of a .vault_pass file. Also record the cron-section
idea in docs/TODO.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 19:17:25 +02:00

4.3 KiB

Runbook — Adding a new managed host

Prerequisites

  • Proxmox VM template exists (Debian 13 cloud-init image — see below if not)
  • rbw is installed and unlocked (rbw unlock) so the vault password resolves from Vaultwarden
  • The host's intended hostname and IP are decided

Part A — Create the Proxmox template (one-time)

Run on a Proxmox node. Only needed once per cluster.

# Download the Debian 13 genericcloud image
wget https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2

# Create a VM (adjust ID, storage name as needed)
qm create 9000 --name debian13-template --memory 2048 --cores 2 \
  --net0 virtio,bridge=vmbr0 --serial0 socket --vga serial0

# Import the disk
qm importdisk 9000 debian-13-genericcloud-amd64.qcow2 local-lvm

# Attach disk and set boot order
qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9000-disk-0
qm set 9000 --boot c --bootdisk scsi0

# Add cloud-init drive
qm set 9000 --ide2 local-lvm:cloudinit

# Enable QEMU guest agent
qm set 9000 --agent enabled=1

# Convert to template (cannot be undone)
qm template 9000

Part B — Define the VM in Terraform

Managed hosts are created by Terraform, never by hand. Add an entry to local.vms in the environment's main.tf (terraform/environments/<env>/main.tf):

locals {
  vms = {
    <hostname> = {
      ip        = "<IP>/24"        # static; from docs/decisions/007-network.md
      group     = "docker_hosts"   # control | docker_hosts | proxmox_hosts
      cores     = 2
      memory_mb = 2048
    }
  }
}

Terraform clones the cloud-init template from Part A, sets the cloud-init values (hostname, SSH key, IP/gateway), and writes the host's DNS A record. See ADR-009 for the full handoff and the vms output → inventory data contract.


Part C — Provision and regenerate the inventory

make tf-plan TF_ENV=production       # review — confirm only the new VM is added
make tf-apply TF_ENV=production      # create the VM + write its DNS A record
make tf-inventory TF_ENV=production  # regenerate inventories/production/hosts.yml

make tf-inventory rewrites hosts.yml from Terraform outputs — do not edit that file by hand; it carries a "do not edit manually" header and your changes would be overwritten. The source of truth is local.vms.

Wait ~60 seconds after apply for cloud-init to complete, then verify SSH access:

ssh ansible@<IP> echo ok

Add a host_vars/<hostname>/ directory if the host needs specific overrides (this is config, not inventory membership, so it is not generated):

mkdir -p inventories/production/host_vars/<hostname>
touch inventories/production/host_vars/<hostname>/vars.yml

Part D — Bootstrap and configure

# First-run bootstrap (handles Python installation, initial user setup)
make deploy PLAYBOOK=bootstrap

# Apply full standard state
make deploy PLAYBOOK=site

Verify the host reaches baseline:

make check PLAYBOOK=site
# Should report no changes

Part E — Control node (manual exception)

The control node runs Terraform and Ansible, so it cannot be created by the Terraform it hosts (chicken-and-egg). It is the one host provisioned manually — see ADR-009 and the control-node section of ADR-005. Use the template from Part A:

# Clone the template by hand (Proxmox UI or qm clone)
qm clone 9000 <VMID> --name <hostname> --full
qm set <VMID> --memory 2048 --cores 2 \
  --ciuser ansible \
  --sshkeys /path/to/ansible_ed25519.pub \
  --ipconfig0 ip=<IP>/24,gw=<GATEWAY>
qm start <VMID>

Then set up the Ansible environment on it (make setup, make collections, set up rbw and rbw unlock) per ADR-005, and add it to inventories/<env>/hosts.yml under the control group. Because the control node is not in local.vms, this is the only case where editing hosts.yml by hand is expected — every other host comes from make tf-inventory.


Troubleshooting

SSH connection refused: cloud-init may still be running. Wait and retry.

Python not found: the bootstrap playbook handles this via raw module. If bootstrap fails, SSH to the host manually and run apt install -y python3.

Firewall locked out: if nftables rules are misconfigured, connect via Proxmox console (not SSH) and run nft flush ruleset to clear all rules temporarily.