- STATUS: docker_host is built+applied, not scaffold-only (O1) - ADR-004: backup points to ADR-022, not "out of scope"; service-role file table gains ACCESS.md + BACKUP.md rows (O2, O5) - Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22) - ADR-016/017/018 now lead with ## Status per ADR-023 (O4) - ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6) - CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7) - ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18) - new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15) O9 (hosts.yml header) left open: the file is generator-owned (hook-protected); fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5.2 KiB
Runbook — Adding a new managed host
Prerequisites
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not).
Not needed for the control node
ubongo, which is bare-metal (Part E). rbwis installed and unlocked (rbw unlock) so the vault password resolves from Vaultwarden- The host's intended hostname and IP are decided
Part A — Create the Proxmox template (one-time)
Run on a Proxmox node. Only needed once per cluster.
# Download the Debian 13 genericcloud image
wget https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2
# Create a VM (adjust ID, storage name as needed)
qm create 9000 --name debian13-template --memory 2048 --cores 2 \
--net0 virtio,bridge=vmbr0 --serial0 socket --vga serial0
# Import the disk
qm importdisk 9000 debian-13-genericcloud-amd64.qcow2 local-lvm
# Attach disk and set boot order
qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9000-disk-0
qm set 9000 --boot c --bootdisk scsi0
# Add cloud-init drive
qm set 9000 --ide2 local-lvm:cloudinit
# Enable QEMU guest agent
qm set 9000 --agent enabled=1
# Convert to template (cannot be undone)
qm template 9000
Part B — Define the VM in Terraform
Managed hosts are created by Terraform, never by hand. Add an entry to local.vms
in the environment's main.tf (terraform/environments/<env>/main.tf):
locals {
vms = {
<hostname> = {
ip = "<IP>/24" # static; from docs/decisions/007-network.md
group = "docker_hosts" # control | docker_hosts | proxmox_hosts
cores = 2
memory_mb = 2048
}
}
}
Terraform clones the cloud-init template from Part A and sets the cloud-init values
(hostname, SSH key, IP/gateway). It writes no DNS records — the dns role owns the
internal zone. See ADR-009 for the full handoff and the vms output → inventory data contract.
Part C — Provision and regenerate the inventory
make tf-plan TF_ENV=production # review — confirm only the new VM is added
make tf-apply TF_ENV=production # create the VM (no DNS records written)
make tf-inventory TF_ENV=production # regenerate inventories/production/hosts.yml
make tf-inventory rewrites hosts.yml from Terraform outputs — do not edit
that file by hand; it carries a "do not edit manually" header and your changes
would be overwritten. The source of truth is local.vms.
Wait ~60 seconds after apply for cloud-init to complete, then verify SSH access:
ssh ansible@<IP> echo ok
Add a host_vars/<hostname>/ directory if the host needs specific overrides
(this is config, not inventory membership, so it is not generated):
mkdir -p inventories/production/host_vars/<hostname>
touch inventories/production/host_vars/<hostname>/vars.yml
Part D — Bootstrap and configure
# First-run bootstrap (handles Python installation, initial user setup)
make deploy PLAYBOOK=bootstrap
# Apply full standard state
make deploy PLAYBOOK=site
Verify the host reaches baseline:
make check PLAYBOOK=site
# Should report no changes
Part E — Control node (ubongo, manual exception)
The control node runs Terraform and Ansible, so it cannot be created by the
Terraform it hosts (chicken-and-egg). It is ubongo, a dedicated physical
machine outside the cluster — not a Proxmox guest. It is the one host
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
Current state (STATUS.md):
ubongois today managed as the operator accountsjat(group_vars/controlsetsansible_user: sjat); it has no dedicatedansibleservice user yet. The dedicated-ansible-user bootstrap (step 2) is a pending item. Steps below describe the intended end state.
- Install Debian 13 on the physical box by hand (no template to clone).
- Create the
ansibleuser and install its SSH public key. (Pending forubongo— currently managed assjat; see the note above.) - Set up the Ansible environment on it:
git clone <repo> ~/ansible cd ~/ansible make setup # venv + Python deps make collections # Ansible collections rbw login && rbw unlock # vault password from Vaultwarden (see rotate-secrets.md) - Join the mesh VPN — NetBird, self-hosted on
askari(ADR-016) — so it is reachable over SSH from elsewhere. - Add
ubongotoinventories/<env>/hosts.ymlunder thecontrolgroup.
Because ubongo is not in local.vms, this is the only case where editing
hosts.yml by hand is expected. Known limitation: make tf-inventory
regenerates hosts.yml from Terraform outputs and will overwrite a hand-added
control entry — re-add ubongo after running it (preserving the control entry in
the generator is tracked separately, not yet built).
Troubleshooting
SSH connection refused: cloud-init may still be running. Wait and retry.
Python not found: the bootstrap playbook handles this via raw module.
If bootstrap fails, SSH to the host manually and run apt install -y python3.
Firewall locked out: if nftables rules are misconfigured, connect via
Proxmox console (not SSH) and run nft flush ruleset to clear all rules temporarily.