- New docs/runbooks/integration-testing.md: when to use (firewall/ sshd/boot/Docker changes); make test-integration commands; lower- level driver sub-commands; cert tier guidance; diagnostics dir; VM inspection (virsh console / SSH); safety invariants; resource constraints; adding a new profile; self-validating acceptance test. - docs/runbooks/new-host.md: pre-flight warning before deploying lockout-risky changes (firewall/sshd/boot) while break-glass is open - docs/runbooks/new-role.md: step 13 pre-flight for lockout-risky roles Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
164 lines
5.7 KiB
Markdown
164 lines
5.7 KiB
Markdown
# Runbook — Adding a new managed host
|
|
|
|
## Prerequisites
|
|
|
|
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not).
|
|
Not needed for the control node `ubongo`, which is bare-metal (Part E).
|
|
- `rbw` is installed and unlocked (`rbw unlock`) so the vault password resolves from Vaultwarden
|
|
- The host's intended hostname and IP are decided
|
|
|
|
---
|
|
|
|
## Part A — Create the Proxmox template (one-time)
|
|
|
|
Run on a Proxmox node. Only needed once per cluster.
|
|
|
|
```bash
|
|
# Download the Debian 13 genericcloud image
|
|
wget https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2
|
|
|
|
# Create a VM (adjust ID, storage name as needed)
|
|
qm create 9000 --name debian13-template --memory 2048 --cores 2 \
|
|
--net0 virtio,bridge=vmbr0 --serial0 socket --vga serial0
|
|
|
|
# Import the disk
|
|
qm importdisk 9000 debian-13-genericcloud-amd64.qcow2 local-lvm
|
|
|
|
# Attach disk and set boot order
|
|
qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9000-disk-0
|
|
qm set 9000 --boot c --bootdisk scsi0
|
|
|
|
# Add cloud-init drive
|
|
qm set 9000 --ide2 local-lvm:cloudinit
|
|
|
|
# Enable QEMU guest agent
|
|
qm set 9000 --agent enabled=1
|
|
|
|
# Convert to template (cannot be undone)
|
|
qm template 9000
|
|
```
|
|
|
|
---
|
|
|
|
## Part B — Define the VM in Terraform
|
|
|
|
Managed hosts are created by Terraform, never by hand. Add an entry to `local.vms`
|
|
in the environment's `main.tf` (`terraform/environments/<env>/main.tf`):
|
|
|
|
```hcl
|
|
locals {
|
|
vms = {
|
|
<hostname> = {
|
|
ip = "<IP>/24" # static; from docs/decisions/007-network.md
|
|
group = "docker_hosts" # control | docker_hosts | proxmox_hosts
|
|
cores = 2
|
|
memory_mb = 2048
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Terraform clones the cloud-init template from Part A and sets the cloud-init values
|
|
(hostname, SSH key, IP/gateway). It writes no DNS records — the `dns` role owns the
|
|
internal zone. See ADR-009 for the full handoff and the `vms` output → inventory data contract.
|
|
|
|
---
|
|
|
|
## Part C — Provision and regenerate the inventory
|
|
|
|
```bash
|
|
make tf-plan TF_ENV=production # review — confirm only the new VM is added
|
|
make tf-apply TF_ENV=production # create the VM (no DNS records written)
|
|
make tf-inventory TF_ENV=production # regenerate inventories/production/hosts.yml
|
|
```
|
|
|
|
`make tf-inventory` rewrites `hosts.yml` from Terraform outputs — **do not edit
|
|
that file by hand**; it carries a "do not edit manually" header and your changes
|
|
would be overwritten. The source of truth is `local.vms`.
|
|
|
|
Wait ~60 seconds after apply for cloud-init to complete, then verify SSH access:
|
|
|
|
```bash
|
|
ssh ansible@<IP> echo ok
|
|
```
|
|
|
|
Add a `host_vars/<hostname>/` directory if the host needs specific overrides
|
|
(this is config, not inventory membership, so it is not generated):
|
|
|
|
```bash
|
|
mkdir -p inventories/production/host_vars/<hostname>
|
|
touch inventories/production/host_vars/<hostname>/vars.yml
|
|
```
|
|
|
|
---
|
|
|
|
## Part D — Bootstrap and configure
|
|
|
|
```bash
|
|
# First-run bootstrap (handles Python installation, initial user setup)
|
|
make deploy PLAYBOOK=bootstrap
|
|
|
|
# Apply full standard state
|
|
make deploy PLAYBOOK=site
|
|
```
|
|
|
|
Verify the host reaches baseline:
|
|
|
|
```bash
|
|
make check PLAYBOOK=site
|
|
# Should report no changes
|
|
```
|
|
|
|
> **Pre-flight before lockout-risky changes (firewall / sshd / boot):** before applying
|
|
> any change that touches nftables rules, SSH configuration, or boot ordering, run
|
|
> `make test-integration HOST=<name>` and confirm reboot-recovery on the local VM
|
|
> **while the break-glass (Proxmox console / Hetzner console) is still open**. Do not
|
|
> retire the break-glass until the integration test passes. See
|
|
> `docs/runbooks/integration-testing.md` and ADR-025.
|
|
|
|
---
|
|
|
|
## Part E — Control node (`ubongo`, manual exception)
|
|
|
|
The control node runs Terraform and Ansible, so it cannot be created by the
|
|
Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated **physical**
|
|
machine outside the cluster — not a Proxmox guest. It is the **one** host
|
|
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
|
|
|
|
> **Current state (STATUS.md):** `ubongo` is today managed as the operator account
|
|
> `sjat` (`group_vars/control` sets `ansible_user: sjat`); it has **no** dedicated
|
|
> `ansible` service user yet. The dedicated-`ansible`-user bootstrap (step 2) is a
|
|
> **pending** item. Steps below describe the intended end state.
|
|
|
|
1. Install Debian 13 on the physical box by hand (no template to clone).
|
|
2. Create the `ansible` user and install its SSH public key. *(Pending for `ubongo` —
|
|
currently managed as `sjat`; see the note above.)*
|
|
3. Set up the Ansible environment on it:
|
|
```bash
|
|
git clone <repo> ~/ansible
|
|
cd ~/ansible
|
|
make setup # venv + Python deps
|
|
make collections # Ansible collections
|
|
rbw login && rbw unlock # vault password from Vaultwarden (see rotate-secrets.md)
|
|
```
|
|
4. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016) — so it is
|
|
reachable over SSH from elsewhere.
|
|
5. Add `ubongo` to `inventories/<env>/hosts.yml` under the `control` group.
|
|
|
|
Because `ubongo` is not in `local.vms`, this is the only case where editing
|
|
`hosts.yml` by hand is expected. **Known limitation:** `make tf-inventory`
|
|
regenerates `hosts.yml` from Terraform outputs and will overwrite a hand-added
|
|
`control` entry — re-add `ubongo` after running it (preserving the control entry in
|
|
the generator is tracked separately, not yet built).
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
**SSH connection refused**: cloud-init may still be running. Wait and retry.
|
|
|
|
**Python not found**: the bootstrap playbook handles this via `raw` module.
|
|
If bootstrap fails, SSH to the host manually and run `apt install -y python3`.
|
|
|
|
**Firewall locked out**: if nftables rules are misconfigured, connect via
|
|
Proxmox console (not SSH) and run `nft flush ruleset` to clear all rules temporarily.
|