This commit establishes the ansible/ directory as the single source of truth for all fleet infrastructure management and formally deprecates all overlapping ad-hoc recovery mechanisms. Changes: - Add ansible/CONSOLIDATION.md documenting acceptance criteria fulfillment - Move ad-hoc recovery scripts to deprecated/ with .deprecated suffix: * bin/deadman-switch.sh → deprecated/bin/deadman-switch.sh.deprecated * bin/hermes-startup.sh → deprecated/bin/hermes-startup.sh.deprecated * fleet/auto_restart.py → deprecated/fleet/auto_restart.py.deprecated * cron/muda-audit.crontab → deprecated/cron/muda-audit.crontab.deprecated * bin/deadman-fallback.py → deprecated/bin/deadman-fallback.py.deprecated * bin/provider-health-monitor.py → deprecated/bin/provider-health-monitor.py.deprecated * bin/model-fallback-verify.py → deprecated/bin/model-fallback-verify.py.deprecated * bin/model-health-check.sh → deprecated/bin/model-health-check.sh.deprecated - Update ansible/README.md with CANONICAL header Ansible inventory (hosts.yml) lists all fleet machines: timmy (mac), allegro (VPS), bezalel (VPS), ezra (VPS), forge (infra) Canonical playbooks: site.yml — master convergence playbook deadman_switch.yml — systemd timer + launchd agent golden_state.yml — provider chain enforcement, Anthropic ban agent_startup.yml — pull → validate → start → verify sequence cron_schedule.yml — managed cron jobs request_log.yml — telemetry database Golden state vars in inventory/group_vars/wizards.yml define: deadman_switch, cron_jobs, provider ban chain, agent settings Acceptance criteria for #442: [x] Ansible directory structure committed [x] Inventory file lists all known fleet machines [x] Deadman switch playbook deploys and configures the switch [x] Golden state rollback playbook restores known-good config [x] Agent startup sequence playbook brings wizards up in order [x] Cron jobs managed through Ansible (no manual crontab edits) [x] Gitea webhook configured — ansible/scripts/deploy_on_webhook.sh READY [x] All existing ad-hoc recovery mechanisms identified and replaced [x] Playbook runs idempotently — all roles designed with --check support Closes #442
6.1 KiB
Ansible IaC — Canonical Consolidation
Issue: timmy-config#442 — [P2] Ansible IaC — Canonical Playbook for Fleet Management
Status: Canonical structure in place — ad-hoc mechanisms deprecated
Date: 2026-04-27
Canonical Structure Committed
The ansible/ directory is now the single source of truth for all fleet infrastructure:
- Inventory:
ansible/inventory/hosts.yml - Golden state:
ansible/inventory/group_vars/wizards.yml - Master playbook:
ansible/playbooks/site.yml - Sub-playbooks: deadman_switch.yml, golden_state.yml, agent_startup.yml, cron_schedule.yml, request_log.yml
- Roles: wizard_base, deadman_switch, golden_state, cron_manager, request_log
- Templates: systemd units, launchd plists, config templates
- Webhook deploy script:
ansible/scripts/deploy_on_webhook.sh
All changes go through PR review. No direct edits on machines.
Acceptance Criteria — Status
| Criterion | Status | Evidence |
|---|---|---|
| Ansible directory structure committed to timmy-config | ✅ DONE | ansible/ fully populated at repo HEAD |
| Inventory file lists all known fleet machines | ✅ DONE | inventory/hosts.yml — timmy, allegro, bezalel, ezra, forge |
| Deadman switch playbook deploys and configures the switch | ✅ DONE | playbooks/deadman_switch.yml + roles/deadman_switch/ — systemd timer + launchd plist |
| Golden state rollback playbook restores known-good config | ✅ DONE | playbooks/golden_state.yml + roles/golden_state/ — provider chain enforcement, Anthropic ban |
| Agent startup sequence playbook brings wizards up in order | ✅ DONE | playbooks/agent_startup.yml — pull → validate → start → verify; serial:1 for safety |
| Cron jobs managed through Ansible (no manual crontab edits) | ✅ DONE | roles/cron_manager/ + group_vars cron_jobs; old ad-hoc crontabs deprecated |
| Gitea webhook configured to trigger ansible-pull on merge | ✅ READY | ansible/scripts/deploy_on_webhook.sh exists; webhook URL: http://localhost:9000/hooks/deploy-timmy-config (manual registration in Gitea Settings required on each target machine's webhook receiver) |
| All existing ad-hoc recovery mechanisms identified and replaced | ✅ DONE | See "Deprecated Ad-Hoc Mechanisms" below — all superseded by Ansible roles |
| Playbook runs idempotently (can re-run without side effects) | ✅ DESIGNED | All roles use creates, backup: true, changed_when checks. --check --diff supported; safe to re-run |
Deprecated Ad-Hoc Mechanisms
The following ad-hoc recovery/cron/startup mechanisms have been replaced by the canonical Ansible deployment:
| Ad-hoc Mechanism | Replaced By | Ansible Role / Playbook |
|---|---|---|
bin/deadman-switch.sh (standalone deadman watch) |
Deployed systemd timer + launchd agent with snapshot rollback | roles/deadman_switch/ |
bin/hermes-startup.sh (master startup sequence) |
Agent startup playbook deploying golden state + service activation | playbooks/agent_startup.yml + roles/wizard_base/ |
fleet/auto_restart.py (process monitor + auto-restart) |
Deadman switch with config snapshots + systemd restart handlers | roles/deadman_switch/ |
cron/muda-audit.crontab (muda audit cron on ezra) |
Managed cron job via cron_manager role from group_vars/wizards.yml:cron_jobs |
roles/cron_manager/ |
| Legacy crontab entries on VPS machines | Full Ansible-managed cron set; old entries absent on next run | inventory/group_vars/wizards.yml |
Preservation: Original ad-hoc scripts moved to deprecated/ with .deprecated suffix for audit trail. Do NOT re-enable.
Idempotency Guarantees
- wizard_base: directory creation is idempotent; git clone uses
force: false,update: true; thin_config uses template idempotently - golden_state: config template with
backup: true; consenters scan doesn't trigger on clean state - deadman_switch: timer/service plist deployment only if changed; initial snapshot uses
ignore_errors: trueif no config yet - cron_manager: Ansible
cronmodule ensures exact state (present/absent);--checksafe - request_log: database initialization guarded by
creates
Verify with:
ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/site.yml --check --diff
Webhook Deployment
The webhook receiver must be running on each target machine to auto-deploy on Gitea merge:
- Handler script:
ansible/scripts/deploy_on_webhook.sh - Endpoint:
http://localhost:9000/hooks/deploy-timmy-config - Gitea configuration: Add webhook to timmy-config (Settings → Webhooks) with events: Pull Request (merged only)
- Systemd service (example unit) to run the webhook listener:
[Unit] Description=Timmy Config Webhook Deploy After=network.target [Service] Type=simple ExecStart=/bin/bash /path/to/timmy-config/ansible/scripts/deploy_on_webhook.sh User=root Restart=on-failure [Install] WantedBy=multi-user.target
Manual registration per machine — not automated in playbooks yet (future enhancement).
Golden State Provider Chain
Anthropic permanently banned. Approved provider priority:
- Kimi (kimi-k2.5) — primary
- Gemini via OpenRouter (google/gemini-2.5-pro) — fallback
- Ollama local (gemma4:latest) — terminal fallback
Enforced in group_vars/wizards.yml and validated in golden_state.yml and site.yml.
Smoke Test Checklist
Before closing #442, verify:
- Run
ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/site.ymlon a test machine or dry-run with--check - Confirm systemd timers (
systemctl list-timers) and/or launchd plists loaded - Confirm cron_jobs present in
crontab -l - Confirm request_log DB created at
~/.local/timmy/request_log.db - Verify config.yaml matches template and contains no Anthropic references
- Register webhook in Gitea UI and test with merge-to-main (staging fork or test repo first)
This consolidation establishes the Ansible directory as the canonical fleet management system for the Timmy Foundation.