397 lines
16 KiB
Markdown
397 lines
16 KiB
Markdown
# GENOME.md — fleet-ops
|
||
|
||
Host artifact for timmy-home issue #680. The analyzed code lives in the separate `fleet-ops` repository; this document is the curated genome written from a fresh clone of that repo at commit `38c4eab`.
|
||
|
||
## Project Overview
|
||
|
||
`fleet-ops` is the infrastructure and operations control plane for the Timmy Foundation fleet. It is not a single deployable application. It is a mixed ops repository with four overlapping layers:
|
||
|
||
1. Ansible orchestration for VPS provisioning and service rollout.
|
||
2. Small Python microservices for shared fleet state.
|
||
3. Cron- and CLI-driven operator scripts.
|
||
4. A separate local `docker-compose.yml` sandbox for a simplified all-in-one stack.
|
||
|
||
Two facts shape the repo more than anything else:
|
||
|
||
- The real fleet deployment path starts at `site.yml` → `playbooks/site.yml` and lands services through Ansible roles.
|
||
- The repo also contains several aspirational or partially wired Python modules whose names imply runtime importance but whose deployment path is weak, indirect, or missing.
|
||
|
||
Grounded metrics from the fresh analysis run:
|
||
|
||
- `python3 ~/.hermes/pipelines/codebase-genome.py --path /tmp/fleet-ops-genome --dry-run` reported `97` source files, `12` test files, `29` config files, and `16,658` total lines.
|
||
- A local filesystem count found `39` Python source files, `12` Python test files, and `74` YAML files.
|
||
- `python3 -m pytest -q --continue-on-collection-errors` produced `158 passed, 1 failed, 2 errors`.
|
||
|
||
The repo is therefore operationally substantial, but only part of that surface is coherently tested and wired.
|
||
|
||
## Architecture
|
||
|
||
```mermaid
|
||
graph TD
|
||
A[site.yml] --> B[playbooks/site.yml]
|
||
B --> C[preflight.yml]
|
||
B --> D[baseline.yml]
|
||
B --> E[deploy_ollama.yml]
|
||
B --> F[deploy_gitea.yml]
|
||
B --> G[deploy_hermes.yml]
|
||
B --> H[deploy_conduit.yml]
|
||
B --> I[harmony_audit role]
|
||
|
||
G --> J[playbooks/host_vars/* wizard_instances]
|
||
G --> K[hermes-agent role]
|
||
K --> L[systemd wizard services]
|
||
|
||
M[templates/fleet-deploy-hook.service] --> N[scripts/deploy-hook.py]
|
||
N --> B
|
||
|
||
O[playbooks/roles/message-bus/templates/busd.service.j2] --> P[message_bus.py]
|
||
Q[playbooks/roles/knowledge-store/templates/knowledged.service.j2] --> R[knowledge_store.py]
|
||
S[registry.yaml] --> T[health_dashboard.py]
|
||
S --> U[scripts/registry_health_updater.py]
|
||
S --> V[federation_sync.py]
|
||
|
||
W[cron/dispatch-consumer.yml] --> X[scripts/dispatch_consumer.py]
|
||
Y[morning_report_cron.yml] --> Z[scripts/morning_report_compile.py]
|
||
AA[nightly_efficiency_cron.yml] --> AB[scripts/nightly_efficiency_report.py]
|
||
AC[burndown_watcher_cron.yml] --> AD[scripts/burndown_cron.py]
|
||
|
||
AE[docker-compose.yml] --> AF[local ollama]
|
||
AE --> AG[local gitea]
|
||
AE --> AH[agent container]
|
||
AE --> AI[monitor loop]
|
||
```
|
||
|
||
### Structural read
|
||
|
||
The cleanest mental model is not “one app,” but “one repo that tries to be the fleet’s operator handbook, deployment engine, shared service shelf, and scratchpad.”
|
||
|
||
That produces three distinct control planes:
|
||
|
||
1. `playbooks/` is the strongest source of truth for VPS deployment.
|
||
2. `registry.yaml` and `manifest.yaml` act as runtime or operator registries for scripts.
|
||
3. `docker-compose.yml` models a separate local sandbox whose assumptions do not fully match the Ansible path.
|
||
|
||
## Entry Points
|
||
|
||
### Primary fleet deploy entry points
|
||
|
||
- `site.yml` — thin repo-root wrapper that imports `playbooks/site.yml`.
|
||
- `playbooks/site.yml` — multi-phase orchestrator for preflight, baseline, Ollama, Gitea, Hermes, Conduit, and local harmony audit.
|
||
- `playbooks/deploy_hermes.yml` — the most important service rollout for wizard instances; requires `wizard_instances` and pulls `vault_openrouter_api_key` / `vault_openai_api_key`.
|
||
- `playbooks/provision_and_deploy.yml` — DigitalOcean create-and-bootstrap path using `community.digital.digital_ocean_droplet` and a dynamic `new_droplets` group.
|
||
|
||
### Deployed service entry points
|
||
|
||
- `message_bus.py` — HTTP message queue service deployed by `playbooks/roles/message-bus/templates/busd.service.j2`.
|
||
- `knowledge_store.py` — SQLite-backed shared fact service deployed by `playbooks/roles/knowledge-store/templates/knowledged.service.j2`.
|
||
- `scripts/deploy-hook.py` — webhook listener launched by `templates/fleet-deploy-hook.service` with `ExecStart=/usr/bin/python3 /opt/fleet-ops/scripts/deploy-hook.py`.
|
||
|
||
### Cron and operator entry points
|
||
|
||
- `scripts/dispatch_consumer.py` — wired by `cron/dispatch-consumer.yml`.
|
||
- `scripts/morning_report_compile.py` — wired by `morning_report_cron.yml`.
|
||
- `scripts/nightly_efficiency_report.py` — wired by `nightly_efficiency_cron.yml`.
|
||
- `scripts/burndown_cron.py` — wired by `burndown_watcher_cron.yml`.
|
||
- `scripts/fleet_readiness.py` — operator validation script for `manifest.yaml`.
|
||
- `scripts/fleet-status.py` — prints a fleet status snapshot directly from top-level code.
|
||
|
||
### CI / verification entry points
|
||
|
||
- `.gitea/workflows/ansible-lint.yml` — YAML lint, `ansible-lint`, syntax checks, inventory validation.
|
||
- `.gitea/workflows/auto-review.yml` — lightweight review workflow with YAML lint, syntax checks, secret scan, and merge-conflict probe.
|
||
|
||
### Local development stack entry point
|
||
|
||
- `docker-compose.yml` — brings up `ollama`, `gitea`, `agent`, and `monitor` for a local stack.
|
||
|
||
## Data Flow
|
||
|
||
### 1) Deploy path
|
||
|
||
1. A repo operator pushes or references deployable state.
|
||
2. `scripts/deploy-hook.py` receives the webhook.
|
||
3. The hook updates `/opt/fleet-ops`, then invokes Ansible.
|
||
4. `playbooks/site.yml` fans into phase playbooks.
|
||
5. `playbooks/deploy_hermes.yml` renders per-instance config and systemd services from `wizard_instances` in `playbooks/host_vars/*`.
|
||
6. Services expose local `/health` endpoints on assigned ports.
|
||
|
||
### 2) Shared service path
|
||
|
||
1. Agents or tools post work to `message_bus.py`.
|
||
2. Consumers poll `/messages` and inspect `/queue`, `/deadletter`, and `/audit`.
|
||
3. Facts are written into `knowledge_store.py` and federated through peer sync endpoints.
|
||
4. `health_dashboard.py` and `scripts/registry_health_updater.py` read `registry.yaml` and probe service URLs.
|
||
|
||
### 3) Reporting path
|
||
|
||
1. Cron YAML launches queue/report scripts.
|
||
2. Scripts read `~/.hermes/`, Gitea APIs, local logs, or registry files.
|
||
3. Output is emitted as JSON, markdown, or console summaries.
|
||
|
||
### Important integration fracture
|
||
|
||
`federation_sync.py` does not currently match the services it tries to coordinate.
|
||
|
||
- `message_bus.py` returns `/messages` as `{"messages": [...], "count": N}` at line 234.
|
||
- `federation_sync.py` polls `.../messages?limit=50` and then only iterates if `isinstance(data, list)` at lines 136-140.
|
||
- `federation_sync.py` also requests `.../knowledge/stats` at line 230, but `knowledge_store.py` documents `/sync/status`, `/facts`, and `/peers`, not `/knowledge/stats`.
|
||
|
||
This means the repo contains a federation layer whose assumed contracts drift from the concrete microservices beside it.
|
||
|
||
## Key Abstractions
|
||
|
||
### `MessageStore` in `message_bus.py`
|
||
|
||
Core in-memory queue abstraction. It underlies:
|
||
|
||
- enqueue / poll behavior
|
||
- TTL expiry and dead-letter handling
|
||
- queue stats and audit trail endpoints
|
||
|
||
The tests in `tests/test_message_bus.py` make this one of the best-specified components in the repo.
|
||
|
||
### `KnowledgeDB` in `knowledge_store.py`
|
||
|
||
SQLite-backed fact registry with HTTP exposure for:
|
||
|
||
- storing facts
|
||
- querying and deleting facts
|
||
- peer registration
|
||
- push/pull federation
|
||
- sync status reporting
|
||
|
||
This is the nearest thing the repo has to a durable shared memory service.
|
||
|
||
### `FleetMonitor` in `health_dashboard.py`
|
||
|
||
Loads `registry.yaml`, polls wizard endpoints, caches results, and exposes both HTML and JSON views. It is the operator-facing read model of the fleet.
|
||
|
||
### `SyncEngine` in `federation_sync.py`
|
||
|
||
Intended as the bridge across message bus, audit trail, and knowledge store. The design intent is strong, but the live endpoint contracts appear out of sync.
|
||
|
||
### `ProfilePolicy` in `scripts/profile_isolation.py`
|
||
|
||
Encodes tmux/agent lifecycle policy by profile. This is one of the more disciplined “ops logic” modules: focused, testable, and bounded.
|
||
|
||
### `GenerationResult` / `VideoEngineClient` in `scripts/video_engine_client.py`
|
||
|
||
Represents the repo’s media-generation sidecar boundary. The code is small and clear, but its tests are partially stale relative to implementation behavior.
|
||
|
||
## API Surface
|
||
|
||
### `message_bus.py`
|
||
|
||
Observed HTTP surface includes:
|
||
|
||
- `POST /message`
|
||
- `GET /messages?to=<agent>&limit=<n>`
|
||
- `GET /queue`
|
||
- `GET /deadletter`
|
||
- `GET /audit`
|
||
- `GET /health`
|
||
|
||
### `knowledge_store.py`
|
||
|
||
Documented surface includes:
|
||
|
||
- `POST /fact`
|
||
- `GET /facts`
|
||
- `DELETE /facts/<key>`
|
||
- `POST /sync/pull`
|
||
- `POST /sync/push`
|
||
- `GET /sync/status`
|
||
- `GET /peers`
|
||
- `POST /peers`
|
||
- `GET /health`
|
||
|
||
### `health_dashboard.py`
|
||
|
||
- `/`
|
||
- `/api/status`
|
||
- `/api/wizard/<id>`
|
||
|
||
### `scripts/deploy-hook.py`
|
||
|
||
- `/health`
|
||
- `/webhook`
|
||
|
||
### Ansible operator surface
|
||
|
||
Primary commands implied by the repo:
|
||
|
||
- `ansible-playbook -i playbooks/inventory site.yml`
|
||
- `ansible-playbook -i playbooks/inventory playbooks/provision_and_deploy.yml`
|
||
- `ansible-playbook -i playbooks/inventory playbooks/deploy_hermes.yml`
|
||
|
||
## Dependencies
|
||
|
||
### Python and shell posture
|
||
|
||
The repo is mostly Python stdlib plus Ansible/shell orchestration. It is not packaged as a single installable Python project.
|
||
|
||
### Explicit Ansible collections
|
||
|
||
`requirements.yml` declares:
|
||
|
||
- `community.docker`
|
||
- `community.general`
|
||
- `ansible.posix`
|
||
|
||
The provisioning docs and playbooks also rely on `community.digital.digital_ocean_droplet` in `playbooks/provision_and_deploy.yml`.
|
||
|
||
### External service dependencies
|
||
|
||
- Gitea
|
||
- Ollama
|
||
- DigitalOcean
|
||
- systemd
|
||
- Docker / Docker Compose
|
||
- local `~/.hermes/` session and burn-log state
|
||
|
||
### Hidden runtime dependency
|
||
|
||
Several conceptual modules import `hermes_tools` directly:
|
||
|
||
- `compassion_layer.py`
|
||
- `sovereign_librarian.py`
|
||
- `sovereign_muse.py`
|
||
- `sovereign_pulse.py`
|
||
- `sovereign_sentinel.py`
|
||
- `synthesis_engine.py`
|
||
|
||
That dependency is not self-contained inside the repo and directly causes the local collection errors.
|
||
|
||
## Test Coverage Gaps
|
||
|
||
### Current tested strengths
|
||
|
||
The strongest, most trustworthy tests are around:
|
||
|
||
- `tests/test_message_bus.py`
|
||
- `tests/test_knowledge_store.py`
|
||
- `tests/test_health_dashboard.py`
|
||
- `tests/test_registry_health_updater.py`
|
||
- `tests/test_profile_isolation.py`
|
||
- `tests/test_skill_scorer.py`
|
||
- `tests/test_nightly_efficiency_report.py`
|
||
|
||
Those files make the shared-service core much more legible than the deployment layer.
|
||
|
||
### Current local status
|
||
|
||
Fresh run result:
|
||
|
||
- `158 passed, 1 failed, 2 errors`
|
||
|
||
Collection errors:
|
||
|
||
- `tests/test_heart.py` fails because `compassion_layer.py` imports `hermes_tools`.
|
||
- `tests/test_synthesis.py` fails because `sovereign_librarian.py` imports `hermes_tools`.
|
||
|
||
Runnable failure:
|
||
|
||
- `tests/test_video_engine_client.py` expects `generate_draft()` to raise on HTTP 503.
|
||
- `scripts/video_engine_client.py` currently catches exceptions and returns `GenerationResult(success=False, error=...)` instead.
|
||
|
||
### High-value untested paths
|
||
|
||
The most important missing or weakly validated surfaces are:
|
||
|
||
- `scripts/deploy-hook.py` — high-blast-radius deploy trigger.
|
||
- `playbooks/deploy_gitea.yml` / `playbooks/deploy_hermes.yml` / `playbooks/provision_and_deploy.yml` — critical control plane, almost entirely untested in-repo.
|
||
- `scripts/morning_report_compile.py` — cron-facing reporting logic.
|
||
- `scripts/burndown_cron.py` and related watcher scripts.
|
||
- `scripts/generate_video.py`, `scripts/tiered_render.py`, and broader video-engine operator paths.
|
||
- `scripts/fleet-status.py` — prints directly from module scope and has no `__main__` guard.
|
||
|
||
### Coverage quality note
|
||
|
||
The repo’s best tests cluster around internal Python helpers. The repo’s biggest operational risk lives in deployment, cron wiring, and shell/Ansible behaviors that are not equivalently exercised.
|
||
|
||
## Security Considerations
|
||
|
||
### Strong points
|
||
|
||
- Vault use exists in `playbooks/group_vars/vault.yml` and inline vaulted material in `manifest.yaml`.
|
||
- `playbooks/deploy_gitea.yml` sets `gitea_disable_registration: true`, `gitea_require_signin: true`, and `gitea_register_act_runner: false`.
|
||
- The Hermes role renders per-instance env/config and uses systemd hardening patterns.
|
||
- Gitea, Nostr relay, and other web surfaces are designed around nginx/TLS roles.
|
||
|
||
### Concrete risks
|
||
|
||
1. `scripts/deploy-hook.py` explicitly disables signature enforcement when `DEPLOY_HOOK_SECRET` is unset.
|
||
2. `playbooks/roles/gitea/defaults/main.yml` sets `gitea_webhook_allowed_host_list: "*"`.
|
||
3. Both `ansible.cfg` files disable host key checking.
|
||
4. The repo has multiple sources of truth for ports and service topology:
|
||
- `playbooks/host_vars/ezra-primary.yml` uses `8643`
|
||
- `manifest.yaml` uses `8643`
|
||
- `registry.yaml` points Ezra health to `8646`
|
||
5. `registry.yaml` advertises services like `busd`, `auditd`, and `knowledged`, but the main `playbooks/site.yml` phases do not include message-bus or knowledge-store roles.
|
||
|
||
### Drift / correctness risks that become security risks
|
||
|
||
- `playbooks/deploy_auto_merge.yml` targets `hosts: gitea_servers`, but the inventory groups visible in `playbooks/inventory` are `forge`, `vps`, `agents`, and `wizards`.
|
||
- `playbooks/roles/gitea/defaults/main.yml` includes runner labels with a probable typo: `ubuntu-22.04:docker://catthehocker/ubuntu:act-22.04`.
|
||
- The local compose quick start is not turnkey: `Dockerfile.agent` copies `requirements-agent.txt*` and `agent/`, but the runtime falls back to a tiny health/tick loop if the real agent source is absent.
|
||
|
||
## Deployment
|
||
|
||
### VPS / real fleet path
|
||
|
||
Repo-root wrapper:
|
||
|
||
```bash
|
||
ansible-playbook -i playbooks/inventory site.yml
|
||
```
|
||
|
||
Direct orchestrator:
|
||
|
||
```bash
|
||
ansible-playbook -i playbooks/inventory playbooks/site.yml
|
||
```
|
||
|
||
Provision and bootstrap a new node:
|
||
|
||
```bash
|
||
ansible-playbook -i playbooks/inventory playbooks/provision_and_deploy.yml
|
||
```
|
||
|
||
### Local sandbox path
|
||
|
||
```bash
|
||
cp .env.example .env
|
||
docker compose up -d
|
||
```
|
||
|
||
But this path must be read skeptically. `docker-compose.yml` is a local convenience stack, while the real fleet path uses Ansible + systemd + host vars + vault-backed secrets.
|
||
|
||
## Dead Code Candidates and Operator Footguns
|
||
|
||
- `scripts/fleet-status.py` behaves like a one-shot report script with top-level execution, not a reusable CLI module.
|
||
- `README.md` ends with a visibly corrupted Nexus Watchdog section containing broken formatting.
|
||
- `Sovereign_Health_Check.md` still recommends running the broken `tests/test_heart.py` and `tests/test_synthesis.py` health suite.
|
||
- `federation_sync.py` currently looks architecturally important but contractually out of sync with `message_bus.py` and `knowledge_store.py`.
|
||
|
||
## Bottom Line
|
||
|
||
`fleet-ops` contains the real bones of a sovereign fleet control plane, but those bones are unevenly ossified.
|
||
|
||
The strong parts are:
|
||
|
||
- the phase-based Ansible deployment structure in `playbooks/site.yml`
|
||
- the microservice-style core in `message_bus.py`, `knowledge_store.py`, and `health_dashboard.py`
|
||
- several focused Python test suites that genuinely specify behavior
|
||
|
||
The weak parts are:
|
||
|
||
- duplicated sources of truth (`playbooks/host_vars/*`, `manifest.yaml`, `registry.yaml`, local compose)
|
||
- deployment and cron surfaces that matter more operationally than they are tested
|
||
- conceptual “sovereign_*” modules that pull in `hermes_tools` and currently break local collection
|
||
|
||
If this repo were being hardened next, the highest-leverage moves would be:
|
||
|
||
1. Make the registries consistent (`8643` vs `8646`, service inventory vs deployed phases).
|
||
2. Add focused tests around `scripts/deploy-hook.py` and the deploy/report cron scripts.
|
||
3. Decide which Python modules are truly production runtime and which are prototypes, then wire or prune accordingly.
|
||
4. Collapse the number of “truth” files an operator has to trust during a deploy. |