Files

Alexander Whitestone b3dd906805 docs: add fleet-ops genome analysis (#680 )

2026-04-15 21:36:08 -04:00

16 KiB

Raw Blame History

GENOME.md — fleet-ops

Host artifact for timmy-home issue #680. The analyzed code lives in the separate fleet-ops repository; this document is the curated genome written from a fresh clone of that repo at commit 38c4eab.

Project Overview

fleet-ops is the infrastructure and operations control plane for the Timmy Foundation fleet. It is not a single deployable application. It is a mixed ops repository with four overlapping layers:

Ansible orchestration for VPS provisioning and service rollout.
Small Python microservices for shared fleet state.
Cron- and CLI-driven operator scripts.
A separate local docker-compose.yml sandbox for a simplified all-in-one stack.

Two facts shape the repo more than anything else:

The real fleet deployment path starts at site.yml → playbooks/site.yml and lands services through Ansible roles.
The repo also contains several aspirational or partially wired Python modules whose names imply runtime importance but whose deployment path is weak, indirect, or missing.

Grounded metrics from the fresh analysis run:

python3 ~/.hermes/pipelines/codebase-genome.py --path /tmp/fleet-ops-genome --dry-run reported 97 source files, 12 test files, 29 config files, and 16,658 total lines.
A local filesystem count found 39 Python source files, 12 Python test files, and 74 YAML files.
python3 -m pytest -q --continue-on-collection-errors produced 158 passed, 1 failed, 2 errors.

The repo is therefore operationally substantial, but only part of that surface is coherently tested and wired.

Architecture

graph TD
    A[site.yml] --> B[playbooks/site.yml]
    B --> C[preflight.yml]
    B --> D[baseline.yml]
    B --> E[deploy_ollama.yml]
    B --> F[deploy_gitea.yml]
    B --> G[deploy_hermes.yml]
    B --> H[deploy_conduit.yml]
    B --> I[harmony_audit role]

    G --> J[playbooks/host_vars/* wizard_instances]
    G --> K[hermes-agent role]
    K --> L[systemd wizard services]

    M[templates/fleet-deploy-hook.service] --> N[scripts/deploy-hook.py]
    N --> B

    O[playbooks/roles/message-bus/templates/busd.service.j2] --> P[message_bus.py]
    Q[playbooks/roles/knowledge-store/templates/knowledged.service.j2] --> R[knowledge_store.py]
    S[registry.yaml] --> T[health_dashboard.py]
    S --> U[scripts/registry_health_updater.py]
    S --> V[federation_sync.py]

    W[cron/dispatch-consumer.yml] --> X[scripts/dispatch_consumer.py]
    Y[morning_report_cron.yml] --> Z[scripts/morning_report_compile.py]
    AA[nightly_efficiency_cron.yml] --> AB[scripts/nightly_efficiency_report.py]
    AC[burndown_watcher_cron.yml] --> AD[scripts/burndown_cron.py]

    AE[docker-compose.yml] --> AF[local ollama]
    AE --> AG[local gitea]
    AE --> AH[agent container]
    AE --> AI[monitor loop]

Structural read

The cleanest mental model is not “one app,” but “one repo that tries to be the fleet’s operator handbook, deployment engine, shared service shelf, and scratchpad.”

That produces three distinct control planes:

playbooks/ is the strongest source of truth for VPS deployment.
registry.yaml and manifest.yaml act as runtime or operator registries for scripts.
docker-compose.yml models a separate local sandbox whose assumptions do not fully match the Ansible path.

Entry Points

Primary fleet deploy entry points

site.yml — thin repo-root wrapper that imports playbooks/site.yml.
playbooks/site.yml — multi-phase orchestrator for preflight, baseline, Ollama, Gitea, Hermes, Conduit, and local harmony audit.
playbooks/deploy_hermes.yml — the most important service rollout for wizard instances; requires wizard_instances and pulls vault_openrouter_api_key / vault_openai_api_key.
playbooks/provision_and_deploy.yml — DigitalOcean create-and-bootstrap path using community.digital.digital_ocean_droplet and a dynamic new_droplets group.

Deployed service entry points

message_bus.py — HTTP message queue service deployed by playbooks/roles/message-bus/templates/busd.service.j2.
knowledge_store.py — SQLite-backed shared fact service deployed by playbooks/roles/knowledge-store/templates/knowledged.service.j2.
scripts/deploy-hook.py — webhook listener launched by templates/fleet-deploy-hook.service with ExecStart=/usr/bin/python3 /opt/fleet-ops/scripts/deploy-hook.py.

Cron and operator entry points

scripts/dispatch_consumer.py — wired by cron/dispatch-consumer.yml.
scripts/morning_report_compile.py — wired by morning_report_cron.yml.
scripts/nightly_efficiency_report.py — wired by nightly_efficiency_cron.yml.
scripts/burndown_cron.py — wired by burndown_watcher_cron.yml.
scripts/fleet_readiness.py — operator validation script for manifest.yaml.
scripts/fleet-status.py — prints a fleet status snapshot directly from top-level code.

CI / verification entry points

.gitea/workflows/ansible-lint.yml — YAML lint, ansible-lint, syntax checks, inventory validation.
.gitea/workflows/auto-review.yml — lightweight review workflow with YAML lint, syntax checks, secret scan, and merge-conflict probe.

Local development stack entry point

docker-compose.yml — brings up ollama, gitea, agent, and monitor for a local stack.

Data Flow

1) Deploy path

A repo operator pushes or references deployable state.
scripts/deploy-hook.py receives the webhook.
The hook updates /opt/fleet-ops, then invokes Ansible.
playbooks/site.yml fans into phase playbooks.
playbooks/deploy_hermes.yml renders per-instance config and systemd services from wizard_instances in playbooks/host_vars/*.
Services expose local /health endpoints on assigned ports.

2) Shared service path

Agents or tools post work to message_bus.py.
Consumers poll /messages and inspect /queue, /deadletter, and /audit.
Facts are written into knowledge_store.py and federated through peer sync endpoints.
health_dashboard.py and scripts/registry_health_updater.py read registry.yaml and probe service URLs.

3) Reporting path

Cron YAML launches queue/report scripts.
Scripts read ~/.hermes/, Gitea APIs, local logs, or registry files.
Output is emitted as JSON, markdown, or console summaries.

Important integration fracture

federation_sync.py does not currently match the services it tries to coordinate.

message_bus.py returns /messages as {"messages": [...], "count": N} at line 234.
federation_sync.py polls .../messages?limit=50 and then only iterates if isinstance(data, list) at lines 136-140.
federation_sync.py also requests .../knowledge/stats at line 230, but knowledge_store.py documents /sync/status, /facts, and /peers, not /knowledge/stats.

This means the repo contains a federation layer whose assumed contracts drift from the concrete microservices beside it.

Key Abstractions

`MessageStore` in `message_bus.py`

Core in-memory queue abstraction. It underlies:

enqueue / poll behavior
TTL expiry and dead-letter handling
queue stats and audit trail endpoints

The tests in tests/test_message_bus.py make this one of the best-specified components in the repo.

`KnowledgeDB` in `knowledge_store.py`

SQLite-backed fact registry with HTTP exposure for:

storing facts
querying and deleting facts
peer registration
push/pull federation
sync status reporting

This is the nearest thing the repo has to a durable shared memory service.

`FleetMonitor` in `health_dashboard.py`

Loads registry.yaml, polls wizard endpoints, caches results, and exposes both HTML and JSON views. It is the operator-facing read model of the fleet.

`SyncEngine` in `federation_sync.py`

Intended as the bridge across message bus, audit trail, and knowledge store. The design intent is strong, but the live endpoint contracts appear out of sync.

`ProfilePolicy` in `scripts/profile_isolation.py`

Encodes tmux/agent lifecycle policy by profile. This is one of the more disciplined “ops logic” modules: focused, testable, and bounded.

`GenerationResult` / `VideoEngineClient` in `scripts/video_engine_client.py`

Represents the repo’s media-generation sidecar boundary. The code is small and clear, but its tests are partially stale relative to implementation behavior.

API Surface

`message_bus.py`

Observed HTTP surface includes:

POST /message
GET /messages?to=<agent>&limit=<n>
GET /queue
GET /deadletter
GET /audit
GET /health

`knowledge_store.py`

Documented surface includes:

POST /fact
GET /facts
DELETE /facts/<key>
POST /sync/pull
POST /sync/push
GET /sync/status
GET /peers
POST /peers
GET /health

`health_dashboard.py`

/
/api/status
/api/wizard/<id>

`scripts/deploy-hook.py`

/health
/webhook

Ansible operator surface

Primary commands implied by the repo:

ansible-playbook -i playbooks/inventory site.yml
ansible-playbook -i playbooks/inventory playbooks/provision_and_deploy.yml
ansible-playbook -i playbooks/inventory playbooks/deploy_hermes.yml

Dependencies

Python and shell posture

The repo is mostly Python stdlib plus Ansible/shell orchestration. It is not packaged as a single installable Python project.

Explicit Ansible collections

requirements.yml declares:

community.docker
community.general
ansible.posix

The provisioning docs and playbooks also rely on community.digital.digital_ocean_droplet in playbooks/provision_and_deploy.yml.

External service dependencies

Gitea
Ollama
DigitalOcean
systemd
Docker / Docker Compose
local ~/.hermes/ session and burn-log state

Hidden runtime dependency

Several conceptual modules import hermes_tools directly:

compassion_layer.py
sovereign_librarian.py
sovereign_muse.py
sovereign_pulse.py
sovereign_sentinel.py
synthesis_engine.py

That dependency is not self-contained inside the repo and directly causes the local collection errors.

Test Coverage Gaps

Current tested strengths

The strongest, most trustworthy tests are around:

tests/test_message_bus.py
tests/test_knowledge_store.py
tests/test_health_dashboard.py
tests/test_registry_health_updater.py
tests/test_profile_isolation.py
tests/test_skill_scorer.py
tests/test_nightly_efficiency_report.py

Those files make the shared-service core much more legible than the deployment layer.

Current local status

Fresh run result:

158 passed, 1 failed, 2 errors

Collection errors:

tests/test_heart.py fails because compassion_layer.py imports hermes_tools.
tests/test_synthesis.py fails because sovereign_librarian.py imports hermes_tools.

Runnable failure:

tests/test_video_engine_client.py expects generate_draft() to raise on HTTP 503.
scripts/video_engine_client.py currently catches exceptions and returns GenerationResult(success=False, error=...) instead.

High-value untested paths

The most important missing or weakly validated surfaces are:

scripts/deploy-hook.py — high-blast-radius deploy trigger.
playbooks/deploy_gitea.yml / playbooks/deploy_hermes.yml / playbooks/provision_and_deploy.yml — critical control plane, almost entirely untested in-repo.
scripts/morning_report_compile.py — cron-facing reporting logic.
scripts/burndown_cron.py and related watcher scripts.
scripts/generate_video.py, scripts/tiered_render.py, and broader video-engine operator paths.
scripts/fleet-status.py — prints directly from module scope and has no __main__ guard.

Coverage quality note

The repo’s best tests cluster around internal Python helpers. The repo’s biggest operational risk lives in deployment, cron wiring, and shell/Ansible behaviors that are not equivalently exercised.

Security Considerations

Strong points

Vault use exists in playbooks/group_vars/vault.yml and inline vaulted material in manifest.yaml.
playbooks/deploy_gitea.yml sets gitea_disable_registration: true, gitea_require_signin: true, and gitea_register_act_runner: false.
The Hermes role renders per-instance env/config and uses systemd hardening patterns.
Gitea, Nostr relay, and other web surfaces are designed around nginx/TLS roles.

Concrete risks

scripts/deploy-hook.py explicitly disables signature enforcement when DEPLOY_HOOK_SECRET is unset.
playbooks/roles/gitea/defaults/main.yml sets gitea_webhook_allowed_host_list: "*".
Both ansible.cfg files disable host key checking.
The repo has multiple sources of truth for ports and service topology:
- playbooks/host_vars/ezra-primary.yml uses 8643
- manifest.yaml uses 8643
- registry.yaml points Ezra health to 8646
registry.yaml advertises services like busd, auditd, and knowledged, but the main playbooks/site.yml phases do not include message-bus or knowledge-store roles.

Drift / correctness risks that become security risks

playbooks/deploy_auto_merge.yml targets hosts: gitea_servers, but the inventory groups visible in playbooks/inventory are forge, vps, agents, and wizards.
playbooks/roles/gitea/defaults/main.yml includes runner labels with a probable typo: ubuntu-22.04:docker://catthehocker/ubuntu:act-22.04.
The local compose quick start is not turnkey: Dockerfile.agent copies requirements-agent.txt* and agent/, but the runtime falls back to a tiny health/tick loop if the real agent source is absent.

Deployment

VPS / real fleet path

Repo-root wrapper:

ansible-playbook -i playbooks/inventory site.yml

Direct orchestrator:

ansible-playbook -i playbooks/inventory playbooks/site.yml

Provision and bootstrap a new node:

ansible-playbook -i playbooks/inventory playbooks/provision_and_deploy.yml

Local sandbox path

cp .env.example .env
docker compose up -d

But this path must be read skeptically. docker-compose.yml is a local convenience stack, while the real fleet path uses Ansible + systemd + host vars + vault-backed secrets.

Dead Code Candidates and Operator Footguns

scripts/fleet-status.py behaves like a one-shot report script with top-level execution, not a reusable CLI module.
README.md ends with a visibly corrupted Nexus Watchdog section containing broken formatting.
Sovereign_Health_Check.md still recommends running the broken tests/test_heart.py and tests/test_synthesis.py health suite.
federation_sync.py currently looks architecturally important but contractually out of sync with message_bus.py and knowledge_store.py.

Bottom Line

fleet-ops contains the real bones of a sovereign fleet control plane, but those bones are unevenly ossified.

The strong parts are:

the phase-based Ansible deployment structure in playbooks/site.yml
the microservice-style core in message_bus.py, knowledge_store.py, and health_dashboard.py
several focused Python test suites that genuinely specify behavior

The weak parts are:

duplicated sources of truth (playbooks/host_vars/*, manifest.yaml, registry.yaml, local compose)
deployment and cron surfaces that matter more operationally than they are tested
conceptual “sovereign_*” modules that pull in hermes_tools and currently break local collection

If this repo were being hardened next, the highest-leverage moves would be:

Make the registries consistent (8643 vs 8646, service inventory vs deployed phases).
Add focused tests around scripts/deploy-hook.py and the deploy/report cron scripts.
Decide which Python modules are truly production runtime and which are prototypes, then wire or prune accordingly.
Collapse the number of “truth” files an operator has to trust during a deploy.

16 KiB Raw Blame History Unescape Escape