16 KiB
GENOME.md — fleet-ops
Host artifact for timmy-home issue #680. The analyzed code lives in the separate fleet-ops repository; this document is the curated genome written from a fresh clone of that repo at commit 38c4eab.
Project Overview
fleet-ops is the infrastructure and operations control plane for the Timmy Foundation fleet. It is not a single deployable application. It is a mixed ops repository with four overlapping layers:
- Ansible orchestration for VPS provisioning and service rollout.
- Small Python microservices for shared fleet state.
- Cron- and CLI-driven operator scripts.
- A separate local
docker-compose.ymlsandbox for a simplified all-in-one stack.
Two facts shape the repo more than anything else:
- The real fleet deployment path starts at
site.yml→playbooks/site.ymland lands services through Ansible roles. - The repo also contains several aspirational or partially wired Python modules whose names imply runtime importance but whose deployment path is weak, indirect, or missing.
Grounded metrics from the fresh analysis run:
python3 ~/.hermes/pipelines/codebase-genome.py --path /tmp/fleet-ops-genome --dry-runreported97source files,12test files,29config files, and16,658total lines.- A local filesystem count found
39Python source files,12Python test files, and74YAML files. python3 -m pytest -q --continue-on-collection-errorsproduced158 passed, 1 failed, 2 errors.
The repo is therefore operationally substantial, but only part of that surface is coherently tested and wired.
Architecture
graph TD
A[site.yml] --> B[playbooks/site.yml]
B --> C[preflight.yml]
B --> D[baseline.yml]
B --> E[deploy_ollama.yml]
B --> F[deploy_gitea.yml]
B --> G[deploy_hermes.yml]
B --> H[deploy_conduit.yml]
B --> I[harmony_audit role]
G --> J[playbooks/host_vars/* wizard_instances]
G --> K[hermes-agent role]
K --> L[systemd wizard services]
M[templates/fleet-deploy-hook.service] --> N[scripts/deploy-hook.py]
N --> B
O[playbooks/roles/message-bus/templates/busd.service.j2] --> P[message_bus.py]
Q[playbooks/roles/knowledge-store/templates/knowledged.service.j2] --> R[knowledge_store.py]
S[registry.yaml] --> T[health_dashboard.py]
S --> U[scripts/registry_health_updater.py]
S --> V[federation_sync.py]
W[cron/dispatch-consumer.yml] --> X[scripts/dispatch_consumer.py]
Y[morning_report_cron.yml] --> Z[scripts/morning_report_compile.py]
AA[nightly_efficiency_cron.yml] --> AB[scripts/nightly_efficiency_report.py]
AC[burndown_watcher_cron.yml] --> AD[scripts/burndown_cron.py]
AE[docker-compose.yml] --> AF[local ollama]
AE --> AG[local gitea]
AE --> AH[agent container]
AE --> AI[monitor loop]
Structural read
The cleanest mental model is not “one app,” but “one repo that tries to be the fleet’s operator handbook, deployment engine, shared service shelf, and scratchpad.”
That produces three distinct control planes:
playbooks/is the strongest source of truth for VPS deployment.registry.yamlandmanifest.yamlact as runtime or operator registries for scripts.docker-compose.ymlmodels a separate local sandbox whose assumptions do not fully match the Ansible path.
Entry Points
Primary fleet deploy entry points
site.yml— thin repo-root wrapper that importsplaybooks/site.yml.playbooks/site.yml— multi-phase orchestrator for preflight, baseline, Ollama, Gitea, Hermes, Conduit, and local harmony audit.playbooks/deploy_hermes.yml— the most important service rollout for wizard instances; requireswizard_instancesand pullsvault_openrouter_api_key/vault_openai_api_key.playbooks/provision_and_deploy.yml— DigitalOcean create-and-bootstrap path usingcommunity.digital.digital_ocean_dropletand a dynamicnew_dropletsgroup.
Deployed service entry points
message_bus.py— HTTP message queue service deployed byplaybooks/roles/message-bus/templates/busd.service.j2.knowledge_store.py— SQLite-backed shared fact service deployed byplaybooks/roles/knowledge-store/templates/knowledged.service.j2.scripts/deploy-hook.py— webhook listener launched bytemplates/fleet-deploy-hook.servicewithExecStart=/usr/bin/python3 /opt/fleet-ops/scripts/deploy-hook.py.
Cron and operator entry points
scripts/dispatch_consumer.py— wired bycron/dispatch-consumer.yml.scripts/morning_report_compile.py— wired bymorning_report_cron.yml.scripts/nightly_efficiency_report.py— wired bynightly_efficiency_cron.yml.scripts/burndown_cron.py— wired byburndown_watcher_cron.yml.scripts/fleet_readiness.py— operator validation script formanifest.yaml.scripts/fleet-status.py— prints a fleet status snapshot directly from top-level code.
CI / verification entry points
.gitea/workflows/ansible-lint.yml— YAML lint,ansible-lint, syntax checks, inventory validation..gitea/workflows/auto-review.yml— lightweight review workflow with YAML lint, syntax checks, secret scan, and merge-conflict probe.
Local development stack entry point
docker-compose.yml— brings upollama,gitea,agent, andmonitorfor a local stack.
Data Flow
1) Deploy path
- A repo operator pushes or references deployable state.
scripts/deploy-hook.pyreceives the webhook.- The hook updates
/opt/fleet-ops, then invokes Ansible. playbooks/site.ymlfans into phase playbooks.playbooks/deploy_hermes.ymlrenders per-instance config and systemd services fromwizard_instancesinplaybooks/host_vars/*.- Services expose local
/healthendpoints on assigned ports.
2) Shared service path
- Agents or tools post work to
message_bus.py. - Consumers poll
/messagesand inspect/queue,/deadletter, and/audit. - Facts are written into
knowledge_store.pyand federated through peer sync endpoints. health_dashboard.pyandscripts/registry_health_updater.pyreadregistry.yamland probe service URLs.
3) Reporting path
- Cron YAML launches queue/report scripts.
- Scripts read
~/.hermes/, Gitea APIs, local logs, or registry files. - Output is emitted as JSON, markdown, or console summaries.
Important integration fracture
federation_sync.py does not currently match the services it tries to coordinate.
message_bus.pyreturns/messagesas{"messages": [...], "count": N}at line 234.federation_sync.pypolls.../messages?limit=50and then only iterates ifisinstance(data, list)at lines 136-140.federation_sync.pyalso requests.../knowledge/statsat line 230, butknowledge_store.pydocuments/sync/status,/facts, and/peers, not/knowledge/stats.
This means the repo contains a federation layer whose assumed contracts drift from the concrete microservices beside it.
Key Abstractions
MessageStore in message_bus.py
Core in-memory queue abstraction. It underlies:
- enqueue / poll behavior
- TTL expiry and dead-letter handling
- queue stats and audit trail endpoints
The tests in tests/test_message_bus.py make this one of the best-specified components in the repo.
KnowledgeDB in knowledge_store.py
SQLite-backed fact registry with HTTP exposure for:
- storing facts
- querying and deleting facts
- peer registration
- push/pull federation
- sync status reporting
This is the nearest thing the repo has to a durable shared memory service.
FleetMonitor in health_dashboard.py
Loads registry.yaml, polls wizard endpoints, caches results, and exposes both HTML and JSON views. It is the operator-facing read model of the fleet.
SyncEngine in federation_sync.py
Intended as the bridge across message bus, audit trail, and knowledge store. The design intent is strong, but the live endpoint contracts appear out of sync.
ProfilePolicy in scripts/profile_isolation.py
Encodes tmux/agent lifecycle policy by profile. This is one of the more disciplined “ops logic” modules: focused, testable, and bounded.
GenerationResult / VideoEngineClient in scripts/video_engine_client.py
Represents the repo’s media-generation sidecar boundary. The code is small and clear, but its tests are partially stale relative to implementation behavior.
API Surface
message_bus.py
Observed HTTP surface includes:
POST /messageGET /messages?to=<agent>&limit=<n>GET /queueGET /deadletterGET /auditGET /health
knowledge_store.py
Documented surface includes:
POST /factGET /factsDELETE /facts/<key>POST /sync/pullPOST /sync/pushGET /sync/statusGET /peersPOST /peersGET /health
health_dashboard.py
//api/status/api/wizard/<id>
scripts/deploy-hook.py
/health/webhook
Ansible operator surface
Primary commands implied by the repo:
ansible-playbook -i playbooks/inventory site.ymlansible-playbook -i playbooks/inventory playbooks/provision_and_deploy.ymlansible-playbook -i playbooks/inventory playbooks/deploy_hermes.yml
Dependencies
Python and shell posture
The repo is mostly Python stdlib plus Ansible/shell orchestration. It is not packaged as a single installable Python project.
Explicit Ansible collections
requirements.yml declares:
community.dockercommunity.generalansible.posix
The provisioning docs and playbooks also rely on community.digital.digital_ocean_droplet in playbooks/provision_and_deploy.yml.
External service dependencies
- Gitea
- Ollama
- DigitalOcean
- systemd
- Docker / Docker Compose
- local
~/.hermes/session and burn-log state
Hidden runtime dependency
Several conceptual modules import hermes_tools directly:
compassion_layer.pysovereign_librarian.pysovereign_muse.pysovereign_pulse.pysovereign_sentinel.pysynthesis_engine.py
That dependency is not self-contained inside the repo and directly causes the local collection errors.
Test Coverage Gaps
Current tested strengths
The strongest, most trustworthy tests are around:
tests/test_message_bus.pytests/test_knowledge_store.pytests/test_health_dashboard.pytests/test_registry_health_updater.pytests/test_profile_isolation.pytests/test_skill_scorer.pytests/test_nightly_efficiency_report.py
Those files make the shared-service core much more legible than the deployment layer.
Current local status
Fresh run result:
158 passed, 1 failed, 2 errors
Collection errors:
tests/test_heart.pyfails becausecompassion_layer.pyimportshermes_tools.tests/test_synthesis.pyfails becausesovereign_librarian.pyimportshermes_tools.
Runnable failure:
tests/test_video_engine_client.pyexpectsgenerate_draft()to raise on HTTP 503.scripts/video_engine_client.pycurrently catches exceptions and returnsGenerationResult(success=False, error=...)instead.
High-value untested paths
The most important missing or weakly validated surfaces are:
scripts/deploy-hook.py— high-blast-radius deploy trigger.playbooks/deploy_gitea.yml/playbooks/deploy_hermes.yml/playbooks/provision_and_deploy.yml— critical control plane, almost entirely untested in-repo.scripts/morning_report_compile.py— cron-facing reporting logic.scripts/burndown_cron.pyand related watcher scripts.scripts/generate_video.py,scripts/tiered_render.py, and broader video-engine operator paths.scripts/fleet-status.py— prints directly from module scope and has no__main__guard.
Coverage quality note
The repo’s best tests cluster around internal Python helpers. The repo’s biggest operational risk lives in deployment, cron wiring, and shell/Ansible behaviors that are not equivalently exercised.
Security Considerations
Strong points
- Vault use exists in
playbooks/group_vars/vault.ymland inline vaulted material inmanifest.yaml. playbooks/deploy_gitea.ymlsetsgitea_disable_registration: true,gitea_require_signin: true, andgitea_register_act_runner: false.- The Hermes role renders per-instance env/config and uses systemd hardening patterns.
- Gitea, Nostr relay, and other web surfaces are designed around nginx/TLS roles.
Concrete risks
scripts/deploy-hook.pyexplicitly disables signature enforcement whenDEPLOY_HOOK_SECRETis unset.playbooks/roles/gitea/defaults/main.ymlsetsgitea_webhook_allowed_host_list: "*".- Both
ansible.cfgfiles disable host key checking. - The repo has multiple sources of truth for ports and service topology:
playbooks/host_vars/ezra-primary.ymluses8643manifest.yamluses8643registry.yamlpoints Ezra health to8646
registry.yamladvertises services likebusd,auditd, andknowledged, but the mainplaybooks/site.ymlphases do not include message-bus or knowledge-store roles.
Drift / correctness risks that become security risks
playbooks/deploy_auto_merge.ymltargetshosts: gitea_servers, but the inventory groups visible inplaybooks/inventoryareforge,vps,agents, andwizards.playbooks/roles/gitea/defaults/main.ymlincludes runner labels with a probable typo:ubuntu-22.04:docker://catthehocker/ubuntu:act-22.04.- The local compose quick start is not turnkey:
Dockerfile.agentcopiesrequirements-agent.txt*andagent/, but the runtime falls back to a tiny health/tick loop if the real agent source is absent.
Deployment
VPS / real fleet path
Repo-root wrapper:
ansible-playbook -i playbooks/inventory site.yml
Direct orchestrator:
ansible-playbook -i playbooks/inventory playbooks/site.yml
Provision and bootstrap a new node:
ansible-playbook -i playbooks/inventory playbooks/provision_and_deploy.yml
Local sandbox path
cp .env.example .env
docker compose up -d
But this path must be read skeptically. docker-compose.yml is a local convenience stack, while the real fleet path uses Ansible + systemd + host vars + vault-backed secrets.
Dead Code Candidates and Operator Footguns
scripts/fleet-status.pybehaves like a one-shot report script with top-level execution, not a reusable CLI module.README.mdends with a visibly corrupted Nexus Watchdog section containing broken formatting.Sovereign_Health_Check.mdstill recommends running the brokentests/test_heart.pyandtests/test_synthesis.pyhealth suite.federation_sync.pycurrently looks architecturally important but contractually out of sync withmessage_bus.pyandknowledge_store.py.
Bottom Line
fleet-ops contains the real bones of a sovereign fleet control plane, but those bones are unevenly ossified.
The strong parts are:
- the phase-based Ansible deployment structure in
playbooks/site.yml - the microservice-style core in
message_bus.py,knowledge_store.py, andhealth_dashboard.py - several focused Python test suites that genuinely specify behavior
The weak parts are:
- duplicated sources of truth (
playbooks/host_vars/*,manifest.yaml,registry.yaml, local compose) - deployment and cron surfaces that matter more operationally than they are tested
- conceptual “sovereign_*” modules that pull in
hermes_toolsand currently break local collection
If this repo were being hardened next, the highest-leverage moves would be:
- Make the registries consistent (
8643vs8646, service inventory vs deployed phases). - Add focused tests around
scripts/deploy-hook.pyand the deploy/report cron scripts. - Decide which Python modules are truly production runtime and which are prototypes, then wire or prune accordingly.
- Collapse the number of “truth” files an operator has to trust during a deploy.