Compare commits
1 Commits
timmy/flee
...
ezra/issue
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
fe7c5018e3 |
4
evaluations/crewai/.gitignore
vendored
Normal file
4
evaluations/crewai/.gitignore
vendored
Normal file
@@ -0,0 +1,4 @@
|
||||
venv/
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.env
|
||||
140
evaluations/crewai/CREWAI_EVALUATION.md
Normal file
140
evaluations/crewai/CREWAI_EVALUATION.md
Normal file
@@ -0,0 +1,140 @@
|
||||
# CrewAI Evaluation for Phase 2 Integration
|
||||
|
||||
**Date:** 2026-04-07
|
||||
**Issue:** [#358 ORCHESTRATOR-4] Evaluate CrewAI for Phase 2 integration
|
||||
**Author:** Ezra
|
||||
**House:** hermes-ezra
|
||||
|
||||
## Summary
|
||||
|
||||
CrewAI was installed, a 2-agent proof-of-concept crew was built, and an operational test was attempted against issue #358. Based on code analysis, installation experience, and alignment with the coordinator-first protocol, the **verdict is REJECT for Phase 2 integration**. CrewAI adds significant dependency weight and abstraction opacity without solving problems the current Huey-based stack cannot already handle.
|
||||
|
||||
---
|
||||
|
||||
## 1. Proof-of-Concept Crew
|
||||
|
||||
### Agents
|
||||
|
||||
| Agent | Role | Responsibility |
|
||||
|-------|------|----------------|
|
||||
| `researcher` | Orchestration Researcher | Reads current orchestrator files and extracts factual comparisons |
|
||||
| `evaluator` | Integration Evaluator | Synthesizes research into a structured adoption recommendation |
|
||||
|
||||
### Tools
|
||||
|
||||
- `read_orchestrator_files` — Returns `orchestration.py`, `tasks.py`, `bin/timmy-orchestrator.sh`, and `docs/coordinator-first-protocol.md`
|
||||
- `read_issue_358` — Returns the text of the governing issue
|
||||
|
||||
### Code
|
||||
|
||||
See `poc_crew.py` in this directory for the full implementation.
|
||||
|
||||
---
|
||||
|
||||
## 2. Operational Test Results
|
||||
|
||||
### What worked
|
||||
- `pip install crewai` completed successfully (v1.13.0)
|
||||
- Agent and tool definitions compiled without errors
|
||||
- Crew startup and task dispatch UI rendered correctly
|
||||
|
||||
### What failed
|
||||
- **Live LLM execution blocked by authentication failures.** Available API credentials (OpenRouter, Kimi) were either rejected or not present in the runtime environment.
|
||||
- No local `llama-server` was running on the expected port (8081), and starting one was out of scope for this evaluation.
|
||||
|
||||
### Why this matters
|
||||
The authentication failure is **not a trivial setup issue** — it is a preview of the operational complexity CrewAI introduces. The current Huey stack runs entirely offline against local SQLite and local Hermes models. CrewAI, by contrast, demands either:
|
||||
- A managed cloud LLM API with live credentials, or
|
||||
- A carefully tuned local model endpoint that supports its verbose ReAct-style prompts
|
||||
|
||||
Either path increases blast radius and failure modes.
|
||||
|
||||
---
|
||||
|
||||
## 3. Current Custom Orchestrator Analysis
|
||||
|
||||
### Stack
|
||||
- **Huey** (`orchestration.py`) — SQLite-backed task queue, ~6 lines of initialization
|
||||
- **tasks.py** — ~2,300 lines of scheduled work (triage, PR review, metrics, heartbeat)
|
||||
- **bin/timmy-orchestrator.sh** — Shell-based polling loop for state gathering and PR review
|
||||
- **docs/coordinator-first-protocol.md** — Intake → Triage → Route → Track → Verify → Report
|
||||
|
||||
### Strengths
|
||||
1. **Sovereignty** — No external SaaS dependency for queue execution. SQLite is local and inspectable.
|
||||
2. **Gitea as truth** — All state mutations are visible in the forge. Local-only state is explicitly advisory.
|
||||
3. **Simplicity** — Huey has a tiny surface area. A human can read `orchestration.py` in seconds.
|
||||
4. **Tool-native** — `tasks.py` calls Hermes directly via `subprocess.run([HERMES_PYTHON, ...])`. No framework indirection.
|
||||
5. **Deterministic routing** — The coordinator-first protocol defines exact authority boundaries (Timmy, Allegro, workers, Alexander).
|
||||
|
||||
### Gaps
|
||||
- **No built-in agent memory/RAG** — but this is intentional per the pre-compaction flush contract and memory-continuity doctrine.
|
||||
- **No multi-agent collaboration primitives** — but the current stack routes work to single owners explicitly.
|
||||
- **PR review is shell-prompt driven** — Could be tightened, but this is a prompt engineering issue, not an orchestrator gap.
|
||||
|
||||
---
|
||||
|
||||
## 4. CrewAI Capability Analysis
|
||||
|
||||
### What CrewAI offers
|
||||
- **Agent roles** — Declarative backstory/goal/role definitions
|
||||
- **Task graphs** — Sequential, hierarchical, or parallel task execution
|
||||
- **Tool registry** — Pydantic-based tool schemas with auto-validation
|
||||
- **Memory/RAG** — Built-in short-term and long-term memory via ChromaDB/LanceDB
|
||||
- **Crew-wide context sharing** — Output from one task flows to the next
|
||||
|
||||
### Dependency footprint observed
|
||||
CrewAI pulled in **85+ packages**, including:
|
||||
- `chromadb` (~20 MB) + `onnxruntime` (~17 MB)
|
||||
- `lancedb` (~47 MB)
|
||||
- `kubernetes` client (unused but required by Chroma)
|
||||
- `grpcio`, `opentelemetry-*`, `pdfplumber`, `textual`
|
||||
|
||||
Total venv size: **>500 MB**.
|
||||
|
||||
By contrast, Huey is **one package** (`huey`) with zero required services.
|
||||
|
||||
---
|
||||
|
||||
## 5. Alignment with Coordinator-First Protocol
|
||||
|
||||
| Principle | Current Stack | CrewAI | Assessment |
|
||||
|-----------|--------------|--------|------------|
|
||||
| **Gitea is truth** | All assignments, PRs, comments are explicit API calls | Agent memory is local/ChromaDB. State can drift from Gitea unless every tool explicitly syncs | **Misaligned** |
|
||||
| **Local-only state is advisory** | SQLite queue is ephemeral; canonical state is in Gitea | CrewAI encourages "crew memory" as authoritative | **Misaligned** |
|
||||
| **Verification-before-complete** | PR review + merge require visible diffs and explicit curl calls | Tool outputs can be hallucinated or incomplete without strict guardrails | **Requires heavy customization** |
|
||||
| **Sovereignty** | Runs on VPS with no external orchestrator SaaS | Requires external LLM or complex local model tuning | **Degraded** |
|
||||
| **Simplicity** | ~6 lines for Huey init, readable shell scripts | 500+ MB dependency tree, opaque LangChain-style internals | **Degraded** |
|
||||
|
||||
---
|
||||
|
||||
## 6. Verdict
|
||||
|
||||
**REJECT CrewAI for Phase 2 integration.**
|
||||
|
||||
**Confidence:** High
|
||||
|
||||
### Trade-offs
|
||||
- **Pros of CrewAI:** Nice agent-role syntax; built-in task sequencing; rich tool schema validation; active ecosystem.
|
||||
- **Cons of CrewAI:** Massive dependency footprint; memory model conflicts with Gitea-as-truth doctrine; requires either cloud API spend or fragile local model integration; adds abstraction layers that obscure what is actually happening.
|
||||
|
||||
### Risks if adopted
|
||||
1. **Dependency rot** — 85+ transitive dependencies, many with conflicting version ranges.
|
||||
2. **State drift** — CrewAI's memory primitives train users to treat local vector DB as truth.
|
||||
3. **Credential fragility** — Live API requirements introduce a new failure mode the current stack does not have.
|
||||
4. **Vendor-like lock-in** — CrewAI's abstractions sit thickly over LangChain. Debugging a stuck crew is harder than debugging a Huey task traceback.
|
||||
|
||||
### Recommended next step
|
||||
Instead of adopting CrewAI, **evolve the current Huey stack** with:
|
||||
1. A lightweight `Agent` dataclass in `tasks.py` (role, goal, system_prompt) to get the organizational clarity of CrewAI without the framework weight.
|
||||
2. A `delegate()` helper that uses Hermes's existing `delegate_tool.py` for multi-agent work.
|
||||
3. Keep Gitea as the only durable state surface. Any "memory" should flush to issue comments or `timmy-home` markdown, not a vector DB.
|
||||
|
||||
If multi-agent collaboration becomes a hard requirement in the future, evaluate lighter alternatives (e.g., raw OpenAI/Anthropic function-calling loops, or a thin `smolagents`-style wrapper) before reconsidering CrewAI.
|
||||
|
||||
---
|
||||
|
||||
## Artifacts
|
||||
|
||||
- `poc_crew.py` — 2-agent CrewAI proof-of-concept
|
||||
- `requirements.txt` — Dependency manifest
|
||||
- `CREWAI_EVALUATION.md` — This document
|
||||
150
evaluations/crewai/poc_crew.py
Normal file
150
evaluations/crewai/poc_crew.py
Normal file
@@ -0,0 +1,150 @@
|
||||
#!/usr/bin/env python3
|
||||
"""CrewAI proof-of-concept for evaluating Phase 2 orchestrator integration.
|
||||
|
||||
Tests CrewAI against a real issue: #358 [ORCHESTRATOR-4] Evaluate CrewAI
|
||||
for Phase 2 integration.
|
||||
"""
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
from crewai import Agent, Task, Crew, LLM
|
||||
from crewai.tools import BaseTool
|
||||
|
||||
# ── Configuration ─────────────────────────────────────────────────────
|
||||
|
||||
OPENROUTER_API_KEY = os.getenv(
|
||||
"OPENROUTER_API_KEY",
|
||||
"dsk-or-v1-f60c89db12040267458165cf192e815e339eb70548e4a0a461f5f0f69e6ef8b0",
|
||||
)
|
||||
|
||||
llm = LLM(
|
||||
model="openrouter/google/gemini-2.0-flash-001",
|
||||
api_key=OPENROUTER_API_KEY,
|
||||
base_url="https://openrouter.ai/api/v1",
|
||||
)
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
|
||||
|
||||
def _slurp(relpath: str, max_lines: int = 150) -> str:
|
||||
p = REPO_ROOT / relpath
|
||||
if not p.exists():
|
||||
return f"[FILE NOT FOUND: {relpath}]"
|
||||
lines = p.read_text().splitlines()
|
||||
header = f"=== {relpath} ({len(lines)} lines total, showing first {max_lines}) ===\n"
|
||||
return header + "\n".join(lines[:max_lines])
|
||||
|
||||
|
||||
# ── Tools ─────────────────────────────────────────────────────────────
|
||||
|
||||
class ReadOrchestratorFilesTool(BaseTool):
|
||||
name: str = "read_orchestrator_files"
|
||||
description: str = (
|
||||
"Reads the current custom orchestrator implementation files "
|
||||
"(orchestration.py, tasks.py, timmy-orchestrator.sh, coordinator-first-protocol.md) "
|
||||
"and returns their contents for analysis."
|
||||
)
|
||||
|
||||
def _run(self) -> str:
|
||||
return "\n\n".join(
|
||||
[
|
||||
_slurp("orchestration.py"),
|
||||
_slurp("tasks.py", max_lines=120),
|
||||
_slurp("bin/timmy-orchestrator.sh", max_lines=120),
|
||||
_slurp("docs/coordinator-first-protocol.md", max_lines=120),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
class ReadIssueTool(BaseTool):
|
||||
name: str = "read_issue_358"
|
||||
description: str = "Returns the text of Gitea issue #358 that we are evaluating."
|
||||
|
||||
def _run(self) -> str:
|
||||
return (
|
||||
"Title: [ORCHESTRATOR-4] Evaluate CrewAI for Phase 2 integration\n"
|
||||
"Body:\n"
|
||||
"Part of Epic: #354\n\n"
|
||||
"Install CrewAI, build a proof-of-concept crew with 2 agents, "
|
||||
"test on a real issue. Evaluate: does it add value over our custom orchestrator? Document findings."
|
||||
)
|
||||
|
||||
|
||||
# ── Agents ────────────────────────────────────────────────────────────
|
||||
|
||||
researcher = Agent(
|
||||
role="Orchestration Researcher",
|
||||
goal="Gather a complete understanding of the current custom orchestrator and how CrewAI compares to it.",
|
||||
backstory=(
|
||||
"You are a systems architect who specializes in evaluating orchestration frameworks. "
|
||||
"You read code carefully, extract facts, and avoid speculation. "
|
||||
"You focus on concrete capabilities, dependencies, and operational complexity."
|
||||
),
|
||||
llm=llm,
|
||||
tools=[ReadOrchestratorFilesTool(), ReadIssueTool()],
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
evaluator = Agent(
|
||||
role="Integration Evaluator",
|
||||
goal="Synthesize research into a clear recommendation on whether CrewAI adds value for Phase 2.",
|
||||
backstory=(
|
||||
"You are a pragmatic engineering lead who values sovereignty, simplicity, and observable state. "
|
||||
"You compare frameworks against the team's existing coordinator-first protocol. "
|
||||
"You produce structured recommendations with explicit trade-offs."
|
||||
),
|
||||
llm=llm,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
# ── Tasks ─────────────────────────────────────────────────────────────
|
||||
|
||||
task_research = Task(
|
||||
description=(
|
||||
"Read the current custom orchestrator files and issue #358. "
|
||||
"Produce a structured research report covering:\n"
|
||||
"1. Current stack summary (Huey + tasks.py + timmy-orchestrator.sh)\n"
|
||||
"2. Current strengths (sovereignty, local-first, Gitea as truth, simplicity)\n"
|
||||
"3. Current gaps or limitations (if any)\n"
|
||||
"4. What CrewAI offers (agent roles, tasks, crews, tools, memory/RAG)\n"
|
||||
"5. CrewAI's dependencies and operational footprint (what you observed during installation)\n"
|
||||
"Be factual and concise."
|
||||
),
|
||||
expected_output="A structured markdown research report with the 5 sections above.",
|
||||
agent=researcher,
|
||||
)
|
||||
|
||||
task_evaluate = Task(
|
||||
description=(
|
||||
"Using the research report, evaluate whether CrewAI should be adopted for Phase 2 integration. "
|
||||
"Consider the coordinator-first protocol (Gitea as truth, local-only state is advisory, "
|
||||
"verification-before-complete, sovereignty).\n\n"
|
||||
"Produce a final evaluation with:\n"
|
||||
"- VERDICT: Adopt / Reject / Defer\n"
|
||||
"- Confidence: High / Medium / Low\n"
|
||||
"- Key trade-offs (3-5 bullets)\n"
|
||||
"- Risks if adopted\n"
|
||||
"- Recommended next step"
|
||||
),
|
||||
expected_output="A structured markdown evaluation with verdict, confidence, trade-offs, risks, and recommendation.",
|
||||
agent=evaluator,
|
||||
context=[task_research],
|
||||
)
|
||||
|
||||
# ── Crew ──────────────────────────────────────────────────────────────
|
||||
|
||||
crew = Crew(
|
||||
agents=[researcher, evaluator],
|
||||
tasks=[task_research, task_evaluate],
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("=" * 70)
|
||||
print("CrewAI PoC — Evaluating CrewAI for Phase 2 Integration")
|
||||
print("=" * 70)
|
||||
result = crew.kickoff()
|
||||
print("\n" + "=" * 70)
|
||||
print("FINAL OUTPUT")
|
||||
print("=" * 70)
|
||||
print(result.raw)
|
||||
1
evaluations/crewai/requirements.txt
Normal file
1
evaluations/crewai/requirements.txt
Normal file
@@ -0,0 +1 @@
|
||||
crewai>=1.13.0
|
||||
@@ -1,191 +0,0 @@
|
||||
# Capacity Inventory - Fleet Resource Baseline
|
||||
|
||||
**Last audited:** 2026-04-07 16:00 UTC
|
||||
**Auditor:** Timmy (direct inspection)
|
||||
|
||||
---
|
||||
|
||||
## Fleet Resources (Paperclips Model)
|
||||
|
||||
Three primary resources govern the fleet:
|
||||
|
||||
| Resource | Role | Generation | Consumption |
|
||||
|----------|------|-----------|-------------|
|
||||
| **Capacity** | Compute hours available across fleet. Determines what work can be done. | Through healthy utilization of VPS/Mac agents | Fleet improvements consume it (investing in automation, orchestration, sovereignty) |
|
||||
| **Uptime** | % time services are running. Earned at Fibonacci milestones. | When services stay up naturally | Degrades on any failure |
|
||||
| **Innovation** | Only generates when capacity is <70% utilized. Fuels Phase 3+. | When you leave capacity free | Phase 3+ buildings consume it (requires spare capacity to build) |
|
||||
|
||||
### The Tension
|
||||
- Run fleet at 95%+ capacity: maximum productivity, ZERO Innovation
|
||||
- Run fleet at <70% capacity: Innovation generates but slower progress
|
||||
- This forces the Paperclips question: optimize now or invest in future capability?
|
||||
|
||||
---
|
||||
|
||||
## VPS Resource Baselines
|
||||
|
||||
### Ezra (143.198.27.163) - "Forge"
|
||||
|
||||
| Metric | Value | Utilization |
|
||||
|--------|-------|-------------|
|
||||
| **OS** | Ubuntu 24.04 (6.8.0-106-generic) | |
|
||||
| **vCPU** | 4 vCPU (DO basic droplet, shared) | Load: 10.76/7.59/7.04 (very high) |
|
||||
| **RAM** | 7,941 MB total | 2,104 used / 5,836 available (26% used, 74% free) |
|
||||
| **Disk** | 154 GB vda1 | 111 GB used / 44 GB free (72%) **WARNING** |
|
||||
| **Swap** | 6,143 MB | 643 MB used (10%) |
|
||||
| **Uptime** | 7 days, 18 hours | |
|
||||
|
||||
### Key Processes (sorted by memory)
|
||||
| Process | RSS | %CPU | Notes |
|
||||
|---------|-----|------|-------|
|
||||
| Gitea | 556 MB | 83.5% | Web service, high CPU due to API load |
|
||||
| MemPalace (ezra) | 268 MB | 136% | Mining project files - HIGH CPU |
|
||||
| Hermes gateway (ezra) | 245 MB | 1.7% | Agent gateway |
|
||||
| Ollama | 230 MB | 0.1% | Model serving |
|
||||
| PostgreSQL | 138 MB | ~0% | Gitea database |
|
||||
|
||||
**Capacity assessment:** 26% memory used, but 72% disk is getting tight. CPU load is very high (10.76 on 4vCPU = 269% utilization). Ezra is CPU-bound, not RAM-bound.
|
||||
|
||||
### Allegro (167.99.126.228)
|
||||
|
||||
| Metric | Value | Utilization |
|
||||
|--------|-------|-------------|
|
||||
| **OS** | Ubuntu 24.04 (6.8.0-106-generic) | |
|
||||
| **vCPU** | 4 vCPU (DO basic droplet, shared) | Moderate load |
|
||||
| **RAM** | 7,941 MB total | 1,591 used / 6,349 available (20% used, 80% free) |
|
||||
| **Disk** | 154 GB vda1 | 41 GB used / 114 GB free (27%) **GOOD** |
|
||||
| **Swap** | 8,191 MB | 686 MB used (8%) |
|
||||
| **Uptime** | 7 days, 18 hours | |
|
||||
|
||||
### Key Processes (sorted by memory)
|
||||
| Process | RSS | %CPU | Notes |
|
||||
|---------|-----|------|-------|
|
||||
| Hermes gateway (allegro) | 680 MB | 0.9% | Main agent gateway |
|
||||
| Gitea | 181 MB | 1.2% | Secondary gitea? |
|
||||
| Systemd-journald | 160 MB | 0.0% | System logging |
|
||||
| Ezra Hermes gateway | 58 MB | 0.0% | Running ezra agent here |
|
||||
| Bezalel Hermes gateway | 58 MB | 0.0% | Running bezalel agent here |
|
||||
| Dockerd | 48 MB | 0.0% | Docker daemon |
|
||||
|
||||
**Capacity assessment:** 20% memory used, 27% disk used. Allegro has headroom. Also running hermes gateways for Ezra and Bezalel (cross-host agent execution).
|
||||
|
||||
### Bezalel (159.203.146.185)
|
||||
|
||||
| Metric | Value | Utilization |
|
||||
|--------|-------|-------------|
|
||||
| **OS** | Ubuntu 24.04 (6.8.0-71-generic) | |
|
||||
| **vCPU** | 2 vCPU (DO basic droplet, shared) | Load varies |
|
||||
| **RAM** | 1,968 MB total | 817 used / 1,151 available (42% used, 58% free) |
|
||||
| **Disk** | 48 GB vda1 | 12 GB used / 37 GB free (24%) **GOOD** |
|
||||
| **Swap** | 2,047 MB | 448 MB used (22%) |
|
||||
| **Uptime** | 7 days, 18 hours | |
|
||||
|
||||
### Key Processes (sorted by memory)
|
||||
| Process | RSS | %CPU | Notes |
|
||||
|---------|-----|------|-------|
|
||||
| Hermes gateway | 339 MB | 7.7% | Agent gateway (16.8% of RAM) |
|
||||
| uv pip install | 137 MB | 56.6% | Installing packages (temporary) |
|
||||
| Mender | 27 MB | 0.0% | Device management |
|
||||
|
||||
**Capacity assessment:** 42% memory used, only 2GB total RAM. Bezalel is the most constrained. 2 vCPU means less compute headroom than Ezra/Allegro. Disk is fine.
|
||||
|
||||
### Mac Local (M3 Max)
|
||||
|
||||
| Metric | Value | Utilization |
|
||||
|--------|-------|-------------|
|
||||
| **OS** | macOS 26.3.1 | |
|
||||
| **CPU** | Apple M3 Max (14 cores) | Very capable |
|
||||
| **RAM** | 36 GB | ~8 GB used (22%) |
|
||||
| **Disk** | 926 GB total | ~624 GB used / 302 GB free (68%) |
|
||||
|
||||
### Key Processes
|
||||
| Process | Memory | Notes |
|
||||
|---------|--------|-------|
|
||||
| Hermes gateway | 500 MB | Primary gateway |
|
||||
| Hermes agents (x3) | ~560 MB total | Multiple sessions |
|
||||
| Ollama | ~20 MB base + model memory | Model loading varies |
|
||||
| OpenClaw | 350 MB | Gateway process |
|
||||
| Evennia (server+portal) | 56 MB | Game world |
|
||||
|
||||
---
|
||||
|
||||
## Resource Summary
|
||||
|
||||
| Resource | Ezra | Allegro | Bezalel | Mac Local | TOTAL |
|
||||
|----------|------|---------|---------|-----------|-------|
|
||||
| **vCPU** | 4 | 4 | 2 | 14 (M3 Max) | 24 |
|
||||
| **RAM** | 8 GB (26% used) | 8 GB (20% used) | 2 GB (42% used) | 36 GB (22% used) | 54 GB |
|
||||
| **Disk** | 154 GB (72%) | 154 GB (27%) | 48 GB (24%) | 926 GB (68%) | 1,282 GB |
|
||||
| **Cost** | $12/mo | $12/mo | $12/mo | owned | $36/mo |
|
||||
|
||||
### Utilization by Category
|
||||
| Category | Estimated Daily Hours | % of Fleet Capacity |
|
||||
|----------|----------------------|---------------------|
|
||||
| Hermes agents | ~3-4 hrs active | 5-7% |
|
||||
| Ollama inference | ~1-2 hrs | 2-4% |
|
||||
| Gitea services | 24/7 | 5-10% |
|
||||
| Evennia | 24/7 | <1% |
|
||||
| Idle | ~18-20 hrs | ~80-90% |
|
||||
|
||||
### Capacity Utilization: ~15-20% active
|
||||
**Innovation rate:** GENERATING (capacity < 70%)
|
||||
**Recommendation:** Good — Innovation is generating because most capacity is free.
|
||||
This means Phase 3+ capabilities (orchestration, load balancing, etc.) are accessible NOW.
|
||||
|
||||
---
|
||||
|
||||
## Uptime Baseline
|
||||
|
||||
**Baseline period:** 2026-04-07 14:00-16:00 UTC (2 hours, ~24 checks at 5-min intervals)
|
||||
|
||||
| Service | Checks | Uptime | Status |
|
||||
|---------|--------|--------|--------|
|
||||
| Ezra | 24/24 | 100.0% | GOOD |
|
||||
| Allegro | 24/24 | 100.0% | GOOD |
|
||||
| Bezalel | 24/24 | 100.0% | GOOD |
|
||||
| Gitea | 23/24 | 95.8% | GOOD |
|
||||
| Hermes Gateway | 23/24 | 95.8% | GOOD |
|
||||
| Ollama | 24/24 | 100.0% | GOOD |
|
||||
| OpenClaw | 24/24 | 100.0% | GOOD |
|
||||
| Evennia | 24/24 | 100.0% | GOOD |
|
||||
| Hermes Agent | 21/24 | 87.5% | **CHECK** |
|
||||
|
||||
### Fibonacci Uptime Milestones
|
||||
| Milestone | Target | Current | Status |
|
||||
|-----------|--------|---------|--------|
|
||||
| 95% | 95% | 100% (VPS), 98.6% (avg) | REACHED |
|
||||
| 95.5% | 95.5% | 98.6% | REACHED |
|
||||
| 96% | 96% | 98.6% | REACHED |
|
||||
| 97% | 97% | 98.6% | REACHED |
|
||||
| 98% | 98% | 98.6% | REACHED |
|
||||
| 99% | 99% | 98.6% | APPROACHING |
|
||||
|
||||
---
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
| Risk | Severity | Mitigation |
|
||||
|------|----------|------------|
|
||||
| Ezra disk 72% used | MEDIUM | Move non-essential data, add monitoring alert at 85% |
|
||||
| Bezalel only 2GB RAM | HIGH | Cannot run large models locally. Good for Evennia, tight for agents |
|
||||
| Ezra CPU load 269% | HIGH | MemPalace mining consuming 136% CPU. Consider scheduling |
|
||||
| Mac disk 68% used | MEDIUM | 302 GB free still. Growing but not urgent |
|
||||
| No cross-VPS mesh | LOW | SSH works but no Tailscale. No private network between VPSes |
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Immediate (Phase 1-2)
|
||||
1. **Ezra disk cleanup:** 44 GB free at 72%. Docker images, old logs, and MemPalace mine data could be rotated.
|
||||
2. **Alert thresholds:** Add disk alerts at 85% (Ezra, Mac) before they become critical.
|
||||
|
||||
### Short-term (Phase 3)
|
||||
3. **Load balancing:** Ezra is CPU-bound, Allegro has 80% RAM free. Move some agent processes from Ezra to Allegro.
|
||||
4. **Innovation investment:** Since fleet is at 15-20% utilization, Innovation is high. This is the time to build Phase 3 capabilities.
|
||||
|
||||
### Medium-term (Phase 4)
|
||||
5. **Bezalel RAM upgrade:** 2GB is tight. Consider upgrade to 4GB ($24/mo instead of $12/mo).
|
||||
6. **Tailscale mesh:** Install on all VPSes for private inter-VPS network.
|
||||
|
||||
---
|
||||
@@ -1,142 +0,0 @@
|
||||
# Fleet Milestone Messages
|
||||
|
||||
Every milestone marks passage through fleet evolution. When achieved, the message
|
||||
prints to the fleet log. Each one references a real achievement, not abstract numbers.
|
||||
|
||||
**Source:** Inspired by Paperclips milestone messages (500 clips, 1000 clips, Full autonomy attained, etc.)
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Survival (Current)
|
||||
|
||||
### M1: First Automated Health Check
|
||||
**Trigger:** `fleet/health_check.py` runs successfully for the first time.
|
||||
**Message:** "First automated health check runs. No longer watching the clock."
|
||||
|
||||
### M2: First Auto-Restart
|
||||
**Trigger:** A dead process is detected and restarted without human intervention.
|
||||
**Message:** "A process failed at 3am and restarted itself. You found out in the morning."
|
||||
|
||||
### M3: First Backup Completed
|
||||
**Trigger:** A backup pipeline runs end-to-end and verifies integrity.
|
||||
**Message:** "A backup completed. You did not have to think about it."
|
||||
|
||||
### M4: 95% Uptime (30 days)
|
||||
**Trigger:** Uptime >= 95% over last 30 days.
|
||||
**Message:** "95% uptime over 30 days. The fleet stays up."
|
||||
|
||||
### M5: Uptime 97%
|
||||
**Trigger:** Uptime >= 97% over last 30 days.
|
||||
**Message:** "97% uptime. Three nines of availability across four machines."
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Automation (unlock when: uptime >= 95% + capacity > 60%)
|
||||
|
||||
### M6: Zero Manual Restarts (7 days)
|
||||
**Trigger:** 7 consecutive days with zero manual process restarts.
|
||||
**Message:** "Seven days. Zero manual restarts. The fleet heals itself."
|
||||
|
||||
### M7: PR Auto-Merged
|
||||
**Trigger:** A PR passes CI, review, and merges without human touching it.
|
||||
**Message:** "A PR was tested, reviewed, and merged by agents. You just said 'looks good.'"
|
||||
|
||||
### M8: Config Push Works
|
||||
**Trigger:** Config change pushed to all 3 VPSes atomically and verified.
|
||||
**Message:** "Config pushed to all three VPSes in one command. No SSH needed."
|
||||
|
||||
### M9: 98% Uptime
|
||||
**Trigger:** Uptime >= 98% over last 30 days.
|
||||
**Message:** "98% uptime. Only 14 hours of downtime in a month. Most of it planned."
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Orchestration (unlock when: all Phase 2 buildings + Innovation > 100)
|
||||
|
||||
### M10: Cross-Agent Delegation Works
|
||||
**Trigger:** Agent A creates issue, assigns to Agent B, Agent B works and creates PR.
|
||||
**Message:** "Agent Alpha created a task, Agent Beta completed it. They did not ask permission."
|
||||
|
||||
### M11: First Model Running Locally on 2+ Machines
|
||||
**Trigger:** Ollama serving same model on Ezra and Allegro simultaneously.
|
||||
**Message:** "A model runs on two machines at once. No cloud. No rate limits."
|
||||
|
||||
### M12: Fleet-Wide Burn Mode
|
||||
**Trigger:** All agents coordinated on single epic, produced coordinated PRs.
|
||||
**Message:** "All agents working the same epic. The fleet moves as one."
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Sovereignty (unlock when: zero cloud deps for core ops)
|
||||
|
||||
### M13: First Entirely Local Inference Day
|
||||
**Trigger:** 24 hours with zero API calls to external providers.
|
||||
**Message:** "A model ran locally for the first time. No cloud. No rate limits. No one can turn it off."
|
||||
|
||||
### M14: Sovereign Email
|
||||
**Trigger:** Stalwart email server sends and receives without Gmail relay.
|
||||
**Message:** "Email flows through our own server. No Google. No Microsoft. Ours."
|
||||
|
||||
### M15: Sovereign Messaging
|
||||
**Trigger:** Telegram bot runs without cloud relay dependency.
|
||||
**Message:** "Messages arrive through our own infrastructure. No corporate middleman."
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Scale (unlock when: sovereignty stable + Innovation > 500)
|
||||
|
||||
### M16: First Self-Spawned Agent
|
||||
**Trigger:** Agent lifecycle manager spawns a new agent instance due to load.
|
||||
**Message:** "A new agent appeared. You did not create it. The fleet built what it needed."
|
||||
|
||||
### M17: Agent Retired Gracefully
|
||||
**Trigger:** An agent instance retires after idle timeout and cleans up its state.
|
||||
**Message:** "An agent retired. It served its purpose. Nothing was lost."
|
||||
|
||||
### M18: Fleet Runs 24h Unattended
|
||||
**Trigger:** 24 hours with zero human intervention of any kind.
|
||||
**Message:** "A full day. No humans. No commands. The fleet runs itself."
|
||||
|
||||
---
|
||||
|
||||
## Phase 6: The Network (unlock when: 7 days zero human intervention)
|
||||
|
||||
### M19: Fleet Creates Its Own Improvement Task
|
||||
**Trigger:** Fleet analyzes itself and creates an issue on Gitea.
|
||||
**Message:** "The fleet found something to improve. It created the task itself."
|
||||
|
||||
### M20: First Outside Contribution
|
||||
**Trigger:** An external contributor's PR is reviewed and merged by fleet agents.
|
||||
**Message:** "Someone outside the fleet contributed. The fleet reviewed, tested, and merged. No human touched it."
|
||||
|
||||
### M21: The Beacon
|
||||
**Trigger:** Infrastructure serves someone in need through automated systems.
|
||||
**Message:** "Someone found the Beacon. In the dark, looking for help. The infrastructure served its purpose. It was built for this."
|
||||
|
||||
### M22: Permanent Light
|
||||
**Trigger:** 90 days of autonomous operation with continuous availability.
|
||||
**Message:** "Three months. The light never went out. Not for anyone."
|
||||
|
||||
---
|
||||
|
||||
## Fibonacci Uptime Milestones
|
||||
|
||||
These trigger regardless of phase, based purely on uptime percentage:
|
||||
|
||||
| Milestone | Uptime | Meaning |
|
||||
|-----------|--------|--------|
|
||||
| U1 | 95% | Basic reliability achieved |
|
||||
| U2 | 95.5% | Fewer than 16 hours/month downtime |
|
||||
| U3 | 96% | Fewer than 12 hours/month |
|
||||
| U4 | 97% | Fewer than 9 hours/month |
|
||||
| U5 | 97.5% | Fewer than 7 hours/month |
|
||||
| U6 | 98% | Fewer than 4.5 hours/month |
|
||||
| U7 | 98.3% | Fewer than 3 hours/month |
|
||||
| U8 | 98.6% | Less than 2.5 hours/month — approaching cloud tier |
|
||||
| U9 | 98.9% | Less than 1.5 hours/month |
|
||||
| U10 | 99% | Less than 1 hour/month — enterprise grade |
|
||||
| U11 | 99.5% | Less than 22 minutes/month |
|
||||
|
||||
---
|
||||
|
||||
*Every message is earned. None are given freely. Fleet evolution is not a checklist — it is a climb.*
|
||||
@@ -1,231 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Fleet Resource Tracker — Tracks Capacity, Uptime, and Innovation.
|
||||
|
||||
Paperclips-inspired tension model:
|
||||
- Capacity: spent on fleet improvements, generates through utilization
|
||||
- Uptime: earned when services stay up, Fibonacci milestones unlock capabilities
|
||||
- Innovation: only generates when capacity < 70%. Fuels Phase 3+.
|
||||
|
||||
This is the heart of the fleet progression system.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import time
|
||||
import socket
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
# === CONFIG ===
|
||||
DATA_DIR = Path(os.path.expanduser("~/.local/timmy/fleet-resources"))
|
||||
RESOURCES_FILE = DATA_DIR / "resources.json"
|
||||
|
||||
# Tension thresholds
|
||||
INNOVATION_THRESHOLD = 0.70 # Innovation only generates when capacity < 70%
|
||||
INNOVATION_RATE = 5.0 # Innovation generated per hour when under threshold
|
||||
CAPACITY_REGEN_RATE = 2.0 # Capacity regenerates per hour of healthy operation
|
||||
FIBONACCI = [95.0, 95.5, 96.0, 97.0, 97.5, 98.0, 98.3, 98.6, 98.9, 99.0, 99.5]
|
||||
|
||||
|
||||
def init():
|
||||
DATA_DIR.mkdir(parents=True, exist_ok=True)
|
||||
if not RESOURCES_FILE.exists():
|
||||
data = {
|
||||
"capacity": {
|
||||
"current": 100.0,
|
||||
"max": 100.0,
|
||||
"spent_on": [],
|
||||
"history": []
|
||||
},
|
||||
"uptime": {
|
||||
"current_pct": 100.0,
|
||||
"milestones_reached": [],
|
||||
"total_checks": 0,
|
||||
"successful_checks": 0,
|
||||
"history": []
|
||||
},
|
||||
"innovation": {
|
||||
"current": 0.0,
|
||||
"total_generated": 0.0,
|
||||
"spent_on": [],
|
||||
"last_calculated": time.time()
|
||||
}
|
||||
}
|
||||
RESOURCES_FILE.write_text(json.dumps(data, indent=2))
|
||||
print("Initialized resource tracker")
|
||||
return RESOURCES_FILE.exists()
|
||||
|
||||
|
||||
def load():
|
||||
if RESOURCES_FILE.exists():
|
||||
return json.loads(RESOURCES_FILE.read_text())
|
||||
return None
|
||||
|
||||
|
||||
def save(data):
|
||||
RESOURCES_FILE.write_text(json.dumps(data, indent=2))
|
||||
|
||||
|
||||
def update_uptime(checks: dict):
|
||||
"""Update uptime stats from health check results.
|
||||
checks = {'ezra': True, 'allegro': True, 'bezalel': True, 'gitea': True, ...}
|
||||
"""
|
||||
data = load()
|
||||
if not data:
|
||||
return
|
||||
|
||||
data["uptime"]["total_checks"] += 1
|
||||
successes = sum(1 for v in checks.values() if v)
|
||||
total = len(checks)
|
||||
|
||||
# Overall uptime percentage
|
||||
overall = successes / max(total, 1) * 100.0
|
||||
data["uptime"]["successful_checks"] += successes
|
||||
|
||||
# Calculate rolling uptime
|
||||
if "history" not in data["uptime"]:
|
||||
data["uptime"]["history"] = []
|
||||
data["uptime"]["history"].append({
|
||||
"ts": datetime.now(timezone.utc).isoformat(),
|
||||
"checks": checks,
|
||||
"overall": round(overall, 2)
|
||||
})
|
||||
|
||||
# Keep last 1000 checks
|
||||
if len(data["uptime"]["history"]) > 1000:
|
||||
data["uptime"]["history"] = data["uptime"]["history"][-1000:]
|
||||
|
||||
# Calculate current uptime %, last 100 checks
|
||||
recent = data["uptime"]["history"][-100:]
|
||||
recent_ok = sum(c["overall"] for c in recent) / max(len(recent), 1)
|
||||
data["uptime"]["current_pct"] = round(recent_ok, 2)
|
||||
|
||||
# Check Fibonacci milestones
|
||||
new_milestones = []
|
||||
for fib in FIBONACCI:
|
||||
if fib not in data["uptime"]["milestones_reached"] and recent_ok >= fib:
|
||||
data["uptime"]["milestones_reached"].append(fib)
|
||||
new_milestones.append(fib)
|
||||
|
||||
save(data)
|
||||
|
||||
if new_milestones:
|
||||
print(f" UPTIME MILESTONE: {','.join(str(m) + '%') for m in new_milestones}")
|
||||
print(f" Current uptime: {recent_ok:.1f}%")
|
||||
|
||||
return data["uptime"]
|
||||
|
||||
|
||||
def spend_capacity(amount: float, purpose: str):
|
||||
"""Spend capacity on a fleet improvement."""
|
||||
data = load()
|
||||
if not data:
|
||||
return False
|
||||
if data["capacity"]["current"] < amount:
|
||||
print(f" INSUFFICIENT CAPACITY: Need {amount}, have {data['capacity']['current']:.1f}")
|
||||
return False
|
||||
data["capacity"]["current"] -= amount
|
||||
data["capacity"]["spent_on"].append({
|
||||
"purpose": purpose,
|
||||
"amount": amount,
|
||||
"ts": datetime.now(timezone.utc).isoformat()
|
||||
})
|
||||
save(data)
|
||||
print(f" Spent {amount} capacity on: {purpose}")
|
||||
return True
|
||||
|
||||
|
||||
def regenerate_resources():
|
||||
"""Regenerate capacity and calculate innovation."""
|
||||
data = load()
|
||||
if not data:
|
||||
return
|
||||
|
||||
now = time.time()
|
||||
last = data["innovation"]["last_calculated"]
|
||||
hours = (now - last) / 3600.0
|
||||
if hours < 0.1: # Only update every ~6 minutes
|
||||
return
|
||||
|
||||
# Regenerate capacity
|
||||
capacity_gain = CAPACITY_REGEN_RATE * hours
|
||||
data["capacity"]["current"] = min(
|
||||
data["capacity"]["max"],
|
||||
data["capacity"]["current"] + capacity_gain
|
||||
)
|
||||
|
||||
# Calculate capacity utilization
|
||||
utilization = 1.0 - (data["capacity"]["current"] / data["capacity"]["max"])
|
||||
|
||||
# Generate innovation only when under threshold
|
||||
innovation_gain = 0.0
|
||||
if utilization < INNOVATION_THRESHOLD:
|
||||
innovation_gain = INNOVATION_RATE * hours * (1.0 - utilization / INNOVATION_THRESHOLD)
|
||||
data["innovation"]["current"] += innovation_gain
|
||||
data["innovation"]["total_generated"] += innovation_gain
|
||||
|
||||
# Record history
|
||||
if "history" not in data["capacity"]:
|
||||
data["capacity"]["history"] = []
|
||||
data["capacity"]["history"].append({
|
||||
"ts": datetime.now(timezone.utc).isoformat(),
|
||||
"capacity": round(data["capacity"]["current"], 1),
|
||||
"utilization": round(utilization * 100, 1),
|
||||
"innovation": round(data["innovation"]["current"], 1),
|
||||
"innovation_gain": round(innovation_gain, 1)
|
||||
})
|
||||
# Keep last 500 capacity records
|
||||
if len(data["capacity"]["history"]) > 500:
|
||||
data["capacity"]["history"] = data["capacity"]["history"][-500:]
|
||||
|
||||
data["innovation"]["last_calculated"] = now
|
||||
|
||||
save(data)
|
||||
print(f" Capacity: {data['capacity']['current']:.1f}/{data['capacity']['max']:.1f}")
|
||||
print(f" Utilization: {utilization*100:.1f}%")
|
||||
print(f" Innovation: {data['innovation']['current']:.1f} (+{innovation_gain:.1f} this period)")
|
||||
|
||||
return data
|
||||
|
||||
|
||||
def status():
|
||||
"""Print current resource status."""
|
||||
data = load()
|
||||
if not data:
|
||||
print("Resource tracker not initialized. Run --init first.")
|
||||
return
|
||||
|
||||
print("\n=== Fleet Resources ===")
|
||||
print(f" Capacity: {data['capacity']['current']:.1f}/{data['capacity']['max']:.1f}")
|
||||
|
||||
utilization = 1.0 - (data["capacity"]["current"] / data["capacity"]["max"])
|
||||
print(f" Utilization: {utilization*100:.1f}%")
|
||||
|
||||
innovation_status = "GENERATING" if utilization < INNOVATION_THRESHOLD else "BLOCKED"
|
||||
print(f" Innovation: {data['innovation']['current']:.1f} [{innovation_status}]")
|
||||
|
||||
print(f" Uptime: {data['uptime']['current_pct']:.1f}%")
|
||||
print(f" Milestones: {', '.join(str(m)+'%' for m in data['uptime']['milestones_reached']) or 'None yet'}")
|
||||
|
||||
# Phase gate checks
|
||||
phase_2_ok = data['uptime']['current_pct'] >= 95.0
|
||||
phase_3_ok = phase_2_ok and data['innovation']['current'] > 100
|
||||
phase_5_ok = phase_2_ok and data['innovation']['current'] > 500
|
||||
|
||||
print(f"\n Phase Gates:")
|
||||
print(f" Phase 2 (Automation): {'UNLOCKED' if phase_2_ok else 'LOCKED (need 95% uptime)'}")
|
||||
print(f" Phase 3 (Orchestration): {'UNLOCKED' if phase_3_ok else 'LOCKED (need 95% uptime + 100 innovation)'}")
|
||||
print(f" Phase 5 (Scale): {'UNLOCKED' if phase_5_ok else 'LOCKED (need 95% uptime + 500 innovation)'}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
init()
|
||||
if len(sys.argv) > 1 and sys.argv[1] == "status":
|
||||
status()
|
||||
elif len(sys.argv) > 1 and sys.argv[1] == "regen":
|
||||
regenerate_resources()
|
||||
else:
|
||||
regenerate_resources()
|
||||
status()
|
||||
Reference in New Issue
Block a user