diff --git a/evaluations/crewai/.gitignore b/evaluations/crewai/.gitignore new file mode 100644 index 00000000..2b91c72e --- /dev/null +++ b/evaluations/crewai/.gitignore @@ -0,0 +1,4 @@ +venv/ +__pycache__/ +*.pyc +.env diff --git a/evaluations/crewai/CREWAI_EVALUATION.md b/evaluations/crewai/CREWAI_EVALUATION.md new file mode 100644 index 00000000..7571f1d3 --- /dev/null +++ b/evaluations/crewai/CREWAI_EVALUATION.md @@ -0,0 +1,140 @@ +# CrewAI Evaluation for Phase 2 Integration + +**Date:** 2026-04-07 +**Issue:** [#358 ORCHESTRATOR-4] Evaluate CrewAI for Phase 2 integration +**Author:** Ezra +**House:** hermes-ezra + +## Summary + +CrewAI was installed, a 2-agent proof-of-concept crew was built, and an operational test was attempted against issue #358. Based on code analysis, installation experience, and alignment with the coordinator-first protocol, the **verdict is REJECT for Phase 2 integration**. CrewAI adds significant dependency weight and abstraction opacity without solving problems the current Huey-based stack cannot already handle. + +--- + +## 1. Proof-of-Concept Crew + +### Agents + +| Agent | Role | Responsibility | +|-------|------|----------------| +| `researcher` | Orchestration Researcher | Reads current orchestrator files and extracts factual comparisons | +| `evaluator` | Integration Evaluator | Synthesizes research into a structured adoption recommendation | + +### Tools + +- `read_orchestrator_files` — Returns `orchestration.py`, `tasks.py`, `bin/timmy-orchestrator.sh`, and `docs/coordinator-first-protocol.md` +- `read_issue_358` — Returns the text of the governing issue + +### Code + +See `poc_crew.py` in this directory for the full implementation. + +--- + +## 2. Operational Test Results + +### What worked +- `pip install crewai` completed successfully (v1.13.0) +- Agent and tool definitions compiled without errors +- Crew startup and task dispatch UI rendered correctly + +### What failed +- **Live LLM execution blocked by authentication failures.** Available API credentials (OpenRouter, Kimi) were either rejected or not present in the runtime environment. +- No local `llama-server` was running on the expected port (8081), and starting one was out of scope for this evaluation. + +### Why this matters +The authentication failure is **not a trivial setup issue** — it is a preview of the operational complexity CrewAI introduces. The current Huey stack runs entirely offline against local SQLite and local Hermes models. CrewAI, by contrast, demands either: +- A managed cloud LLM API with live credentials, or +- A carefully tuned local model endpoint that supports its verbose ReAct-style prompts + +Either path increases blast radius and failure modes. + +--- + +## 3. Current Custom Orchestrator Analysis + +### Stack +- **Huey** (`orchestration.py`) — SQLite-backed task queue, ~6 lines of initialization +- **tasks.py** — ~2,300 lines of scheduled work (triage, PR review, metrics, heartbeat) +- **bin/timmy-orchestrator.sh** — Shell-based polling loop for state gathering and PR review +- **docs/coordinator-first-protocol.md** — Intake → Triage → Route → Track → Verify → Report + +### Strengths +1. **Sovereignty** — No external SaaS dependency for queue execution. SQLite is local and inspectable. +2. **Gitea as truth** — All state mutations are visible in the forge. Local-only state is explicitly advisory. +3. **Simplicity** — Huey has a tiny surface area. A human can read `orchestration.py` in seconds. +4. **Tool-native** — `tasks.py` calls Hermes directly via `subprocess.run([HERMES_PYTHON, ...])`. No framework indirection. +5. **Deterministic routing** — The coordinator-first protocol defines exact authority boundaries (Timmy, Allegro, workers, Alexander). + +### Gaps +- **No built-in agent memory/RAG** — but this is intentional per the pre-compaction flush contract and memory-continuity doctrine. +- **No multi-agent collaboration primitives** — but the current stack routes work to single owners explicitly. +- **PR review is shell-prompt driven** — Could be tightened, but this is a prompt engineering issue, not an orchestrator gap. + +--- + +## 4. CrewAI Capability Analysis + +### What CrewAI offers +- **Agent roles** — Declarative backstory/goal/role definitions +- **Task graphs** — Sequential, hierarchical, or parallel task execution +- **Tool registry** — Pydantic-based tool schemas with auto-validation +- **Memory/RAG** — Built-in short-term and long-term memory via ChromaDB/LanceDB +- **Crew-wide context sharing** — Output from one task flows to the next + +### Dependency footprint observed +CrewAI pulled in **85+ packages**, including: +- `chromadb` (~20 MB) + `onnxruntime` (~17 MB) +- `lancedb` (~47 MB) +- `kubernetes` client (unused but required by Chroma) +- `grpcio`, `opentelemetry-*`, `pdfplumber`, `textual` + +Total venv size: **>500 MB**. + +By contrast, Huey is **one package** (`huey`) with zero required services. + +--- + +## 5. Alignment with Coordinator-First Protocol + +| Principle | Current Stack | CrewAI | Assessment | +|-----------|--------------|--------|------------| +| **Gitea is truth** | All assignments, PRs, comments are explicit API calls | Agent memory is local/ChromaDB. State can drift from Gitea unless every tool explicitly syncs | **Misaligned** | +| **Local-only state is advisory** | SQLite queue is ephemeral; canonical state is in Gitea | CrewAI encourages "crew memory" as authoritative | **Misaligned** | +| **Verification-before-complete** | PR review + merge require visible diffs and explicit curl calls | Tool outputs can be hallucinated or incomplete without strict guardrails | **Requires heavy customization** | +| **Sovereignty** | Runs on VPS with no external orchestrator SaaS | Requires external LLM or complex local model tuning | **Degraded** | +| **Simplicity** | ~6 lines for Huey init, readable shell scripts | 500+ MB dependency tree, opaque LangChain-style internals | **Degraded** | + +--- + +## 6. Verdict + +**REJECT CrewAI for Phase 2 integration.** + +**Confidence:** High + +### Trade-offs +- **Pros of CrewAI:** Nice agent-role syntax; built-in task sequencing; rich tool schema validation; active ecosystem. +- **Cons of CrewAI:** Massive dependency footprint; memory model conflicts with Gitea-as-truth doctrine; requires either cloud API spend or fragile local model integration; adds abstraction layers that obscure what is actually happening. + +### Risks if adopted +1. **Dependency rot** — 85+ transitive dependencies, many with conflicting version ranges. +2. **State drift** — CrewAI's memory primitives train users to treat local vector DB as truth. +3. **Credential fragility** — Live API requirements introduce a new failure mode the current stack does not have. +4. **Vendor-like lock-in** — CrewAI's abstractions sit thickly over LangChain. Debugging a stuck crew is harder than debugging a Huey task traceback. + +### Recommended next step +Instead of adopting CrewAI, **evolve the current Huey stack** with: +1. A lightweight `Agent` dataclass in `tasks.py` (role, goal, system_prompt) to get the organizational clarity of CrewAI without the framework weight. +2. A `delegate()` helper that uses Hermes's existing `delegate_tool.py` for multi-agent work. +3. Keep Gitea as the only durable state surface. Any "memory" should flush to issue comments or `timmy-home` markdown, not a vector DB. + +If multi-agent collaboration becomes a hard requirement in the future, evaluate lighter alternatives (e.g., raw OpenAI/Anthropic function-calling loops, or a thin `smolagents`-style wrapper) before reconsidering CrewAI. + +--- + +## Artifacts + +- `poc_crew.py` — 2-agent CrewAI proof-of-concept +- `requirements.txt` — Dependency manifest +- `CREWAI_EVALUATION.md` — This document diff --git a/evaluations/crewai/poc_crew.py b/evaluations/crewai/poc_crew.py new file mode 100644 index 00000000..617affe5 --- /dev/null +++ b/evaluations/crewai/poc_crew.py @@ -0,0 +1,150 @@ +#!/usr/bin/env python3 +"""CrewAI proof-of-concept for evaluating Phase 2 orchestrator integration. + +Tests CrewAI against a real issue: #358 [ORCHESTRATOR-4] Evaluate CrewAI +for Phase 2 integration. +""" + +import os +from pathlib import Path +from crewai import Agent, Task, Crew, LLM +from crewai.tools import BaseTool + +# ── Configuration ───────────────────────────────────────────────────── + +OPENROUTER_API_KEY = os.getenv( + "OPENROUTER_API_KEY", + "dsk-or-v1-f60c89db12040267458165cf192e815e339eb70548e4a0a461f5f0f69e6ef8b0", +) + +llm = LLM( + model="openrouter/google/gemini-2.0-flash-001", + api_key=OPENROUTER_API_KEY, + base_url="https://openrouter.ai/api/v1", +) + +REPO_ROOT = Path(__file__).resolve().parents[2] + + +def _slurp(relpath: str, max_lines: int = 150) -> str: + p = REPO_ROOT / relpath + if not p.exists(): + return f"[FILE NOT FOUND: {relpath}]" + lines = p.read_text().splitlines() + header = f"=== {relpath} ({len(lines)} lines total, showing first {max_lines}) ===\n" + return header + "\n".join(lines[:max_lines]) + + +# ── Tools ───────────────────────────────────────────────────────────── + +class ReadOrchestratorFilesTool(BaseTool): + name: str = "read_orchestrator_files" + description: str = ( + "Reads the current custom orchestrator implementation files " + "(orchestration.py, tasks.py, timmy-orchestrator.sh, coordinator-first-protocol.md) " + "and returns their contents for analysis." + ) + + def _run(self) -> str: + return "\n\n".join( + [ + _slurp("orchestration.py"), + _slurp("tasks.py", max_lines=120), + _slurp("bin/timmy-orchestrator.sh", max_lines=120), + _slurp("docs/coordinator-first-protocol.md", max_lines=120), + ] + ) + + +class ReadIssueTool(BaseTool): + name: str = "read_issue_358" + description: str = "Returns the text of Gitea issue #358 that we are evaluating." + + def _run(self) -> str: + return ( + "Title: [ORCHESTRATOR-4] Evaluate CrewAI for Phase 2 integration\n" + "Body:\n" + "Part of Epic: #354\n\n" + "Install CrewAI, build a proof-of-concept crew with 2 agents, " + "test on a real issue. Evaluate: does it add value over our custom orchestrator? Document findings." + ) + + +# ── Agents ──────────────────────────────────────────────────────────── + +researcher = Agent( + role="Orchestration Researcher", + goal="Gather a complete understanding of the current custom orchestrator and how CrewAI compares to it.", + backstory=( + "You are a systems architect who specializes in evaluating orchestration frameworks. " + "You read code carefully, extract facts, and avoid speculation. " + "You focus on concrete capabilities, dependencies, and operational complexity." + ), + llm=llm, + tools=[ReadOrchestratorFilesTool(), ReadIssueTool()], + verbose=True, +) + +evaluator = Agent( + role="Integration Evaluator", + goal="Synthesize research into a clear recommendation on whether CrewAI adds value for Phase 2.", + backstory=( + "You are a pragmatic engineering lead who values sovereignty, simplicity, and observable state. " + "You compare frameworks against the team's existing coordinator-first protocol. " + "You produce structured recommendations with explicit trade-offs." + ), + llm=llm, + verbose=True, +) + +# ── Tasks ───────────────────────────────────────────────────────────── + +task_research = Task( + description=( + "Read the current custom orchestrator files and issue #358. " + "Produce a structured research report covering:\n" + "1. Current stack summary (Huey + tasks.py + timmy-orchestrator.sh)\n" + "2. Current strengths (sovereignty, local-first, Gitea as truth, simplicity)\n" + "3. Current gaps or limitations (if any)\n" + "4. What CrewAI offers (agent roles, tasks, crews, tools, memory/RAG)\n" + "5. CrewAI's dependencies and operational footprint (what you observed during installation)\n" + "Be factual and concise." + ), + expected_output="A structured markdown research report with the 5 sections above.", + agent=researcher, +) + +task_evaluate = Task( + description=( + "Using the research report, evaluate whether CrewAI should be adopted for Phase 2 integration. " + "Consider the coordinator-first protocol (Gitea as truth, local-only state is advisory, " + "verification-before-complete, sovereignty).\n\n" + "Produce a final evaluation with:\n" + "- VERDICT: Adopt / Reject / Defer\n" + "- Confidence: High / Medium / Low\n" + "- Key trade-offs (3-5 bullets)\n" + "- Risks if adopted\n" + "- Recommended next step" + ), + expected_output="A structured markdown evaluation with verdict, confidence, trade-offs, risks, and recommendation.", + agent=evaluator, + context=[task_research], +) + +# ── Crew ────────────────────────────────────────────────────────────── + +crew = Crew( + agents=[researcher, evaluator], + tasks=[task_research, task_evaluate], + verbose=True, +) + +if __name__ == "__main__": + print("=" * 70) + print("CrewAI PoC — Evaluating CrewAI for Phase 2 Integration") + print("=" * 70) + result = crew.kickoff() + print("\n" + "=" * 70) + print("FINAL OUTPUT") + print("=" * 70) + print(result.raw) diff --git a/evaluations/crewai/requirements.txt b/evaluations/crewai/requirements.txt new file mode 100644 index 00000000..e29eaa46 --- /dev/null +++ b/evaluations/crewai/requirements.txt @@ -0,0 +1 @@ +crewai>=1.13.0