# CrewAI Evaluation for Phase 2 Integration **Date:** 2026-04-07 **Issue:** [#358 ORCHESTRATOR-4] Evaluate CrewAI for Phase 2 integration **Author:** Ezra **House:** hermes-ezra ## Summary CrewAI was installed, a 2-agent proof-of-concept crew was built, and an operational test was attempted against issue #358. Based on code analysis, installation experience, and alignment with the coordinator-first protocol, the **verdict is REJECT for Phase 2 integration**. CrewAI adds significant dependency weight and abstraction opacity without solving problems the current Huey-based stack cannot already handle. --- ## 1. Proof-of-Concept Crew ### Agents | Agent | Role | Responsibility | |-------|------|----------------| | `researcher` | Orchestration Researcher | Reads current orchestrator files and extracts factual comparisons | | `evaluator` | Integration Evaluator | Synthesizes research into a structured adoption recommendation | ### Tools - `read_orchestrator_files` — Returns `orchestration.py`, `tasks.py`, `bin/timmy-orchestrator.sh`, and `docs/coordinator-first-protocol.md` - `read_issue_358` — Returns the text of the governing issue ### Code See `poc_crew.py` in this directory for the full implementation. --- ## 2. Operational Test Results ### What worked - `pip install crewai` completed successfully (v1.13.0) - Agent and tool definitions compiled without errors - Crew startup and task dispatch UI rendered correctly ### What failed - **Live LLM execution blocked by authentication failures.** Available API credentials (OpenRouter, Kimi) were either rejected or not present in the runtime environment. - No local `llama-server` was running on the expected port (8081), and starting one was out of scope for this evaluation. ### Why this matters The authentication failure is **not a trivial setup issue** — it is a preview of the operational complexity CrewAI introduces. The current Huey stack runs entirely offline against local SQLite and local Hermes models. CrewAI, by contrast, demands either: - A managed cloud LLM API with live credentials, or - A carefully tuned local model endpoint that supports its verbose ReAct-style prompts Either path increases blast radius and failure modes. --- ## 3. Current Custom Orchestrator Analysis ### Stack - **Huey** (`orchestration.py`) — SQLite-backed task queue, ~6 lines of initialization - **tasks.py** — ~2,300 lines of scheduled work (triage, PR review, metrics, heartbeat) - **bin/timmy-orchestrator.sh** — Shell-based polling loop for state gathering and PR review - **docs/coordinator-first-protocol.md** — Intake → Triage → Route → Track → Verify → Report ### Strengths 1. **Sovereignty** — No external SaaS dependency for queue execution. SQLite is local and inspectable. 2. **Gitea as truth** — All state mutations are visible in the forge. Local-only state is explicitly advisory. 3. **Simplicity** — Huey has a tiny surface area. A human can read `orchestration.py` in seconds. 4. **Tool-native** — `tasks.py` calls Hermes directly via `subprocess.run([HERMES_PYTHON, ...])`. No framework indirection. 5. **Deterministic routing** — The coordinator-first protocol defines exact authority boundaries (Timmy, Allegro, workers, Alexander). ### Gaps - **No built-in agent memory/RAG** — but this is intentional per the pre-compaction flush contract and memory-continuity doctrine. - **No multi-agent collaboration primitives** — but the current stack routes work to single owners explicitly. - **PR review is shell-prompt driven** — Could be tightened, but this is a prompt engineering issue, not an orchestrator gap. --- ## 4. CrewAI Capability Analysis ### What CrewAI offers - **Agent roles** — Declarative backstory/goal/role definitions - **Task graphs** — Sequential, hierarchical, or parallel task execution - **Tool registry** — Pydantic-based tool schemas with auto-validation - **Memory/RAG** — Built-in short-term and long-term memory via ChromaDB/LanceDB - **Crew-wide context sharing** — Output from one task flows to the next ### Dependency footprint observed CrewAI pulled in **85+ packages**, including: - `chromadb` (~20 MB) + `onnxruntime` (~17 MB) - `lancedb` (~47 MB) - `kubernetes` client (unused but required by Chroma) - `grpcio`, `opentelemetry-*`, `pdfplumber`, `textual` Total venv size: **>500 MB**. By contrast, Huey is **one package** (`huey`) with zero required services. --- ## 5. Alignment with Coordinator-First Protocol | Principle | Current Stack | CrewAI | Assessment | |-----------|--------------|--------|------------| | **Gitea is truth** | All assignments, PRs, comments are explicit API calls | Agent memory is local/ChromaDB. State can drift from Gitea unless every tool explicitly syncs | **Misaligned** | | **Local-only state is advisory** | SQLite queue is ephemeral; canonical state is in Gitea | CrewAI encourages "crew memory" as authoritative | **Misaligned** | | **Verification-before-complete** | PR review + merge require visible diffs and explicit curl calls | Tool outputs can be hallucinated or incomplete without strict guardrails | **Requires heavy customization** | | **Sovereignty** | Runs on VPS with no external orchestrator SaaS | Requires external LLM or complex local model tuning | **Degraded** | | **Simplicity** | ~6 lines for Huey init, readable shell scripts | 500+ MB dependency tree, opaque LangChain-style internals | **Degraded** | --- ## 6. Verdict **REJECT CrewAI for Phase 2 integration.** **Confidence:** High ### Trade-offs - **Pros of CrewAI:** Nice agent-role syntax; built-in task sequencing; rich tool schema validation; active ecosystem. - **Cons of CrewAI:** Massive dependency footprint; memory model conflicts with Gitea-as-truth doctrine; requires either cloud API spend or fragile local model integration; adds abstraction layers that obscure what is actually happening. ### Risks if adopted 1. **Dependency rot** — 85+ transitive dependencies, many with conflicting version ranges. 2. **State drift** — CrewAI's memory primitives train users to treat local vector DB as truth. 3. **Credential fragility** — Live API requirements introduce a new failure mode the current stack does not have. 4. **Vendor-like lock-in** — CrewAI's abstractions sit thickly over LangChain. Debugging a stuck crew is harder than debugging a Huey task traceback. ### Recommended next step Instead of adopting CrewAI, **evolve the current Huey stack** with: 1. A lightweight `Agent` dataclass in `tasks.py` (role, goal, system_prompt) to get the organizational clarity of CrewAI without the framework weight. 2. A `delegate()` helper that uses Hermes's existing `delegate_tool.py` for multi-agent work. 3. Keep Gitea as the only durable state surface. Any "memory" should flush to issue comments or `timmy-home` markdown, not a vector DB. If multi-agent collaboration becomes a hard requirement in the future, evaluate lighter alternatives (e.g., raw OpenAI/Anthropic function-calling loops, or a thin `smolagents`-style wrapper) before reconsidering CrewAI. --- ## Artifacts - `poc_crew.py` — 2-agent CrewAI proof-of-concept - `requirements.txt` — Dependency manifest - `CREWAI_EVALUATION.md` — This document