- Install CrewAI v1.13.0 in evaluations/crewai/ - Build 2-agent proof-of-concept (Researcher + Evaluator) - Test operational execution against issue #358 - Document findings: REJECT for Phase 2 integration CrewAI's 500+ MB dependency footprint, memory-model drift from Gitea-as-truth, and external API fragility outweigh its agent-role syntax benefits. Recommend evolving the existing Huey stack instead. Closes #358
7.1 KiB
CrewAI Evaluation for Phase 2 Integration
Date: 2026-04-07
Issue: [#358 ORCHESTRATOR-4] Evaluate CrewAI for Phase 2 integration
Author: Ezra
House: hermes-ezra
Summary
CrewAI was installed, a 2-agent proof-of-concept crew was built, and an operational test was attempted against issue #358. Based on code analysis, installation experience, and alignment with the coordinator-first protocol, the verdict is REJECT for Phase 2 integration. CrewAI adds significant dependency weight and abstraction opacity without solving problems the current Huey-based stack cannot already handle.
1. Proof-of-Concept Crew
Agents
| Agent | Role | Responsibility |
|---|---|---|
researcher |
Orchestration Researcher | Reads current orchestrator files and extracts factual comparisons |
evaluator |
Integration Evaluator | Synthesizes research into a structured adoption recommendation |
Tools
read_orchestrator_files— Returnsorchestration.py,tasks.py,bin/timmy-orchestrator.sh, anddocs/coordinator-first-protocol.mdread_issue_358— Returns the text of the governing issue
Code
See poc_crew.py in this directory for the full implementation.
2. Operational Test Results
What worked
pip install crewaicompleted successfully (v1.13.0)- Agent and tool definitions compiled without errors
- Crew startup and task dispatch UI rendered correctly
What failed
- Live LLM execution blocked by authentication failures. Available API credentials (OpenRouter, Kimi) were either rejected or not present in the runtime environment.
- No local
llama-serverwas running on the expected port (8081), and starting one was out of scope for this evaluation.
Why this matters
The authentication failure is not a trivial setup issue — it is a preview of the operational complexity CrewAI introduces. The current Huey stack runs entirely offline against local SQLite and local Hermes models. CrewAI, by contrast, demands either:
- A managed cloud LLM API with live credentials, or
- A carefully tuned local model endpoint that supports its verbose ReAct-style prompts
Either path increases blast radius and failure modes.
3. Current Custom Orchestrator Analysis
Stack
- Huey (
orchestration.py) — SQLite-backed task queue, ~6 lines of initialization - tasks.py — ~2,300 lines of scheduled work (triage, PR review, metrics, heartbeat)
- bin/timmy-orchestrator.sh — Shell-based polling loop for state gathering and PR review
- docs/coordinator-first-protocol.md — Intake → Triage → Route → Track → Verify → Report
Strengths
- Sovereignty — No external SaaS dependency for queue execution. SQLite is local and inspectable.
- Gitea as truth — All state mutations are visible in the forge. Local-only state is explicitly advisory.
- Simplicity — Huey has a tiny surface area. A human can read
orchestration.pyin seconds. - Tool-native —
tasks.pycalls Hermes directly viasubprocess.run([HERMES_PYTHON, ...]). No framework indirection. - Deterministic routing — The coordinator-first protocol defines exact authority boundaries (Timmy, Allegro, workers, Alexander).
Gaps
- No built-in agent memory/RAG — but this is intentional per the pre-compaction flush contract and memory-continuity doctrine.
- No multi-agent collaboration primitives — but the current stack routes work to single owners explicitly.
- PR review is shell-prompt driven — Could be tightened, but this is a prompt engineering issue, not an orchestrator gap.
4. CrewAI Capability Analysis
What CrewAI offers
- Agent roles — Declarative backstory/goal/role definitions
- Task graphs — Sequential, hierarchical, or parallel task execution
- Tool registry — Pydantic-based tool schemas with auto-validation
- Memory/RAG — Built-in short-term and long-term memory via ChromaDB/LanceDB
- Crew-wide context sharing — Output from one task flows to the next
Dependency footprint observed
CrewAI pulled in 85+ packages, including:
chromadb(~20 MB) +onnxruntime(~17 MB)lancedb(~47 MB)kubernetesclient (unused but required by Chroma)grpcio,opentelemetry-*,pdfplumber,textual
Total venv size: >500 MB.
By contrast, Huey is one package (huey) with zero required services.
5. Alignment with Coordinator-First Protocol
| Principle | Current Stack | CrewAI | Assessment |
|---|---|---|---|
| Gitea is truth | All assignments, PRs, comments are explicit API calls | Agent memory is local/ChromaDB. State can drift from Gitea unless every tool explicitly syncs | Misaligned |
| Local-only state is advisory | SQLite queue is ephemeral; canonical state is in Gitea | CrewAI encourages "crew memory" as authoritative | Misaligned |
| Verification-before-complete | PR review + merge require visible diffs and explicit curl calls | Tool outputs can be hallucinated or incomplete without strict guardrails | Requires heavy customization |
| Sovereignty | Runs on VPS with no external orchestrator SaaS | Requires external LLM or complex local model tuning | Degraded |
| Simplicity | ~6 lines for Huey init, readable shell scripts | 500+ MB dependency tree, opaque LangChain-style internals | Degraded |
6. Verdict
REJECT CrewAI for Phase 2 integration.
Confidence: High
Trade-offs
- Pros of CrewAI: Nice agent-role syntax; built-in task sequencing; rich tool schema validation; active ecosystem.
- Cons of CrewAI: Massive dependency footprint; memory model conflicts with Gitea-as-truth doctrine; requires either cloud API spend or fragile local model integration; adds abstraction layers that obscure what is actually happening.
Risks if adopted
- Dependency rot — 85+ transitive dependencies, many with conflicting version ranges.
- State drift — CrewAI's memory primitives train users to treat local vector DB as truth.
- Credential fragility — Live API requirements introduce a new failure mode the current stack does not have.
- Vendor-like lock-in — CrewAI's abstractions sit thickly over LangChain. Debugging a stuck crew is harder than debugging a Huey task traceback.
Recommended next step
Instead of adopting CrewAI, evolve the current Huey stack with:
- A lightweight
Agentdataclass intasks.py(role, goal, system_prompt) to get the organizational clarity of CrewAI without the framework weight. - A
delegate()helper that uses Hermes's existingdelegate_tool.pyfor multi-agent work. - Keep Gitea as the only durable state surface. Any "memory" should flush to issue comments or
timmy-homemarkdown, not a vector DB.
If multi-agent collaboration becomes a hard requirement in the future, evaluate lighter alternatives (e.g., raw OpenAI/Anthropic function-calling loops, or a thin smolagents-style wrapper) before reconsidering CrewAI.
Artifacts
poc_crew.py— 2-agent CrewAI proof-of-conceptrequirements.txt— Dependency manifestCREWAI_EVALUATION.md— This document