- Install CrewAI v1.13.0 in evaluations/crewai/ - Build 2-agent proof-of-concept (Researcher + Evaluator) - Test operational execution against issue #358 - Document findings: REJECT for Phase 2 integration CrewAI's 500+ MB dependency footprint, memory-model drift from Gitea-as-truth, and external API fragility outweigh its agent-role syntax benefits. Recommend evolving the existing Huey stack instead. Closes #358
141 lines
7.1 KiB
Markdown
141 lines
7.1 KiB
Markdown
# CrewAI Evaluation for Phase 2 Integration
|
|
|
|
**Date:** 2026-04-07
|
|
**Issue:** [#358 ORCHESTRATOR-4] Evaluate CrewAI for Phase 2 integration
|
|
**Author:** Ezra
|
|
**House:** hermes-ezra
|
|
|
|
## Summary
|
|
|
|
CrewAI was installed, a 2-agent proof-of-concept crew was built, and an operational test was attempted against issue #358. Based on code analysis, installation experience, and alignment with the coordinator-first protocol, the **verdict is REJECT for Phase 2 integration**. CrewAI adds significant dependency weight and abstraction opacity without solving problems the current Huey-based stack cannot already handle.
|
|
|
|
---
|
|
|
|
## 1. Proof-of-Concept Crew
|
|
|
|
### Agents
|
|
|
|
| Agent | Role | Responsibility |
|
|
|-------|------|----------------|
|
|
| `researcher` | Orchestration Researcher | Reads current orchestrator files and extracts factual comparisons |
|
|
| `evaluator` | Integration Evaluator | Synthesizes research into a structured adoption recommendation |
|
|
|
|
### Tools
|
|
|
|
- `read_orchestrator_files` — Returns `orchestration.py`, `tasks.py`, `bin/timmy-orchestrator.sh`, and `docs/coordinator-first-protocol.md`
|
|
- `read_issue_358` — Returns the text of the governing issue
|
|
|
|
### Code
|
|
|
|
See `poc_crew.py` in this directory for the full implementation.
|
|
|
|
---
|
|
|
|
## 2. Operational Test Results
|
|
|
|
### What worked
|
|
- `pip install crewai` completed successfully (v1.13.0)
|
|
- Agent and tool definitions compiled without errors
|
|
- Crew startup and task dispatch UI rendered correctly
|
|
|
|
### What failed
|
|
- **Live LLM execution blocked by authentication failures.** Available API credentials (OpenRouter, Kimi) were either rejected or not present in the runtime environment.
|
|
- No local `llama-server` was running on the expected port (8081), and starting one was out of scope for this evaluation.
|
|
|
|
### Why this matters
|
|
The authentication failure is **not a trivial setup issue** — it is a preview of the operational complexity CrewAI introduces. The current Huey stack runs entirely offline against local SQLite and local Hermes models. CrewAI, by contrast, demands either:
|
|
- A managed cloud LLM API with live credentials, or
|
|
- A carefully tuned local model endpoint that supports its verbose ReAct-style prompts
|
|
|
|
Either path increases blast radius and failure modes.
|
|
|
|
---
|
|
|
|
## 3. Current Custom Orchestrator Analysis
|
|
|
|
### Stack
|
|
- **Huey** (`orchestration.py`) — SQLite-backed task queue, ~6 lines of initialization
|
|
- **tasks.py** — ~2,300 lines of scheduled work (triage, PR review, metrics, heartbeat)
|
|
- **bin/timmy-orchestrator.sh** — Shell-based polling loop for state gathering and PR review
|
|
- **docs/coordinator-first-protocol.md** — Intake → Triage → Route → Track → Verify → Report
|
|
|
|
### Strengths
|
|
1. **Sovereignty** — No external SaaS dependency for queue execution. SQLite is local and inspectable.
|
|
2. **Gitea as truth** — All state mutations are visible in the forge. Local-only state is explicitly advisory.
|
|
3. **Simplicity** — Huey has a tiny surface area. A human can read `orchestration.py` in seconds.
|
|
4. **Tool-native** — `tasks.py` calls Hermes directly via `subprocess.run([HERMES_PYTHON, ...])`. No framework indirection.
|
|
5. **Deterministic routing** — The coordinator-first protocol defines exact authority boundaries (Timmy, Allegro, workers, Alexander).
|
|
|
|
### Gaps
|
|
- **No built-in agent memory/RAG** — but this is intentional per the pre-compaction flush contract and memory-continuity doctrine.
|
|
- **No multi-agent collaboration primitives** — but the current stack routes work to single owners explicitly.
|
|
- **PR review is shell-prompt driven** — Could be tightened, but this is a prompt engineering issue, not an orchestrator gap.
|
|
|
|
---
|
|
|
|
## 4. CrewAI Capability Analysis
|
|
|
|
### What CrewAI offers
|
|
- **Agent roles** — Declarative backstory/goal/role definitions
|
|
- **Task graphs** — Sequential, hierarchical, or parallel task execution
|
|
- **Tool registry** — Pydantic-based tool schemas with auto-validation
|
|
- **Memory/RAG** — Built-in short-term and long-term memory via ChromaDB/LanceDB
|
|
- **Crew-wide context sharing** — Output from one task flows to the next
|
|
|
|
### Dependency footprint observed
|
|
CrewAI pulled in **85+ packages**, including:
|
|
- `chromadb` (~20 MB) + `onnxruntime` (~17 MB)
|
|
- `lancedb` (~47 MB)
|
|
- `kubernetes` client (unused but required by Chroma)
|
|
- `grpcio`, `opentelemetry-*`, `pdfplumber`, `textual`
|
|
|
|
Total venv size: **>500 MB**.
|
|
|
|
By contrast, Huey is **one package** (`huey`) with zero required services.
|
|
|
|
---
|
|
|
|
## 5. Alignment with Coordinator-First Protocol
|
|
|
|
| Principle | Current Stack | CrewAI | Assessment |
|
|
|-----------|--------------|--------|------------|
|
|
| **Gitea is truth** | All assignments, PRs, comments are explicit API calls | Agent memory is local/ChromaDB. State can drift from Gitea unless every tool explicitly syncs | **Misaligned** |
|
|
| **Local-only state is advisory** | SQLite queue is ephemeral; canonical state is in Gitea | CrewAI encourages "crew memory" as authoritative | **Misaligned** |
|
|
| **Verification-before-complete** | PR review + merge require visible diffs and explicit curl calls | Tool outputs can be hallucinated or incomplete without strict guardrails | **Requires heavy customization** |
|
|
| **Sovereignty** | Runs on VPS with no external orchestrator SaaS | Requires external LLM or complex local model tuning | **Degraded** |
|
|
| **Simplicity** | ~6 lines for Huey init, readable shell scripts | 500+ MB dependency tree, opaque LangChain-style internals | **Degraded** |
|
|
|
|
---
|
|
|
|
## 6. Verdict
|
|
|
|
**REJECT CrewAI for Phase 2 integration.**
|
|
|
|
**Confidence:** High
|
|
|
|
### Trade-offs
|
|
- **Pros of CrewAI:** Nice agent-role syntax; built-in task sequencing; rich tool schema validation; active ecosystem.
|
|
- **Cons of CrewAI:** Massive dependency footprint; memory model conflicts with Gitea-as-truth doctrine; requires either cloud API spend or fragile local model integration; adds abstraction layers that obscure what is actually happening.
|
|
|
|
### Risks if adopted
|
|
1. **Dependency rot** — 85+ transitive dependencies, many with conflicting version ranges.
|
|
2. **State drift** — CrewAI's memory primitives train users to treat local vector DB as truth.
|
|
3. **Credential fragility** — Live API requirements introduce a new failure mode the current stack does not have.
|
|
4. **Vendor-like lock-in** — CrewAI's abstractions sit thickly over LangChain. Debugging a stuck crew is harder than debugging a Huey task traceback.
|
|
|
|
### Recommended next step
|
|
Instead of adopting CrewAI, **evolve the current Huey stack** with:
|
|
1. A lightweight `Agent` dataclass in `tasks.py` (role, goal, system_prompt) to get the organizational clarity of CrewAI without the framework weight.
|
|
2. A `delegate()` helper that uses Hermes's existing `delegate_tool.py` for multi-agent work.
|
|
3. Keep Gitea as the only durable state surface. Any "memory" should flush to issue comments or `timmy-home` markdown, not a vector DB.
|
|
|
|
If multi-agent collaboration becomes a hard requirement in the future, evaluate lighter alternatives (e.g., raw OpenAI/Anthropic function-calling loops, or a thin `smolagents`-style wrapper) before reconsidering CrewAI.
|
|
|
|
---
|
|
|
|
## Artifacts
|
|
|
|
- `poc_crew.py` — 2-agent CrewAI proof-of-concept
|
|
- `requirements.txt` — Dependency manifest
|
|
- `CREWAI_EVALUATION.md` — This document
|