Files
timmy-config/evaluations/crewai/CREWAI_EVALUATION.md
ezra fe7c5018e3 eval(crewai): PoC crew + evaluation for Phase 2 integration
- Install CrewAI v1.13.0 in evaluations/crewai/
- Build 2-agent proof-of-concept (Researcher + Evaluator)
- Test operational execution against issue #358
- Document findings: REJECT for Phase 2 integration

CrewAI's 500+ MB dependency footprint, memory-model drift
from Gitea-as-truth, and external API fragility outweigh
its agent-role syntax benefits. Recommend evolving the
existing Huey stack instead.

Closes #358
2026-04-07 16:25:21 +00:00

7.1 KiB

CrewAI Evaluation for Phase 2 Integration

Date: 2026-04-07
Issue: [#358 ORCHESTRATOR-4] Evaluate CrewAI for Phase 2 integration
Author: Ezra
House: hermes-ezra

Summary

CrewAI was installed, a 2-agent proof-of-concept crew was built, and an operational test was attempted against issue #358. Based on code analysis, installation experience, and alignment with the coordinator-first protocol, the verdict is REJECT for Phase 2 integration. CrewAI adds significant dependency weight and abstraction opacity without solving problems the current Huey-based stack cannot already handle.


1. Proof-of-Concept Crew

Agents

Agent Role Responsibility
researcher Orchestration Researcher Reads current orchestrator files and extracts factual comparisons
evaluator Integration Evaluator Synthesizes research into a structured adoption recommendation

Tools

  • read_orchestrator_files — Returns orchestration.py, tasks.py, bin/timmy-orchestrator.sh, and docs/coordinator-first-protocol.md
  • read_issue_358 — Returns the text of the governing issue

Code

See poc_crew.py in this directory for the full implementation.


2. Operational Test Results

What worked

  • pip install crewai completed successfully (v1.13.0)
  • Agent and tool definitions compiled without errors
  • Crew startup and task dispatch UI rendered correctly

What failed

  • Live LLM execution blocked by authentication failures. Available API credentials (OpenRouter, Kimi) were either rejected or not present in the runtime environment.
  • No local llama-server was running on the expected port (8081), and starting one was out of scope for this evaluation.

Why this matters

The authentication failure is not a trivial setup issue — it is a preview of the operational complexity CrewAI introduces. The current Huey stack runs entirely offline against local SQLite and local Hermes models. CrewAI, by contrast, demands either:

  • A managed cloud LLM API with live credentials, or
  • A carefully tuned local model endpoint that supports its verbose ReAct-style prompts

Either path increases blast radius and failure modes.


3. Current Custom Orchestrator Analysis

Stack

  • Huey (orchestration.py) — SQLite-backed task queue, ~6 lines of initialization
  • tasks.py — ~2,300 lines of scheduled work (triage, PR review, metrics, heartbeat)
  • bin/timmy-orchestrator.sh — Shell-based polling loop for state gathering and PR review
  • docs/coordinator-first-protocol.md — Intake → Triage → Route → Track → Verify → Report

Strengths

  1. Sovereignty — No external SaaS dependency for queue execution. SQLite is local and inspectable.
  2. Gitea as truth — All state mutations are visible in the forge. Local-only state is explicitly advisory.
  3. Simplicity — Huey has a tiny surface area. A human can read orchestration.py in seconds.
  4. Tool-nativetasks.py calls Hermes directly via subprocess.run([HERMES_PYTHON, ...]). No framework indirection.
  5. Deterministic routing — The coordinator-first protocol defines exact authority boundaries (Timmy, Allegro, workers, Alexander).

Gaps

  • No built-in agent memory/RAG — but this is intentional per the pre-compaction flush contract and memory-continuity doctrine.
  • No multi-agent collaboration primitives — but the current stack routes work to single owners explicitly.
  • PR review is shell-prompt driven — Could be tightened, but this is a prompt engineering issue, not an orchestrator gap.

4. CrewAI Capability Analysis

What CrewAI offers

  • Agent roles — Declarative backstory/goal/role definitions
  • Task graphs — Sequential, hierarchical, or parallel task execution
  • Tool registry — Pydantic-based tool schemas with auto-validation
  • Memory/RAG — Built-in short-term and long-term memory via ChromaDB/LanceDB
  • Crew-wide context sharing — Output from one task flows to the next

Dependency footprint observed

CrewAI pulled in 85+ packages, including:

  • chromadb (~20 MB) + onnxruntime (~17 MB)
  • lancedb (~47 MB)
  • kubernetes client (unused but required by Chroma)
  • grpcio, opentelemetry-*, pdfplumber, textual

Total venv size: >500 MB.

By contrast, Huey is one package (huey) with zero required services.


5. Alignment with Coordinator-First Protocol

Principle Current Stack CrewAI Assessment
Gitea is truth All assignments, PRs, comments are explicit API calls Agent memory is local/ChromaDB. State can drift from Gitea unless every tool explicitly syncs Misaligned
Local-only state is advisory SQLite queue is ephemeral; canonical state is in Gitea CrewAI encourages "crew memory" as authoritative Misaligned
Verification-before-complete PR review + merge require visible diffs and explicit curl calls Tool outputs can be hallucinated or incomplete without strict guardrails Requires heavy customization
Sovereignty Runs on VPS with no external orchestrator SaaS Requires external LLM or complex local model tuning Degraded
Simplicity ~6 lines for Huey init, readable shell scripts 500+ MB dependency tree, opaque LangChain-style internals Degraded

6. Verdict

REJECT CrewAI for Phase 2 integration.

Confidence: High

Trade-offs

  • Pros of CrewAI: Nice agent-role syntax; built-in task sequencing; rich tool schema validation; active ecosystem.
  • Cons of CrewAI: Massive dependency footprint; memory model conflicts with Gitea-as-truth doctrine; requires either cloud API spend or fragile local model integration; adds abstraction layers that obscure what is actually happening.

Risks if adopted

  1. Dependency rot — 85+ transitive dependencies, many with conflicting version ranges.
  2. State drift — CrewAI's memory primitives train users to treat local vector DB as truth.
  3. Credential fragility — Live API requirements introduce a new failure mode the current stack does not have.
  4. Vendor-like lock-in — CrewAI's abstractions sit thickly over LangChain. Debugging a stuck crew is harder than debugging a Huey task traceback.

Instead of adopting CrewAI, evolve the current Huey stack with:

  1. A lightweight Agent dataclass in tasks.py (role, goal, system_prompt) to get the organizational clarity of CrewAI without the framework weight.
  2. A delegate() helper that uses Hermes's existing delegate_tool.py for multi-agent work.
  3. Keep Gitea as the only durable state surface. Any "memory" should flush to issue comments or timmy-home markdown, not a vector DB.

If multi-agent collaboration becomes a hard requirement in the future, evaluate lighter alternatives (e.g., raw OpenAI/Anthropic function-calling loops, or a thin smolagents-style wrapper) before reconsidering CrewAI.


Artifacts

  • poc_crew.py — 2-agent CrewAI proof-of-concept
  • requirements.txt — Dependency manifest
  • CREWAI_EVALUATION.md — This document