Files

ezra fe7c5018e3 eval(crewai): PoC crew + evaluation for Phase 2 integration

- Install CrewAI v1.13.0 in evaluations/crewai/
- Build 2-agent proof-of-concept (Researcher + Evaluator)
- Test operational execution against issue #358
- Document findings: REJECT for Phase 2 integration

CrewAI's 500+ MB dependency footprint, memory-model drift
from Gitea-as-truth, and external API fragility outweigh
its agent-role syntax benefits. Recommend evolving the
existing Huey stack instead.

Closes #358

2026-04-07 16:25:21 +00:00

7.1 KiB

Raw Blame History

CrewAI Evaluation for Phase 2 Integration

Date: 2026-04-07
Issue: [#358 ORCHESTRATOR-4] Evaluate CrewAI for Phase 2 integration
Author: Ezra
House: hermes-ezra

Summary

CrewAI was installed, a 2-agent proof-of-concept crew was built, and an operational test was attempted against issue #358. Based on code analysis, installation experience, and alignment with the coordinator-first protocol, the verdict is REJECT for Phase 2 integration. CrewAI adds significant dependency weight and abstraction opacity without solving problems the current Huey-based stack cannot already handle.

1. Proof-of-Concept Crew

Agents

Agent	Role	Responsibility
`researcher`	Orchestration Researcher	Reads current orchestrator files and extracts factual comparisons
`evaluator`	Integration Evaluator	Synthesizes research into a structured adoption recommendation

Tools

read_orchestrator_files — Returns orchestration.py, tasks.py, bin/timmy-orchestrator.sh, and docs/coordinator-first-protocol.md
read_issue_358 — Returns the text of the governing issue

Code

See poc_crew.py in this directory for the full implementation.

2. Operational Test Results

What worked

pip install crewai completed successfully (v1.13.0)
Agent and tool definitions compiled without errors
Crew startup and task dispatch UI rendered correctly

What failed

Live LLM execution blocked by authentication failures. Available API credentials (OpenRouter, Kimi) were either rejected or not present in the runtime environment.
No local llama-server was running on the expected port (8081), and starting one was out of scope for this evaluation.

Why this matters

The authentication failure is not a trivial setup issue — it is a preview of the operational complexity CrewAI introduces. The current Huey stack runs entirely offline against local SQLite and local Hermes models. CrewAI, by contrast, demands either:

A managed cloud LLM API with live credentials, or
A carefully tuned local model endpoint that supports its verbose ReAct-style prompts

Either path increases blast radius and failure modes.

3. Current Custom Orchestrator Analysis

Stack

Huey (orchestration.py) — SQLite-backed task queue, ~6 lines of initialization
tasks.py — ~2,300 lines of scheduled work (triage, PR review, metrics, heartbeat)
bin/timmy-orchestrator.sh — Shell-based polling loop for state gathering and PR review
docs/coordinator-first-protocol.md — Intake → Triage → Route → Track → Verify → Report

Strengths

Sovereignty — No external SaaS dependency for queue execution. SQLite is local and inspectable.
Gitea as truth — All state mutations are visible in the forge. Local-only state is explicitly advisory.
Simplicity — Huey has a tiny surface area. A human can read orchestration.py in seconds.
Tool-native — tasks.py calls Hermes directly via subprocess.run([HERMES_PYTHON, ...]). No framework indirection.
Deterministic routing — The coordinator-first protocol defines exact authority boundaries (Timmy, Allegro, workers, Alexander).

Gaps

No built-in agent memory/RAG — but this is intentional per the pre-compaction flush contract and memory-continuity doctrine.
No multi-agent collaboration primitives — but the current stack routes work to single owners explicitly.
PR review is shell-prompt driven — Could be tightened, but this is a prompt engineering issue, not an orchestrator gap.

4. CrewAI Capability Analysis

What CrewAI offers

Agent roles — Declarative backstory/goal/role definitions
Task graphs — Sequential, hierarchical, or parallel task execution
Tool registry — Pydantic-based tool schemas with auto-validation
Memory/RAG — Built-in short-term and long-term memory via ChromaDB/LanceDB
Crew-wide context sharing — Output from one task flows to the next

Dependency footprint observed

CrewAI pulled in 85+ packages, including:

chromadb (~20 MB) + onnxruntime (~17 MB)
lancedb (~47 MB)
kubernetes client (unused but required by Chroma)
grpcio, opentelemetry-*, pdfplumber, textual

Total venv size: >500 MB.

By contrast, Huey is one package (huey) with zero required services.

5. Alignment with Coordinator-First Protocol

Principle	Current Stack	CrewAI	Assessment
Gitea is truth	All assignments, PRs, comments are explicit API calls	Agent memory is local/ChromaDB. State can drift from Gitea unless every tool explicitly syncs	Misaligned
Local-only state is advisory	SQLite queue is ephemeral; canonical state is in Gitea	CrewAI encourages "crew memory" as authoritative	Misaligned
Verification-before-complete	PR review + merge require visible diffs and explicit curl calls	Tool outputs can be hallucinated or incomplete without strict guardrails	Requires heavy customization
Sovereignty	Runs on VPS with no external orchestrator SaaS	Requires external LLM or complex local model tuning	Degraded
Simplicity	~6 lines for Huey init, readable shell scripts	500+ MB dependency tree, opaque LangChain-style internals	Degraded

6. Verdict

REJECT CrewAI for Phase 2 integration.

Confidence: High

Trade-offs

Pros of CrewAI: Nice agent-role syntax; built-in task sequencing; rich tool schema validation; active ecosystem.
Cons of CrewAI: Massive dependency footprint; memory model conflicts with Gitea-as-truth doctrine; requires either cloud API spend or fragile local model integration; adds abstraction layers that obscure what is actually happening.

Risks if adopted

Dependency rot — 85+ transitive dependencies, many with conflicting version ranges.
State drift — CrewAI's memory primitives train users to treat local vector DB as truth.
Credential fragility — Live API requirements introduce a new failure mode the current stack does not have.
Vendor-like lock-in — CrewAI's abstractions sit thickly over LangChain. Debugging a stuck crew is harder than debugging a Huey task traceback.

Recommended next step

Instead of adopting CrewAI, evolve the current Huey stack with:

A lightweight Agent dataclass in tasks.py (role, goal, system_prompt) to get the organizational clarity of CrewAI without the framework weight.
A delegate() helper that uses Hermes's existing delegate_tool.py for multi-agent work.
Keep Gitea as the only durable state surface. Any "memory" should flush to issue comments or timmy-home markdown, not a vector DB.

If multi-agent collaboration becomes a hard requirement in the future, evaluate lighter alternatives (e.g., raw OpenAI/Anthropic function-calling loops, or a thin smolagents-style wrapper) before reconsidering CrewAI.

Artifacts

poc_crew.py — 2-agent CrewAI proof-of-concept
requirements.txt — Dependency manifest
CREWAI_EVALUATION.md — This document

7.1 KiB Raw Blame History