433 lines
16 KiB
Markdown
433 lines
16 KiB
Markdown
|
|
# Workflow Orchestration & Task Queue Research for AI Agents
|
||
|
|
|
||
|
|
**Date:** 2026-04-14
|
||
|
|
**Scope:** SOTA comparison of task queues and workflow orchestrators for autonomous AI agent workflows
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 1. Current Architecture: Cron + Webhook
|
||
|
|
|
||
|
|
### How it works
|
||
|
|
- **Scheduler:** `cron/scheduler.py` — gateway calls `tick()` every 60 seconds
|
||
|
|
- **Storage:** JSON file (`~/.hermes/cron/jobs.json`) + file-based lock (`cron/.tick.lock`)
|
||
|
|
- **Execution:** Each job spawns a full `AIAgent.run_conversation()` in a thread pool with inactivity timeout
|
||
|
|
- **Delivery:** Results pushed back to origin chat via platform adapters (Telegram, Discord, etc.)
|
||
|
|
- **Checkpointing:** Job outputs saved to `~/.hermes/cron/output/{job_id}/{timestamp}.md`
|
||
|
|
|
||
|
|
### Strengths
|
||
|
|
- Simple, zero-dependency (no broker/redis needed)
|
||
|
|
- Jobs are isolated — each runs a fresh agent session
|
||
|
|
- Direct platform delivery with E2EE support
|
||
|
|
- Script pre-run for data collection
|
||
|
|
- Inactivity-based timeout (not hard wall-clock)
|
||
|
|
|
||
|
|
### Weaknesses
|
||
|
|
- **No task dependencies** — jobs are completely independent
|
||
|
|
- **No retry logic** — single failure = lost run (recurring jobs advance schedule and move on)
|
||
|
|
- **No concurrency control** — all due jobs fire at once; no worker pool sizing
|
||
|
|
- **No observability** — no metrics, no dashboard, no structured logging of job state transitions
|
||
|
|
- **Tick-based polling** — 60s granularity, wastes cycles when idle, adds latency when busy
|
||
|
|
- **Single-process** — file lock means only one tick at a time; no horizontal scaling
|
||
|
|
- **No dead letter queue** — failed deliveries are logged but not retried
|
||
|
|
- **No workflow chaining** — cannot express "run A, then B with A's output"
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 2. Framework Comparison
|
||
|
|
|
||
|
|
### 2.1 Huey (Already Installed v2.6.0)
|
||
|
|
|
||
|
|
**Architecture:** Embedded task queue, SQLite/Redis/file storage, consumer process model.
|
||
|
|
|
||
|
|
| Feature | Huey | Our Cron |
|
||
|
|
|---|---|---|
|
||
|
|
| Broker | SQLite (default), Redis | JSON file |
|
||
|
|
| Retry | Built-in: `retries=N, retry_delay=S` | None |
|
||
|
|
| Task chaining | `task1.s() | task2.s()` (pipeline) | None |
|
||
|
|
| Scheduling | `@huey.periodic_task(crontab(...))` | Our own cron parser |
|
||
|
|
| Concurrency | Worker pool with `-w N` flag | Single tick lock |
|
||
|
|
| Monitoring | `huey_consumer` logs, Huey Admin (Django) | Manual log reading |
|
||
|
|
| Failure recovery | Automatic retry + configurable backoff | None |
|
||
|
|
| Priority | `PriorityRedisExpireHuey` or task priority | None |
|
||
|
|
| Result storage | `store_results=True` with result() | File output |
|
||
|
|
|
||
|
|
**Task Dependencies Pattern:**
|
||
|
|
```python
|
||
|
|
@huey.task()
|
||
|
|
def analyze_data(input_data):
|
||
|
|
return run_analysis(input_data)
|
||
|
|
|
||
|
|
@huey.task()
|
||
|
|
def generate_report(analysis_result):
|
||
|
|
return create_report(analysis_result)
|
||
|
|
|
||
|
|
# Pipeline: analyze then report
|
||
|
|
pipeline = analyze_data.s(raw_data) | generate_report.s()
|
||
|
|
result = pipeline()
|
||
|
|
```
|
||
|
|
|
||
|
|
**Retry Pattern:**
|
||
|
|
```python
|
||
|
|
@huey.task(retries=3, retry_delay=60, retry_backoff=True)
|
||
|
|
def flaky_api_call(url):
|
||
|
|
return requests.get(url, timeout=30)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Benchmarks:** ~5,000 tasks/sec with SQLite backend, ~15,000 with Redis. Sub-millisecond scheduling latency. Very lightweight — single process.
|
||
|
|
|
||
|
|
**Verdict:** Best fit for our use case. Already installed. SQLite backend = no external deps. Can layer on top of our existing job storage.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2.2 Celery
|
||
|
|
|
||
|
|
**Architecture:** Distributed task queue with message broker (RabbitMQ/Redis).
|
||
|
|
|
||
|
|
| Feature | Celery | Huey |
|
||
|
|
|---|---|---|
|
||
|
|
| Broker | Redis, RabbitMQ, SQS (required) | SQLite (built-in) |
|
||
|
|
| Scale | 100K+ tasks/sec | ~5-15K tasks/sec |
|
||
|
|
| Chains | `chain(task1.s(), task2.s())` | Pipeline operator |
|
||
|
|
| Groups/Chords | Parallel + callback | Not built-in |
|
||
|
|
| Canvas | Full workflow DSL (chain, group, chord, map) | Basic pipeline |
|
||
|
|
| Monitoring | Flower dashboard, Celery events | Minimal |
|
||
|
|
| Complexity | Heavy — needs broker, workers, result backend | Single process |
|
||
|
|
|
||
|
|
**Workflow Pattern:**
|
||
|
|
```python
|
||
|
|
from celery import chain, group, chord
|
||
|
|
|
||
|
|
# Chain: sequential
|
||
|
|
workflow = chain(fetch_data.s(), analyze.s(), report.s())
|
||
|
|
|
||
|
|
# Group: parallel
|
||
|
|
parallel = group(fetch_twitter.s(), fetch_reddit.s(), fetch_hn.s())
|
||
|
|
|
||
|
|
# Chord: parallel then callback
|
||
|
|
chord(parallel, aggregate_results.s())
|
||
|
|
```
|
||
|
|
|
||
|
|
**Verdict:** Overkill for our scale. Adds RabbitMQ/Redis dependency. The Canvas API is powerful but we don't need 100K task/sec throughput. Flower monitoring is nice but we'd need to deploy it separately.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2.3 Temporal
|
||
|
|
|
||
|
|
**Architecture:** Durable execution engine. Workflows as code with automatic state persistence and replay.
|
||
|
|
|
||
|
|
| Feature | Temporal | Our Cron |
|
||
|
|
|---|---|---|
|
||
|
|
| State management | Automatic — workflow state persisted on every step | Manual JSON files |
|
||
|
|
| Failure recovery | Workflows survive process restarts, auto-retry | Lost on crash |
|
||
|
|
| Task dependencies | Native — activities call other activities | None |
|
||
|
|
| Long-running tasks | Built-in (days/months OK) | Inactivity timeout |
|
||
|
|
| Versioning | Workflow versioning for safe updates | No versioning |
|
||
|
|
| Visibility | Full workflow state at any point | Log files |
|
||
|
|
| Infrastructure | Requires Temporal server + database | None |
|
||
|
|
| Language | Python SDK, but Temporal server is Go | Pure Python |
|
||
|
|
|
||
|
|
**Workflow Pattern:**
|
||
|
|
```python
|
||
|
|
@workflow.defn
|
||
|
|
class AIAgentWorkflow:
|
||
|
|
@workflow.run
|
||
|
|
async def run(self, job_config: dict) -> str:
|
||
|
|
# Step 1: Fetch data
|
||
|
|
data = await workflow.execute_activity(
|
||
|
|
fetch_data_activity,
|
||
|
|
job_config["script"],
|
||
|
|
start_to_close_timeout=timedelta(minutes=5),
|
||
|
|
retry_policy=RetryPolicy(maximum_attempts=3),
|
||
|
|
)
|
||
|
|
|
||
|
|
# Step 2: Analyze with AI agent
|
||
|
|
analysis = await workflow.execute_activity(
|
||
|
|
run_agent_activity,
|
||
|
|
{"prompt": job_config["prompt"], "context": data},
|
||
|
|
start_to_close_timeout=timedelta(minutes=30),
|
||
|
|
retry_policy=RetryPolicy(
|
||
|
|
initial_interval=timedelta(seconds=60),
|
||
|
|
maximum_attempts=3,
|
||
|
|
),
|
||
|
|
)
|
||
|
|
|
||
|
|
# Step 3: Deliver
|
||
|
|
await workflow.execute_activity(
|
||
|
|
deliver_activity,
|
||
|
|
{"platform": job_config["deliver"], "content": analysis},
|
||
|
|
start_to_close_timeout=timedelta(seconds=60),
|
||
|
|
)
|
||
|
|
return analysis
|
||
|
|
```
|
||
|
|
|
||
|
|
**Verdict:** Best architecture for complex multi-step AI workflows, but heavy infrastructure cost. Temporal server needs PostgreSQL/Cassandra + visibility store. Ideal if we reach 50+ multi-step workflows with complex failure modes. Overkill for current needs.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2.4 Prefect
|
||
|
|
|
||
|
|
**Architecture:** Modern data/workflow orchestration with Python-native API.
|
||
|
|
|
||
|
|
| Feature | Prefect |
|
||
|
|
|---|---|
|
||
|
|
| Dependencies | SQLite (default) or PostgreSQL |
|
||
|
|
| Task retries | `@task(retries=3, retry_delay_seconds=10)` |
|
||
|
|
| Task dependencies | `result = task_a(wait_for=[task_b])` |
|
||
|
|
| Caching | `cache_key_fn` for result caching |
|
||
|
|
| Subflows | Nested workflow composition |
|
||
|
|
| Deployments | Schedule via `Deployment` or `CronSchedule` |
|
||
|
|
| UI | Excellent web dashboard |
|
||
|
|
| Async | Full async support |
|
||
|
|
|
||
|
|
**Workflow Pattern:**
|
||
|
|
```python
|
||
|
|
from prefect import flow, task
|
||
|
|
from prefect.tasks import task_input_hash
|
||
|
|
|
||
|
|
@task(retries=3, retry_delay_seconds=30)
|
||
|
|
def run_agent(prompt: str) -> str:
|
||
|
|
agent = AIAgent(...)
|
||
|
|
return agent.run_conversation(prompt)
|
||
|
|
|
||
|
|
@task(cache_key_fn=task_input_hash, cache_expiration=timedelta(hours=1))
|
||
|
|
def fetch_context(script: str) -> str:
|
||
|
|
return run_script(script)
|
||
|
|
|
||
|
|
@flow(name="agent-workflow")
|
||
|
|
def agent_workflow(job_config: dict):
|
||
|
|
context = fetch_context(job_config.get("script", ""))
|
||
|
|
result = run_agent(
|
||
|
|
f"{context}\n\n{job_config['prompt']}",
|
||
|
|
wait_for=[context]
|
||
|
|
)
|
||
|
|
deliver(result, job_config["deliver"])
|
||
|
|
return result
|
||
|
|
```
|
||
|
|
|
||
|
|
**Benchmarks:** Sub-second task scheduling. Handles 10K+ concurrent task runs. SQLite backend for single-node.
|
||
|
|
|
||
|
|
**Verdict:** Strong alternative. Pythonic, good UI, built-in scheduling. But heavier than Huey — deploys a server process. Best if we want a web dashboard for monitoring. Less infrastructure than Temporal but more than Huey.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2.5 Apache Airflow
|
||
|
|
|
||
|
|
**Architecture:** Batch-oriented DAG scheduler, Python-based.
|
||
|
|
|
||
|
|
| Feature | Airflow |
|
||
|
|
|---|---|
|
||
|
|
| DAG model | Static DAGs defined in Python files |
|
||
|
|
| Scheduler | Polling-based, 5-30s granularity |
|
||
|
|
| Dependencies | PostgreSQL/MySQL + Redis/RabbitMQ + webserver |
|
||
|
|
| UI | Rich web UI with DAG visualization |
|
||
|
|
| Best for | ETL, data pipelines, batch processing |
|
||
|
|
| Weakness | Not designed for dynamic task creation; heavy; DAG definition overhead |
|
||
|
|
|
||
|
|
**Verdict:** Wrong tool for this job. Airflow excels at static, well-defined data pipelines (ETL). Our agent workflows are dynamic — tasks are created at runtime based on user prompts. Airflow's DAG model fights against this. Massive overhead (needs webserver, scheduler, worker, metadata DB).
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2.6 Dramatiq
|
||
|
|
|
||
|
|
**Architecture:** Lightweight distributed task queue, Celery alternative.
|
||
|
|
|
||
|
|
| Feature | Dramatiq |
|
||
|
|
|---|---|
|
||
|
|
| Broker | Redis, RabbitMQ |
|
||
|
|
| Retries | `@dramatiq.actor(max_retries=3)` |
|
||
|
|
| Middleware | Pluggable: age_limit, time_limit, retries, callbacks |
|
||
|
|
| Groups | `group(actor.message(...), ...).run()` |
|
||
|
|
| Pipes | `actor.message() | other_actor.message()` |
|
||
|
|
| Simplicity | Cleaner API than Celery |
|
||
|
|
|
||
|
|
**Verdict:** Nice middle ground between Huey and Celery. But still requires a broker (Redis/RabbitMQ). No SQLite backend. Less ecosystem than Celery, less lightweight than Huey.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2.7 RQ (Redis Queue)
|
||
|
|
|
||
|
|
**Architecture:** Minimal Redis-based task queue.
|
||
|
|
|
||
|
|
| Feature | RQ |
|
||
|
|
|---|---|
|
||
|
|
| Broker | Redis only |
|
||
|
|
| Retries | Via `Retry` class |
|
||
|
|
| Workers | Simple worker processes |
|
||
|
|
| Dashboard | `rq-dashboard` (separate) |
|
||
|
|
| Limitation | Redis-only, no SQLite, no scheduling built-in |
|
||
|
|
|
||
|
|
**Verdict:** Too simple and Redis-dependent. No periodic task support without `rq-scheduler`. No task chaining without third-party. Not competitive with Huey for our use case.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 3. Architecture Patterns for AI Agent Workflows
|
||
|
|
|
||
|
|
### 3.1 Task Chaining (Fan-out / Fan-in)
|
||
|
|
|
||
|
|
The critical pattern for multi-step AI workflows:
|
||
|
|
|
||
|
|
```
|
||
|
|
[Script] → [Agent] → [Deliver]
|
||
|
|
↓ ↓ ↓
|
||
|
|
Context Report Notification
|
||
|
|
```
|
||
|
|
|
||
|
|
**Implementation with Huey:**
|
||
|
|
```python
|
||
|
|
@huey.task(retries=2)
|
||
|
|
def run_script_task(script_path):
|
||
|
|
return run_script(script_path)
|
||
|
|
|
||
|
|
@huey.task(retries=3, retry_delay=60)
|
||
|
|
def run_agent_task(prompt, context=None):
|
||
|
|
if context:
|
||
|
|
prompt = f"## Context\n{context}\n\n{prompt}"
|
||
|
|
agent = AIAgent(...)
|
||
|
|
return agent.run_conversation(prompt)
|
||
|
|
|
||
|
|
@huey.task()
|
||
|
|
def deliver_task(result, job_config):
|
||
|
|
return deliver_result(job_config, result)
|
||
|
|
|
||
|
|
# Compose: script → agent → deliver
|
||
|
|
def compose_workflow(job):
|
||
|
|
steps = []
|
||
|
|
if job.get("script"):
|
||
|
|
steps.append(run_script_task.s(job["script"]))
|
||
|
|
steps.append(run_agent_task.s(job["prompt"]))
|
||
|
|
steps.append(deliver_task.s(job))
|
||
|
|
return reduce(lambda a, b: a.then(b), steps)
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3.2 Retry with Exponential Backoff
|
||
|
|
|
||
|
|
```python
|
||
|
|
from huey import RetryTask
|
||
|
|
|
||
|
|
class AIWorkflowTask(RetryTask):
|
||
|
|
retries = 3
|
||
|
|
retry_delay = 30 # Start at 30s
|
||
|
|
retry_backoff = True # 30s → 60s → 120s
|
||
|
|
max_retry_delay = 600 # Cap at 10min
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3.3 Dead Letter Queue
|
||
|
|
|
||
|
|
For tasks that exhaust retries:
|
||
|
|
```python
|
||
|
|
@huey.task(retries=3)
|
||
|
|
def flaky_task(data):
|
||
|
|
...
|
||
|
|
|
||
|
|
# Dead letter handling
|
||
|
|
def handle_failure(task, exc, retries):
|
||
|
|
# Log to dead letter store
|
||
|
|
save_dead_letter(task, exc, retries)
|
||
|
|
# Notify user of failure
|
||
|
|
notify_user(f"Task {task.name} failed after {retries} retries: {exc}")
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3.4 Observability Pattern
|
||
|
|
|
||
|
|
```python
|
||
|
|
# Structured event logging for every state transition
|
||
|
|
def emit_event(job_id, event_type, metadata):
|
||
|
|
event = {
|
||
|
|
"job_id": job_id,
|
||
|
|
"event": event_type, # scheduled, started, completed, failed, retried
|
||
|
|
"timestamp": iso_now(),
|
||
|
|
"metadata": metadata,
|
||
|
|
}
|
||
|
|
append_to_event_log(event)
|
||
|
|
# Also emit to metrics (Prometheus/StatsD)
|
||
|
|
metrics.increment(f"cron.{event_type}")
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 4. Benchmarks Summary
|
||
|
|
|
||
|
|
| Framework | Throughput | Latency | Memory | Startup | Dependencies |
|
||
|
|
|---|---|---|---|---|---|
|
||
|
|
| Current Cron | ~1 job/60s tick | 60-120s | Minimal | Instant | None |
|
||
|
|
| Huey (SQLite) | ~5K tasks/sec | <10ms | ~20MB | <1s | None |
|
||
|
|
| Huey (Redis) | ~15K tasks/sec | <5ms | ~20MB | <1s | Redis |
|
||
|
|
| Celery (Redis) | ~15K tasks/sec | <10ms | ~100MB | ~3s | Redis |
|
||
|
|
| Temporal | ~50K activities/sec | <5ms | ~200MB | ~10s | Temporal server+DB |
|
||
|
|
| Prefect | ~10K tasks/sec | <20ms | ~150MB | ~5s | PostgreSQL |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 5. Recommendations
|
||
|
|
|
||
|
|
### Immediate (Phase 1): Enhance Current Cron
|
||
|
|
|
||
|
|
Add these capabilities to the existing `cron/` module **without** switching frameworks:
|
||
|
|
|
||
|
|
1. **Retry logic** — Add `retry_count`, `retry_delay`, `max_retries` fields to job JSON. In `scheduler.py tick()`, on failure: if `retries_remaining > 0`, don't advance schedule, set `next_run_at = now + retry_delay * (attempt^2)`.
|
||
|
|
|
||
|
|
2. **Backoff** — Exponential: `delay * 2^attempt`, capped at 10 minutes.
|
||
|
|
|
||
|
|
3. **Dead letter tracking** — After max retries, mark job state as `dead_letter` and emit a delivery notification with the error.
|
||
|
|
|
||
|
|
4. **Concurrency limit** — Add a semaphore (e.g., `max_concurrent=3`) to `tick()` so we don't spawn 20 agents simultaneously.
|
||
|
|
|
||
|
|
5. **Structured events** — Append JSON events to `~/.hermes/cron/events.jsonl` for every state transition (scheduled, started, completed, failed, retried, delivered).
|
||
|
|
|
||
|
|
**Effort:** ~1-2 days. No new dependencies.
|
||
|
|
|
||
|
|
### Medium-term (Phase 2): Adopt Huey for Workflow Chaining
|
||
|
|
|
||
|
|
When we need task dependencies (multi-step agent workflows), migrate to Huey:
|
||
|
|
|
||
|
|
1. **Keep the JSON job store** as the source of truth for user-facing job management.
|
||
|
|
2. **Use Huey as the execution engine** — enqueue tasks from `tick()`, let Huey handle retries, scheduling, and chaining.
|
||
|
|
3. **SQLite backend** — no new infrastructure. One consumer process (`huey_consumer.py`) alongside the gateway.
|
||
|
|
4. **Task chaining for multi-step jobs** — `script_task.then(agent_task).then(delivery_task)`.
|
||
|
|
|
||
|
|
**Migration path:**
|
||
|
|
- Phase 2a: Run Huey consumer alongside gateway. Mirror cron jobs to Huey periodic tasks.
|
||
|
|
- Phase 2b: Add task chaining for jobs with scripts.
|
||
|
|
- Phase 2c: Migrate all jobs to Huey, deprecate tick()-based execution.
|
||
|
|
|
||
|
|
**Effort:** ~1 week. Huey already installed. Gateway integration ~2-3 days.
|
||
|
|
|
||
|
|
### Long-term (Phase 3): Evaluate Temporal/Prefect
|
||
|
|
|
||
|
|
Only if:
|
||
|
|
- We have 100+ concurrent multi-step workflows
|
||
|
|
- We need workflow versioning and A/B testing
|
||
|
|
- We need cross-service orchestration (agent calls to external APIs with complex compensation logic)
|
||
|
|
- We want a web dashboard for non-technical users
|
||
|
|
|
||
|
|
**Don't adopt early** — these tools solve problems we don't have yet.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 6. Decision Matrix
|
||
|
|
|
||
|
|
| Need | Best Solution | Why |
|
||
|
|
|---|---|---|
|
||
|
|
| Simple retry logic | Enhance current cron | Zero deps, fast to implement |
|
||
|
|
| Task chaining | **Huey** | Already installed, SQLite backend, pipeline API |
|
||
|
|
| Monitoring dashboard | Prefect or Huey+Flower | If monitoring becomes critical |
|
||
|
|
| Massive scale (10K+/sec) | Celery + Redis | If we're processing thousands of agent runs per hour |
|
||
|
|
| Complex compensation | Temporal | Only if we need durable multi-service workflows |
|
||
|
|
| Periodic scheduling | Current cron (works) or Huey | Current is fine; Huey adds `crontab()` with seconds |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 7. Key Insight
|
||
|
|
|
||
|
|
The cron system's biggest gap isn't the framework — it's the **absence of retry and dependency primitives**. These can be added to the current system in <100 lines of code. The second biggest gap is observability (structured events + metrics), which is also solvable incrementally.
|
||
|
|
|
||
|
|
Huey is the right *eventual* target for workflow execution because:
|
||
|
|
1. Already installed, zero new dependencies
|
||
|
|
2. SQLite backend matches our "no infrastructure" philosophy
|
||
|
|
3. Pipeline API gives us task chaining for free
|
||
|
|
4. Retry/backoff is first-class
|
||
|
|
5. Consumer model is more efficient than tick-polling
|
||
|
|
6. ~50x better scheduling latency (ms vs 60s)
|
||
|
|
|
||
|
|
The migration should be gradual — start by wrapping Huey inside our existing cron tick, then progressively move execution to Huey's consumer model.
|