Merge PR #626: pipelines/README.md

2026-04-16 05:11:26 +00:00
parent bb09b374ce
commit 852dd8f210
1 changed files with 94 additions and 0 deletions
--- a/pipelines/README.md
+++ b/pipelines/README.md
@@ -0,0 +1,94 @@
+# Pipeline Infrastructure
+
+Shared orchestrator for all batch pipelines.
+
+## Components
+
+### orchestrator.py
+Shared orchestrator providing:
+- **Job Queue**: SQLite-backed with priority support
+- **Worker Pool**: Configurable parallelism (default 10)
+- **Token Budget**: Per-job tracking and limits
+- **Checkpointing**: Resume from any point after restart
+- **Rate Limiting**: Provider-aware request throttling
+- **Retry Logic**: Exponential backoff with configurable retries
+- **Reporting**: Generate summary reports
+
+## Usage
+
+### Python API
+```python
+from pipelines.orchestrator import PipelineOrchestrator, JobPriority
+
+# Create orchestrator
+orchestrator = PipelineOrchestrator(max_workers=10)
+
+# Register pipeline handler
+def my_handler(job):
+    # Process job.task
+    return {"result": "done"}
+
+orchestrator.register_handler("my_pipeline", my_handler)
+
+# Submit jobs
+job_id = orchestrator.submit_job(
+    pipeline="my_pipeline",
+    task={"action": "process", "data": "..."},
+    priority=JobPriority.HIGH,
+    token_budget=100000
+)
+
+# Run orchestrator
+orchestrator.run()
+```
+
+### CLI
+```bash
+# Submit a job
+python -m pipelines.orchestrator submit my_pipeline --task '{"action": "process"}'
+
+# Run orchestrator
+python -m pipelines.orchestrator run --workers 10 --max-jobs 100
+
+# Check job status
+python -m pipelines.orchestrator status <job_id>
+
+# Resume paused job
+python -m pipelines.orchestrator resume <job_id>
+
+# Show stats
+python -m pipelines.orchestrator stats
+
+# Generate report
+python -m pipelines.orchestrator report
+```
+
+## Database
+
+Jobs are stored in `~/.hermes/pipelines/orchestrator.db`:
+- `jobs` - Job queue and state
+- `checkpoints` - Resume points
+- `reports` - Generated reports
+
+## Configuration
+
+### Rate Limits
+```python
+orchestrator.configure_rate_limit("Nous", rpm=60, tpm=1000000)
+orchestrator.configure_rate_limit("Anthropic", rpm=50, tpm=800000)
+```
+
+### Token Budgets
+Default: 1M tokens per job. Override per-job:
+```python
+orchestrator.submit_job("pipeline", task, token_budget=500000)
+```
+
+## Pipelines
+
+All pipelines share this orchestrator:
+1. **batch-runner** - Run prompts across datasets
+2. **data-gen** - Generate training data
+3. **eval-runner** - Run evaluations
+4. **trajectory-compress** - Compress trajectories
+5. **web-research** - Research tasks