Merge PR #626: pipelines/README.md
This commit is contained in:
94
pipelines/README.md
Normal file
94
pipelines/README.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# Pipeline Infrastructure
|
||||
|
||||
Shared orchestrator for all batch pipelines.
|
||||
|
||||
## Components
|
||||
|
||||
### orchestrator.py
|
||||
Shared orchestrator providing:
|
||||
- **Job Queue**: SQLite-backed with priority support
|
||||
- **Worker Pool**: Configurable parallelism (default 10)
|
||||
- **Token Budget**: Per-job tracking and limits
|
||||
- **Checkpointing**: Resume from any point after restart
|
||||
- **Rate Limiting**: Provider-aware request throttling
|
||||
- **Retry Logic**: Exponential backoff with configurable retries
|
||||
- **Reporting**: Generate summary reports
|
||||
|
||||
## Usage
|
||||
|
||||
### Python API
|
||||
```python
|
||||
from pipelines.orchestrator import PipelineOrchestrator, JobPriority
|
||||
|
||||
# Create orchestrator
|
||||
orchestrator = PipelineOrchestrator(max_workers=10)
|
||||
|
||||
# Register pipeline handler
|
||||
def my_handler(job):
|
||||
# Process job.task
|
||||
return {"result": "done"}
|
||||
|
||||
orchestrator.register_handler("my_pipeline", my_handler)
|
||||
|
||||
# Submit jobs
|
||||
job_id = orchestrator.submit_job(
|
||||
pipeline="my_pipeline",
|
||||
task={"action": "process", "data": "..."},
|
||||
priority=JobPriority.HIGH,
|
||||
token_budget=100000
|
||||
)
|
||||
|
||||
# Run orchestrator
|
||||
orchestrator.run()
|
||||
```
|
||||
|
||||
### CLI
|
||||
```bash
|
||||
# Submit a job
|
||||
python -m pipelines.orchestrator submit my_pipeline --task '{"action": "process"}'
|
||||
|
||||
# Run orchestrator
|
||||
python -m pipelines.orchestrator run --workers 10 --max-jobs 100
|
||||
|
||||
# Check job status
|
||||
python -m pipelines.orchestrator status <job_id>
|
||||
|
||||
# Resume paused job
|
||||
python -m pipelines.orchestrator resume <job_id>
|
||||
|
||||
# Show stats
|
||||
python -m pipelines.orchestrator stats
|
||||
|
||||
# Generate report
|
||||
python -m pipelines.orchestrator report
|
||||
```
|
||||
|
||||
## Database
|
||||
|
||||
Jobs are stored in `~/.hermes/pipelines/orchestrator.db`:
|
||||
- `jobs` - Job queue and state
|
||||
- `checkpoints` - Resume points
|
||||
- `reports` - Generated reports
|
||||
|
||||
## Configuration
|
||||
|
||||
### Rate Limits
|
||||
```python
|
||||
orchestrator.configure_rate_limit("Nous", rpm=60, tpm=1000000)
|
||||
orchestrator.configure_rate_limit("Anthropic", rpm=50, tpm=800000)
|
||||
```
|
||||
|
||||
### Token Budgets
|
||||
Default: 1M tokens per job. Override per-job:
|
||||
```python
|
||||
orchestrator.submit_job("pipeline", task, token_budget=500000)
|
||||
```
|
||||
|
||||
## Pipelines
|
||||
|
||||
All pipelines share this orchestrator:
|
||||
1. **batch-runner** - Run prompts across datasets
|
||||
2. **data-gen** - Generate training data
|
||||
3. **eval-runner** - Run evaluations
|
||||
4. **trajectory-compress** - Compress trajectories
|
||||
5. **web-research** - Research tasks
|
||||
Reference in New Issue
Block a user