hermes-agent

Timmy_Foundation/hermes-agent

Fork 0

Commit Graph

Author	SHA1	Message	Date
Alexander Whitestone	ebf69d155b	feat: GPU Inference Scheduler — Multi-Model Resource Management Fixes #645 Queue-based model loading with priority lanes and VRAM budget tracking. Prevents GPU OOM crashes when multiple projects compete for VRAM. ## Features ### Priority Lanes - REALTIME (1): LPM, live video, interactive sessions - INTERACTIVE (2): Playground, chat, user-facing - BATCH (3): Harvester, overnight jobs, background ### VRAM Management - Tracks total/used/available VRAM - Reserves VRAM when job starts - Releases VRAM when job completes - CPU fallback when GPU full ### Model Registry Pre-registered models: - Video Forge: SD XL (8GB), HeartMuLa (4GB), Wan2.1 (12GB) - LPM: Video Gen (16GB), A2A (8GB) - Local: Llama 3 70B (40GB), Llama 3 8B (8GB), MiMo v2 Pro (16GB) - Playground: SDXL Turbo (6GB) ### Cross-Project Scenarios Handled 1. Video Forge batch + LPM live → LPM gets priority 2. 3 Video Forge jobs → Sequential with shared cache 3. Night harvester + playground → Batch runs on idle cycles ## Files - tools/gpu_scheduler.py: InferenceScheduler class, CLI interface - tests/tools/test_gpu_scheduler.py: 19 tests, all passing ## Usage ```python from tools.gpu_scheduler import InferenceScheduler, Priority scheduler = InferenceScheduler(vram_budget_mb=49152) # 48GB scheduler.submit_job("job-1", "lpm", "llama3_8b", Priority.REALTIME) job = scheduler.get_next_job() scheduler.start_job(job) # ... do inference ... scheduler.complete_job(job) ```	2026-04-14 21:15:58 -04:00

Author

SHA1

Message

Date

Alexander Whitestone

ebf69d155b

feat: GPU Inference Scheduler — Multi-Model Resource Management

Fixes #645

Queue-based model loading with priority lanes and VRAM budget tracking.
Prevents GPU OOM crashes when multiple projects compete for VRAM.

## Features

### Priority Lanes
- REALTIME (1): LPM, live video, interactive sessions
- INTERACTIVE (2): Playground, chat, user-facing
- BATCH (3): Harvester, overnight jobs, background

### VRAM Management
- Tracks total/used/available VRAM
- Reserves VRAM when job starts
- Releases VRAM when job completes
- CPU fallback when GPU full

### Model Registry
Pre-registered models:
- Video Forge: SD XL (8GB), HeartMuLa (4GB), Wan2.1 (12GB)
- LPM: Video Gen (16GB), A2A (8GB)
- Local: Llama 3 70B (40GB), Llama 3 8B (8GB), MiMo v2 Pro (16GB)
- Playground: SDXL Turbo (6GB)

### Cross-Project Scenarios Handled
1. Video Forge batch + LPM live → LPM gets priority
2. 3 Video Forge jobs → Sequential with shared cache
3. Night harvester + playground → Batch runs on idle cycles

## Files
- tools/gpu_scheduler.py: InferenceScheduler class, CLI interface
- tests/tools/test_gpu_scheduler.py: 19 tests, all passing

## Usage
```python
from tools.gpu_scheduler import InferenceScheduler, Priority

scheduler = InferenceScheduler(vram_budget_mb=49152)  # 48GB
scheduler.submit_job("job-1", "lpm", "llama3_8b", Priority.REALTIME)
job = scheduler.get_next_job()
scheduler.start_job(job)
# ... do inference ...
scheduler.complete_job(job)
```

2026-04-14 21:15:58 -04:00

1 Commits