forked from Rockachopa/Timmy-time-dashboard
This commit implements six major features: 1. Event Log System (src/swarm/event_log.py) - SQLite-based audit trail for all swarm events - Task lifecycle tracking (created, assigned, completed, failed) - Agent lifecycle tracking (joined, left, status changes) - Integrated with coordinator for automatic logging - Dashboard page at /swarm/events 2. Lightning Ledger (src/lightning/ledger.py) - Transaction tracking for Lightning Network payments - Balance calculations (incoming, outgoing, net, available) - Integrated with payment_handler for automatic logging - Dashboard page at /lightning/ledger 3. Semantic Memory / Vector Store (src/memory/vector_store.py) - Embedding-based similarity search for Echo agent - Fallback to keyword matching if sentence-transformers unavailable - Personal facts storage and retrieval - Dashboard page at /memory 4. Cascade Router Integration (src/timmy/cascade_adapter.py) - Automatic LLM failover between providers (Ollama → AirLLM → API) - Circuit breaker pattern for failing providers - Metrics tracking per provider (latency, error rates) - Dashboard status page at /router/status 5. Self-Upgrade Approval Queue (src/upgrades/) - State machine for self-modifications: proposed → approved/rejected → applied/failed - Human approval required before applying changes - Git integration for branch management - Dashboard queue at /self-modify/queue 6. Real-Time Activity Feed (src/events/broadcaster.py) - WebSocket-based live activity streaming - Bridges event_log to dashboard clients - Activity panel on /swarm/live Tests: - 101 unit tests passing - 4 new E2E test files for Selenium testing - Run with: SELENIUM_UI=1 pytest tests/functional/ -v --headed Documentation: - 6 ADRs (017-022) documenting architecture decisions - Implementation summary in docs/IMPLEMENTATION_SUMMARY.md - Architecture diagram in docs/architecture-v2.md
127 lines
3.8 KiB
Markdown
127 lines
3.8 KiB
Markdown
# ADR 020: Cascade Router Integration with Timmy Agent
|
|
|
|
## Status
|
|
Proposed
|
|
|
|
## Context
|
|
Currently, the Timmy agent (`src/timmy/agent.py`) uses `src/timmy/backends.py` which provides a simple abstraction over Ollama and AirLLM. However, this lacks:
|
|
- Automatic failover between multiple LLM providers
|
|
- Circuit breaker pattern for failing providers
|
|
- Cost and latency tracking per provider
|
|
- Priority-based routing (local first, then APIs)
|
|
|
|
The Cascade Router (`src/router/cascade.py`) already implements these features but is not integrated with Timmy.
|
|
|
|
## Decision
|
|
Integrate the Cascade Router as the primary LLM routing layer for Timmy, replacing the direct backend abstraction.
|
|
|
|
## Architecture
|
|
|
|
### Current Flow
|
|
```
|
|
User Request → Timmy Agent → backends.py → Ollama/AirLLM
|
|
```
|
|
|
|
### Proposed Flow
|
|
```
|
|
User Request → Timmy Agent → Cascade Router → Provider 1 (Ollama)
|
|
↓ (if fail)
|
|
Provider 2 (Local AirLLM)
|
|
↓ (if fail)
|
|
Provider 3 (API - optional)
|
|
↓
|
|
Track metrics per provider
|
|
```
|
|
|
|
### Integration Points
|
|
|
|
1. **Timmy Agent** (`src/timmy/agent.py`)
|
|
- Replace `create_timmy()` backend initialization
|
|
- Use `CascadeRouter.complete()` instead of direct `agent.run()`
|
|
- Expose provider status in agent responses
|
|
|
|
2. **Cascade Router** (`src/router/cascade.py`)
|
|
- Already supports: Ollama, OpenAI, Anthropic, AirLLM
|
|
- Already has: Circuit breakers, metrics, failover logic
|
|
- Add: Integration with existing `src/timmy/prompts.py`
|
|
|
|
3. **Configuration** (`config.yaml` or `config.py`)
|
|
- Provider list with priorities
|
|
- API keys (optional, for cloud fallback)
|
|
- Circuit breaker thresholds
|
|
|
|
4. **Dashboard** (new route)
|
|
- `/router/status` - Show provider health, metrics, recent failures
|
|
- Real-time provider status indicator
|
|
|
|
### Provider Priority Order
|
|
|
|
1. **Ollama (local)** - Priority 1, always try first
|
|
2. **AirLLM (local)** - Priority 2, if Ollama unavailable
|
|
3. **API providers** - Priority 3+, only if configured
|
|
|
|
### Data Flow
|
|
|
|
```python
|
|
# Timmy Agent
|
|
async def respond(self, message: str) -> str:
|
|
# Get cascade router
|
|
router = get_cascade_router()
|
|
|
|
# Route through cascade with automatic failover
|
|
response = await router.complete(
|
|
messages=[{"role": "user", "content": message}],
|
|
system_prompt=TIMMY_SYSTEM_PROMPT,
|
|
)
|
|
|
|
# Response includes which provider was used
|
|
return response.content
|
|
```
|
|
|
|
## Schema Additions
|
|
|
|
### Provider Status Table (new)
|
|
```sql
|
|
CREATE TABLE provider_metrics (
|
|
provider_name TEXT PRIMARY KEY,
|
|
total_requests INTEGER DEFAULT 0,
|
|
successful_requests INTEGER DEFAULT 0,
|
|
failed_requests INTEGER DEFAULT 0,
|
|
avg_latency_ms REAL DEFAULT 0,
|
|
last_error_time TEXT,
|
|
circuit_state TEXT DEFAULT 'closed',
|
|
updated_at TEXT
|
|
);
|
|
```
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
- Automatic failover improves reliability
|
|
- Metrics enable data-driven provider selection
|
|
- Circuit breakers prevent cascade failures
|
|
- Configurable without code changes
|
|
|
|
### Negative
|
|
- Additional complexity in request path
|
|
- Potential latency increase from retries
|
|
- Requires careful circuit breaker tuning
|
|
|
|
### Mitigations
|
|
- Circuit breakers have short recovery timeouts (60s)
|
|
- Metrics exposed for monitoring
|
|
- Fallback to mock responses if all providers fail
|
|
|
|
## Implementation Plan
|
|
|
|
1. Create `src/timmy/cascade_adapter.py` - Adapter between Timmy and Cascade Router
|
|
2. Modify `src/timmy/agent.py` - Use adapter instead of direct backends
|
|
3. Create dashboard route `/router/status` - Provider health UI
|
|
4. Add provider metrics persistence to SQLite
|
|
5. Write tests for failover scenarios
|
|
|
|
## Dependencies
|
|
- Existing `src/router/cascade.py`
|
|
- Existing `src/timmy/agent.py`
|
|
- New dashboard route
|