Complete four-pass evolution to production-ready architecture: **Pass 1 → Foundation:** - Tool registry, basic harness, 19 tools - VPS provisioning, Syncthing mesh - Health daemon, systemd services **Pass 2 → Three-House Canon:** - Timmy (Sovereign), Ezra (Archivist), Bezalel (Artificer) - Provenance tracking, artifact-flow discipline - House-aware policy enforcement **Pass 3 → Self-Improvement:** - Pattern database with SQLite backend - Adaptive policies (auto-adjust thresholds) - Predictive execution (success prediction) - Hermes bridge for shortest-loop telemetry - Learning velocity tracking **Pass 4 → Production Integration:** - Unified API: `from uni_wizard import Harness, House, Mode` - Three modes: SIMPLE / INTELLIGENT / SOVEREIGN - Circuit breaker pattern for fault tolerance - Async/concurrent execution support - Production hardening (timeouts, retries) **Allegro Lane Definition:** - Narrowed to: Gitea integration, Hermes bridge, redundancy/failover - Provides: Cloud connectivity, telemetry streaming, issue routing - Does NOT: Make sovereign decisions, authenticate as Timmy **Files:** - v3/: Intelligence engine, adaptive harness, Hermes bridge - v4/: Unified API, production harness, final architecture Total: ~25KB architecture documentation + production code
132 lines
4.3 KiB
Markdown
132 lines
4.3 KiB
Markdown
# Uni-Wizard v3 — Design Critique & Review
|
|
|
|
## Review of Existing Work
|
|
|
|
### 1. Timmy's model_tracker.py (v1)
|
|
**What's good:**
|
|
- Tracks local vs cloud usage
|
|
- Cost estimation
|
|
- SQLite persistence
|
|
- Ingests from Hermes session DB
|
|
|
|
**The gap:**
|
|
- **Data goes nowhere.** It logs but doesn't learn.
|
|
- No feedback loop into decision-making
|
|
- Sovereignty score is a vanity metric unless it changes behavior
|
|
- No pattern recognition on "which models succeed at which tasks"
|
|
|
|
**Verdict:** Good telemetry, zero intelligence. Missing: `telemetry → analysis → adaptation`.
|
|
|
|
---
|
|
|
|
### 2. Ezra's v2 Harness (Archivist)
|
|
**What's good:**
|
|
- `must_read_before_write` policy enforcement
|
|
- Evidence level tracking
|
|
- Source citation
|
|
|
|
**The gap:**
|
|
- **Policies are static.** Ezra doesn't learn which evidence sources are most reliable.
|
|
- No tracking of "I read source X, made decision Y, was I right?"
|
|
- No adaptive confidence calibration
|
|
|
|
**Verdict:** Good discipline, no learning. Missing: `outcome feedback → policy refinement`.
|
|
|
|
---
|
|
|
|
### 3. Bezalel's v2 Harness (Artificer)
|
|
**What's good:**
|
|
- `requires_proof` enforcement
|
|
- `test_before_ship` gate
|
|
- Proof verification
|
|
|
|
**The gap:**
|
|
- **No failure pattern analysis.** If tests fail 80% of the time on certain tools, Bezalel doesn't adapt.
|
|
- No "pre-flight check" based on historical failure modes
|
|
- No learning from which proof types catch most bugs
|
|
|
|
**Verdict:** Good rigor, no adaptation. Missing: `failure pattern → prevention`.
|
|
|
|
---
|
|
|
|
### 4. Hermes Harness Integration
|
|
**What's good:**
|
|
- Rich session data available
|
|
- Tool call tracking
|
|
- Model performance per task
|
|
|
|
**The gap:**
|
|
- **Shortest loop not utilized.** Hermes data exists but doesn't flow into Timmy's decision context.
|
|
- No real-time "last 10 similar tasks succeeded with model X"
|
|
- No context window optimization based on historical patterns
|
|
|
|
**Verdict:** Rich data, unused. Missing: `hermes_telemetry → timmy_context → smarter_routing`.
|
|
|
|
---
|
|
|
|
## The Core Problem
|
|
|
|
```
|
|
Current Flow (Open Loop):
|
|
┌─────────┐ ┌──────────┐ ┌─────────┐
|
|
│ Execute │───→│ Log Data │───→│ Report │───→ 🗑️
|
|
└─────────┘ └──────────┘ └─────────┘
|
|
|
|
Needed Flow (Closed Loop):
|
|
┌─────────┐ ┌──────────┐ ┌───────────┐
|
|
│ Execute │───→│ Log Data │───→│ Analyze │
|
|
└─────────┘ └──────────┘ └─────┬─────┘
|
|
▲ │
|
|
└───────────────────────────────┘
|
|
Adapt Policy / Route / Model
|
|
```
|
|
|
|
**The Focus:** Local sovereign Timmy must get **smarter, faster, and self-improving** by closing this loop.
|
|
|
|
---
|
|
|
|
## v3 Solution: The Intelligence Layer
|
|
|
|
### 1. Feedback Loop Architecture
|
|
Every execution feeds into:
|
|
- **Pattern DB**: Tool X with params Y → success rate Z%
|
|
- **Model Performance**: Task type T → best model M
|
|
- **House Calibration**: House H on task T → confidence adjustment
|
|
- **Predictive Cache**: Pre-fetch based on execution patterns
|
|
|
|
### 2. Adaptive Policies
|
|
Policies become functions of historical performance:
|
|
```python
|
|
# Instead of static:
|
|
evidence_threshold = 0.8
|
|
|
|
# Dynamic based on track record:
|
|
evidence_threshold = base_threshold * (1 + success_rate_adjustment)
|
|
```
|
|
|
|
### 3. Hermes Telemetry Integration
|
|
Real-time ingestion from Hermes session DB:
|
|
- Last N similar tasks
|
|
- Success rates by model
|
|
- Latency patterns
|
|
- Token efficiency
|
|
|
|
### 4. Self-Improvement Metrics
|
|
- **Prediction accuracy**: Did predicted success match actual?
|
|
- **Policy effectiveness**: Did policy change improve outcomes?
|
|
- **Learning velocity**: How fast is Timmy getting better?
|
|
|
|
---
|
|
|
|
## Design Principles for v3
|
|
|
|
1. **Every execution teaches** — No telemetry without analysis
|
|
2. **Local learning only** — Pattern recognition runs locally, no cloud
|
|
3. **Shortest feedback loop** — Hermes data → Timmy context in <100ms
|
|
4. **Transparent adaptation** — Timmy explains why he changed his policy
|
|
5. **Sovereignty-preserving** — Learning improves local decision-making, doesn't outsource it
|
|
|
|
---
|
|
|
|
*The goal: Timmy gets measurably better every day he runs.*
|