Complete four-pass evolution to production-ready architecture: **Pass 1 → Foundation:** - Tool registry, basic harness, 19 tools - VPS provisioning, Syncthing mesh - Health daemon, systemd services **Pass 2 → Three-House Canon:** - Timmy (Sovereign), Ezra (Archivist), Bezalel (Artificer) - Provenance tracking, artifact-flow discipline - House-aware policy enforcement **Pass 3 → Self-Improvement:** - Pattern database with SQLite backend - Adaptive policies (auto-adjust thresholds) - Predictive execution (success prediction) - Hermes bridge for shortest-loop telemetry - Learning velocity tracking **Pass 4 → Production Integration:** - Unified API: `from uni_wizard import Harness, House, Mode` - Three modes: SIMPLE / INTELLIGENT / SOVEREIGN - Circuit breaker pattern for fault tolerance - Async/concurrent execution support - Production hardening (timeouts, retries) **Allegro Lane Definition:** - Narrowed to: Gitea integration, Hermes bridge, redundancy/failover - Provides: Cloud connectivity, telemetry streaming, issue routing - Does NOT: Make sovereign decisions, authenticate as Timmy **Files:** - v3/: Intelligence engine, adaptive harness, Hermes bridge - v4/: Unified API, production harness, final architecture Total: ~25KB architecture documentation + production code
4.3 KiB
4.3 KiB
Uni-Wizard v3 — Design Critique & Review
Review of Existing Work
1. Timmy's model_tracker.py (v1)
What's good:
- Tracks local vs cloud usage
- Cost estimation
- SQLite persistence
- Ingests from Hermes session DB
The gap:
- Data goes nowhere. It logs but doesn't learn.
- No feedback loop into decision-making
- Sovereignty score is a vanity metric unless it changes behavior
- No pattern recognition on "which models succeed at which tasks"
Verdict: Good telemetry, zero intelligence. Missing: telemetry → analysis → adaptation.
2. Ezra's v2 Harness (Archivist)
What's good:
must_read_before_writepolicy enforcement- Evidence level tracking
- Source citation
The gap:
- Policies are static. Ezra doesn't learn which evidence sources are most reliable.
- No tracking of "I read source X, made decision Y, was I right?"
- No adaptive confidence calibration
Verdict: Good discipline, no learning. Missing: outcome feedback → policy refinement.
3. Bezalel's v2 Harness (Artificer)
What's good:
requires_proofenforcementtest_before_shipgate- Proof verification
The gap:
- No failure pattern analysis. If tests fail 80% of the time on certain tools, Bezalel doesn't adapt.
- No "pre-flight check" based on historical failure modes
- No learning from which proof types catch most bugs
Verdict: Good rigor, no adaptation. Missing: failure pattern → prevention.
4. Hermes Harness Integration
What's good:
- Rich session data available
- Tool call tracking
- Model performance per task
The gap:
- Shortest loop not utilized. Hermes data exists but doesn't flow into Timmy's decision context.
- No real-time "last 10 similar tasks succeeded with model X"
- No context window optimization based on historical patterns
Verdict: Rich data, unused. Missing: hermes_telemetry → timmy_context → smarter_routing.
The Core Problem
Current Flow (Open Loop):
┌─────────┐ ┌──────────┐ ┌─────────┐
│ Execute │───→│ Log Data │───→│ Report │───→ 🗑️
└─────────┘ └──────────┘ └─────────┘
Needed Flow (Closed Loop):
┌─────────┐ ┌──────────┐ ┌───────────┐
│ Execute │───→│ Log Data │───→│ Analyze │
└─────────┘ └──────────┘ └─────┬─────┘
▲ │
└───────────────────────────────┘
Adapt Policy / Route / Model
The Focus: Local sovereign Timmy must get smarter, faster, and self-improving by closing this loop.
v3 Solution: The Intelligence Layer
1. Feedback Loop Architecture
Every execution feeds into:
- Pattern DB: Tool X with params Y → success rate Z%
- Model Performance: Task type T → best model M
- House Calibration: House H on task T → confidence adjustment
- Predictive Cache: Pre-fetch based on execution patterns
2. Adaptive Policies
Policies become functions of historical performance:
# Instead of static:
evidence_threshold = 0.8
# Dynamic based on track record:
evidence_threshold = base_threshold * (1 + success_rate_adjustment)
3. Hermes Telemetry Integration
Real-time ingestion from Hermes session DB:
- Last N similar tasks
- Success rates by model
- Latency patterns
- Token efficiency
4. Self-Improvement Metrics
- Prediction accuracy: Did predicted success match actual?
- Policy effectiveness: Did policy change improve outcomes?
- Learning velocity: How fast is Timmy getting better?
Design Principles for v3
- Every execution teaches — No telemetry without analysis
- Local learning only — Pattern recognition runs locally, no cloud
- Shortest feedback loop — Hermes data → Timmy context in <100ms
- Transparent adaptation — Timmy explains why he changed his policy
- Sovereignty-preserving — Learning improves local decision-making, doesn't outsource it
The goal: Timmy gets measurably better every day he runs.