Files
timmy-home/uni-wizard/v3/CRITIQUE.md
Allegro 31026ddcc1 [#76-v4] Final Uni-Wizard Architecture — Production Integration
Complete four-pass evolution to production-ready architecture:

**Pass 1 → Foundation:**
- Tool registry, basic harness, 19 tools
- VPS provisioning, Syncthing mesh
- Health daemon, systemd services

**Pass 2 → Three-House Canon:**
- Timmy (Sovereign), Ezra (Archivist), Bezalel (Artificer)
- Provenance tracking, artifact-flow discipline
- House-aware policy enforcement

**Pass 3 → Self-Improvement:**
- Pattern database with SQLite backend
- Adaptive policies (auto-adjust thresholds)
- Predictive execution (success prediction)
- Hermes bridge for shortest-loop telemetry
- Learning velocity tracking

**Pass 4 → Production Integration:**
- Unified API: `from uni_wizard import Harness, House, Mode`
- Three modes: SIMPLE / INTELLIGENT / SOVEREIGN
- Circuit breaker pattern for fault tolerance
- Async/concurrent execution support
- Production hardening (timeouts, retries)

**Allegro Lane Definition:**
- Narrowed to: Gitea integration, Hermes bridge, redundancy/failover
- Provides: Cloud connectivity, telemetry streaming, issue routing
- Does NOT: Make sovereign decisions, authenticate as Timmy

**Files:**
- v3/: Intelligence engine, adaptive harness, Hermes bridge
- v4/: Unified API, production harness, final architecture

Total: ~25KB architecture documentation + production code
2026-03-30 16:39:42 +00:00

4.3 KiB

Uni-Wizard v3 — Design Critique & Review

Review of Existing Work

1. Timmy's model_tracker.py (v1)

What's good:

  • Tracks local vs cloud usage
  • Cost estimation
  • SQLite persistence
  • Ingests from Hermes session DB

The gap:

  • Data goes nowhere. It logs but doesn't learn.
  • No feedback loop into decision-making
  • Sovereignty score is a vanity metric unless it changes behavior
  • No pattern recognition on "which models succeed at which tasks"

Verdict: Good telemetry, zero intelligence. Missing: telemetry → analysis → adaptation.


2. Ezra's v2 Harness (Archivist)

What's good:

  • must_read_before_write policy enforcement
  • Evidence level tracking
  • Source citation

The gap:

  • Policies are static. Ezra doesn't learn which evidence sources are most reliable.
  • No tracking of "I read source X, made decision Y, was I right?"
  • No adaptive confidence calibration

Verdict: Good discipline, no learning. Missing: outcome feedback → policy refinement.


3. Bezalel's v2 Harness (Artificer)

What's good:

  • requires_proof enforcement
  • test_before_ship gate
  • Proof verification

The gap:

  • No failure pattern analysis. If tests fail 80% of the time on certain tools, Bezalel doesn't adapt.
  • No "pre-flight check" based on historical failure modes
  • No learning from which proof types catch most bugs

Verdict: Good rigor, no adaptation. Missing: failure pattern → prevention.


4. Hermes Harness Integration

What's good:

  • Rich session data available
  • Tool call tracking
  • Model performance per task

The gap:

  • Shortest loop not utilized. Hermes data exists but doesn't flow into Timmy's decision context.
  • No real-time "last 10 similar tasks succeeded with model X"
  • No context window optimization based on historical patterns

Verdict: Rich data, unused. Missing: hermes_telemetry → timmy_context → smarter_routing.


The Core Problem

Current Flow (Open Loop):
┌─────────┐    ┌──────────┐    ┌─────────┐
│ Execute │───→│ Log Data │───→│  Report │───→ 🗑️
└─────────┘    └──────────┘    └─────────┘

Needed Flow (Closed Loop):
┌─────────┐    ┌──────────┐    ┌───────────┐
│ Execute │───→│ Log Data │───→│  Analyze  │
└─────────┘    └──────────┘    └─────┬─────┘
     ▲                               │
     └───────────────────────────────┘
         Adapt Policy / Route / Model

The Focus: Local sovereign Timmy must get smarter, faster, and self-improving by closing this loop.


v3 Solution: The Intelligence Layer

1. Feedback Loop Architecture

Every execution feeds into:

  • Pattern DB: Tool X with params Y → success rate Z%
  • Model Performance: Task type T → best model M
  • House Calibration: House H on task T → confidence adjustment
  • Predictive Cache: Pre-fetch based on execution patterns

2. Adaptive Policies

Policies become functions of historical performance:

# Instead of static:
evidence_threshold = 0.8

# Dynamic based on track record:
evidence_threshold = base_threshold * (1 + success_rate_adjustment)

3. Hermes Telemetry Integration

Real-time ingestion from Hermes session DB:

  • Last N similar tasks
  • Success rates by model
  • Latency patterns
  • Token efficiency

4. Self-Improvement Metrics

  • Prediction accuracy: Did predicted success match actual?
  • Policy effectiveness: Did policy change improve outcomes?
  • Learning velocity: How fast is Timmy getting better?

Design Principles for v3

  1. Every execution teaches — No telemetry without analysis
  2. Local learning only — Pattern recognition runs locally, no cloud
  3. Shortest feedback loop — Hermes data → Timmy context in <100ms
  4. Transparent adaptation — Timmy explains why he changed his policy
  5. Sovereignty-preserving — Learning improves local decision-making, doesn't outsource it

The goal: Timmy gets measurably better every day he runs.