timmy-home/uni-wizard/v3/CRITIQUE.md

# Uni-Wizard v3 — Design Critique & Review

## Review of Existing Work

### 1. Timmy's model_tracker.py (v1)
**What's good:**
- Tracks local vs cloud usage
- Cost estimation
- SQLite persistence
- Ingests from Hermes session DB

**The gap:**
- **Data goes nowhere.** It logs but doesn't learn.
- No feedback loop into decision-making
- Sovereignty score is a vanity metric unless it changes behavior
- No pattern recognition on "which models succeed at which tasks"

**Verdict:** Good telemetry, zero intelligence. Missing: `telemetry → analysis → adaptation`.

---

### 2. Ezra's v2 Harness (Archivist)
**What's good:**
- `must_read_before_write` policy enforcement
- Evidence level tracking
- Source citation

**The gap:**
- **Policies are static.** Ezra doesn't learn which evidence sources are most reliable.
- No tracking of "I read source X, made decision Y, was I right?"
- No adaptive confidence calibration

**Verdict:** Good discipline, no learning. Missing: `outcome feedback → policy refinement`.

---

### 3. Bezalel's v2 Harness (Artificer)
**What's good:**
- `requires_proof` enforcement
- `test_before_ship` gate
- Proof verification

**The gap:**
- **No failure pattern analysis.** If tests fail 80% of the time on certain tools, Bezalel doesn't adapt.
- No "pre-flight check" based on historical failure modes
- No learning from which proof types catch most bugs

**Verdict:** Good rigor, no adaptation. Missing: `failure pattern → prevention`.

---

### 4. Hermes Harness Integration
**What's good:**
- Rich session data available
- Tool call tracking
- Model performance per task

**The gap:**
- **Shortest loop not utilized.** Hermes data exists but doesn't flow into Timmy's decision context.
- No real-time "last 10 similar tasks succeeded with model X"
- No context window optimization based on historical patterns

**Verdict:** Rich data, unused. Missing: `hermes_telemetry → timmy_context → smarter_routing`.

---

## The Core Problem

```
Current Flow (Open Loop):
┌─────────┐    ┌──────────┐    ┌─────────┐
│ Execute │───→│ Log Data │───→│  Report │───→ 🗑️
└─────────┘    └──────────┘    └─────────┘

Needed Flow (Closed Loop):
┌─────────┐    ┌──────────┐    ┌───────────┐
│ Execute │───→│ Log Data │───→│  Analyze  │
└─────────┘    └──────────┘    └─────┬─────┘
     ▲                               │
     └───────────────────────────────┘
         Adapt Policy / Route / Model
```

**The Focus:** Local sovereign Timmy must get **smarter, faster, and self-improving** by closing this loop.

---

## v3 Solution: The Intelligence Layer

### 1. Feedback Loop Architecture
Every execution feeds into:
- **Pattern DB**: Tool X with params Y → success rate Z%
- **Model Performance**: Task type T → best model M
- **House Calibration**: House H on task T → confidence adjustment
- **Predictive Cache**: Pre-fetch based on execution patterns

### 2. Adaptive Policies
Policies become functions of historical performance:
```python
# Instead of static:
evidence_threshold = 0.8

# Dynamic based on track record:
evidence_threshold = base_threshold * (1 + success_rate_adjustment)
```

### 3. Hermes Telemetry Integration
Real-time ingestion from Hermes session DB:
- Last N similar tasks
- Success rates by model
- Latency patterns
- Token efficiency

### 4. Self-Improvement Metrics
- **Prediction accuracy**: Did predicted success match actual?
- **Policy effectiveness**: Did policy change improve outcomes?
- **Learning velocity**: How fast is Timmy getting better?

---

## Design Principles for v3

1. **Every execution teaches** — No telemetry without analysis
2. **Local learning only** — Pattern recognition runs locally, no cloud
3. **Shortest feedback loop** — Hermes data → Timmy context in <100ms
4. **Transparent adaptation** — Timmy explains why he changed his policy
5. **Sovereignty-preserving** — Learning improves local decision-making, doesn't outsource it

---

*The goal: Timmy gets measurably better every day he runs.*