intelligence/deepdive/QUALITY_FRAMEWORK.md

# Deep Dive Quality Evaluation Framework

> **Issue**: [#830](http://143.198.27.163:3000/Timmy_Foundation/the-nexus/issues/830) — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing  
> **Created**: Ezra | 2026-04-05 | Burn mode  
> **Purpose**: Ensure every Deep Dive briefing meets a consistent quality bar. Detect drift. Enable A/B prompt optimization.

---

## 1. Why This Exists

An automated daily briefing is only valuable if it remains **relevant**, **grounded in our work**, **concise**, and **actionable**. Without explicit quality control, three failure modes are inevitable:

1. **Relevance decay** — sources drift toward generic AI news
2. **Grounding loss** — fleet context is injected but ignored by the LLM
3. **Length creep** — briefings grow too long or shrink to bullet points

This framework defines the rubric, provides an automated scoring tool, and establishes a process for continuous improvement.

---

## 2. Quality Rubric

Every briefing is scored across five dimensions (0–100 each). Weights are tuned to Alexander's acceptance criteria.

| Dimension | Weight | Target | Measured By |
|-----------|--------|--------|-------------|
| **Relevance** | 25% | ≥ 70 | Presence of AI/ML keywords aligned with Hermes work |
| **Grounding** | 25% | ≥ 70 | References to fleet repos, issues, commits, architecture |
| **Conciseness** | 20% | 80–100 | Word count landing in 600–1200 words (≈ 10–15 min audio) |
| **Actionability** | 20% | ≥ 60 | Explicit recommendations, implications, next steps |
| **Source Diversity** | 10% | ≥ 60 | Breadth of unique domains represented in briefing |

### 2.1 Relevance

**Keywords tracked** (representative sample):
- LLM, agent, architecture, Hermes, tool use, MCP
- Reinforcement learning, RLHF, GRPO, transformer
- Local model, llama.cpp, Gemma, inference, alignment
- Fleet, Timmy, Nexus, OpenClaw, sovereign

A briefing that touches on 30%+ of these keyword clusters scores near 100. Fewer than 3 hits triggers a warning.

### 2.2 Grounding

Grounding requires that the briefing **uses** the fleet context injected in Phase 0, not just receives it.

**Positive markers**:
- Mentions of specific repos, open issues, recent PRs, or commits
- References to wizard houses (Bezalel, Ezra, Allegro, Gemini)
- Connections between external news and our live architecture

**Penalty**: If `fleet_context` is present in the payload but the briefing text contains no grounding markers, the score is halved.

### 2.3 Conciseness

The target is a **10–15 minute audio briefing**.

At a natural speaking pace of ~130 WPM:
- 600 words ≈ 4.6 min (too short)
- 900 words ≈ 6.9 min (good)
- 1200 words ≈ 9.2 min (good)
- 1950 words ≈ 15 min (upper bound)

Wait — 130 WPM * 15 min = 1950 words. The current evaluator uses 600–1200 as a proxy for a tighter brief. If Alexander wants true 10–15 min, the target band should be **1300–1950 words**. Adjust `TARGET_WORD_COUNT_*` in `quality_eval.py` to match preference.

### 2.4 Actionability

A briefing must answer the implicit question: *"So what should we do?"*

**Positive markers**:
- "implication", "recommend", "should", "next step", "action"
- "deploy", "integrate", "watch", "risk", "opportunity"

### 2.5 Source Diversity

A briefing built from 8 arXiv papers alone scores poorly here. A mix of arXiv, OpenAI blog, Anthropic research, and newsletter commentary scores highly.

---

## 3. Running the Evaluator

### 3.1 Single Briefing

```bash
cd intelligence/deepdive
python3 quality_eval.py ~/.cache/deepdive/briefing_20260405_124506.json
```

### 3.2 With Drift Detection

```bash
python3 quality_eval.py \
  ~/.cache/deepdive/briefing_20260405_124506.json \
  --previous ~/.cache/deepdive/briefing_20260404_124506.json
```

### 3.3 JSON Output (for CI/automation)

```bash
python3 quality_eval.py briefing.json --json > quality_report.json
```

### 3.4 Makefile Integration

Add to `Makefile`:

```makefile
evaluate-latest:
	@latest=$$(ls -t ~/.cache/deepdive/briefing_*.json | head -1); \
	python3 quality_eval.py "$${latest}"
```

---

## 4. Interpreting Scores

| Overall Score | Verdict | Action |
|---------------|---------|--------|
| 85–100 | Excellent | Ship it |
| 70–84 | Good | Minor prompt tuning optional |
| 50–69 | Marginal | Review warnings and apply recommendations |
| < 50 | Unacceptable | Do not deliver. Fix pipeline before next run. |

---

## 5. Drift Detection

Drift is measured by **Jaccard similarity** between the vocabulary of consecutive briefings.

| Drift Score | Meaning |
|-------------|---------|
| > 85% | High overlap — briefings may be repetitive or sources are stale |
| 30–85% | Healthy variation |
| < 15% | High drift — briefings share almost no vocabulary; possible source aggregation failure or prompt instability |

**Note**: Jaccard is a simple heuristic. It does not capture semantic similarity. For a more advanced metric, replace `detect_drift()` with sentence-transformer cosine similarity.

---

## 6. A/B Prompt Testing

To compare two synthesis prompts:

1. Run the pipeline with **Prompt A** → save `briefing_A.json`
2. Run the pipeline with **Prompt B** → save `briefing_B.json`
3. Evaluate both:

```bash
python3 quality_eval.py briefing_A.json --json > report_A.json
python3 quality_eval.py briefing_B.json --json > report_B.json
```

4. Compare dimension scores with `diff` or a small script.

### 6.1 Prompt Variants to Test

| Variant | Hypothesis |
|---------|------------|
| **V1 (Default)** | Neutral synthesis with grounded context |
| **V2 (Action-forward)** | Explicit "Implications → Recommendations" section structure |
| **V3 (Narrative)** | Story-driven podcast script format with transitions |

Record results in `prompt_experiments/RESULTS.md`.

---

## 7. Recommendations Engine

`quality_eval.py` emits concrete recommendations based on low scores:

- **Relevance < 50** → Expand `RELEVANCE_KEYWORDS` or tighten source aggregation filters
- **Grounding < 50** → Verify `fleet_context` is injected and explicitly referenced in the synthesis prompt
- **Conciseness < 50** → Adjust synthesis prompt word-count guidance or ranking threshold
- **Actionability < 50** → Add explicit instructions to include "Implications" and "Recommended Actions" sections

---

## 8. Integration into Production

### 8.1 Gatekeeper Mode

Run the evaluator after every pipeline generation. If `overall_score < 60`, abort delivery and alert the operator room:

```python
# In pipeline.py delivery phase
report = evaluate(briefing_path)
if report.overall_score < 60:
    logger.error("Briefing quality below threshold. Halting delivery.")
    send_alert(f"Deep Dive quality failed: {report.overall_score}/100")
    return
```

### 8.2 Weekly Quality Audit

Every Sunday, run drift detection on the past 7 briefings and post a SITREP to #830 if scores are trending down.

---

## 9. File Reference

| File | Purpose |
|------|---------|
| `quality_eval.py` | Executable evaluator |
| `QUALITY_FRAMEWORK.md` | This document — rubric and process |

---

## 10. Changelog

| Date | Change | Author |
|------|--------|--------|
| 2026-04-05 | Quality framework v1.0 — rubric, evaluator, drift detection | Ezra |