Files
the-nexus/intelligence/deepdive/QUALITY_FRAMEWORK.md

213 lines
7.1 KiB
Markdown
Raw Normal View History

# Deep Dive Quality Evaluation Framework
> **Issue**: [#830](http://143.198.27.163:3000/Timmy_Foundation/the-nexus/issues/830) — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing
> **Created**: Ezra | 2026-04-05 | Burn mode
> **Purpose**: Ensure every Deep Dive briefing meets a consistent quality bar. Detect drift. Enable A/B prompt optimization.
---
## 1. Why This Exists
An automated daily briefing is only valuable if it remains **relevant**, **grounded in our work**, **concise**, and **actionable**. Without explicit quality control, three failure modes are inevitable:
1. **Relevance decay** — sources drift toward generic AI news
2. **Grounding loss** — fleet context is injected but ignored by the LLM
3. **Length creep** — briefings grow too long or shrink to bullet points
This framework defines the rubric, provides an automated scoring tool, and establishes a process for continuous improvement.
---
## 2. Quality Rubric
Every briefing is scored across five dimensions (0100 each). Weights are tuned to Alexander's acceptance criteria.
| Dimension | Weight | Target | Measured By |
|-----------|--------|--------|-------------|
| **Relevance** | 25% | ≥ 70 | Presence of AI/ML keywords aligned with Hermes work |
| **Grounding** | 25% | ≥ 70 | References to fleet repos, issues, commits, architecture |
| **Conciseness** | 20% | 80100 | Word count landing in 6001200 words (≈ 1015 min audio) |
| **Actionability** | 20% | ≥ 60 | Explicit recommendations, implications, next steps |
| **Source Diversity** | 10% | ≥ 60 | Breadth of unique domains represented in briefing |
### 2.1 Relevance
**Keywords tracked** (representative sample):
- LLM, agent, architecture, Hermes, tool use, MCP
- Reinforcement learning, RLHF, GRPO, transformer
- Local model, llama.cpp, Gemma, inference, alignment
- Fleet, Timmy, Nexus, OpenClaw, sovereign
A briefing that touches on 30%+ of these keyword clusters scores near 100. Fewer than 3 hits triggers a warning.
### 2.2 Grounding
Grounding requires that the briefing **uses** the fleet context injected in Phase 0, not just receives it.
**Positive markers**:
- Mentions of specific repos, open issues, recent PRs, or commits
- References to wizard houses (Bezalel, Ezra, Allegro, Gemini)
- Connections between external news and our live architecture
**Penalty**: If `fleet_context` is present in the payload but the briefing text contains no grounding markers, the score is halved.
### 2.3 Conciseness
The target is a **1015 minute audio briefing**.
At a natural speaking pace of ~130 WPM:
- 600 words ≈ 4.6 min (too short)
- 900 words ≈ 6.9 min (good)
- 1200 words ≈ 9.2 min (good)
- 1950 words ≈ 15 min (upper bound)
Wait — 130 WPM * 15 min = 1950 words. The current evaluator uses 6001200 as a proxy for a tighter brief. If Alexander wants true 1015 min, the target band should be **13001950 words**. Adjust `TARGET_WORD_COUNT_*` in `quality_eval.py` to match preference.
### 2.4 Actionability
A briefing must answer the implicit question: *"So what should we do?"*
**Positive markers**:
- "implication", "recommend", "should", "next step", "action"
- "deploy", "integrate", "watch", "risk", "opportunity"
### 2.5 Source Diversity
A briefing built from 8 arXiv papers alone scores poorly here. A mix of arXiv, OpenAI blog, Anthropic research, and newsletter commentary scores highly.
---
## 3. Running the Evaluator
### 3.1 Single Briefing
```bash
cd intelligence/deepdive
python3 quality_eval.py ~/.cache/deepdive/briefing_20260405_124506.json
```
### 3.2 With Drift Detection
```bash
python3 quality_eval.py \
~/.cache/deepdive/briefing_20260405_124506.json \
--previous ~/.cache/deepdive/briefing_20260404_124506.json
```
### 3.3 JSON Output (for CI/automation)
```bash
python3 quality_eval.py briefing.json --json > quality_report.json
```
### 3.4 Makefile Integration
Add to `Makefile`:
```makefile
evaluate-latest:
@latest=$$(ls -t ~/.cache/deepdive/briefing_*.json | head -1); \
python3 quality_eval.py "$${latest}"
```
---
## 4. Interpreting Scores
| Overall Score | Verdict | Action |
|---------------|---------|--------|
| 85100 | Excellent | Ship it |
| 7084 | Good | Minor prompt tuning optional |
| 5069 | Marginal | Review warnings and apply recommendations |
| < 50 | Unacceptable | Do not deliver. Fix pipeline before next run. |
---
## 5. Drift Detection
Drift is measured by **Jaccard similarity** between the vocabulary of consecutive briefings.
| Drift Score | Meaning |
|-------------|---------|
| > 85% | High overlap — briefings may be repetitive or sources are stale |
| 3085% | Healthy variation |
| < 15% | High drift — briefings share almost no vocabulary; possible source aggregation failure or prompt instability |
**Note**: Jaccard is a simple heuristic. It does not capture semantic similarity. For a more advanced metric, replace `detect_drift()` with sentence-transformer cosine similarity.
---
## 6. A/B Prompt Testing
To compare two synthesis prompts:
1. Run the pipeline with **Prompt A** → save `briefing_A.json`
2. Run the pipeline with **Prompt B** → save `briefing_B.json`
3. Evaluate both:
```bash
python3 quality_eval.py briefing_A.json --json > report_A.json
python3 quality_eval.py briefing_B.json --json > report_B.json
```
4. Compare dimension scores with `diff` or a small script.
### 6.1 Prompt Variants to Test
| Variant | Hypothesis |
|---------|------------|
| **V1 (Default)** | Neutral synthesis with grounded context |
| **V2 (Action-forward)** | Explicit "Implications → Recommendations" section structure |
| **V3 (Narrative)** | Story-driven podcast script format with transitions |
Record results in `prompt_experiments/RESULTS.md`.
---
## 7. Recommendations Engine
`quality_eval.py` emits concrete recommendations based on low scores:
- **Relevance < 50** → Expand `RELEVANCE_KEYWORDS` or tighten source aggregation filters
- **Grounding < 50** → Verify `fleet_context` is injected and explicitly referenced in the synthesis prompt
- **Conciseness < 50** → Adjust synthesis prompt word-count guidance or ranking threshold
- **Actionability < 50** → Add explicit instructions to include "Implications" and "Recommended Actions" sections
---
## 8. Integration into Production
### 8.1 Gatekeeper Mode
Run the evaluator after every pipeline generation. If `overall_score < 60`, abort delivery and alert the operator room:
```python
# In pipeline.py delivery phase
report = evaluate(briefing_path)
if report.overall_score < 60:
logger.error("Briefing quality below threshold. Halting delivery.")
send_alert(f"Deep Dive quality failed: {report.overall_score}/100")
return
```
### 8.2 Weekly Quality Audit
Every Sunday, run drift detection on the past 7 briefings and post a SITREP to #830 if scores are trending down.
---
## 9. File Reference
| File | Purpose |
|------|---------|
| `quality_eval.py` | Executable evaluator |
| `QUALITY_FRAMEWORK.md` | This document — rubric and process |
---
## 10. Changelog
| Date | Change | Author |
|------|--------|--------|
| 2026-04-05 | Quality framework v1.0 — rubric, evaluator, drift detection | Ezra |