- Add quality_eval.py: automated briefing quality scorer with drift detection - Add QUALITY_FRAMEWORK.md: rubric, usage guide, and production integration spec Refs #830
7.1 KiB
Deep Dive Quality Evaluation Framework
Issue: #830 — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing
Created: Ezra | 2026-04-05 | Burn mode
Purpose: Ensure every Deep Dive briefing meets a consistent quality bar. Detect drift. Enable A/B prompt optimization.
1. Why This Exists
An automated daily briefing is only valuable if it remains relevant, grounded in our work, concise, and actionable. Without explicit quality control, three failure modes are inevitable:
- Relevance decay — sources drift toward generic AI news
- Grounding loss — fleet context is injected but ignored by the LLM
- Length creep — briefings grow too long or shrink to bullet points
This framework defines the rubric, provides an automated scoring tool, and establishes a process for continuous improvement.
2. Quality Rubric
Every briefing is scored across five dimensions (0–100 each). Weights are tuned to Alexander's acceptance criteria.
| Dimension | Weight | Target | Measured By |
|---|---|---|---|
| Relevance | 25% | ≥ 70 | Presence of AI/ML keywords aligned with Hermes work |
| Grounding | 25% | ≥ 70 | References to fleet repos, issues, commits, architecture |
| Conciseness | 20% | 80–100 | Word count landing in 600–1200 words (≈ 10–15 min audio) |
| Actionability | 20% | ≥ 60 | Explicit recommendations, implications, next steps |
| Source Diversity | 10% | ≥ 60 | Breadth of unique domains represented in briefing |
2.1 Relevance
Keywords tracked (representative sample):
- LLM, agent, architecture, Hermes, tool use, MCP
- Reinforcement learning, RLHF, GRPO, transformer
- Local model, llama.cpp, Gemma, inference, alignment
- Fleet, Timmy, Nexus, OpenClaw, sovereign
A briefing that touches on 30%+ of these keyword clusters scores near 100. Fewer than 3 hits triggers a warning.
2.2 Grounding
Grounding requires that the briefing uses the fleet context injected in Phase 0, not just receives it.
Positive markers:
- Mentions of specific repos, open issues, recent PRs, or commits
- References to wizard houses (Bezalel, Ezra, Allegro, Gemini)
- Connections between external news and our live architecture
Penalty: If fleet_context is present in the payload but the briefing text contains no grounding markers, the score is halved.
2.3 Conciseness
The target is a 10–15 minute audio briefing.
At a natural speaking pace of ~130 WPM:
- 600 words ≈ 4.6 min (too short)
- 900 words ≈ 6.9 min (good)
- 1200 words ≈ 9.2 min (good)
- 1950 words ≈ 15 min (upper bound)
Wait — 130 WPM * 15 min = 1950 words. The current evaluator uses 600–1200 as a proxy for a tighter brief. If Alexander wants true 10–15 min, the target band should be 1300–1950 words. Adjust TARGET_WORD_COUNT_* in quality_eval.py to match preference.
2.4 Actionability
A briefing must answer the implicit question: "So what should we do?"
Positive markers:
- "implication", "recommend", "should", "next step", "action"
- "deploy", "integrate", "watch", "risk", "opportunity"
2.5 Source Diversity
A briefing built from 8 arXiv papers alone scores poorly here. A mix of arXiv, OpenAI blog, Anthropic research, and newsletter commentary scores highly.
3. Running the Evaluator
3.1 Single Briefing
cd intelligence/deepdive
python3 quality_eval.py ~/.cache/deepdive/briefing_20260405_124506.json
3.2 With Drift Detection
python3 quality_eval.py \
~/.cache/deepdive/briefing_20260405_124506.json \
--previous ~/.cache/deepdive/briefing_20260404_124506.json
3.3 JSON Output (for CI/automation)
python3 quality_eval.py briefing.json --json > quality_report.json
3.4 Makefile Integration
Add to Makefile:
evaluate-latest:
@latest=$$(ls -t ~/.cache/deepdive/briefing_*.json | head -1); \
python3 quality_eval.py "$${latest}"
4. Interpreting Scores
| Overall Score | Verdict | Action |
|---|---|---|
| 85–100 | Excellent | Ship it |
| 70–84 | Good | Minor prompt tuning optional |
| 50–69 | Marginal | Review warnings and apply recommendations |
| < 50 | Unacceptable | Do not deliver. Fix pipeline before next run. |
5. Drift Detection
Drift is measured by Jaccard similarity between the vocabulary of consecutive briefings.
| Drift Score | Meaning |
|---|---|
| > 85% | High overlap — briefings may be repetitive or sources are stale |
| 30–85% | Healthy variation |
| < 15% | High drift — briefings share almost no vocabulary; possible source aggregation failure or prompt instability |
Note: Jaccard is a simple heuristic. It does not capture semantic similarity. For a more advanced metric, replace detect_drift() with sentence-transformer cosine similarity.
6. A/B Prompt Testing
To compare two synthesis prompts:
- Run the pipeline with Prompt A → save
briefing_A.json - Run the pipeline with Prompt B → save
briefing_B.json - Evaluate both:
python3 quality_eval.py briefing_A.json --json > report_A.json
python3 quality_eval.py briefing_B.json --json > report_B.json
- Compare dimension scores with
diffor a small script.
6.1 Prompt Variants to Test
| Variant | Hypothesis |
|---|---|
| V1 (Default) | Neutral synthesis with grounded context |
| V2 (Action-forward) | Explicit "Implications → Recommendations" section structure |
| V3 (Narrative) | Story-driven podcast script format with transitions |
Record results in prompt_experiments/RESULTS.md.
7. Recommendations Engine
quality_eval.py emits concrete recommendations based on low scores:
- Relevance < 50 → Expand
RELEVANCE_KEYWORDSor tighten source aggregation filters - Grounding < 50 → Verify
fleet_contextis injected and explicitly referenced in the synthesis prompt - Conciseness < 50 → Adjust synthesis prompt word-count guidance or ranking threshold
- Actionability < 50 → Add explicit instructions to include "Implications" and "Recommended Actions" sections
8. Integration into Production
8.1 Gatekeeper Mode
Run the evaluator after every pipeline generation. If overall_score < 60, abort delivery and alert the operator room:
# In pipeline.py delivery phase
report = evaluate(briefing_path)
if report.overall_score < 60:
logger.error("Briefing quality below threshold. Halting delivery.")
send_alert(f"Deep Dive quality failed: {report.overall_score}/100")
return
8.2 Weekly Quality Audit
Every Sunday, run drift detection on the past 7 briefings and post a SITREP to #830 if scores are trending down.
9. File Reference
| File | Purpose |
|---|---|
quality_eval.py |
Executable evaluator |
QUALITY_FRAMEWORK.md |
This document — rubric and process |
10. Changelog
| Date | Change | Author |
|---|---|---|
| 2026-04-05 | Quality framework v1.0 — rubric, evaluator, drift detection | Ezra |