# Deep Dive Quality Evaluation Framework > **Issue**: [#830](http://143.198.27.163:3000/Timmy_Foundation/the-nexus/issues/830) — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing > **Created**: Ezra | 2026-04-05 | Burn mode > **Purpose**: Ensure every Deep Dive briefing meets a consistent quality bar. Detect drift. Enable A/B prompt optimization. --- ## 1. Why This Exists An automated daily briefing is only valuable if it remains **relevant**, **grounded in our work**, **concise**, and **actionable**. Without explicit quality control, three failure modes are inevitable: 1. **Relevance decay** — sources drift toward generic AI news 2. **Grounding loss** — fleet context is injected but ignored by the LLM 3. **Length creep** — briefings grow too long or shrink to bullet points This framework defines the rubric, provides an automated scoring tool, and establishes a process for continuous improvement. --- ## 2. Quality Rubric Every briefing is scored across five dimensions (0–100 each). Weights are tuned to Alexander's acceptance criteria. | Dimension | Weight | Target | Measured By | |-----------|--------|--------|-------------| | **Relevance** | 25% | ≥ 70 | Presence of AI/ML keywords aligned with Hermes work | | **Grounding** | 25% | ≥ 70 | References to fleet repos, issues, commits, architecture | | **Conciseness** | 20% | 80–100 | Word count landing in 600–1200 words (≈ 10–15 min audio) | | **Actionability** | 20% | ≥ 60 | Explicit recommendations, implications, next steps | | **Source Diversity** | 10% | ≥ 60 | Breadth of unique domains represented in briefing | ### 2.1 Relevance **Keywords tracked** (representative sample): - LLM, agent, architecture, Hermes, tool use, MCP - Reinforcement learning, RLHF, GRPO, transformer - Local model, llama.cpp, Gemma, inference, alignment - Fleet, Timmy, Nexus, OpenClaw, sovereign A briefing that touches on 30%+ of these keyword clusters scores near 100. Fewer than 3 hits triggers a warning. ### 2.2 Grounding Grounding requires that the briefing **uses** the fleet context injected in Phase 0, not just receives it. **Positive markers**: - Mentions of specific repos, open issues, recent PRs, or commits - References to wizard houses (Bezalel, Ezra, Allegro, Gemini) - Connections between external news and our live architecture **Penalty**: If `fleet_context` is present in the payload but the briefing text contains no grounding markers, the score is halved. ### 2.3 Conciseness The target is a **10–15 minute audio briefing**. At a natural speaking pace of ~130 WPM: - 600 words ≈ 4.6 min (too short) - 900 words ≈ 6.9 min (good) - 1200 words ≈ 9.2 min (good) - 1950 words ≈ 15 min (upper bound) Wait — 130 WPM * 15 min = 1950 words. The current evaluator uses 600–1200 as a proxy for a tighter brief. If Alexander wants true 10–15 min, the target band should be **1300–1950 words**. Adjust `TARGET_WORD_COUNT_*` in `quality_eval.py` to match preference. ### 2.4 Actionability A briefing must answer the implicit question: *"So what should we do?"* **Positive markers**: - "implication", "recommend", "should", "next step", "action" - "deploy", "integrate", "watch", "risk", "opportunity" ### 2.5 Source Diversity A briefing built from 8 arXiv papers alone scores poorly here. A mix of arXiv, OpenAI blog, Anthropic research, and newsletter commentary scores highly. --- ## 3. Running the Evaluator ### 3.1 Single Briefing ```bash cd intelligence/deepdive python3 quality_eval.py ~/.cache/deepdive/briefing_20260405_124506.json ``` ### 3.2 With Drift Detection ```bash python3 quality_eval.py \ ~/.cache/deepdive/briefing_20260405_124506.json \ --previous ~/.cache/deepdive/briefing_20260404_124506.json ``` ### 3.3 JSON Output (for CI/automation) ```bash python3 quality_eval.py briefing.json --json > quality_report.json ``` ### 3.4 Makefile Integration Add to `Makefile`: ```makefile evaluate-latest: @latest=$$(ls -t ~/.cache/deepdive/briefing_*.json | head -1); \ python3 quality_eval.py "$${latest}" ``` --- ## 4. Interpreting Scores | Overall Score | Verdict | Action | |---------------|---------|--------| | 85–100 | Excellent | Ship it | | 70–84 | Good | Minor prompt tuning optional | | 50–69 | Marginal | Review warnings and apply recommendations | | < 50 | Unacceptable | Do not deliver. Fix pipeline before next run. | --- ## 5. Drift Detection Drift is measured by **Jaccard similarity** between the vocabulary of consecutive briefings. | Drift Score | Meaning | |-------------|---------| | > 85% | High overlap — briefings may be repetitive or sources are stale | | 30–85% | Healthy variation | | < 15% | High drift — briefings share almost no vocabulary; possible source aggregation failure or prompt instability | **Note**: Jaccard is a simple heuristic. It does not capture semantic similarity. For a more advanced metric, replace `detect_drift()` with sentence-transformer cosine similarity. --- ## 6. A/B Prompt Testing To compare two synthesis prompts: 1. Run the pipeline with **Prompt A** → save `briefing_A.json` 2. Run the pipeline with **Prompt B** → save `briefing_B.json` 3. Evaluate both: ```bash python3 quality_eval.py briefing_A.json --json > report_A.json python3 quality_eval.py briefing_B.json --json > report_B.json ``` 4. Compare dimension scores with `diff` or a small script. ### 6.1 Prompt Variants to Test | Variant | Hypothesis | |---------|------------| | **V1 (Default)** | Neutral synthesis with grounded context | | **V2 (Action-forward)** | Explicit "Implications → Recommendations" section structure | | **V3 (Narrative)** | Story-driven podcast script format with transitions | Record results in `prompt_experiments/RESULTS.md`. --- ## 7. Recommendations Engine `quality_eval.py` emits concrete recommendations based on low scores: - **Relevance < 50** → Expand `RELEVANCE_KEYWORDS` or tighten source aggregation filters - **Grounding < 50** → Verify `fleet_context` is injected and explicitly referenced in the synthesis prompt - **Conciseness < 50** → Adjust synthesis prompt word-count guidance or ranking threshold - **Actionability < 50** → Add explicit instructions to include "Implications" and "Recommended Actions" sections --- ## 8. Integration into Production ### 8.1 Gatekeeper Mode Run the evaluator after every pipeline generation. If `overall_score < 60`, abort delivery and alert the operator room: ```python # In pipeline.py delivery phase report = evaluate(briefing_path) if report.overall_score < 60: logger.error("Briefing quality below threshold. Halting delivery.") send_alert(f"Deep Dive quality failed: {report.overall_score}/100") return ``` ### 8.2 Weekly Quality Audit Every Sunday, run drift detection on the past 7 briefings and post a SITREP to #830 if scores are trending down. --- ## 9. File Reference | File | Purpose | |------|---------| | `quality_eval.py` | Executable evaluator | | `QUALITY_FRAMEWORK.md` | This document — rubric and process | --- ## 10. Changelog | Date | Change | Author | |------|--------|--------| | 2026-04-05 | Quality framework v1.0 — rubric, evaluator, drift detection | Ezra |