Files
the-nexus/intelligence/deepdive/QUALITY_FRAMEWORK.md
Ezra 3df184e1e6
Some checks failed
Deploy Nexus / deploy (push) Has been cancelled
feat(deepdive): quality evaluation framework
- Add quality_eval.py: automated briefing quality scorer with drift detection
- Add QUALITY_FRAMEWORK.md: rubric, usage guide, and production integration spec

Refs #830
2026-04-05 19:03:05 +00:00

7.1 KiB
Raw Blame History

Deep Dive Quality Evaluation Framework

Issue: #830 — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing
Created: Ezra | 2026-04-05 | Burn mode
Purpose: Ensure every Deep Dive briefing meets a consistent quality bar. Detect drift. Enable A/B prompt optimization.


1. Why This Exists

An automated daily briefing is only valuable if it remains relevant, grounded in our work, concise, and actionable. Without explicit quality control, three failure modes are inevitable:

  1. Relevance decay — sources drift toward generic AI news
  2. Grounding loss — fleet context is injected but ignored by the LLM
  3. Length creep — briefings grow too long or shrink to bullet points

This framework defines the rubric, provides an automated scoring tool, and establishes a process for continuous improvement.


2. Quality Rubric

Every briefing is scored across five dimensions (0100 each). Weights are tuned to Alexander's acceptance criteria.

Dimension Weight Target Measured By
Relevance 25% ≥ 70 Presence of AI/ML keywords aligned with Hermes work
Grounding 25% ≥ 70 References to fleet repos, issues, commits, architecture
Conciseness 20% 80100 Word count landing in 6001200 words (≈ 1015 min audio)
Actionability 20% ≥ 60 Explicit recommendations, implications, next steps
Source Diversity 10% ≥ 60 Breadth of unique domains represented in briefing

2.1 Relevance

Keywords tracked (representative sample):

  • LLM, agent, architecture, Hermes, tool use, MCP
  • Reinforcement learning, RLHF, GRPO, transformer
  • Local model, llama.cpp, Gemma, inference, alignment
  • Fleet, Timmy, Nexus, OpenClaw, sovereign

A briefing that touches on 30%+ of these keyword clusters scores near 100. Fewer than 3 hits triggers a warning.

2.2 Grounding

Grounding requires that the briefing uses the fleet context injected in Phase 0, not just receives it.

Positive markers:

  • Mentions of specific repos, open issues, recent PRs, or commits
  • References to wizard houses (Bezalel, Ezra, Allegro, Gemini)
  • Connections between external news and our live architecture

Penalty: If fleet_context is present in the payload but the briefing text contains no grounding markers, the score is halved.

2.3 Conciseness

The target is a 1015 minute audio briefing.

At a natural speaking pace of ~130 WPM:

  • 600 words ≈ 4.6 min (too short)
  • 900 words ≈ 6.9 min (good)
  • 1200 words ≈ 9.2 min (good)
  • 1950 words ≈ 15 min (upper bound)

Wait — 130 WPM * 15 min = 1950 words. The current evaluator uses 6001200 as a proxy for a tighter brief. If Alexander wants true 1015 min, the target band should be 13001950 words. Adjust TARGET_WORD_COUNT_* in quality_eval.py to match preference.

2.4 Actionability

A briefing must answer the implicit question: "So what should we do?"

Positive markers:

  • "implication", "recommend", "should", "next step", "action"
  • "deploy", "integrate", "watch", "risk", "opportunity"

2.5 Source Diversity

A briefing built from 8 arXiv papers alone scores poorly here. A mix of arXiv, OpenAI blog, Anthropic research, and newsletter commentary scores highly.


3. Running the Evaluator

3.1 Single Briefing

cd intelligence/deepdive
python3 quality_eval.py ~/.cache/deepdive/briefing_20260405_124506.json

3.2 With Drift Detection

python3 quality_eval.py \
  ~/.cache/deepdive/briefing_20260405_124506.json \
  --previous ~/.cache/deepdive/briefing_20260404_124506.json

3.3 JSON Output (for CI/automation)

python3 quality_eval.py briefing.json --json > quality_report.json

3.4 Makefile Integration

Add to Makefile:

evaluate-latest:
	@latest=$$(ls -t ~/.cache/deepdive/briefing_*.json | head -1); \
	python3 quality_eval.py "$${latest}"

4. Interpreting Scores

Overall Score Verdict Action
85100 Excellent Ship it
7084 Good Minor prompt tuning optional
5069 Marginal Review warnings and apply recommendations
< 50 Unacceptable Do not deliver. Fix pipeline before next run.

5. Drift Detection

Drift is measured by Jaccard similarity between the vocabulary of consecutive briefings.

Drift Score Meaning
> 85% High overlap — briefings may be repetitive or sources are stale
3085% Healthy variation
< 15% High drift — briefings share almost no vocabulary; possible source aggregation failure or prompt instability

Note: Jaccard is a simple heuristic. It does not capture semantic similarity. For a more advanced metric, replace detect_drift() with sentence-transformer cosine similarity.


6. A/B Prompt Testing

To compare two synthesis prompts:

  1. Run the pipeline with Prompt A → save briefing_A.json
  2. Run the pipeline with Prompt B → save briefing_B.json
  3. Evaluate both:
python3 quality_eval.py briefing_A.json --json > report_A.json
python3 quality_eval.py briefing_B.json --json > report_B.json
  1. Compare dimension scores with diff or a small script.

6.1 Prompt Variants to Test

Variant Hypothesis
V1 (Default) Neutral synthesis with grounded context
V2 (Action-forward) Explicit "Implications → Recommendations" section structure
V3 (Narrative) Story-driven podcast script format with transitions

Record results in prompt_experiments/RESULTS.md.


7. Recommendations Engine

quality_eval.py emits concrete recommendations based on low scores:

  • Relevance < 50 → Expand RELEVANCE_KEYWORDS or tighten source aggregation filters
  • Grounding < 50 → Verify fleet_context is injected and explicitly referenced in the synthesis prompt
  • Conciseness < 50 → Adjust synthesis prompt word-count guidance or ranking threshold
  • Actionability < 50 → Add explicit instructions to include "Implications" and "Recommended Actions" sections

8. Integration into Production

8.1 Gatekeeper Mode

Run the evaluator after every pipeline generation. If overall_score < 60, abort delivery and alert the operator room:

# In pipeline.py delivery phase
report = evaluate(briefing_path)
if report.overall_score < 60:
    logger.error("Briefing quality below threshold. Halting delivery.")
    send_alert(f"Deep Dive quality failed: {report.overall_score}/100")
    return

8.2 Weekly Quality Audit

Every Sunday, run drift detection on the past 7 briefings and post a SITREP to #830 if scores are trending down.


9. File Reference

File Purpose
quality_eval.py Executable evaluator
QUALITY_FRAMEWORK.md This document — rubric and process

10. Changelog

Date Change Author
2026-04-05 Quality framework v1.0 — rubric, evaluator, drift detection Ezra