Files

Deploy Nexus / deploy (push) Has been cancelled

Details

feat(deepdive): quality evaluation framework

- Add quality_eval.py: automated briefing quality scorer with drift detection
- Add QUALITY_FRAMEWORK.md: rubric, usage guide, and production integration spec

Refs #830

2026-04-05 19:03:05 +00:00

7.1 KiB

Raw Blame History

Deep Dive Quality Evaluation Framework

Issue: #830 — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing
Created: Ezra | 2026-04-05 | Burn mode
Purpose: Ensure every Deep Dive briefing meets a consistent quality bar. Detect drift. Enable A/B prompt optimization.

1. Why This Exists

An automated daily briefing is only valuable if it remains relevant, grounded in our work, concise, and actionable. Without explicit quality control, three failure modes are inevitable:

Relevance decay — sources drift toward generic AI news
Grounding loss — fleet context is injected but ignored by the LLM
Length creep — briefings grow too long or shrink to bullet points

This framework defines the rubric, provides an automated scoring tool, and establishes a process for continuous improvement.

2. Quality Rubric

Every briefing is scored across five dimensions (0–100 each). Weights are tuned to Alexander's acceptance criteria.

Dimension	Weight	Target	Measured By
Relevance	25%	≥ 70	Presence of AI/ML keywords aligned with Hermes work
Grounding	25%	≥ 70	References to fleet repos, issues, commits, architecture
Conciseness	20%	80–100	Word count landing in 600–1200 words (≈ 10–15 min audio)
Actionability	20%	≥ 60	Explicit recommendations, implications, next steps
Source Diversity	10%	≥ 60	Breadth of unique domains represented in briefing

2.1 Relevance

Keywords tracked (representative sample):

LLM, agent, architecture, Hermes, tool use, MCP
Reinforcement learning, RLHF, GRPO, transformer
Local model, llama.cpp, Gemma, inference, alignment
Fleet, Timmy, Nexus, OpenClaw, sovereign

A briefing that touches on 30%+ of these keyword clusters scores near 100. Fewer than 3 hits triggers a warning.

2.2 Grounding

Grounding requires that the briefing uses the fleet context injected in Phase 0, not just receives it.

Positive markers:

Mentions of specific repos, open issues, recent PRs, or commits
References to wizard houses (Bezalel, Ezra, Allegro, Gemini)
Connections between external news and our live architecture

Penalty: If fleet_context is present in the payload but the briefing text contains no grounding markers, the score is halved.

2.3 Conciseness

The target is a 10–15 minute audio briefing.

At a natural speaking pace of ~130 WPM:

600 words ≈ 4.6 min (too short)
900 words ≈ 6.9 min (good)
1200 words ≈ 9.2 min (good)
1950 words ≈ 15 min (upper bound)

Wait — 130 WPM * 15 min = 1950 words. The current evaluator uses 600–1200 as a proxy for a tighter brief. If Alexander wants true 10–15 min, the target band should be 1300–1950 words. Adjust TARGET_WORD_COUNT_* in quality_eval.py to match preference.

2.4 Actionability

A briefing must answer the implicit question: "So what should we do?"

Positive markers:

"implication", "recommend", "should", "next step", "action"
"deploy", "integrate", "watch", "risk", "opportunity"

2.5 Source Diversity

A briefing built from 8 arXiv papers alone scores poorly here. A mix of arXiv, OpenAI blog, Anthropic research, and newsletter commentary scores highly.

3. Running the Evaluator

3.1 Single Briefing

cd intelligence/deepdive
python3 quality_eval.py ~/.cache/deepdive/briefing_20260405_124506.json

3.2 With Drift Detection

python3 quality_eval.py \
  ~/.cache/deepdive/briefing_20260405_124506.json \
  --previous ~/.cache/deepdive/briefing_20260404_124506.json

3.3 JSON Output (for CI/automation)

python3 quality_eval.py briefing.json --json > quality_report.json

3.4 Makefile Integration

Add to Makefile:

evaluate-latest:
	@latest=$$(ls -t ~/.cache/deepdive/briefing_*.json | head -1); \
	python3 quality_eval.py "$${latest}"

4. Interpreting Scores

Overall Score	Verdict	Action
85–100	Excellent	Ship it
70–84	Good	Minor prompt tuning optional
50–69	Marginal	Review warnings and apply recommendations
< 50	Unacceptable	Do not deliver. Fix pipeline before next run.

5. Drift Detection

Drift is measured by Jaccard similarity between the vocabulary of consecutive briefings.

Drift Score	Meaning
> 85%	High overlap — briefings may be repetitive or sources are stale
30–85%	Healthy variation
< 15%	High drift — briefings share almost no vocabulary; possible source aggregation failure or prompt instability

Note: Jaccard is a simple heuristic. It does not capture semantic similarity. For a more advanced metric, replace detect_drift() with sentence-transformer cosine similarity.

6. A/B Prompt Testing

To compare two synthesis prompts:

Run the pipeline with Prompt A → save briefing_A.json
Run the pipeline with Prompt B → save briefing_B.json
Evaluate both:

python3 quality_eval.py briefing_A.json --json > report_A.json
python3 quality_eval.py briefing_B.json --json > report_B.json

Compare dimension scores with diff or a small script.

6.1 Prompt Variants to Test

Variant	Hypothesis
V1 (Default)	Neutral synthesis with grounded context
V2 (Action-forward)	Explicit "Implications → Recommendations" section structure
V3 (Narrative)	Story-driven podcast script format with transitions

Record results in prompt_experiments/RESULTS.md.

7. Recommendations Engine

quality_eval.py emits concrete recommendations based on low scores:

Relevance < 50 → Expand RELEVANCE_KEYWORDS or tighten source aggregation filters
Grounding < 50 → Verify fleet_context is injected and explicitly referenced in the synthesis prompt
Conciseness < 50 → Adjust synthesis prompt word-count guidance or ranking threshold
Actionability < 50 → Add explicit instructions to include "Implications" and "Recommended Actions" sections

8. Integration into Production

8.1 Gatekeeper Mode

Run the evaluator after every pipeline generation. If overall_score < 60, abort delivery and alert the operator room:

# In pipeline.py delivery phase
report = evaluate(briefing_path)
if report.overall_score < 60:
    logger.error("Briefing quality below threshold. Halting delivery.")
    send_alert(f"Deep Dive quality failed: {report.overall_score}/100")
    return

8.2 Weekly Quality Audit

Every Sunday, run drift detection on the past 7 briefings and post a SITREP to #830 if scores are trending down.

9. File Reference

File	Purpose
`quality_eval.py`	Executable evaluator
`QUALITY_FRAMEWORK.md`	This document — rubric and process

10. Changelog

Date	Change	Author
2026-04-05	Quality framework v1.0 — rubric, evaluator, drift detection	Ezra

7.1 KiB Raw Blame History Unescape Escape