Some checks failed
Deploy Nexus / deploy (push) Has been cancelled
- Add quality_eval.py: automated briefing quality scorer with drift detection - Add QUALITY_FRAMEWORK.md: rubric, usage guide, and production integration spec Refs #830
213 lines
7.1 KiB
Markdown
213 lines
7.1 KiB
Markdown
# Deep Dive Quality Evaluation Framework
|
||
|
||
> **Issue**: [#830](http://143.198.27.163:3000/Timmy_Foundation/the-nexus/issues/830) — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing
|
||
> **Created**: Ezra | 2026-04-05 | Burn mode
|
||
> **Purpose**: Ensure every Deep Dive briefing meets a consistent quality bar. Detect drift. Enable A/B prompt optimization.
|
||
|
||
---
|
||
|
||
## 1. Why This Exists
|
||
|
||
An automated daily briefing is only valuable if it remains **relevant**, **grounded in our work**, **concise**, and **actionable**. Without explicit quality control, three failure modes are inevitable:
|
||
|
||
1. **Relevance decay** — sources drift toward generic AI news
|
||
2. **Grounding loss** — fleet context is injected but ignored by the LLM
|
||
3. **Length creep** — briefings grow too long or shrink to bullet points
|
||
|
||
This framework defines the rubric, provides an automated scoring tool, and establishes a process for continuous improvement.
|
||
|
||
---
|
||
|
||
## 2. Quality Rubric
|
||
|
||
Every briefing is scored across five dimensions (0–100 each). Weights are tuned to Alexander's acceptance criteria.
|
||
|
||
| Dimension | Weight | Target | Measured By |
|
||
|-----------|--------|--------|-------------|
|
||
| **Relevance** | 25% | ≥ 70 | Presence of AI/ML keywords aligned with Hermes work |
|
||
| **Grounding** | 25% | ≥ 70 | References to fleet repos, issues, commits, architecture |
|
||
| **Conciseness** | 20% | 80–100 | Word count landing in 600–1200 words (≈ 10–15 min audio) |
|
||
| **Actionability** | 20% | ≥ 60 | Explicit recommendations, implications, next steps |
|
||
| **Source Diversity** | 10% | ≥ 60 | Breadth of unique domains represented in briefing |
|
||
|
||
### 2.1 Relevance
|
||
|
||
**Keywords tracked** (representative sample):
|
||
- LLM, agent, architecture, Hermes, tool use, MCP
|
||
- Reinforcement learning, RLHF, GRPO, transformer
|
||
- Local model, llama.cpp, Gemma, inference, alignment
|
||
- Fleet, Timmy, Nexus, OpenClaw, sovereign
|
||
|
||
A briefing that touches on 30%+ of these keyword clusters scores near 100. Fewer than 3 hits triggers a warning.
|
||
|
||
### 2.2 Grounding
|
||
|
||
Grounding requires that the briefing **uses** the fleet context injected in Phase 0, not just receives it.
|
||
|
||
**Positive markers**:
|
||
- Mentions of specific repos, open issues, recent PRs, or commits
|
||
- References to wizard houses (Bezalel, Ezra, Allegro, Gemini)
|
||
- Connections between external news and our live architecture
|
||
|
||
**Penalty**: If `fleet_context` is present in the payload but the briefing text contains no grounding markers, the score is halved.
|
||
|
||
### 2.3 Conciseness
|
||
|
||
The target is a **10–15 minute audio briefing**.
|
||
|
||
At a natural speaking pace of ~130 WPM:
|
||
- 600 words ≈ 4.6 min (too short)
|
||
- 900 words ≈ 6.9 min (good)
|
||
- 1200 words ≈ 9.2 min (good)
|
||
- 1950 words ≈ 15 min (upper bound)
|
||
|
||
Wait — 130 WPM * 15 min = 1950 words. The current evaluator uses 600–1200 as a proxy for a tighter brief. If Alexander wants true 10–15 min, the target band should be **1300–1950 words**. Adjust `TARGET_WORD_COUNT_*` in `quality_eval.py` to match preference.
|
||
|
||
### 2.4 Actionability
|
||
|
||
A briefing must answer the implicit question: *"So what should we do?"*
|
||
|
||
**Positive markers**:
|
||
- "implication", "recommend", "should", "next step", "action"
|
||
- "deploy", "integrate", "watch", "risk", "opportunity"
|
||
|
||
### 2.5 Source Diversity
|
||
|
||
A briefing built from 8 arXiv papers alone scores poorly here. A mix of arXiv, OpenAI blog, Anthropic research, and newsletter commentary scores highly.
|
||
|
||
---
|
||
|
||
## 3. Running the Evaluator
|
||
|
||
### 3.1 Single Briefing
|
||
|
||
```bash
|
||
cd intelligence/deepdive
|
||
python3 quality_eval.py ~/.cache/deepdive/briefing_20260405_124506.json
|
||
```
|
||
|
||
### 3.2 With Drift Detection
|
||
|
||
```bash
|
||
python3 quality_eval.py \
|
||
~/.cache/deepdive/briefing_20260405_124506.json \
|
||
--previous ~/.cache/deepdive/briefing_20260404_124506.json
|
||
```
|
||
|
||
### 3.3 JSON Output (for CI/automation)
|
||
|
||
```bash
|
||
python3 quality_eval.py briefing.json --json > quality_report.json
|
||
```
|
||
|
||
### 3.4 Makefile Integration
|
||
|
||
Add to `Makefile`:
|
||
|
||
```makefile
|
||
evaluate-latest:
|
||
@latest=$$(ls -t ~/.cache/deepdive/briefing_*.json | head -1); \
|
||
python3 quality_eval.py "$${latest}"
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Interpreting Scores
|
||
|
||
| Overall Score | Verdict | Action |
|
||
|---------------|---------|--------|
|
||
| 85–100 | Excellent | Ship it |
|
||
| 70–84 | Good | Minor prompt tuning optional |
|
||
| 50–69 | Marginal | Review warnings and apply recommendations |
|
||
| < 50 | Unacceptable | Do not deliver. Fix pipeline before next run. |
|
||
|
||
---
|
||
|
||
## 5. Drift Detection
|
||
|
||
Drift is measured by **Jaccard similarity** between the vocabulary of consecutive briefings.
|
||
|
||
| Drift Score | Meaning |
|
||
|-------------|---------|
|
||
| > 85% | High overlap — briefings may be repetitive or sources are stale |
|
||
| 30–85% | Healthy variation |
|
||
| < 15% | High drift — briefings share almost no vocabulary; possible source aggregation failure or prompt instability |
|
||
|
||
**Note**: Jaccard is a simple heuristic. It does not capture semantic similarity. For a more advanced metric, replace `detect_drift()` with sentence-transformer cosine similarity.
|
||
|
||
---
|
||
|
||
## 6. A/B Prompt Testing
|
||
|
||
To compare two synthesis prompts:
|
||
|
||
1. Run the pipeline with **Prompt A** → save `briefing_A.json`
|
||
2. Run the pipeline with **Prompt B** → save `briefing_B.json`
|
||
3. Evaluate both:
|
||
|
||
```bash
|
||
python3 quality_eval.py briefing_A.json --json > report_A.json
|
||
python3 quality_eval.py briefing_B.json --json > report_B.json
|
||
```
|
||
|
||
4. Compare dimension scores with `diff` or a small script.
|
||
|
||
### 6.1 Prompt Variants to Test
|
||
|
||
| Variant | Hypothesis |
|
||
|---------|------------|
|
||
| **V1 (Default)** | Neutral synthesis with grounded context |
|
||
| **V2 (Action-forward)** | Explicit "Implications → Recommendations" section structure |
|
||
| **V3 (Narrative)** | Story-driven podcast script format with transitions |
|
||
|
||
Record results in `prompt_experiments/RESULTS.md`.
|
||
|
||
---
|
||
|
||
## 7. Recommendations Engine
|
||
|
||
`quality_eval.py` emits concrete recommendations based on low scores:
|
||
|
||
- **Relevance < 50** → Expand `RELEVANCE_KEYWORDS` or tighten source aggregation filters
|
||
- **Grounding < 50** → Verify `fleet_context` is injected and explicitly referenced in the synthesis prompt
|
||
- **Conciseness < 50** → Adjust synthesis prompt word-count guidance or ranking threshold
|
||
- **Actionability < 50** → Add explicit instructions to include "Implications" and "Recommended Actions" sections
|
||
|
||
---
|
||
|
||
## 8. Integration into Production
|
||
|
||
### 8.1 Gatekeeper Mode
|
||
|
||
Run the evaluator after every pipeline generation. If `overall_score < 60`, abort delivery and alert the operator room:
|
||
|
||
```python
|
||
# In pipeline.py delivery phase
|
||
report = evaluate(briefing_path)
|
||
if report.overall_score < 60:
|
||
logger.error("Briefing quality below threshold. Halting delivery.")
|
||
send_alert(f"Deep Dive quality failed: {report.overall_score}/100")
|
||
return
|
||
```
|
||
|
||
### 8.2 Weekly Quality Audit
|
||
|
||
Every Sunday, run drift detection on the past 7 briefings and post a SITREP to #830 if scores are trending down.
|
||
|
||
---
|
||
|
||
## 9. File Reference
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `quality_eval.py` | Executable evaluator |
|
||
| `QUALITY_FRAMEWORK.md` | This document — rubric and process |
|
||
|
||
---
|
||
|
||
## 10. Changelog
|
||
|
||
| Date | Change | Author |
|
||
|------|--------|--------|
|
||
| 2026-04-05 | Quality framework v1.0 — rubric, evaluator, drift detection | Ezra |
|