213 lines
7.1 KiB
Markdown
213 lines
7.1 KiB
Markdown
|
|
# Deep Dive Quality Evaluation Framework
|
|||
|
|
|
|||
|
|
> **Issue**: [#830](http://143.198.27.163:3000/Timmy_Foundation/the-nexus/issues/830) — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing
|
|||
|
|
> **Created**: Ezra | 2026-04-05 | Burn mode
|
|||
|
|
> **Purpose**: Ensure every Deep Dive briefing meets a consistent quality bar. Detect drift. Enable A/B prompt optimization.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. Why This Exists
|
|||
|
|
|
|||
|
|
An automated daily briefing is only valuable if it remains **relevant**, **grounded in our work**, **concise**, and **actionable**. Without explicit quality control, three failure modes are inevitable:
|
|||
|
|
|
|||
|
|
1. **Relevance decay** — sources drift toward generic AI news
|
|||
|
|
2. **Grounding loss** — fleet context is injected but ignored by the LLM
|
|||
|
|
3. **Length creep** — briefings grow too long or shrink to bullet points
|
|||
|
|
|
|||
|
|
This framework defines the rubric, provides an automated scoring tool, and establishes a process for continuous improvement.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. Quality Rubric
|
|||
|
|
|
|||
|
|
Every briefing is scored across five dimensions (0–100 each). Weights are tuned to Alexander's acceptance criteria.
|
|||
|
|
|
|||
|
|
| Dimension | Weight | Target | Measured By |
|
|||
|
|
|-----------|--------|--------|-------------|
|
|||
|
|
| **Relevance** | 25% | ≥ 70 | Presence of AI/ML keywords aligned with Hermes work |
|
|||
|
|
| **Grounding** | 25% | ≥ 70 | References to fleet repos, issues, commits, architecture |
|
|||
|
|
| **Conciseness** | 20% | 80–100 | Word count landing in 600–1200 words (≈ 10–15 min audio) |
|
|||
|
|
| **Actionability** | 20% | ≥ 60 | Explicit recommendations, implications, next steps |
|
|||
|
|
| **Source Diversity** | 10% | ≥ 60 | Breadth of unique domains represented in briefing |
|
|||
|
|
|
|||
|
|
### 2.1 Relevance
|
|||
|
|
|
|||
|
|
**Keywords tracked** (representative sample):
|
|||
|
|
- LLM, agent, architecture, Hermes, tool use, MCP
|
|||
|
|
- Reinforcement learning, RLHF, GRPO, transformer
|
|||
|
|
- Local model, llama.cpp, Gemma, inference, alignment
|
|||
|
|
- Fleet, Timmy, Nexus, OpenClaw, sovereign
|
|||
|
|
|
|||
|
|
A briefing that touches on 30%+ of these keyword clusters scores near 100. Fewer than 3 hits triggers a warning.
|
|||
|
|
|
|||
|
|
### 2.2 Grounding
|
|||
|
|
|
|||
|
|
Grounding requires that the briefing **uses** the fleet context injected in Phase 0, not just receives it.
|
|||
|
|
|
|||
|
|
**Positive markers**:
|
|||
|
|
- Mentions of specific repos, open issues, recent PRs, or commits
|
|||
|
|
- References to wizard houses (Bezalel, Ezra, Allegro, Gemini)
|
|||
|
|
- Connections between external news and our live architecture
|
|||
|
|
|
|||
|
|
**Penalty**: If `fleet_context` is present in the payload but the briefing text contains no grounding markers, the score is halved.
|
|||
|
|
|
|||
|
|
### 2.3 Conciseness
|
|||
|
|
|
|||
|
|
The target is a **10–15 minute audio briefing**.
|
|||
|
|
|
|||
|
|
At a natural speaking pace of ~130 WPM:
|
|||
|
|
- 600 words ≈ 4.6 min (too short)
|
|||
|
|
- 900 words ≈ 6.9 min (good)
|
|||
|
|
- 1200 words ≈ 9.2 min (good)
|
|||
|
|
- 1950 words ≈ 15 min (upper bound)
|
|||
|
|
|
|||
|
|
Wait — 130 WPM * 15 min = 1950 words. The current evaluator uses 600–1200 as a proxy for a tighter brief. If Alexander wants true 10–15 min, the target band should be **1300–1950 words**. Adjust `TARGET_WORD_COUNT_*` in `quality_eval.py` to match preference.
|
|||
|
|
|
|||
|
|
### 2.4 Actionability
|
|||
|
|
|
|||
|
|
A briefing must answer the implicit question: *"So what should we do?"*
|
|||
|
|
|
|||
|
|
**Positive markers**:
|
|||
|
|
- "implication", "recommend", "should", "next step", "action"
|
|||
|
|
- "deploy", "integrate", "watch", "risk", "opportunity"
|
|||
|
|
|
|||
|
|
### 2.5 Source Diversity
|
|||
|
|
|
|||
|
|
A briefing built from 8 arXiv papers alone scores poorly here. A mix of arXiv, OpenAI blog, Anthropic research, and newsletter commentary scores highly.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. Running the Evaluator
|
|||
|
|
|
|||
|
|
### 3.1 Single Briefing
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd intelligence/deepdive
|
|||
|
|
python3 quality_eval.py ~/.cache/deepdive/briefing_20260405_124506.json
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.2 With Drift Detection
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python3 quality_eval.py \
|
|||
|
|
~/.cache/deepdive/briefing_20260405_124506.json \
|
|||
|
|
--previous ~/.cache/deepdive/briefing_20260404_124506.json
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.3 JSON Output (for CI/automation)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python3 quality_eval.py briefing.json --json > quality_report.json
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.4 Makefile Integration
|
|||
|
|
|
|||
|
|
Add to `Makefile`:
|
|||
|
|
|
|||
|
|
```makefile
|
|||
|
|
evaluate-latest:
|
|||
|
|
@latest=$$(ls -t ~/.cache/deepdive/briefing_*.json | head -1); \
|
|||
|
|
python3 quality_eval.py "$${latest}"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Interpreting Scores
|
|||
|
|
|
|||
|
|
| Overall Score | Verdict | Action |
|
|||
|
|
|---------------|---------|--------|
|
|||
|
|
| 85–100 | Excellent | Ship it |
|
|||
|
|
| 70–84 | Good | Minor prompt tuning optional |
|
|||
|
|
| 50–69 | Marginal | Review warnings and apply recommendations |
|
|||
|
|
| < 50 | Unacceptable | Do not deliver. Fix pipeline before next run. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. Drift Detection
|
|||
|
|
|
|||
|
|
Drift is measured by **Jaccard similarity** between the vocabulary of consecutive briefings.
|
|||
|
|
|
|||
|
|
| Drift Score | Meaning |
|
|||
|
|
|-------------|---------|
|
|||
|
|
| > 85% | High overlap — briefings may be repetitive or sources are stale |
|
|||
|
|
| 30–85% | Healthy variation |
|
|||
|
|
| < 15% | High drift — briefings share almost no vocabulary; possible source aggregation failure or prompt instability |
|
|||
|
|
|
|||
|
|
**Note**: Jaccard is a simple heuristic. It does not capture semantic similarity. For a more advanced metric, replace `detect_drift()` with sentence-transformer cosine similarity.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. A/B Prompt Testing
|
|||
|
|
|
|||
|
|
To compare two synthesis prompts:
|
|||
|
|
|
|||
|
|
1. Run the pipeline with **Prompt A** → save `briefing_A.json`
|
|||
|
|
2. Run the pipeline with **Prompt B** → save `briefing_B.json`
|
|||
|
|
3. Evaluate both:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python3 quality_eval.py briefing_A.json --json > report_A.json
|
|||
|
|
python3 quality_eval.py briefing_B.json --json > report_B.json
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
4. Compare dimension scores with `diff` or a small script.
|
|||
|
|
|
|||
|
|
### 6.1 Prompt Variants to Test
|
|||
|
|
|
|||
|
|
| Variant | Hypothesis |
|
|||
|
|
|---------|------------|
|
|||
|
|
| **V1 (Default)** | Neutral synthesis with grounded context |
|
|||
|
|
| **V2 (Action-forward)** | Explicit "Implications → Recommendations" section structure |
|
|||
|
|
| **V3 (Narrative)** | Story-driven podcast script format with transitions |
|
|||
|
|
|
|||
|
|
Record results in `prompt_experiments/RESULTS.md`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. Recommendations Engine
|
|||
|
|
|
|||
|
|
`quality_eval.py` emits concrete recommendations based on low scores:
|
|||
|
|
|
|||
|
|
- **Relevance < 50** → Expand `RELEVANCE_KEYWORDS` or tighten source aggregation filters
|
|||
|
|
- **Grounding < 50** → Verify `fleet_context` is injected and explicitly referenced in the synthesis prompt
|
|||
|
|
- **Conciseness < 50** → Adjust synthesis prompt word-count guidance or ranking threshold
|
|||
|
|
- **Actionability < 50** → Add explicit instructions to include "Implications" and "Recommended Actions" sections
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 8. Integration into Production
|
|||
|
|
|
|||
|
|
### 8.1 Gatekeeper Mode
|
|||
|
|
|
|||
|
|
Run the evaluator after every pipeline generation. If `overall_score < 60`, abort delivery and alert the operator room:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# In pipeline.py delivery phase
|
|||
|
|
report = evaluate(briefing_path)
|
|||
|
|
if report.overall_score < 60:
|
|||
|
|
logger.error("Briefing quality below threshold. Halting delivery.")
|
|||
|
|
send_alert(f"Deep Dive quality failed: {report.overall_score}/100")
|
|||
|
|
return
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 8.2 Weekly Quality Audit
|
|||
|
|
|
|||
|
|
Every Sunday, run drift detection on the past 7 briefings and post a SITREP to #830 if scores are trending down.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 9. File Reference
|
|||
|
|
|
|||
|
|
| File | Purpose |
|
|||
|
|
|------|---------|
|
|||
|
|
| `quality_eval.py` | Executable evaluator |
|
|||
|
|
| `QUALITY_FRAMEWORK.md` | This document — rubric and process |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 10. Changelog
|
|||
|
|
|
|||
|
|
| Date | Change | Author |
|
|||
|
|
|------|--------|--------|
|
|||
|
|
| 2026-04-05 | Quality framework v1.0 — rubric, evaluator, drift detection | Ezra |
|