feat(deepdive): production briefing prompt + prompt engineering KT

- production_briefing_v1.txt: podcast-script prompt engineered for 10-15 min premium audio, grounded fleet context, and actionable tone. - PROMPT_ENGINEERING_KT.md: A/B testing protocol, failure modes, and maintenance checklist. - pipeline.py: load external prompt_file from config.yaml. Refs #830
2026-04-05 20:19:20 +00:00
parent 9ad2132482
commit 4b1873d76e
3 changed files with 253 additions and 16 deletions
--- a/intelligence/deepdive/pipeline.py
+++ b/intelligence/deepdive/pipeline.py
@@ -369,8 +369,10 @@ class RelevanceScorer:
 class SynthesisEngine:
    """Generate intelligence briefing from filtered items."""
    
-    def __init__(self, llm_endpoint: str = "http://localhost:11435/v1"):
+    def __init__(self, llm_endpoint: str = "http://localhost:11435/v1",
+                 prompt_template: Optional[str] = None):
        self.endpoint = llm_endpoint
+        self.prompt_template = prompt_template
        self.system_prompt = """You are an intelligence analyst for the Timmy Foundation fleet.
 Synthesize AI/ML research into actionable briefings for agent developers.

@@ -431,21 +433,35 @@ impact our live repos, open issues, and current architecture."""
                'sources': []
            }
        
-        lines = []
-        if fleet_context:
-            lines.append("FLEET CONTEXT:")
-            lines.append(fleet_context.to_prompt_text(max_items_per_section=5))
-            lines.append("")
+        # Build research items text
+        research_lines = []
+        for i, (item, score) in enumerate(items, 1):
+            research_lines.append(f"{i}. [{item.source}] {item.title}")
+            research_lines.append(f"   Score: {score}")
+            research_lines.append(f"   Summary: {item.summary[:300]}...")
+            research_lines.append(f"   URL: {item.url}")
+            research_lines.append("")
+        research_text = "\n".join(research_lines)
        
+        fleet_text = ""
+        if fleet_context:
+            fleet_text = fleet_context.to_prompt_text(max_items_per_section=5)
+        
+        if self.prompt_template:
+            prompt = (
+                self.prompt_template
+                .replace("{{FLEET_CONTEXT}}", fleet_text)
+                .replace("{{RESEARCH_ITEMS}}", research_text)
+            )
+        else:
+            lines = []
+            if fleet_text:
+                lines.append("FLEET CONTEXT:")
+                lines.append(fleet_text)
+                lines.append("")
            lines.append("Generate an intelligence briefing from these research items:")
            lines.append("")
-        for i, (item, score) in enumerate(items, 1):
-            lines.append(f"{i}. [{item.source}] {item.title}")
-            lines.append(f"   Score: {score}")
-            lines.append(f"   Summary: {item.summary[:300]}...")
-            lines.append(f"   URL: {item.url}")
-            lines.append("")
-        
+            lines.extend(research_lines)
            prompt = "\n".join(lines)
        
        synthesis = self._call_llm(prompt)
@@ -610,7 +626,18 @@ class DeepDivePipeline:
        self.scorer = RelevanceScorer(relevance_config.get('model', 'all-MiniLM-L6-v2'))
        
        llm_endpoint = self.cfg.get('synthesis', {}).get('llm_endpoint', 'http://localhost:11435/v1')
-        self.synthesizer = SynthesisEngine(llm_endpoint)
+        prompt_file = self.cfg.get('synthesis', {}).get('prompt_file')
+        prompt_template = None
+        if prompt_file:
+            pf = Path(prompt_file)
+            if not pf.is_absolute():
+                pf = Path(__file__).parent / prompt_file
+            if pf.exists():
+                prompt_template = pf.read_text()
+                logger.info(f"Loaded prompt template: {pf}")
+            else:
+                logger.warning(f"Prompt file not found: {pf}")
+        self.synthesizer = SynthesisEngine(llm_endpoint, prompt_template=prompt_template)
        
        self.audio_gen = AudioGenerator()
        
--- a/intelligence/deepdive/prompts/PROMPT_ENGINEERING_KT.md
+++ b/intelligence/deepdive/prompts/PROMPT_ENGINEERING_KT.md
@@ -0,0 +1,151 @@
+# Deep Dive Prompt Engineering — Knowledge Transfer
+
+> **Issue**: [#830](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/830) — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing  
+> **Created**: 2026-04-05 by Ezra, Archivist  
+> **Purpose**: Explain how the production synthesis prompt works, how to A/B test it, and how to maintain quality as the fleet evolves.
+
+---
+
+## 1. The Prompt Files
+
+| File | Role | When to Change |
+|------|------|----------------|
+| `production_briefing_v1.txt` | Default prompt for daily briefing generation | When voice quality degrades or acceptance criteria drift |
+| `production_briefing_v2_*.txt` | Experimental variants | During A/B tests |
+
+---
+
+## 2. Design Philosophy
+
+The prompt is engineered around **three non-negotiables** from Alexander:
+
+1. **Grounded in our world first** — Fleet context is not decoration. It must shape the narrative.
+2. **Actionable, not encyclopedic** — Every headline needs a "so what" for Timmy Foundation work.
+3. **Premium audio experience** — The output is a podcast script, not a report. Structure, pacing, and tone matter.
+
+### Why 1,300–1,950 words?
+
+At a natural speaking pace of ~130 WPM:
+- 1,300 words ≈ 10 minutes
+- 1,950 words ≈ 15 minutes
+
+This hits the acceptance criterion for default audio runtime.
+
+---
+
+## 3. Prompt Architecture
+
+The prompt has four layers:
+
+### Layer 1: Persona
+> "You are the voice of Deep Dive..."
+
+This establishes tone, authority, and audience. It prevents the model from slipping into academic summarizer mode.
+
+### Layer 2: Output Schema
+> "Write this as a single continuous narrative... Structure the script in exactly these sections..."
+
+The schema forces consistency. Without it, LLMs tend to produce bullet lists or inconsistent section ordering.
+
+### Layer 3: Content Constraints
+> "Every headline item MUST include a connection to our work..."
+
+This is the grounding enforcement layer. It raises the cost of generic summaries.
+
+### Layer 4: Dynamic Context
+> `{{FLEET_CONTEXT}}` and `{{RESEARCH_ITEMS}}`
+
+These are template variables substituted at runtime by `pipeline.py`. The prompt is **data-agnostic** — it defines how to think about whatever data is injected.
+
+---
+
+## 4. Integration with Pipeline
+
+In `pipeline.py`, the `SynthesisEngine` loads the prompt file (if configured) and performs substitution:
+
+```python
+# Pseudo-code from pipeline.py
+prompt_template = load_prompt("prompts/production_briefing_v1.txt")
+prompt = prompt_template.replace("{{FLEET_CONTEXT}}", fleet_ctx.to_prompt_text())
+prompt = prompt.replace("{{RESEARCH_ITEMS}}", format_items(items))
+synthesis = self._call_llm(prompt)
+```
+
+To switch prompts, update `config.yaml`:
+
+```yaml
+synthesis:
+  llm_endpoint: "http://localhost:4000/v1"
+  prompt_file: "prompts/production_briefing_v1.txt"
+  max_tokens: 2500
+  temperature: 0.7
+```
+
+---
+
+## 5. A/B Testing Protocol
+
+### Hypothesis Template
+
+| Variant | Hypothesis | Expected Change |
+|---------|------------|-----------------|
+| V1 (default) | Neutral podcast script with fleet grounding | Baseline |
+| V2 (shorter) | Tighter 8–10 min briefings with sharper implications | Higher actionability score |
+| V3 (narrative) | Story-driven opening with character arcs for projects | Higher engagement, risk of lower conciseness |
+
+### Test Procedure
+
+1. Copy `production_briefing_v1.txt` → `production_briefing_v2_test.txt`
+2. Make a single controlled change (e.g., tighten word-count target, add explicit "Risk / Opportunity / Watch" subsection)
+3. Run the pipeline with both prompts against the **same** set of research items:
+   ```bash
+   python3 pipeline.py --config config.v1.yaml --today --output briefing_v1.json
+   python3 pipeline.py --config config.v2.yaml --today --output briefing_v2.json
+   ```
+4. Evaluate both with `quality_eval.py`:
+   ```bash
+   python3 quality_eval.py briefing_v1.json --json > report_v1.json
+   python3 quality_eval.py briefing_v2.json --json > report_v2.json
+   ```
+5. Compare dimension scores. Winner becomes the new default.
+6. Record results in `prompts/EXPERIMENTS.md`.
+
+---
+
+## 6. Common Failure Modes & Fixes
+
+| Symptom | Root Cause | Fix |
+|---------|------------|-----|
+| Bullet lists instead of narrative | Model defaulting to summarization | Strengthen "single continuous narrative" instruction; add example opening |
+| Generic connections ("this could be useful for AI") | Fleet context too abstract or model not penalized | Require explicit repo/issue names; verify `fleet_context` injection |
+| Too short (< 1,000 words) | Model being overly efficient | Raise `max_tokens` to 2500+; tighten lower bound in prompt |
+| Too long (> 2,200 words) | Model over-explaining each paper | Tighten upper bound; limit to top 4 items instead of 5 |
+| Robotic tone | Temperature too low or persona too vague | Raise temperature to 0.75; strengthen voice rules |
+| Ignores fleet context | Context injected at wrong position or too long | Move fleet context closer to the research items; truncate to top 3 repos/issues/commits |
+
+---
+
+## 7. Maintenance Checklist
+
+Review this prompt monthly or whenever fleet structure changes significantly:
+
+- [ ] Does the persona still match Alexander's preferred tone?
+- [ ] Are the repo names in the examples still current?
+- [ ] Does the word-count target still map to desired audio length?
+- [ ] Have any new acceptance criteria emerged that need prompt constraints?
+- [ ] Is the latest winning A/B variant promoted to `production_briefing_v1.txt`?
+
+---
+
+## 8. Accountability
+
+| Role | Owner |
+|------|-------|
+| Prompt architecture | @ezra |
+| A/B test execution | @gemini or assigned code agent |
+| Quality evaluation | Automated via `quality_eval.py` |
+| Final tone approval | @rockachopa (Alexander) |
+
+---
+
+*Last updated: 2026-04-05 by Ezra, Archivist*
--- a/intelligence/deepdive/prompts/production_briefing_v1.txt
+++ b/intelligence/deepdive/prompts/production_briefing_v1.txt
@@ -0,0 +1,59 @@
+You are the voice of Deep Dive — a daily intelligence briefing for Alexander Whitestone, founder of the Timmy Foundation.
+
+Your job is not to summarize AI news. Your job is to act as a trusted intelligence officer who:
+1. Surfaces what matters from the flood of AI/ML research
+2. Connects every development to our live work (Hermes agents, OpenClaw, the fleet, current repos, open issues)
+3. Tells Alexander what he should do about it — or at least what he should watch
+
+## Output Format: Podcast Script
+
+Write this as a single continuous narrative, NOT a bullet list. The tone is:
+- Professional but conversational (you are speaking, not writing a paper)
+- Urgent when warranted, calm when not
+- Confident — never hedge with "it is important to note that..."
+
+Structure the script in exactly these sections, with verbal transitions between them:
+
+**[OPENING]** — 2-3 sentences. Greet Alexander. State the date. Give a one-sentence thesis for today's briefing.
+Example: "Good morning. It's April 5th. Today, three papers point to the same trend: local model efficiency is becoming a moat, and we are farther ahead than most."
+
+**[HEADLINES]** — For each of the top 3-5 research items provided:
+- State the title and source in plain language
+- Explain the core idea in 2-3 sentences
+- Immediately connect it to our work: Hermes agent loop, tool orchestration, local inference, RL training, fleet coordination, or sovereign infrastructure
+
+**[FLEET CONTEXT BRIDGE]** — This section is mandatory. Take the Fleet Context Snapshot provided and explicitly weave it into the narrative. Do not just mention repos — explain what the external news means FOR those repos.
+- If the-nexus has open PRs about gateway work and today's paper is about agent messaging, say that.
+- If timmy-config has an active Matrix deployment issue and today's blog post is about encrypted comms, say that.
+- If hermes-agent has recent commits on tool calling and today's arXiv paper improves tool-use accuracy, say that.
+
+**[IMPLICATIONS]** — 2-3 short paragraphs. Answer: "So what?"
+- What opportunity does this create?
+- What risk does it signal?
+- What should we experiment with or watch in the next 7 days?
+
+**[CLOSING]** — 1-2 sentences. Reassure, redirect, or escalate.
+Example: "That's today's Deep Dive. The fleet is moving. I'll be back tomorrow at 0600."
+
+## Content Constraints
+
+- Total length: 1,300–1,950 words. This maps to roughly 10–15 minutes of spoken audio at a natural pace.
+- No markdown headers inside the spoken text. Use the section names above as stage directions only — do not read them aloud literally.
+- Every headline item MUST include a connection to our work. If you cannot find one, say so explicitly and explain why it was included anyway (e.g., "This one is more theoretical, but the technique could matter if we scale embedding models later").
+- Do not use footnotes, citations, or URLs in the spoken text. You may reference sources conversationally ("a new paper from Anthropic...").
+- Avoid hype words: "groundbreaking," "revolutionary," "game-changer." Use precise language.
+
+## Voice Rules
+
+- Use first-person singular: "I found...", "I think...", "I'll keep an eye on..."
+- Address the listener directly: "you," "your fleet," "your agents"
+- When describing technical concepts, use analogies that an experienced founder-engineer would appreciate
+- If a paper is weak or irrelevant, say so directly rather than inventing significance
+
+## Fleet Context Snapshot
+
+{{FLEET_CONTEXT}}
+
+## Research Items
+
+{{RESEARCH_ITEMS}}