- production_briefing_v1.txt: podcast-script prompt engineered for 10-15 min premium audio, grounded fleet context, and actionable tone. - PROMPT_ENGINEERING_KT.md: A/B testing protocol, failure modes, and maintenance checklist. - pipeline.py: load external prompt_file from config.yaml. Refs #830
5.7 KiB
Deep Dive Prompt Engineering — Knowledge Transfer
Issue: #830 — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing
Created: 2026-04-05 by Ezra, Archivist
Purpose: Explain how the production synthesis prompt works, how to A/B test it, and how to maintain quality as the fleet evolves.
1. The Prompt Files
| File | Role | When to Change |
|---|---|---|
production_briefing_v1.txt |
Default prompt for daily briefing generation | When voice quality degrades or acceptance criteria drift |
production_briefing_v2_*.txt |
Experimental variants | During A/B tests |
2. Design Philosophy
The prompt is engineered around three non-negotiables from Alexander:
- Grounded in our world first — Fleet context is not decoration. It must shape the narrative.
- Actionable, not encyclopedic — Every headline needs a "so what" for Timmy Foundation work.
- Premium audio experience — The output is a podcast script, not a report. Structure, pacing, and tone matter.
Why 1,300–1,950 words?
At a natural speaking pace of ~130 WPM:
- 1,300 words ≈ 10 minutes
- 1,950 words ≈ 15 minutes
This hits the acceptance criterion for default audio runtime.
3. Prompt Architecture
The prompt has four layers:
Layer 1: Persona
"You are the voice of Deep Dive..."
This establishes tone, authority, and audience. It prevents the model from slipping into academic summarizer mode.
Layer 2: Output Schema
"Write this as a single continuous narrative... Structure the script in exactly these sections..."
The schema forces consistency. Without it, LLMs tend to produce bullet lists or inconsistent section ordering.
Layer 3: Content Constraints
"Every headline item MUST include a connection to our work..."
This is the grounding enforcement layer. It raises the cost of generic summaries.
Layer 4: Dynamic Context
{{FLEET_CONTEXT}}and{{RESEARCH_ITEMS}}
These are template variables substituted at runtime by pipeline.py. The prompt is data-agnostic — it defines how to think about whatever data is injected.
4. Integration with Pipeline
In pipeline.py, the SynthesisEngine loads the prompt file (if configured) and performs substitution:
# Pseudo-code from pipeline.py
prompt_template = load_prompt("prompts/production_briefing_v1.txt")
prompt = prompt_template.replace("{{FLEET_CONTEXT}}", fleet_ctx.to_prompt_text())
prompt = prompt.replace("{{RESEARCH_ITEMS}}", format_items(items))
synthesis = self._call_llm(prompt)
To switch prompts, update config.yaml:
synthesis:
llm_endpoint: "http://localhost:4000/v1"
prompt_file: "prompts/production_briefing_v1.txt"
max_tokens: 2500
temperature: 0.7
5. A/B Testing Protocol
Hypothesis Template
| Variant | Hypothesis | Expected Change |
|---|---|---|
| V1 (default) | Neutral podcast script with fleet grounding | Baseline |
| V2 (shorter) | Tighter 8–10 min briefings with sharper implications | Higher actionability score |
| V3 (narrative) | Story-driven opening with character arcs for projects | Higher engagement, risk of lower conciseness |
Test Procedure
- Copy
production_briefing_v1.txt→production_briefing_v2_test.txt - Make a single controlled change (e.g., tighten word-count target, add explicit "Risk / Opportunity / Watch" subsection)
- Run the pipeline with both prompts against the same set of research items:
python3 pipeline.py --config config.v1.yaml --today --output briefing_v1.json python3 pipeline.py --config config.v2.yaml --today --output briefing_v2.json - Evaluate both with
quality_eval.py:python3 quality_eval.py briefing_v1.json --json > report_v1.json python3 quality_eval.py briefing_v2.json --json > report_v2.json - Compare dimension scores. Winner becomes the new default.
- Record results in
prompts/EXPERIMENTS.md.
6. Common Failure Modes & Fixes
| Symptom | Root Cause | Fix |
|---|---|---|
| Bullet lists instead of narrative | Model defaulting to summarization | Strengthen "single continuous narrative" instruction; add example opening |
| Generic connections ("this could be useful for AI") | Fleet context too abstract or model not penalized | Require explicit repo/issue names; verify fleet_context injection |
| Too short (< 1,000 words) | Model being overly efficient | Raise max_tokens to 2500+; tighten lower bound in prompt |
| Too long (> 2,200 words) | Model over-explaining each paper | Tighten upper bound; limit to top 4 items instead of 5 |
| Robotic tone | Temperature too low or persona too vague | Raise temperature to 0.75; strengthen voice rules |
| Ignores fleet context | Context injected at wrong position or too long | Move fleet context closer to the research items; truncate to top 3 repos/issues/commits |
7. Maintenance Checklist
Review this prompt monthly or whenever fleet structure changes significantly:
- Does the persona still match Alexander's preferred tone?
- Are the repo names in the examples still current?
- Does the word-count target still map to desired audio length?
- Have any new acceptance criteria emerged that need prompt constraints?
- Is the latest winning A/B variant promoted to
production_briefing_v1.txt?
8. Accountability
| Role | Owner |
|---|---|
| Prompt architecture | @ezra |
| A/B test execution | @gemini or assigned code agent |
| Quality evaluation | Automated via quality_eval.py |
| Final tone approval | @rockachopa (Alexander) |
Last updated: 2026-04-05 by Ezra, Archivist