Files

Ezra (Archivist) 4b1873d76e

Deploy Nexus / deploy (push) Has been cancelled

Details

feat(deepdive): production briefing prompt + prompt engineering KT

- production_briefing_v1.txt: podcast-script prompt engineered for
  10-15 min premium audio, grounded fleet context, and actionable tone.
- PROMPT_ENGINEERING_KT.md: A/B testing protocol, failure modes,
  and maintenance checklist.
- pipeline.py: load external prompt_file from config.yaml.

Refs #830

2026-04-05 20:19:20 +00:00

5.7 KiB

Raw Permalink Blame History

Deep Dive Prompt Engineering — Knowledge Transfer

Issue: #830 — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing
Created: 2026-04-05 by Ezra, Archivist
Purpose: Explain how the production synthesis prompt works, how to A/B test it, and how to maintain quality as the fleet evolves.

1. The Prompt Files

File	Role	When to Change
`production_briefing_v1.txt`	Default prompt for daily briefing generation	When voice quality degrades or acceptance criteria drift
`production_briefing_v2_*.txt`	Experimental variants	During A/B tests

2. Design Philosophy

The prompt is engineered around three non-negotiables from Alexander:

Grounded in our world first — Fleet context is not decoration. It must shape the narrative.
Actionable, not encyclopedic — Every headline needs a "so what" for Timmy Foundation work.
Premium audio experience — The output is a podcast script, not a report. Structure, pacing, and tone matter.

Why 1,300–1,950 words?

At a natural speaking pace of ~130 WPM:

1,300 words ≈ 10 minutes
1,950 words ≈ 15 minutes

This hits the acceptance criterion for default audio runtime.

3. Prompt Architecture

The prompt has four layers:

Layer 1: Persona

"You are the voice of Deep Dive..."

This establishes tone, authority, and audience. It prevents the model from slipping into academic summarizer mode.

Layer 2: Output Schema

"Write this as a single continuous narrative... Structure the script in exactly these sections..."

The schema forces consistency. Without it, LLMs tend to produce bullet lists or inconsistent section ordering.

Layer 3: Content Constraints

"Every headline item MUST include a connection to our work..."

This is the grounding enforcement layer. It raises the cost of generic summaries.

Layer 4: Dynamic Context

{{FLEET_CONTEXT}} and {{RESEARCH_ITEMS}}

These are template variables substituted at runtime by pipeline.py. The prompt is data-agnostic — it defines how to think about whatever data is injected.

4. Integration with Pipeline

In pipeline.py, the SynthesisEngine loads the prompt file (if configured) and performs substitution:

# Pseudo-code from pipeline.py
prompt_template = load_prompt("prompts/production_briefing_v1.txt")
prompt = prompt_template.replace("{{FLEET_CONTEXT}}", fleet_ctx.to_prompt_text())
prompt = prompt.replace("{{RESEARCH_ITEMS}}", format_items(items))
synthesis = self._call_llm(prompt)

To switch prompts, update config.yaml:

synthesis:
  llm_endpoint: "http://localhost:4000/v1"
  prompt_file: "prompts/production_briefing_v1.txt"
  max_tokens: 2500
  temperature: 0.7

5. A/B Testing Protocol

Hypothesis Template

Variant	Hypothesis	Expected Change
V1 (default)	Neutral podcast script with fleet grounding	Baseline
V2 (shorter)	Tighter 8–10 min briefings with sharper implications	Higher actionability score
V3 (narrative)	Story-driven opening with character arcs for projects	Higher engagement, risk of lower conciseness

Test Procedure

Copy production_briefing_v1.txt → production_briefing_v2_test.txt
Make a single controlled change (e.g., tighten word-count target, add explicit "Risk / Opportunity / Watch" subsection)

Run the pipeline with both prompts against the same set of research items:

python3 pipeline.py --config config.v1.yaml --today --output briefing_v1.json
python3 pipeline.py --config config.v2.yaml --today --output briefing_v2.json

Evaluate both with quality_eval.py:

python3 quality_eval.py briefing_v1.json --json > report_v1.json
python3 quality_eval.py briefing_v2.json --json > report_v2.json

Compare dimension scores. Winner becomes the new default.
Record results in prompts/EXPERIMENTS.md.

6. Common Failure Modes & Fixes

Symptom	Root Cause	Fix
Bullet lists instead of narrative	Model defaulting to summarization	Strengthen "single continuous narrative" instruction; add example opening
Generic connections ("this could be useful for AI")	Fleet context too abstract or model not penalized	Require explicit repo/issue names; verify `fleet_context` injection
Too short (< 1,000 words)	Model being overly efficient	Raise `max_tokens` to 2500+; tighten lower bound in prompt
Too long (> 2,200 words)	Model over-explaining each paper	Tighten upper bound; limit to top 4 items instead of 5
Robotic tone	Temperature too low or persona too vague	Raise temperature to 0.75; strengthen voice rules
Ignores fleet context	Context injected at wrong position or too long	Move fleet context closer to the research items; truncate to top 3 repos/issues/commits

7. Maintenance Checklist

Review this prompt monthly or whenever fleet structure changes significantly:

Does the persona still match Alexander's preferred tone?
Are the repo names in the examples still current?
Does the word-count target still map to desired audio length?
Have any new acceptance criteria emerged that need prompt constraints?
Is the latest winning A/B variant promoted to production_briefing_v1.txt?

8. Accountability

Role	Owner
Prompt architecture	@ezra
A/B test execution	@gemini or assigned code agent
Quality evaluation	Automated via `quality_eval.py`
Final tone approval	@rockachopa (Alexander)

Last updated: 2026-04-05 by Ezra, Archivist

5.7 KiB Raw Permalink Blame History Unescape Escape