Files
the-nexus/intelligence/deepdive/prompts/PROMPT_ENGINEERING_KT.md
Ezra (Archivist) 4b1873d76e
Some checks failed
Deploy Nexus / deploy (push) Has been cancelled
feat(deepdive): production briefing prompt + prompt engineering KT
- production_briefing_v1.txt: podcast-script prompt engineered for
  10-15 min premium audio, grounded fleet context, and actionable tone.
- PROMPT_ENGINEERING_KT.md: A/B testing protocol, failure modes,
  and maintenance checklist.
- pipeline.py: load external prompt_file from config.yaml.

Refs #830
2026-04-05 20:19:20 +00:00

5.7 KiB
Raw Permalink Blame History

Deep Dive Prompt Engineering — Knowledge Transfer

Issue: #830 — Deep Dive: Sovereign NotebookLM + Daily AI Intelligence Briefing
Created: 2026-04-05 by Ezra, Archivist
Purpose: Explain how the production synthesis prompt works, how to A/B test it, and how to maintain quality as the fleet evolves.


1. The Prompt Files

File Role When to Change
production_briefing_v1.txt Default prompt for daily briefing generation When voice quality degrades or acceptance criteria drift
production_briefing_v2_*.txt Experimental variants During A/B tests

2. Design Philosophy

The prompt is engineered around three non-negotiables from Alexander:

  1. Grounded in our world first — Fleet context is not decoration. It must shape the narrative.
  2. Actionable, not encyclopedic — Every headline needs a "so what" for Timmy Foundation work.
  3. Premium audio experience — The output is a podcast script, not a report. Structure, pacing, and tone matter.

Why 1,3001,950 words?

At a natural speaking pace of ~130 WPM:

  • 1,300 words ≈ 10 minutes
  • 1,950 words ≈ 15 minutes

This hits the acceptance criterion for default audio runtime.


3. Prompt Architecture

The prompt has four layers:

Layer 1: Persona

"You are the voice of Deep Dive..."

This establishes tone, authority, and audience. It prevents the model from slipping into academic summarizer mode.

Layer 2: Output Schema

"Write this as a single continuous narrative... Structure the script in exactly these sections..."

The schema forces consistency. Without it, LLMs tend to produce bullet lists or inconsistent section ordering.

Layer 3: Content Constraints

"Every headline item MUST include a connection to our work..."

This is the grounding enforcement layer. It raises the cost of generic summaries.

Layer 4: Dynamic Context

{{FLEET_CONTEXT}} and {{RESEARCH_ITEMS}}

These are template variables substituted at runtime by pipeline.py. The prompt is data-agnostic — it defines how to think about whatever data is injected.


4. Integration with Pipeline

In pipeline.py, the SynthesisEngine loads the prompt file (if configured) and performs substitution:

# Pseudo-code from pipeline.py
prompt_template = load_prompt("prompts/production_briefing_v1.txt")
prompt = prompt_template.replace("{{FLEET_CONTEXT}}", fleet_ctx.to_prompt_text())
prompt = prompt.replace("{{RESEARCH_ITEMS}}", format_items(items))
synthesis = self._call_llm(prompt)

To switch prompts, update config.yaml:

synthesis:
  llm_endpoint: "http://localhost:4000/v1"
  prompt_file: "prompts/production_briefing_v1.txt"
  max_tokens: 2500
  temperature: 0.7

5. A/B Testing Protocol

Hypothesis Template

Variant Hypothesis Expected Change
V1 (default) Neutral podcast script with fleet grounding Baseline
V2 (shorter) Tighter 810 min briefings with sharper implications Higher actionability score
V3 (narrative) Story-driven opening with character arcs for projects Higher engagement, risk of lower conciseness

Test Procedure

  1. Copy production_briefing_v1.txtproduction_briefing_v2_test.txt
  2. Make a single controlled change (e.g., tighten word-count target, add explicit "Risk / Opportunity / Watch" subsection)
  3. Run the pipeline with both prompts against the same set of research items:
    python3 pipeline.py --config config.v1.yaml --today --output briefing_v1.json
    python3 pipeline.py --config config.v2.yaml --today --output briefing_v2.json
    
  4. Evaluate both with quality_eval.py:
    python3 quality_eval.py briefing_v1.json --json > report_v1.json
    python3 quality_eval.py briefing_v2.json --json > report_v2.json
    
  5. Compare dimension scores. Winner becomes the new default.
  6. Record results in prompts/EXPERIMENTS.md.

6. Common Failure Modes & Fixes

Symptom Root Cause Fix
Bullet lists instead of narrative Model defaulting to summarization Strengthen "single continuous narrative" instruction; add example opening
Generic connections ("this could be useful for AI") Fleet context too abstract or model not penalized Require explicit repo/issue names; verify fleet_context injection
Too short (< 1,000 words) Model being overly efficient Raise max_tokens to 2500+; tighten lower bound in prompt
Too long (> 2,200 words) Model over-explaining each paper Tighten upper bound; limit to top 4 items instead of 5
Robotic tone Temperature too low or persona too vague Raise temperature to 0.75; strengthen voice rules
Ignores fleet context Context injected at wrong position or too long Move fleet context closer to the research items; truncate to top 3 repos/issues/commits

7. Maintenance Checklist

Review this prompt monthly or whenever fleet structure changes significantly:

  • Does the persona still match Alexander's preferred tone?
  • Are the repo names in the examples still current?
  • Does the word-count target still map to desired audio length?
  • Have any new acceptance criteria emerged that need prompt constraints?
  • Is the latest winning A/B variant promoted to production_briefing_v1.txt?

8. Accountability

Role Owner
Prompt architecture @ezra
A/B test execution @gemini or assigned code agent
Quality evaluation Automated via quality_eval.py
Final tone approval @rockachopa (Alexander)

Last updated: 2026-04-05 by Ezra, Archivist