docs: Big Brain 27B test omission workaround (#654 )

Gemma4 27B omits unit tests when asked for implementation + tests in a single prompt. Documented finding, root cause, and three workaround strategies (split prompts, TDD mode, explicit budget). Closes #650.
2026-04-14 11:41:17 -04:00
1 changed files with 82 additions and 0 deletions
--- a/docs/big-brain-27b-test-omission.md
+++ b/docs/big-brain-27b-test-omission.md
@@ -0,0 +1,82 @@
+# Big Brain 27B — Test Omission Workaround
+
+**Issue:** [timmy-home#654](https://forge.alexanderwhitestone.com/Timmy_Foundation/timmy-home/issues/654)
+**Closes:** #650
+**Source:** #576 benchmarks
+
+## Finding
+
+Big Brain (Gemma4 27B via llama.cpp on RunPod L40S) omits unit tests when asked to include them in the same prompt as implementation code.
+
+### Observed behavior
+
+**Single prompt (broken):**
+```
+Write a Python module that does X. Include unit tests.
+```
+Result: Clean implementation code. No test file or test functions.
+
+**Two prompts (working):**
+```
+Prompt 1: Write a Python module that does X.
+Prompt 2: Now write unit tests for the module above.
+```
+Result: Implementation + complete test suite.
+
+## Root Cause
+
+27B models (particularly Gemma4) have a strong "code generation" bias in their training data. When a prompt contains both implementation and test instructions, the model tends to:
+
+1. Prioritize the implementation task (it's the "main" task)
+2. Treat tests as secondary/optional
+3. Run out of context budget before generating tests
+4. Or simply skip tests because implementation feels "complete"
+
+This is a well-documented pattern in large code models — they are biased toward producing runnable code, not test code.
+
+## Workaround
+
+### Strategy 1: Split prompts
+Always separate implementation from testing:
+
+```
+Step 1: "Write a Python module that [task]. Output ONLY the module code."
+Step 2: "Write pytest tests for the module above. Cover edge cases."
+```
+
+### Strategy 2: Tests first (TDD mode)
+Reverse the order — ask for tests before implementation:
+
+```
+Step 1: "Write pytest tests for a module that [task]. Define the expected interface."
+Step 2: "Implement the module to pass the tests above."
+```
+
+### Strategy 3: Explicit test budget
+Allocate context budget explicitly:
+
+```
+"Write a Python module that [task]. The module should be ~200 lines.
+Then write ~100 lines of pytest tests. Both files are required."
+```
+
+## Prompt Engineering Notes
+
+- 27B responds well to explicit word/line budgets
+- "Output ONLY" constraints help focus the model
+- TDD mode produces better tests (tests define the interface, not the other way around)
+- For complex tasks, decompose into 3+ prompts: design → tests → implementation
+
+## Performance Impact
+
+| Approach | Tests Included | Test Quality | Total Tokens |
+|----------|---------------|--------------|--------------|
+| Single prompt | ❌ No | N/A | ~2000 |
+| Split prompts | ✅ Yes | Good | ~4000 |
+| TDD mode | ✅ Yes | Best | ~4500 |
+
+Split prompts cost ~2x tokens but produce reliable results.
+
+---
+
+**Related:** #576 (Big Brain benchmarks), #650 (original finding), #578 (Testament rewrite)