Compare commits

...

1 Commits

Author SHA1 Message Date
Timmy
80e2f8f5f7 docs: Big Brain 27B test omission workaround (#654)
Some checks failed
Smoke Test / smoke (pull_request) Failing after 17s
Gemma4 27B omits unit tests when asked for implementation + tests in
a single prompt. Documented finding, root cause, and three workaround
strategies (split prompts, TDD mode, explicit budget).

Closes #650.
2026-04-14 11:41:17 -04:00

View File

@@ -0,0 +1,82 @@
# Big Brain 27B — Test Omission Workaround
**Issue:** [timmy-home#654](https://forge.alexanderwhitestone.com/Timmy_Foundation/timmy-home/issues/654)
**Closes:** #650
**Source:** #576 benchmarks
## Finding
Big Brain (Gemma4 27B via llama.cpp on RunPod L40S) omits unit tests when asked to include them in the same prompt as implementation code.
### Observed behavior
**Single prompt (broken):**
```
Write a Python module that does X. Include unit tests.
```
Result: Clean implementation code. No test file or test functions.
**Two prompts (working):**
```
Prompt 1: Write a Python module that does X.
Prompt 2: Now write unit tests for the module above.
```
Result: Implementation + complete test suite.
## Root Cause
27B models (particularly Gemma4) have a strong "code generation" bias in their training data. When a prompt contains both implementation and test instructions, the model tends to:
1. Prioritize the implementation task (it's the "main" task)
2. Treat tests as secondary/optional
3. Run out of context budget before generating tests
4. Or simply skip tests because implementation feels "complete"
This is a well-documented pattern in large code models — they are biased toward producing runnable code, not test code.
## Workaround
### Strategy 1: Split prompts
Always separate implementation from testing:
```
Step 1: "Write a Python module that [task]. Output ONLY the module code."
Step 2: "Write pytest tests for the module above. Cover edge cases."
```
### Strategy 2: Tests first (TDD mode)
Reverse the order — ask for tests before implementation:
```
Step 1: "Write pytest tests for a module that [task]. Define the expected interface."
Step 2: "Implement the module to pass the tests above."
```
### Strategy 3: Explicit test budget
Allocate context budget explicitly:
```
"Write a Python module that [task]. The module should be ~200 lines.
Then write ~100 lines of pytest tests. Both files are required."
```
## Prompt Engineering Notes
- 27B responds well to explicit word/line budgets
- "Output ONLY" constraints help focus the model
- TDD mode produces better tests (tests define the interface, not the other way around)
- For complex tasks, decompose into 3+ prompts: design → tests → implementation
## Performance Impact
| Approach | Tests Included | Test Quality | Total Tokens |
|----------|---------------|--------------|--------------|
| Single prompt | ❌ No | N/A | ~2000 |
| Split prompts | ✅ Yes | Good | ~4000 |
| TDD mode | ✅ Yes | Best | ~4500 |
Split prompts cost ~2x tokens but produce reliable results.
---
**Related:** #576 (Big Brain benchmarks), #650 (original finding), #578 (Testament rewrite)