Compare commits
1 Commits
fix/662-ba
...
docs/654-2
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
80e2f8f5f7 |
82
docs/big-brain-27b-test-omission.md
Normal file
82
docs/big-brain-27b-test-omission.md
Normal file
@@ -0,0 +1,82 @@
|
||||
# Big Brain 27B — Test Omission Workaround
|
||||
|
||||
**Issue:** [timmy-home#654](https://forge.alexanderwhitestone.com/Timmy_Foundation/timmy-home/issues/654)
|
||||
**Closes:** #650
|
||||
**Source:** #576 benchmarks
|
||||
|
||||
## Finding
|
||||
|
||||
Big Brain (Gemma4 27B via llama.cpp on RunPod L40S) omits unit tests when asked to include them in the same prompt as implementation code.
|
||||
|
||||
### Observed behavior
|
||||
|
||||
**Single prompt (broken):**
|
||||
```
|
||||
Write a Python module that does X. Include unit tests.
|
||||
```
|
||||
Result: Clean implementation code. No test file or test functions.
|
||||
|
||||
**Two prompts (working):**
|
||||
```
|
||||
Prompt 1: Write a Python module that does X.
|
||||
Prompt 2: Now write unit tests for the module above.
|
||||
```
|
||||
Result: Implementation + complete test suite.
|
||||
|
||||
## Root Cause
|
||||
|
||||
27B models (particularly Gemma4) have a strong "code generation" bias in their training data. When a prompt contains both implementation and test instructions, the model tends to:
|
||||
|
||||
1. Prioritize the implementation task (it's the "main" task)
|
||||
2. Treat tests as secondary/optional
|
||||
3. Run out of context budget before generating tests
|
||||
4. Or simply skip tests because implementation feels "complete"
|
||||
|
||||
This is a well-documented pattern in large code models — they are biased toward producing runnable code, not test code.
|
||||
|
||||
## Workaround
|
||||
|
||||
### Strategy 1: Split prompts
|
||||
Always separate implementation from testing:
|
||||
|
||||
```
|
||||
Step 1: "Write a Python module that [task]. Output ONLY the module code."
|
||||
Step 2: "Write pytest tests for the module above. Cover edge cases."
|
||||
```
|
||||
|
||||
### Strategy 2: Tests first (TDD mode)
|
||||
Reverse the order — ask for tests before implementation:
|
||||
|
||||
```
|
||||
Step 1: "Write pytest tests for a module that [task]. Define the expected interface."
|
||||
Step 2: "Implement the module to pass the tests above."
|
||||
```
|
||||
|
||||
### Strategy 3: Explicit test budget
|
||||
Allocate context budget explicitly:
|
||||
|
||||
```
|
||||
"Write a Python module that [task]. The module should be ~200 lines.
|
||||
Then write ~100 lines of pytest tests. Both files are required."
|
||||
```
|
||||
|
||||
## Prompt Engineering Notes
|
||||
|
||||
- 27B responds well to explicit word/line budgets
|
||||
- "Output ONLY" constraints help focus the model
|
||||
- TDD mode produces better tests (tests define the interface, not the other way around)
|
||||
- For complex tasks, decompose into 3+ prompts: design → tests → implementation
|
||||
|
||||
## Performance Impact
|
||||
|
||||
| Approach | Tests Included | Test Quality | Total Tokens |
|
||||
|----------|---------------|--------------|--------------|
|
||||
| Single prompt | ❌ No | N/A | ~2000 |
|
||||
| Split prompts | ✅ Yes | Good | ~4000 |
|
||||
| TDD mode | ✅ Yes | Best | ~4500 |
|
||||
|
||||
Split prompts cost ~2x tokens but produce reliable results.
|
||||
|
||||
---
|
||||
|
||||
**Related:** #576 (Big Brain benchmarks), #650 (original finding), #578 (Testament rewrite)
|
||||
Reference in New Issue
Block a user