Compare commits

...

2 Commits

Author SHA1 Message Date
Alexander Whitestone
a8777e0d80 docs: update 27B test finding — prompt overload, not model limit
Some checks failed
Smoke Test / smoke (pull_request) Failing after 15s
Correction from #653: 27B includes tests when prompt is concise.
'Include type hints and one unit test.' → tests included.
'Include type hints, docstring, and one unit test.' → tests omitted.

Issue is prompt overload, not model limitation.

Closes #653
2026-04-13 22:32:18 -04:00
Alexander Whitestone
5f50ac4801 docs: Big Brain 27B test omission workaround
Some checks failed
Smoke Test / smoke (pull_request) Failing after 10s
Document the finding that 27B omits unit tests when asked to include
them in the same prompt as implementation code. Workaround: split
into two prompts (implementation, then tests).

From benchmark runs in #576.

Closes #650
2026-04-13 22:28:28 -04:00

View File

@@ -0,0 +1,53 @@
# Big Brain 27B — Test Omission Pattern
## Finding (2026-04-14)
The 27B model (gemma4) consistently omits unit tests when asked to include them
in the same prompt as implementation code. The model produces complete, high-quality
implementation but stops before the test class/function.
**Affected models:** 1B, 7B, 27B (27B most notable because implementation is best)
**Root cause:** Models treat tests as optional even when explicitly required in prompt.
## Workaround
Split the prompt into two phases:
### Phase 1: Implementation
```
Write a webhook parser with @dataclass, verify_signature(), parse_webhook().
Include type hints and docstrings.
```
### Phase 2: Tests (separate prompt)
```
Write a unit test for the webhook parser above. Cover:
- Valid signature verification
- Invalid signature rejection
- Malformed payload handling
```
## Prompt Engineering Notes
- Do NOT combine "implement X" and "include unit test" in a single prompt
- The model excels at implementation when focused
- Test generation works better as a follow-up on the existing code
- For critical code, always verify test presence manually
## Impact
Low — workaround is simple (split prompt). No data loss or corruption risk.
## Source
Benchmark runs documented in timmy-home #576.
## Update (2026-04-14)
**Correction:** 27B DOES include tests when the prompt is concise.
- "Include type hints and one unit test." → tests included
- "Include type hints, docstring, and one unit test." → tests omitted
The issue is **prompt overload**, not model limitation. Use short, focused
test requirements. See #653.