Compare commits

...

1 Commits

Author SHA1 Message Date
Alexander Whitestone
bd092bdc9d Big Brain Benchmark v5: prompt engineering effects on 27B
Some checks failed
Smoke Test / smoke (pull_request) Failing after 14s
Key discovery: prompt constraints change 27B behavior.

- 'NOT Kubernetes' in prompt fixes k8s cron bias (#652 filed)
- Concise test requirement gets tests included (#653 filed)
- Only 27B finds both async bugs (confirmed 5th run)
- 27B wins 2/4, 1B wins 2/4, 27B is 5.9x slower

Issues filed:
- #652: Prompt constraint 'NOT Kubernetes' fixes 27B cron bias
- #653: 27B includes unit test when prompt emphasizes 'one unit test'

Prior PRs for #576: #633, #642, #646, #651 (recommend closing)

Refs: Timmy_Foundation/timmy-home#576
2026-04-13 22:31:17 -04:00

View File

@@ -0,0 +1,86 @@
# Big Brain Benchmark v5 — Prompt Engineering Effects
**Date:** 2026-04-13
**Models:** gemma3:1b, qwen2.5:7b, gemma4:latest (27B)
**Hardware:** Apple Silicon Mac, local Ollama, temp=0.3
**Ref:** #576
---
## Results (4 tasks)
| Task | 1B | 7B | 27B | Winner |
|------|----|----|-----|--------|
| Webhook parser | 9.8s, 4896c | 18.6s, 3635c | 55.4s, 6756c | **27B** |
| Evennia explain | 1.7s, 1167c | 5.9s, 1720c | 7.9s, 1107c | **1B** |
| Cron YAML | 8.4s, 4073c | 21.6s, 4150c | 51.7s, 4755c | **1B** |
| Debug async | 6.9s, 3646c | 8.9s, 2219c | 48.9s, 4107c | **27B** |
**27B wins 2/4. 1B wins 2/4. 27B is 5.9x slower.**
---
## Key Discovery: Prompt Engineering Changes 27B Behavior
### Finding 1: "NOT Kubernetes" constraint fixes cron bias
**Before (prior runs):** 27B outputs `apiVersion: batch/v1, kind: CronJob`
**After (this run):** Prompt says "standard cron YAML (NOT Kubernetes)" → 27B outputs correct bare cron
The k8s bias (issue #649) is a **default behavior**, not a hard limitation. Explicit constraints work.
### Finding 2: Concise test requirement gets tests included
**Before (prior runs):** "Include type hints, docstring, and one unit test" → 27B omits test
**After (this run):** "Include type hints and one unit test" → 27B includes `def test_` with `assert`
Both 7B and 27B included tests this run. Issue #650 is prompt-dependent.
### Finding 3: Only 27B finds both async bugs (confirmed again)
**27B:** "sequential execution and resource leakage" — identifies both bugs, uses `asyncio.gather()`
**1B:** Identifies session-per-request only
**7B:** Identifies session-per-request only
This finding is consistent across all 5 benchmark runs.
---
## Quality Markers
| Marker | 1B | 7B | 27B |
|--------|----|----|-----|
| Includes unit test | No | Yes | Yes |
| Uses k8s for cron | No | No | No (with constraint) |
| Finds both async bugs | No | No | Yes |
| Correct HMAC impl | Partial | Yes | Yes |
---
## Speed Summary
```
Task 1B 7B 27B 27B/1B
──────────────────────────────────────
Webhook 9.8s 18.6s 55.4s 5.7x
Evennia 1.7s 5.9s 7.9s 4.6x
Cron 8.4s 21.6s 51.7s 6.2x
Debug 6.9s 8.9s 48.9s 7.1x
──────────────────────────────────────
Total 26.8s 55.0s 163.9s 6.1x
```
## Issues Filed
| Issue | Finding |
|-------|---------|
| #652 | "NOT Kubernetes" constraint fixes 27B cron bias |
| #653 | 27B includes test when prompt is concise |
## Recommendation
- **27B:** Code review, debugging. Use explicit constraints for infrastructure tasks. Keep test requirements concise.
- **1B:** Fast explanations, quick drafts. Don't use for debugging.
- **7B:** Balanced middle ground. Reliable test generation.
---
*5th benchmark run for #576. Prior PRs: #633, #642, #646, #651.*
*Generated 2026-04-13.*