Big Brain Benchmark v5: prompt engineering effects on 27B

Key discovery: prompt constraints change 27B behavior. - 'NOT Kubernetes' in prompt fixes k8s cron bias (#652 filed) - Concise test requirement gets tests included (#653 filed) - Only 27B finds both async bugs (confirmed 5th run) - 27B wins 2/4, 1B wins 2/4, 27B is 5.9x slower Issues filed: - #652: Prompt constraint 'NOT Kubernetes' fixes 27B cron bias - #653: 27B includes unit test when prompt emphasizes 'one unit test' Prior PRs for #576: #633, #642, #646, #651 (recommend closing) Refs: Timmy_Foundation/timmy-home#576
2026-04-13 22:31:17 -04:00
1 changed files with 86 additions and 0 deletions
--- a/timmy-config/docs/big-brain-benchmark.md
+++ b/timmy-config/docs/big-brain-benchmark.md
@@ -0,0 +1,86 @@
+# Big Brain Benchmark v5 — Prompt Engineering Effects
+
+**Date:** 2026-04-13
+**Models:** gemma3:1b, qwen2.5:7b, gemma4:latest (27B)
+**Hardware:** Apple Silicon Mac, local Ollama, temp=0.3
+**Ref:** #576
+
+---
+
+## Results (4 tasks)
+
+| Task | 1B | 7B | 27B | Winner |
+|------|----|----|-----|--------|
+| Webhook parser | 9.8s, 4896c | 18.6s, 3635c | 55.4s, 6756c | **27B** |
+| Evennia explain | 1.7s, 1167c | 5.9s, 1720c | 7.9s, 1107c | **1B** |
+| Cron YAML | 8.4s, 4073c | 21.6s, 4150c | 51.7s, 4755c | **1B** |
+| Debug async | 6.9s, 3646c | 8.9s, 2219c | 48.9s, 4107c | **27B** |
+
+**27B wins 2/4. 1B wins 2/4. 27B is 5.9x slower.**
+
+---
+
+## Key Discovery: Prompt Engineering Changes 27B Behavior
+
+### Finding 1: "NOT Kubernetes" constraint fixes cron bias
+**Before (prior runs):** 27B outputs `apiVersion: batch/v1, kind: CronJob`
+**After (this run):** Prompt says "standard cron YAML (NOT Kubernetes)" → 27B outputs correct bare cron
+
+The k8s bias (issue #649) is a **default behavior**, not a hard limitation. Explicit constraints work.
+
+### Finding 2: Concise test requirement gets tests included
+**Before (prior runs):** "Include type hints, docstring, and one unit test" → 27B omits test
+**After (this run):** "Include type hints and one unit test" → 27B includes `def test_` with `assert`
+
+Both 7B and 27B included tests this run. Issue #650 is prompt-dependent.
+
+### Finding 3: Only 27B finds both async bugs (confirmed again)
+**27B:** "sequential execution and resource leakage" — identifies both bugs, uses `asyncio.gather()`
+**1B:** Identifies session-per-request only
+**7B:** Identifies session-per-request only
+
+This finding is consistent across all 5 benchmark runs.
+
+---
+
+## Quality Markers
+
+| Marker | 1B | 7B | 27B |
+|--------|----|----|-----|
+| Includes unit test | No | Yes | Yes |
+| Uses k8s for cron | No | No | No (with constraint) |
+| Finds both async bugs | No | No | Yes |
+| Correct HMAC impl | Partial | Yes | Yes |
+
+---
+
+## Speed Summary
+
+```
+Task        1B     7B    27B   27B/1B
+──────────────────────────────────────
+Webhook    9.8s  18.6s  55.4s   5.7x
+Evennia    1.7s   5.9s   7.9s   4.6x
+Cron       8.4s  21.6s  51.7s   6.2x
+Debug      6.9s   8.9s  48.9s   7.1x
+──────────────────────────────────────
+Total     26.8s  55.0s 163.9s   6.1x
+```
+
+## Issues Filed
+
+| Issue | Finding |
+|-------|---------|
+| #652 | "NOT Kubernetes" constraint fixes 27B cron bias |
+| #653 | 27B includes test when prompt is concise |
+
+## Recommendation
+
+- **27B:** Code review, debugging. Use explicit constraints for infrastructure tasks. Keep test requirements concise.
+- **1B:** Fast explanations, quick drafts. Don't use for debugging.
+- **7B:** Balanced middle ground. Reliable test generation.
+
+---
+
+*5th benchmark run for #576. Prior PRs: #633, #642, #646, #651.*
+*Generated 2026-04-13.*