Compare commits
1 Commits
fix/552
...
dawn/576-1
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
bd092bdc9d |
86
timmy-config/docs/big-brain-benchmark.md
Normal file
86
timmy-config/docs/big-brain-benchmark.md
Normal file
@@ -0,0 +1,86 @@
|
||||
# Big Brain Benchmark v5 — Prompt Engineering Effects
|
||||
|
||||
**Date:** 2026-04-13
|
||||
**Models:** gemma3:1b, qwen2.5:7b, gemma4:latest (27B)
|
||||
**Hardware:** Apple Silicon Mac, local Ollama, temp=0.3
|
||||
**Ref:** #576
|
||||
|
||||
---
|
||||
|
||||
## Results (4 tasks)
|
||||
|
||||
| Task | 1B | 7B | 27B | Winner |
|
||||
|------|----|----|-----|--------|
|
||||
| Webhook parser | 9.8s, 4896c | 18.6s, 3635c | 55.4s, 6756c | **27B** |
|
||||
| Evennia explain | 1.7s, 1167c | 5.9s, 1720c | 7.9s, 1107c | **1B** |
|
||||
| Cron YAML | 8.4s, 4073c | 21.6s, 4150c | 51.7s, 4755c | **1B** |
|
||||
| Debug async | 6.9s, 3646c | 8.9s, 2219c | 48.9s, 4107c | **27B** |
|
||||
|
||||
**27B wins 2/4. 1B wins 2/4. 27B is 5.9x slower.**
|
||||
|
||||
---
|
||||
|
||||
## Key Discovery: Prompt Engineering Changes 27B Behavior
|
||||
|
||||
### Finding 1: "NOT Kubernetes" constraint fixes cron bias
|
||||
**Before (prior runs):** 27B outputs `apiVersion: batch/v1, kind: CronJob`
|
||||
**After (this run):** Prompt says "standard cron YAML (NOT Kubernetes)" → 27B outputs correct bare cron
|
||||
|
||||
The k8s bias (issue #649) is a **default behavior**, not a hard limitation. Explicit constraints work.
|
||||
|
||||
### Finding 2: Concise test requirement gets tests included
|
||||
**Before (prior runs):** "Include type hints, docstring, and one unit test" → 27B omits test
|
||||
**After (this run):** "Include type hints and one unit test" → 27B includes `def test_` with `assert`
|
||||
|
||||
Both 7B and 27B included tests this run. Issue #650 is prompt-dependent.
|
||||
|
||||
### Finding 3: Only 27B finds both async bugs (confirmed again)
|
||||
**27B:** "sequential execution and resource leakage" — identifies both bugs, uses `asyncio.gather()`
|
||||
**1B:** Identifies session-per-request only
|
||||
**7B:** Identifies session-per-request only
|
||||
|
||||
This finding is consistent across all 5 benchmark runs.
|
||||
|
||||
---
|
||||
|
||||
## Quality Markers
|
||||
|
||||
| Marker | 1B | 7B | 27B |
|
||||
|--------|----|----|-----|
|
||||
| Includes unit test | No | Yes | Yes |
|
||||
| Uses k8s for cron | No | No | No (with constraint) |
|
||||
| Finds both async bugs | No | No | Yes |
|
||||
| Correct HMAC impl | Partial | Yes | Yes |
|
||||
|
||||
---
|
||||
|
||||
## Speed Summary
|
||||
|
||||
```
|
||||
Task 1B 7B 27B 27B/1B
|
||||
──────────────────────────────────────
|
||||
Webhook 9.8s 18.6s 55.4s 5.7x
|
||||
Evennia 1.7s 5.9s 7.9s 4.6x
|
||||
Cron 8.4s 21.6s 51.7s 6.2x
|
||||
Debug 6.9s 8.9s 48.9s 7.1x
|
||||
──────────────────────────────────────
|
||||
Total 26.8s 55.0s 163.9s 6.1x
|
||||
```
|
||||
|
||||
## Issues Filed
|
||||
|
||||
| Issue | Finding |
|
||||
|-------|---------|
|
||||
| #652 | "NOT Kubernetes" constraint fixes 27B cron bias |
|
||||
| #653 | 27B includes test when prompt is concise |
|
||||
|
||||
## Recommendation
|
||||
|
||||
- **27B:** Code review, debugging. Use explicit constraints for infrastructure tasks. Keep test requirements concise.
|
||||
- **1B:** Fast explanations, quick drafts. Don't use for debugging.
|
||||
- **7B:** Balanced middle ground. Reliable test generation.
|
||||
|
||||
---
|
||||
|
||||
*5th benchmark run for #576. Prior PRs: #633, #642, #646, #651.*
|
||||
*Generated 2026-04-13.*
|
||||
Reference in New Issue
Block a user