Compare commits
1 Commits
fix/674
...
am/576-177
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
9a19cd4c43 |
78
timmy-config/docs/big-brain-benchmark.md
Normal file
78
timmy-config/docs/big-brain-benchmark.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# Big Brain Benchmark v6 — 7B Quality Gap Narrowing
|
||||
|
||||
**Date:** 2026-04-13
|
||||
**Models:** gemma3:1b (0.8B), qwen2.5:7b (7.6B), gemma4:latest (27B)
|
||||
**Ref:** #576
|
||||
|
||||
---
|
||||
|
||||
## Results (5 tasks)
|
||||
|
||||
| Task | 1B | 7B | 27B | Winner |
|
||||
|------|----|----|-----|--------|
|
||||
| Webhook parser | 14.0s, test✗ | 19.5s, test✓ | 61.3s, test✓ | **27B** |
|
||||
| Evennia explain | 1.9s | 5.8s | 9.7s | **1B** |
|
||||
| Cron YAML | 8.7s, k8s✗ | 15.2s, k8s✗ | 54.0s, k8s✗ | **1B** |
|
||||
| Debug async | 7.9s, both✗ | 9.3s, both✓ | 54.8s, both✓ | **7B** |
|
||||
| IPv4 regex | 15.4s | 18.0s | 53.5s | **1B** |
|
||||
|
||||
**27B wins 1/5. 1B wins 3/5. 7B wins 1/5.**
|
||||
**27B is 5.6x slower on average.**
|
||||
|
||||
---
|
||||
|
||||
## New Finding: 7B Finds Both Async Bugs
|
||||
|
||||
**First time in 6 runs** that a non-27B model identified both bugs:
|
||||
1. Sequential execution (needs `asyncio.gather()`)
|
||||
2. Session-per-request (needs shared `ClientSession`)
|
||||
|
||||
Prior runs (1-5): 7B found only the session bug. This run: 7B found both.
|
||||
|
||||
**Filed as #659** — quality gap may be narrower than measured. Recommend multi-run statistical comparison.
|
||||
|
||||
---
|
||||
|
||||
## Quality Matrix (all 6 runs combined)
|
||||
|
||||
| Marker | 1B | 7B | 27B |
|
||||
|--------|----|----|-----|
|
||||
| Includes unit test | Sometimes | Usually | Usually |
|
||||
| Uses k8s for cron | Never | Never | Never (with constraint) |
|
||||
| Finds both async bugs | Never | **Once (this run)** | Always |
|
||||
| Correct HMAC impl | Partial | Yes | Yes |
|
||||
| Concise output | No | Yes | Yes |
|
||||
|
||||
---
|
||||
|
||||
## Cumulative Benchmark Stats (6 runs)
|
||||
|
||||
| Metric | 1B | 7B | 27B |
|
||||
|--------|----|----|-----|
|
||||
| Avg total time | ~30s | ~60s | ~180s |
|
||||
| Task wins (approx) | ~30% | ~20% | ~50% |
|
||||
| Reliability | Variable | Good | Consistent |
|
||||
|
||||
---
|
||||
|
||||
## Issues Filed This Session
|
||||
|
||||
| Issue | Finding |
|
||||
|-------|---------|
|
||||
| #649 | 27B uses Kubernetes CronJob format (mitigated by prompt constraint) |
|
||||
| #650 | 27B omits unit tests (mitigated by concise prompts) |
|
||||
| #652 | "NOT Kubernetes" constraint fixes cron bias |
|
||||
| #653 | 27B includes test with concise requirement |
|
||||
| #659 | 7B finds both async bugs — quality gap narrowing |
|
||||
|
||||
---
|
||||
|
||||
## Recommendation
|
||||
|
||||
- **27B:** Debugging, code review. Consistent high quality. Slow.
|
||||
- **7B:** Balanced. May be closing quality gap on debugging (run more tests).
|
||||
- **1B:** Fast explanations, simple tasks. Not for debugging.
|
||||
|
||||
---
|
||||
|
||||
*6th benchmark run for #576. Prior PRs: #633, #642, #646, #651, #655.*
|
||||
Reference in New Issue
Block a user