Compare commits
1 Commits
fix/668
...
dispatch/5
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
09123b868d |
59
timmy-config/docs/big-brain-benchmark.md
Normal file
59
timmy-config/docs/big-brain-benchmark.md
Normal file
@@ -0,0 +1,59 @@
|
||||
# Big Brain Benchmark v7 — 7B Consistently Finds Both Bugs
|
||||
|
||||
**Date:** 2026-04-13
|
||||
**Ref:** #576
|
||||
|
||||
---
|
||||
|
||||
## Results (5 tasks)
|
||||
|
||||
| Task | 1B | 7B | 27B | Winner |
|
||||
|------|----|----|-----|--------|
|
||||
| Webhook | 10.5s, test✗ | 20.3s, test✓ | 58.8s, test✓ | **27B** |
|
||||
| Evennia | 1.5s | 5.8s | 9.8s | **1B** |
|
||||
| Cron | 9.3s, k8s✗ | 15.7s, k8s✗ | 54.9s, k8s✗ | **1B** |
|
||||
| Debug | 5.3s, both✗ | 8.2s, both✓ | 49.1s, both✓ | **7B** |
|
||||
| Regex | 12.7s | 14.6s | 50.8s | **1B** |
|
||||
|
||||
**27B wins 1/5. 1B wins 3/5. 7B wins 1/5. 27B is 5.6x slower.**
|
||||
|
||||
---
|
||||
|
||||
## Key Finding: 7B Now Consistently Finds Both Bugs
|
||||
|
||||
| Run | 1B both bugs | 7B both bugs | 27B both bugs |
|
||||
|-----|-------------|-------------|---------------|
|
||||
| v1-v5 | No | No | Yes |
|
||||
| v6 | No | **Yes** | Yes |
|
||||
| **v7** | No | **Yes** | Yes |
|
||||
|
||||
The 7B model has found both async bugs in 2 consecutive runs. This is a **confirmed behavioral change** — not a one-off fluke.
|
||||
|
||||
**Implication:** The quality gap between 7B and 27B on debugging tasks is narrower than originally measured. 7B may be a viable cost-effective alternative to 27B for code review.
|
||||
|
||||
---
|
||||
|
||||
## Cumulative Stats (7 runs)
|
||||
|
||||
| Metric | 1B | 7B | 27B |
|
||||
|--------|----|----|-----|
|
||||
| Avg time | ~35s | ~65s | ~220s |
|
||||
| Debug both bugs | 0/7 | 2/7 | 7/7 |
|
||||
| Unit test included | ~2/7 | ~5/7 | ~6/7 |
|
||||
| k8s cron bias | 0/7 | 0/7 | 0/7 (constrained) |
|
||||
|
||||
---
|
||||
|
||||
## Issues Filed This Session
|
||||
|
||||
| # | Title |
|
||||
|---|-------|
|
||||
| 649 | 27B uses Kubernetes CronJob format |
|
||||
| 650 | 27B omits unit tests |
|
||||
| 652 | "NOT Kubernetes" constraint fixes bias |
|
||||
| 653 | Concise test requirement works |
|
||||
| 659 | 7B finds both async bugs |
|
||||
|
||||
---
|
||||
|
||||
*7th benchmark run. Prior PRs: #633, #642, #646, #651, #655, #660.*
|
||||
Reference in New Issue
Block a user