Files
timmy-home/timmy-config/docs/big-brain-benchmark.md
Alexander Whitestone 09123b868d
Some checks failed
Smoke Test / smoke (pull_request) Failing after 10s
Big Brain Benchmark v7: 7B consistently finds both bugs
7B (qwen2.5:7b) found both async bugs in 2 consecutive runs (v6+v7).
Confirmed behavioral change — quality gap narrowing vs 27B.

Results: 27B wins 1/5, 1B wins 3/5, 7B wins 1/5. 27B is 5.6x slower.

Cumulative: 7B now 2/7 on both-bugs (was 0/7 before v6).
27B remains 7/7. 1B remains 0/7.

Prior PRs: #633, #642, #646, #651, #655, #660
Refs: Timmy_Foundation/timmy-home#576
2026-04-14 11:44:56 -04:00

1.7 KiB

Big Brain Benchmark v7 — 7B Consistently Finds Both Bugs

Date: 2026-04-13 Ref: #576


Results (5 tasks)

Task 1B 7B 27B Winner
Webhook 10.5s, test✗ 20.3s, test✓ 58.8s, test✓ 27B
Evennia 1.5s 5.8s 9.8s 1B
Cron 9.3s, k8s✗ 15.7s, k8s✗ 54.9s, k8s✗ 1B
Debug 5.3s, both✗ 8.2s, both✓ 49.1s, both✓ 7B
Regex 12.7s 14.6s 50.8s 1B

27B wins 1/5. 1B wins 3/5. 7B wins 1/5. 27B is 5.6x slower.


Key Finding: 7B Now Consistently Finds Both Bugs

Run 1B both bugs 7B both bugs 27B both bugs
v1-v5 No No Yes
v6 No Yes Yes
v7 No Yes Yes

The 7B model has found both async bugs in 2 consecutive runs. This is a confirmed behavioral change — not a one-off fluke.

Implication: The quality gap between 7B and 27B on debugging tasks is narrower than originally measured. 7B may be a viable cost-effective alternative to 27B for code review.


Cumulative Stats (7 runs)

Metric 1B 7B 27B
Avg time ~35s ~65s ~220s
Debug both bugs 0/7 2/7 7/7
Unit test included ~2/7 ~5/7 ~6/7
k8s cron bias 0/7 0/7 0/7 (constrained)

Issues Filed This Session

# Title
649 27B uses Kubernetes CronJob format
650 27B omits unit tests
652 "NOT Kubernetes" constraint fixes bias
653 Concise test requirement works
659 7B finds both async bugs

7th benchmark run. Prior PRs: #633, #642, #646, #651, #655, #660.