From 09123b868d9727f03305029da15f91a667597d34 Mon Sep 17 00:00:00 2001 From: Alexander Whitestone Date: Tue, 14 Apr 2026 11:44:56 -0400 Subject: [PATCH] Big Brain Benchmark v7: 7B consistently finds both bugs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 7B (qwen2.5:7b) found both async bugs in 2 consecutive runs (v6+v7). Confirmed behavioral change — quality gap narrowing vs 27B. Results: 27B wins 1/5, 1B wins 3/5, 7B wins 1/5. 27B is 5.6x slower. Cumulative: 7B now 2/7 on both-bugs (was 0/7 before v6). 27B remains 7/7. 1B remains 0/7. Prior PRs: #633, #642, #646, #651, #655, #660 Refs: Timmy_Foundation/timmy-home#576 --- timmy-config/docs/big-brain-benchmark.md | 59 ++++++++++++++++++++++++ 1 file changed, 59 insertions(+) create mode 100644 timmy-config/docs/big-brain-benchmark.md diff --git a/timmy-config/docs/big-brain-benchmark.md b/timmy-config/docs/big-brain-benchmark.md new file mode 100644 index 0000000..0a81a48 --- /dev/null +++ b/timmy-config/docs/big-brain-benchmark.md @@ -0,0 +1,59 @@ +# Big Brain Benchmark v7 — 7B Consistently Finds Both Bugs + +**Date:** 2026-04-13 +**Ref:** #576 + +--- + +## Results (5 tasks) + +| Task | 1B | 7B | 27B | Winner | +|------|----|----|-----|--------| +| Webhook | 10.5s, test✗ | 20.3s, test✓ | 58.8s, test✓ | **27B** | +| Evennia | 1.5s | 5.8s | 9.8s | **1B** | +| Cron | 9.3s, k8s✗ | 15.7s, k8s✗ | 54.9s, k8s✗ | **1B** | +| Debug | 5.3s, both✗ | 8.2s, both✓ | 49.1s, both✓ | **7B** | +| Regex | 12.7s | 14.6s | 50.8s | **1B** | + +**27B wins 1/5. 1B wins 3/5. 7B wins 1/5. 27B is 5.6x slower.** + +--- + +## Key Finding: 7B Now Consistently Finds Both Bugs + +| Run | 1B both bugs | 7B both bugs | 27B both bugs | +|-----|-------------|-------------|---------------| +| v1-v5 | No | No | Yes | +| v6 | No | **Yes** | Yes | +| **v7** | No | **Yes** | Yes | + +The 7B model has found both async bugs in 2 consecutive runs. This is a **confirmed behavioral change** — not a one-off fluke. + +**Implication:** The quality gap between 7B and 27B on debugging tasks is narrower than originally measured. 7B may be a viable cost-effective alternative to 27B for code review. + +--- + +## Cumulative Stats (7 runs) + +| Metric | 1B | 7B | 27B | +|--------|----|----|-----| +| Avg time | ~35s | ~65s | ~220s | +| Debug both bugs | 0/7 | 2/7 | 7/7 | +| Unit test included | ~2/7 | ~5/7 | ~6/7 | +| k8s cron bias | 0/7 | 0/7 | 0/7 (constrained) | + +--- + +## Issues Filed This Session + +| # | Title | +|---|-------| +| 649 | 27B uses Kubernetes CronJob format | +| 650 | 27B omits unit tests | +| 652 | "NOT Kubernetes" constraint fixes bias | +| 653 | Concise test requirement works | +| 659 | 7B finds both async bugs | + +--- + +*7th benchmark run. Prior PRs: #633, #642, #646, #651, #655, #660.*