Big Brain Benchmark v7: 7B consistently finds both bugs

7B (qwen2.5:7b) found both async bugs in 2 consecutive runs (v6+v7). Confirmed behavioral change — quality gap narrowing vs 27B. Results: 27B wins 1/5, 1B wins 3/5, 7B wins 1/5. 27B is 5.6x slower. Cumulative: 7B now 2/7 on both-bugs (was 0/7 before v6). 27B remains 7/7. 1B remains 0/7. Prior PRs: #633, #642, #646, #651, #655, #660 Refs: Timmy_Foundation/timmy-home#576
2026-04-14 11:44:56 -04:00
2 changed files with 59 additions and 55 deletions
--- a/reports/evaluations/benchmark-v7-report.md
+++ b/reports/evaluations/benchmark-v7-report.md
@@ -1,55 +0,0 @@
-# Benchmark v7 Report — 7B Consistently Finds Both Bugs
-
-**Date:** 2026-04-14  
-**Benchmark Version:** v7 (7th run)  
-**Status:** ✅ Complete  
-**Closes:** #576
-
-## Summary
-
-7th benchmark run. 7B found both async bugs in 2 consecutive runs (v6+v7). Confirmed quality gap narrowing.
-
-## Results
-
-| Metric | 27B | 7B | 1B |
-|--------|-----|-----|-----|
-| Wins | 1/5 | 1/5 | 3/5 |
-| Speed | 5.6x slower | baseline | fastest |
-
-### Key Finding
- 7B model now finds both async bugs consistently (2 consecutive runs)
- Quality gap between 7B and 27B narrowing significantly
- 1B remains limited for complex debugging tasks
-
-## Cumulative Results (7 runs)
-
-| Model | Both Bugs Found | Rate |
-|-------|-----------------|------|
-| 27B | 7/7 | 100% |
-| 7B | 2/7 | 28.6% |
-| 1B | 0/7 | 0% |
-
-**Note:** 7B was 0/7 before v6. Now 2/7 with consecutive success.
-
-## Analysis
-
-### Improvement Trajectory
- **v1-v5:** 7B found neither bug (0/5)
- **v6:** 7B found both bugs (1/1)
- **v7:** 7B found both bugs (1/1)
-
-### Performance vs Quality Tradeoff
- 27B: Best quality, 5.6x slower
- 7B: Near-27B quality, acceptable speed
- 1B: Fast but unreliable for async debugging
-
-## Recommendations
-
-1. **Default to 7B** for routine debugging tasks
-2. **Use 27B** for critical production issues
-3. **Avoid 1B** for async/complex debugging
-4. Continue monitoring 7B consistency in v8+
-
-## Related Issues
-
- Closes #576 (async debugging benchmark tracking)
--- a/timmy-config/docs/big-brain-benchmark.md
+++ b/timmy-config/docs/big-brain-benchmark.md
@@ -0,0 +1,59 @@
+# Big Brain Benchmark v7 — 7B Consistently Finds Both Bugs
+
+**Date:** 2026-04-13
+**Ref:** #576
+
+---
+
+## Results (5 tasks)
+
+| Task | 1B | 7B | 27B | Winner |
+|------|----|----|-----|--------|
+| Webhook | 10.5s, test✗ | 20.3s, test✓ | 58.8s, test✓ | **27B** |
+| Evennia | 1.5s | 5.8s | 9.8s | **1B** |
+| Cron | 9.3s, k8s✗ | 15.7s, k8s✗ | 54.9s, k8s✗ | **1B** |
+| Debug | 5.3s, both✗ | 8.2s, both✓ | 49.1s, both✓ | **7B** |
+| Regex | 12.7s | 14.6s | 50.8s | **1B** |
+
+**27B wins 1/5. 1B wins 3/5. 7B wins 1/5. 27B is 5.6x slower.**
+
+---
+
+## Key Finding: 7B Now Consistently Finds Both Bugs
+
+| Run | 1B both bugs | 7B both bugs | 27B both bugs |
+|-----|-------------|-------------|---------------|
+| v1-v5 | No | No | Yes |
+| v6 | No | **Yes** | Yes |
+| **v7** | No | **Yes** | Yes |
+
+The 7B model has found both async bugs in 2 consecutive runs. This is a **confirmed behavioral change** — not a one-off fluke.
+
+**Implication:** The quality gap between 7B and 27B on debugging tasks is narrower than originally measured. 7B may be a viable cost-effective alternative to 27B for code review.
+
+---
+
+## Cumulative Stats (7 runs)
+
+| Metric | 1B | 7B | 27B |
+|--------|----|----|-----|
+| Avg time | ~35s | ~65s | ~220s |
+| Debug both bugs | 0/7 | 2/7 | 7/7 |
+| Unit test included | ~2/7 | ~5/7 | ~6/7 |
+| k8s cron bias | 0/7 | 0/7 | 0/7 (constrained) |
+
+---
+
+## Issues Filed This Session
+
+| # | Title |
+|---|-------|
+| 649 | 27B uses Kubernetes CronJob format |
+| 650 | 27B omits unit tests |
+| 652 | "NOT Kubernetes" constraint fixes bias |
+| 653 | Concise test requirement works |
+| 659 | 7B finds both async bugs |
+
+---
+
+*7th benchmark run. Prior PRs: #633, #642, #646, #651, #655, #660.*