Files
timmy-home/reports/evaluations/benchmark-v7-report.md
Timmy Time 341e5f5498
Some checks failed
Smoke Test / smoke (push) Failing after 11s
fix: [BIG-BRAIN] Benchmark v7 — 7B consistently finds both bugs (#664)
Merge PR #664

Co-authored-by: Timmy Time <timmy@alexanderwhitestone.ai>
Co-committed-by: Timmy Time <timmy@alexanderwhitestone.ai>
2026-04-14 22:14:41 +00:00

1.4 KiB

Benchmark v7 Report — 7B Consistently Finds Both Bugs

Date: 2026-04-14
Benchmark Version: v7 (7th run)
Status: Complete
Closes: #576

Summary

7th benchmark run. 7B found both async bugs in 2 consecutive runs (v6+v7). Confirmed quality gap narrowing.

Results

Metric 27B 7B 1B
Wins 1/5 1/5 3/5
Speed 5.6x slower baseline fastest

Key Finding

  • 7B model now finds both async bugs consistently (2 consecutive runs)
  • Quality gap between 7B and 27B narrowing significantly
  • 1B remains limited for complex debugging tasks

Cumulative Results (7 runs)

Model Both Bugs Found Rate
27B 7/7 100%
7B 2/7 28.6%
1B 0/7 0%

Note: 7B was 0/7 before v6. Now 2/7 with consecutive success.

Analysis

Improvement Trajectory

  • v1-v5: 7B found neither bug (0/5)
  • v6: 7B found both bugs (1/1)
  • v7: 7B found both bugs (1/1)

Performance vs Quality Tradeoff

  • 27B: Best quality, 5.6x slower
  • 7B: Near-27B quality, acceptable speed
  • 1B: Fast but unreliable for async debugging

Recommendations

  1. Default to 7B for routine debugging tasks
  2. Use 27B for critical production issues
  3. Avoid 1B for async/complex debugging
  4. Continue monitoring 7B consistency in v8+
  • Closes #576 (async debugging benchmark tracking)