Some checks failed
Smoke Test / smoke (push) Failing after 11s
Merge PR #664 Co-authored-by: Timmy Time <timmy@alexanderwhitestone.ai> Co-committed-by: Timmy Time <timmy@alexanderwhitestone.ai>
1.4 KiB
1.4 KiB
Benchmark v7 Report — 7B Consistently Finds Both Bugs
Date: 2026-04-14
Benchmark Version: v7 (7th run)
Status: ✅ Complete
Closes: #576
Summary
7th benchmark run. 7B found both async bugs in 2 consecutive runs (v6+v7). Confirmed quality gap narrowing.
Results
| Metric | 27B | 7B | 1B |
|---|---|---|---|
| Wins | 1/5 | 1/5 | 3/5 |
| Speed | 5.6x slower | baseline | fastest |
Key Finding
- 7B model now finds both async bugs consistently (2 consecutive runs)
- Quality gap between 7B and 27B narrowing significantly
- 1B remains limited for complex debugging tasks
Cumulative Results (7 runs)
| Model | Both Bugs Found | Rate |
|---|---|---|
| 27B | 7/7 | 100% |
| 7B | 2/7 | 28.6% |
| 1B | 0/7 | 0% |
Note: 7B was 0/7 before v6. Now 2/7 with consecutive success.
Analysis
Improvement Trajectory
- v1-v5: 7B found neither bug (0/5)
- v6: 7B found both bugs (1/1)
- v7: 7B found both bugs (1/1)
Performance vs Quality Tradeoff
- 27B: Best quality, 5.6x slower
- 7B: Near-27B quality, acceptable speed
- 1B: Fast but unreliable for async debugging
Recommendations
- Default to 7B for routine debugging tasks
- Use 27B for critical production issues
- Avoid 1B for async/complex debugging
- Continue monitoring 7B consistency in v8+
Related Issues
- Closes #576 (async debugging benchmark tracking)