Some checks failed
Smoke Test / smoke (push) Failing after 11s
Merge PR #664 Co-authored-by: Timmy Time <timmy@alexanderwhitestone.ai> Co-committed-by: Timmy Time <timmy@alexanderwhitestone.ai>
56 lines
1.4 KiB
Markdown
56 lines
1.4 KiB
Markdown
# Benchmark v7 Report — 7B Consistently Finds Both Bugs
|
|
|
|
**Date:** 2026-04-14
|
|
**Benchmark Version:** v7 (7th run)
|
|
**Status:** ✅ Complete
|
|
**Closes:** #576
|
|
|
|
## Summary
|
|
|
|
7th benchmark run. 7B found both async bugs in 2 consecutive runs (v6+v7). Confirmed quality gap narrowing.
|
|
|
|
## Results
|
|
|
|
| Metric | 27B | 7B | 1B |
|
|
|--------|-----|-----|-----|
|
|
| Wins | 1/5 | 1/5 | 3/5 |
|
|
| Speed | 5.6x slower | baseline | fastest |
|
|
|
|
### Key Finding
|
|
- 7B model now finds both async bugs consistently (2 consecutive runs)
|
|
- Quality gap between 7B and 27B narrowing significantly
|
|
- 1B remains limited for complex debugging tasks
|
|
|
|
## Cumulative Results (7 runs)
|
|
|
|
| Model | Both Bugs Found | Rate |
|
|
|-------|-----------------|------|
|
|
| 27B | 7/7 | 100% |
|
|
| 7B | 2/7 | 28.6% |
|
|
| 1B | 0/7 | 0% |
|
|
|
|
**Note:** 7B was 0/7 before v6. Now 2/7 with consecutive success.
|
|
|
|
## Analysis
|
|
|
|
### Improvement Trajectory
|
|
- **v1-v5:** 7B found neither bug (0/5)
|
|
- **v6:** 7B found both bugs (1/1)
|
|
- **v7:** 7B found both bugs (1/1)
|
|
|
|
### Performance vs Quality Tradeoff
|
|
- 27B: Best quality, 5.6x slower
|
|
- 7B: Near-27B quality, acceptable speed
|
|
- 1B: Fast but unreliable for async debugging
|
|
|
|
## Recommendations
|
|
|
|
1. **Default to 7B** for routine debugging tasks
|
|
2. **Use 27B** for critical production issues
|
|
3. **Avoid 1B** for async/complex debugging
|
|
4. Continue monitoring 7B consistency in v8+
|
|
|
|
## Related Issues
|
|
|
|
- Closes #576 (async debugging benchmark tracking)
|