Timmy_Foundation/timmy-home

Files

Timmy Time 341e5f5498

Smoke Test / smoke (push) Failing after 11s

Details

fix: [BIG-BRAIN] Benchmark v7 — 7B consistently finds both bugs (#664 )

Merge PR #664

Co-authored-by: Timmy Time <timmy@alexanderwhitestone.ai>
Co-committed-by: Timmy Time <timmy@alexanderwhitestone.ai>

2026-04-14 22:14:41 +00:00

1.4 KiB

Raw Blame History

Benchmark v7 Report — 7B Consistently Finds Both Bugs

Date: 2026-04-14
Benchmark Version: v7 (7th run)
Status: ✅ Complete
Closes: #576

Summary

7th benchmark run. 7B found both async bugs in 2 consecutive runs (v6+v7). Confirmed quality gap narrowing.

Results

Metric	27B	7B	1B
Wins	1/5	1/5	3/5
Speed	5.6x slower	baseline	fastest

Key Finding

7B model now finds both async bugs consistently (2 consecutive runs)
Quality gap between 7B and 27B narrowing significantly
1B remains limited for complex debugging tasks

Cumulative Results (7 runs)

Model	Both Bugs Found	Rate
27B	7/7	100%
7B	2/7	28.6%
1B	0/7	0%

Note: 7B was 0/7 before v6. Now 2/7 with consecutive success.

Analysis

Improvement Trajectory

v1-v5: 7B found neither bug (0/5)
v6: 7B found both bugs (1/1)
v7: 7B found both bugs (1/1)

Performance vs Quality Tradeoff

27B: Best quality, 5.6x slower
7B: Near-27B quality, acceptable speed
1B: Fast but unreliable for async debugging

Recommendations

Default to 7B for routine debugging tasks
Use 27B for critical production issues
Avoid 1B for async/complex debugging
Continue monitoring 7B consistency in v8+

Closes #576 (async debugging benchmark tracking)