Files

Alexander Whitestone 09123b868d

Smoke Test / smoke (pull_request) Failing after 10s

Details

Big Brain Benchmark v7: 7B consistently finds both bugs

7B (qwen2.5:7b) found both async bugs in 2 consecutive runs (v6+v7).
Confirmed behavioral change — quality gap narrowing vs 27B.

Results: 27B wins 1/5, 1B wins 3/5, 7B wins 1/5. 27B is 5.6x slower.

Cumulative: 7B now 2/7 on both-bugs (was 0/7 before v6).
27B remains 7/7. 1B remains 0/7.

Prior PRs: #633, #642, #646, #651, #655, #660
Refs: Timmy_Foundation/timmy-home#576

2026-04-14 11:44:56 -04:00

1.7 KiB

Raw Blame History

Big Brain Benchmark v7 — 7B Consistently Finds Both Bugs

Date: 2026-04-13 Ref: #576

Results (5 tasks)

Task	1B	7B	27B	Winner
Webhook	10.5s, test✗	20.3s, test✓	58.8s, test✓	27B
Evennia	1.5s	5.8s	9.8s	1B
Cron	9.3s, k8s✗	15.7s, k8s✗	54.9s, k8s✗	1B
Debug	5.3s, both✗	8.2s, both✓	49.1s, both✓	7B
Regex	12.7s	14.6s	50.8s	1B

27B wins 1/5. 1B wins 3/5. 7B wins 1/5. 27B is 5.6x slower.

Key Finding: 7B Now Consistently Finds Both Bugs

Run	1B both bugs	7B both bugs	27B both bugs
v1-v5	No	No	Yes
v6	No	Yes	Yes
v7	No	Yes	Yes

The 7B model has found both async bugs in 2 consecutive runs. This is a confirmed behavioral change — not a one-off fluke.

Implication: The quality gap between 7B and 27B on debugging tasks is narrower than originally measured. 7B may be a viable cost-effective alternative to 27B for code review.

Cumulative Stats (7 runs)

Metric	1B	7B	27B
Avg time	~35s	~65s	~220s
Debug both bugs	0/7	2/7	7/7
Unit test included	~2/7	~5/7	~6/7
k8s cron bias	0/7	0/7	0/7 (constrained)

Issues Filed This Session

#	Title
649	27B uses Kubernetes CronJob format
650	27B omits unit tests
652	"NOT Kubernetes" constraint fixes bias
653	Concise test requirement works
659	7B finds both async bugs

7th benchmark run. Prior PRs: #633, #642, #646, #651, #655, #660.

1.7 KiB Raw Blame History