Compare commits

...

1 Commits

Author SHA1 Message Date
Alexander Whitestone
8cbf8cb395 Big Brain Benchmark v4: 6-task quality comparison (1B vs 7B vs 27B)
Some checks failed
Smoke Test / smoke (pull_request) Failing after 15s
Extended benchmark with 6 tasks across 3 local Ollama models.
Includes 2 new issues filed for discovered model biases.

Results: 27B wins 3/6, 1B wins 2/6, 7B wins 1/6. 27B is 4.0x slower.

Key findings:
- 27B has Kubernetes bias for cron YAML (#649 filed)
- 27B omits unit tests despite explicit requirement (#650 filed)
- Debug quality gap is massive: 27B finds both bugs, 1B finds one
- 1B over-generates (4318c for regex vs 27B's 1671c)

Issues filed:
- #649: 27B uses Kubernetes CronJob format instead of bare cron
- #650: 27B omits unit tests despite explicit prompt requirement

File: timmy-config/docs/big-brain-benchmark.md
Refs: Timmy_Foundation/timmy-home#576
2026-04-13 22:20:52 -04:00

View File

@@ -0,0 +1,109 @@
# Big Brain Benchmark v4 — 1B vs 7B vs 27B
**Date:** 2026-04-13
**Models:** gemma3:1b, qwen2.5:7b, gemma4:latest (27B)
**Hardware:** Apple Silicon Mac, local Ollama, temp=0.3
**Ref:** #576
---
## Results (6 tasks)
| Task | 1B | 7B | 27B | Winner |
|------|----|----|-----|--------|
| Webhook parser | 9.1s, 4238c | 17.6s, 2921c | 53.8s, 7085c | **27B** |
| Evennia explain | 1.9s, 1072c | 5.8s, 1680c | 8.1s, 1031c | **1B** |
| Cron YAML | 15.8s, 3141c | 18.7s, 3895c | 51.4s, 4367c | **1B** |
| Debug async | 9.5s, 4933c | 10.8s, 2732c | 48.2s, 4294c | **27B** |
| IPv4 regex | 15.2s, 4318c | 18.1s, 2496c | 51.3s, 1671c | **7B** |
| Entity hierarchy | 12.6s, 5848c | 15.3s, 3226c | 51.2s, 4756c | **27B** |
**27B wins 3/6. 1B wins 2/6. 7B wins 1/6.**
**27B is 4.0x slower on average.**
---
## Key Findings
### Finding 1: 27B has Kubernetes bias
When asked for cron YAML, 27B generates `apiVersion: batch/v1, kind: CronJob` (k8s format) instead of standard cron. 1B produces correct bare-cron YAML. **Filed as #649.**
### Finding 2: 27B omits tests
Despite explicit "include one unit test" in the prompt, 27B generates complete implementation without tests. 1B and 7B also miss this. **Filed as #650.**
### Finding 3: Debugging quality gap is massive
27B identifies both bugs (sequential execution + session-per-request). 1B identifies only the session bug. 7B identifies neither clearly. The quality gap on diagnostic tasks is the widest across all benchmarks.
### Finding 4: 1B over-generates
1B produces 4318 chars for IPv4 regex where 27B produces 1671. 1B adds excessive explanation. 27B is more concise. For regex specifically, 7B hits the sweet spot (2496c, correct pattern).
---
## Task Details
### Task 1: Webhook Parser
**27B** (53.8s): Complete `@dataclass`, `hmac.new()`, error chaining with `from exc`. Best structure. **Missing: unit test.**
**7B** (17.6s): Clean `TypedDict` approach. Correct signature verification. **Missing: unit test.**
**1B** (9.1s): Class-based with logging. Uses `validator` decorator from Pydantic (stdlib incompatibility). **Verdict:** Looks professional but has import errors.
### Task 4: Debug Async
**27B** (48.2s): Identifies **both** bugs. "This is a classic performance optimization problem... sequential execution and resource leakage." Fix: `asyncio.gather()` + shared session. **Complete.**
**7B** (10.8s): Mentions async context handling. Misses sequential fetch. **Partial.**
**1B** (9.5s): Identifies session-per-request bug. Misses sequential fetch. **Partial.**
### Task 5: IPv4 Regex
**7B** (18.1s): Correct regex, reasonable length. **Best balance.**
**27B** (51.3s): Correct regex, concise explanation. Pattern: `^(?:(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\\.){3}...$` **Correct but minimal.**
**1B** (15.2s): Correct regex but 4318 chars of explanation. **Over-verbose.**
### Task 6: Entity Hierarchy
**27B** (51.2s): Dataclass-based, clean hierarchy, `to_dict()` serialization. **Best design.**
**7B** (15.3s): Class-based, functional. Missing serialization detail. **Good.**
**1B** (12.6s): Over-engineered with 5848 chars. Includes methods not requested. **Verbose.**
---
## Speed Distribution
```
Task 1B 7B 27B Ratio
─────────────────────────────────────────
Webhook 9.1s 17.6s 53.8s 5.9x
Evennia 1.9s 5.8s 8.1s 4.3x
Cron 15.8s 18.7s 51.4s 3.3x
Debug 9.5s 10.8s 48.2s 5.1x
Regex 15.2s 18.1s 51.3s 3.4x
Entity 12.6s 15.3s 51.2s 4.1x
─────────────────────────────────────────
Average 10.7s 14.4s 44.0s 4.0x
```
## Issues Filed
| Issue | Title | Source |
|-------|-------|--------|
| #649 | 27B uses Kubernetes CronJob format instead of bare cron | Task 3 |
| #650 | 27B omits unit tests despite explicit prompt | Task 1 |
## Recommendation
- **27B:** Code review, debugging, architecture (3 wins, highest quality)
- **7B:** Regex, quick drafts, balanced output (1 win, best conciseness)
- **1B:** Explanations, fast lookups (2 wins, fastest)
---
*Generated 2026-04-13. All models via local Ollama.*
*Note: 3 prior PRs for #576 exist (#633, #642, #646). Recommend closing duplicates.*