Compare commits
1 Commits
main
...
triage/576
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
8cbf8cb395 |
109
timmy-config/docs/big-brain-benchmark.md
Normal file
109
timmy-config/docs/big-brain-benchmark.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# Big Brain Benchmark v4 — 1B vs 7B vs 27B
|
||||
|
||||
**Date:** 2026-04-13
|
||||
**Models:** gemma3:1b, qwen2.5:7b, gemma4:latest (27B)
|
||||
**Hardware:** Apple Silicon Mac, local Ollama, temp=0.3
|
||||
**Ref:** #576
|
||||
|
||||
---
|
||||
|
||||
## Results (6 tasks)
|
||||
|
||||
| Task | 1B | 7B | 27B | Winner |
|
||||
|------|----|----|-----|--------|
|
||||
| Webhook parser | 9.1s, 4238c | 17.6s, 2921c | 53.8s, 7085c | **27B** |
|
||||
| Evennia explain | 1.9s, 1072c | 5.8s, 1680c | 8.1s, 1031c | **1B** |
|
||||
| Cron YAML | 15.8s, 3141c | 18.7s, 3895c | 51.4s, 4367c | **1B** |
|
||||
| Debug async | 9.5s, 4933c | 10.8s, 2732c | 48.2s, 4294c | **27B** |
|
||||
| IPv4 regex | 15.2s, 4318c | 18.1s, 2496c | 51.3s, 1671c | **7B** |
|
||||
| Entity hierarchy | 12.6s, 5848c | 15.3s, 3226c | 51.2s, 4756c | **27B** |
|
||||
|
||||
**27B wins 3/6. 1B wins 2/6. 7B wins 1/6.**
|
||||
**27B is 4.0x slower on average.**
|
||||
|
||||
---
|
||||
|
||||
## Key Findings
|
||||
|
||||
### Finding 1: 27B has Kubernetes bias
|
||||
When asked for cron YAML, 27B generates `apiVersion: batch/v1, kind: CronJob` (k8s format) instead of standard cron. 1B produces correct bare-cron YAML. **Filed as #649.**
|
||||
|
||||
### Finding 2: 27B omits tests
|
||||
Despite explicit "include one unit test" in the prompt, 27B generates complete implementation without tests. 1B and 7B also miss this. **Filed as #650.**
|
||||
|
||||
### Finding 3: Debugging quality gap is massive
|
||||
27B identifies both bugs (sequential execution + session-per-request). 1B identifies only the session bug. 7B identifies neither clearly. The quality gap on diagnostic tasks is the widest across all benchmarks.
|
||||
|
||||
### Finding 4: 1B over-generates
|
||||
1B produces 4318 chars for IPv4 regex where 27B produces 1671. 1B adds excessive explanation. 27B is more concise. For regex specifically, 7B hits the sweet spot (2496c, correct pattern).
|
||||
|
||||
---
|
||||
|
||||
## Task Details
|
||||
|
||||
### Task 1: Webhook Parser
|
||||
|
||||
**27B** (53.8s): Complete `@dataclass`, `hmac.new()`, error chaining with `from exc`. Best structure. **Missing: unit test.**
|
||||
|
||||
**7B** (17.6s): Clean `TypedDict` approach. Correct signature verification. **Missing: unit test.**
|
||||
|
||||
**1B** (9.1s): Class-based with logging. Uses `validator` decorator from Pydantic (stdlib incompatibility). **Verdict:** Looks professional but has import errors.
|
||||
|
||||
### Task 4: Debug Async
|
||||
|
||||
**27B** (48.2s): Identifies **both** bugs. "This is a classic performance optimization problem... sequential execution and resource leakage." Fix: `asyncio.gather()` + shared session. **Complete.**
|
||||
|
||||
**7B** (10.8s): Mentions async context handling. Misses sequential fetch. **Partial.**
|
||||
|
||||
**1B** (9.5s): Identifies session-per-request bug. Misses sequential fetch. **Partial.**
|
||||
|
||||
### Task 5: IPv4 Regex
|
||||
|
||||
**7B** (18.1s): Correct regex, reasonable length. **Best balance.**
|
||||
|
||||
**27B** (51.3s): Correct regex, concise explanation. Pattern: `^(?:(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\\.){3}...$` **Correct but minimal.**
|
||||
|
||||
**1B** (15.2s): Correct regex but 4318 chars of explanation. **Over-verbose.**
|
||||
|
||||
### Task 6: Entity Hierarchy
|
||||
|
||||
**27B** (51.2s): Dataclass-based, clean hierarchy, `to_dict()` serialization. **Best design.**
|
||||
|
||||
**7B** (15.3s): Class-based, functional. Missing serialization detail. **Good.**
|
||||
|
||||
**1B** (12.6s): Over-engineered with 5848 chars. Includes methods not requested. **Verbose.**
|
||||
|
||||
---
|
||||
|
||||
## Speed Distribution
|
||||
|
||||
```
|
||||
Task 1B 7B 27B Ratio
|
||||
─────────────────────────────────────────
|
||||
Webhook 9.1s 17.6s 53.8s 5.9x
|
||||
Evennia 1.9s 5.8s 8.1s 4.3x
|
||||
Cron 15.8s 18.7s 51.4s 3.3x
|
||||
Debug 9.5s 10.8s 48.2s 5.1x
|
||||
Regex 15.2s 18.1s 51.3s 3.4x
|
||||
Entity 12.6s 15.3s 51.2s 4.1x
|
||||
─────────────────────────────────────────
|
||||
Average 10.7s 14.4s 44.0s 4.0x
|
||||
```
|
||||
|
||||
## Issues Filed
|
||||
|
||||
| Issue | Title | Source |
|
||||
|-------|-------|--------|
|
||||
| #649 | 27B uses Kubernetes CronJob format instead of bare cron | Task 3 |
|
||||
| #650 | 27B omits unit tests despite explicit prompt | Task 1 |
|
||||
|
||||
## Recommendation
|
||||
|
||||
- **27B:** Code review, debugging, architecture (3 wins, highest quality)
|
||||
- **7B:** Regex, quick drafts, balanced output (1 win, best conciseness)
|
||||
- **1B:** Explanations, fast lookups (2 wins, fastest)
|
||||
|
||||
---
|
||||
|
||||
*Generated 2026-04-13. All models via local Ollama.*
|
||||
*Note: 3 prior PRs for #576 exist (#633, #642, #646). Recommend closing duplicates.*
|
||||
Reference in New Issue
Block a user