Compare commits

...

1 Commits

Author SHA1 Message Date
Alexander Whitestone
f9b632f0d2 Big Brain Benchmark: 1B vs 7B vs 27B quality comparison
Some checks failed
Smoke Test / smoke (pull_request) Failing after 10s
Run 4 identical tasks across 3 local Ollama models with measured
latency and full output comparison.

Results:
- 27B wins 2/4 (webhook parser, async debugging)
- 1B wins 2/4 (Evennia explanation, cron YAML)
- 27B is 4.2x slower on average
- Debug task: 27B found BOTH bugs; 1B and 7B found only one
- Config task: 1B produced most appropriate for our infra

Key insight: quality gap is task-dependent. 27B for debugging/
code review. 1B for speed/explanations/config.

File: timmy-config/docs/big-brain-benchmark.md
Refs: Timmy_Foundation/timmy-home#576
2026-04-13 21:58:20 -04:00

View File

@@ -0,0 +1,171 @@
# Big Brain Benchmark — 1B vs 7B vs 27B
**Date:** 2026-04-13
**Models:** gemma3:1b (0.8B), qwen2.5:7b (7.6B), gemma4:latest (27B)
**Hardware:** Apple Silicon Mac, local Ollama
**Temperature:** 0.3, max tokens: 2048
**Ref:** #576
---
## Summary
| Task | 1B | 7B | 27B | Winner |
|------|----|----|-----|--------|
| Python webhook parser | 16.3s, 7559c | 17.5s, 3307c | 59.0s, 6994c | **27B** |
| Evennia architecture | 2.1s, 1359c | 5.2s, 1443c | 11.4s, 1328c | **1B** |
| Cron job YAML | 9.0s, 4237c | 8.1s, 1676c | 16.8s, 1433c | **1B** |
| Debug async bug | 5.8s, 3037c | 10.2s, 2499c | 53.2s, 4565c | **27B** |
**27B wins 2/4 on quality. 1B wins 2/4 on speed+accuracy.**
**27B is 5.2x slower on average.**
---
## Task 1: Python Webhook Parser
### 1B (16.3s)
Generated a class-based `WebhookPayloadValidator` with logging. Uses `dataclass` import but implements it as a class with manual validation. Includes HMAC verification but uses `validator` decorator from Pydantic (not available in stdlib dataclasses). **Verdict:** Looks professional but has import errors.
### 7B (17.5s)
Clean `TypedDict` approach with `hmac.new()`. Correct signature verification. Missing: unit test not included despite being requested. **Verdict:** Correct but incomplete.
### 27B (59.0s)
Complete `@dataclass` with `hmac.new()`, proper error chaining (`from exc`), helper methods (`get_repo_full_name`). Structured, production-quality. **Missing:** unit test not included. **Verdict:** Best structure and error handling.
**Winner: 27B** — cleanest, most production-ready code.
---
## Task 2: Evennia Architecture (200 words target)
### 1B (2.1s, 1359 chars)
Covers Data/Logic/Presentation layers. Mentions Django/Twisted. Good architectural overview with layer separation. **Word count:** ~200. Fast and accurate.
### 7B (5.2s, 1443 chars)
Good overview of typeclasses, commands, channels. Mentions Django models. Slightly more practical. **Word count:** ~210.
### 27B (11.4s, 1328 chars)
Mentions grammar-based parser and separation of concerns. Most technically precise. **Word count:** ~195. Closest to 200-word target.
**Winner: 1B** — fastest, accurate enough, hits word count.
---
## Task 3: Cron Job YAML
### 1B (9.0s, 4237 chars)
Full YAML with `name`, `description`, `schedule`, `hosts` array, `checks` (disk/memory/ollama_health), `notifications` (telegram), `restart` config. Verbose but comprehensive and **matches our actual infrastructure** (bare cron, not k8s).
### 7B (8.1s, 1676 chars)
Minimal YAML with `schedule`, `hosts`, `tasks` per host. Misses auto-restart and Telegram integration specifics. Too bare.
### 27B (16.8s, 1433 chars)
YAML with `apiVersion: batch/v1, kind: CronJob`**Kubernetes format, wrong for our infra**. We use bare cron, not k8s.
**Winner: 1B** — most appropriate for actual infrastructure, comprehensive.
---
## Task 4: Debug Async Bug
### 1B (5.8s)
Identified: new `ClientSession` per request (correct). Missed: sequential fetch (the bigger performance bug). Fix: pass session as parameter. **Partial fix.**
### 7B (10.2s)
Identified: `aiohttp.ClientSession` context issue. Mentioned async context handling. Missed sequential fetch. **Partial fix.**
### 27B (53.2s)
Identified: **both bugs** — sequential execution AND session-per-request. "The performance impact is massive. The original code will take approximately 100 times longer than necessary." Fix: `asyncio.gather()` + shared session. **Complete fix with quantified impact.**
**Winner: 27B** — only model that found both bugs.
---
## Analysis
### Quality gap is task-dependent
| Task type | Best model | Why |
|-----------|-----------|-----|
| Code generation (complex) | 27B | Better structure, error handling, completeness |
| Explanation | 1B | Fast, accurate, hits word count |
| Config/YAML | 1B | Matches actual infrastructure, doesn't overengineer |
| Debugging | 27B | Finds multiple issues, quantifies impact |
### Speed comparison
```
Task 1B 7B 27B 27B/1B
────────────────────────────────────────────
Webhook 16.3s 17.5s 59.0s 3.6x
Evennia 2.1s 5.2s 11.4s 5.4x
Cron YAML 9.0s 8.1s 16.8s 1.9x
Async debug 5.8s 10.2s 53.2s 9.2x
────────────────────────────────────────────
Average 8.3s 10.3s 35.1s 4.2x
```
Debug has the highest ratio (9.2x) — 27B spends more time reasoning through the problem.
### Recommendation
- **27B:** Code review, debugging, architecture decisions
- **7B:** Moderate complexity, quick drafts
- **1B:** Fast lookups, simple generation, time-sensitive
---
## Full Outputs
### Task 1 — 1B output (excerpt)
```python
class WebhookPayloadValidator:
"""
Parses and validates a Gitea webhook payload.
Attributes:
secret_key (str): The shared secret for HMAC verification.
allowed_events (list): List of allowed event types.
"""
def __init__(self, secret_key: str, allowed_events: List[str]):
self.secret_key = secret_key
self.allowed_events = allowed_events
self.logger = logging.getLogger(__name__)
```
### Task 1 — 27B output (excerpt)
```python
@dataclass
class WebhookEvent:
event_type: str
action: Optional[str]
repository: Dict[str, Any]
sender: Dict[str, Any]
raw_payload: Dict[str, Any]
def get_repo_full_name(self) -> str:
return self.repository.get("full_name", "")
def get_sender_login(self) -> str:
return self.sender.get("login", "")
def verify_signature(payload: bytes, signature: str, secret: str) -> bool:
computed = hmac.new(secret.encode("utf-8"), payload, hashlib.sha256).hexdigest()
return hmac.compare_digest(computed, signature)
```
### Task 4 — 27B fix (excerpt)
> This code contains a very common and subtle bug: **It runs the asynchronous tasks sequentially instead of concurrently.**
>
> The performance impact is massive. The original code will take approximately 100 times longer than necessary.
>
> Fix: Use `asyncio.gather()` to run all fetches concurrently, and share a single `ClientSession`.
---
*Generated 2026-04-13. All models via local Ollama.*