Big Brain Benchmark: 1B vs 7B vs 27B quality comparison

Run 4 identical tasks across 3 local Ollama models with measured latency and full output comparison. Results: - 27B wins 2/4 (webhook parser, async debugging) - 1B wins 2/4 (Evennia explanation, cron YAML) - 27B is 4.2x slower on average - Debug task: 27B found BOTH bugs; 1B and 7B found only one - Config task: 1B produced most appropriate for our infra Key insight: quality gap is task-dependent. 27B for debugging/ code review. 1B for speed/explanations/config. File: timmy-config/docs/big-brain-benchmark.md Refs: Timmy_Foundation/timmy-home#576
2026-04-13 21:58:20 -04:00
1 changed files with 171 additions and 0 deletions
--- a/timmy-config/docs/big-brain-benchmark.md
+++ b/timmy-config/docs/big-brain-benchmark.md
@@ -0,0 +1,171 @@
+# Big Brain Benchmark — 1B vs 7B vs 27B
+
+**Date:** 2026-04-13
+**Models:** gemma3:1b (0.8B), qwen2.5:7b (7.6B), gemma4:latest (27B)
+**Hardware:** Apple Silicon Mac, local Ollama
+**Temperature:** 0.3, max tokens: 2048
+
+**Ref:** #576
+
+---
+
+## Summary
+
+| Task | 1B | 7B | 27B | Winner |
+|------|----|----|-----|--------|
+| Python webhook parser | 16.3s, 7559c | 17.5s, 3307c | 59.0s, 6994c | **27B** |
+| Evennia architecture | 2.1s, 1359c | 5.2s, 1443c | 11.4s, 1328c | **1B** |
+| Cron job YAML | 9.0s, 4237c | 8.1s, 1676c | 16.8s, 1433c | **1B** |
+| Debug async bug | 5.8s, 3037c | 10.2s, 2499c | 53.2s, 4565c | **27B** |
+
+**27B wins 2/4 on quality. 1B wins 2/4 on speed+accuracy.**
+**27B is 5.2x slower on average.**
+
+---
+
+## Task 1: Python Webhook Parser
+
+### 1B (16.3s)
+Generated a class-based `WebhookPayloadValidator` with logging. Uses `dataclass` import but implements it as a class with manual validation. Includes HMAC verification but uses `validator` decorator from Pydantic (not available in stdlib dataclasses). **Verdict:** Looks professional but has import errors.
+
+### 7B (17.5s)
+Clean `TypedDict` approach with `hmac.new()`. Correct signature verification. Missing: unit test not included despite being requested. **Verdict:** Correct but incomplete.
+
+### 27B (59.0s)
+Complete `@dataclass` with `hmac.new()`, proper error chaining (`from exc`), helper methods (`get_repo_full_name`). Structured, production-quality. **Missing:** unit test not included. **Verdict:** Best structure and error handling.
+
+**Winner: 27B** — cleanest, most production-ready code.
+
+---
+
+## Task 2: Evennia Architecture (200 words target)
+
+### 1B (2.1s, 1359 chars)
+Covers Data/Logic/Presentation layers. Mentions Django/Twisted. Good architectural overview with layer separation. **Word count:** ~200. Fast and accurate.
+
+### 7B (5.2s, 1443 chars)
+Good overview of typeclasses, commands, channels. Mentions Django models. Slightly more practical. **Word count:** ~210.
+
+### 27B (11.4s, 1328 chars)
+Mentions grammar-based parser and separation of concerns. Most technically precise. **Word count:** ~195. Closest to 200-word target.
+
+**Winner: 1B** — fastest, accurate enough, hits word count.
+
+---
+
+## Task 3: Cron Job YAML
+
+### 1B (9.0s, 4237 chars)
+Full YAML with `name`, `description`, `schedule`, `hosts` array, `checks` (disk/memory/ollama_health), `notifications` (telegram), `restart` config. Verbose but comprehensive and **matches our actual infrastructure** (bare cron, not k8s).
+
+### 7B (8.1s, 1676 chars)
+Minimal YAML with `schedule`, `hosts`, `tasks` per host. Misses auto-restart and Telegram integration specifics. Too bare.
+
+### 27B (16.8s, 1433 chars)
+YAML with `apiVersion: batch/v1, kind: CronJob` — **Kubernetes format, wrong for our infra**. We use bare cron, not k8s.
+
+**Winner: 1B** — most appropriate for actual infrastructure, comprehensive.
+
+---
+
+## Task 4: Debug Async Bug
+
+### 1B (5.8s)
+Identified: new `ClientSession` per request (correct). Missed: sequential fetch (the bigger performance bug). Fix: pass session as parameter. **Partial fix.**
+
+### 7B (10.2s)
+Identified: `aiohttp.ClientSession` context issue. Mentioned async context handling. Missed sequential fetch. **Partial fix.**
+
+### 27B (53.2s)
+Identified: **both bugs** — sequential execution AND session-per-request. "The performance impact is massive. The original code will take approximately 100 times longer than necessary." Fix: `asyncio.gather()` + shared session. **Complete fix with quantified impact.**
+
+**Winner: 27B** — only model that found both bugs.
+
+---
+
+## Analysis
+
+### Quality gap is task-dependent
+
+| Task type | Best model | Why |
+|-----------|-----------|-----|
+| Code generation (complex) | 27B | Better structure, error handling, completeness |
+| Explanation | 1B | Fast, accurate, hits word count |
+| Config/YAML | 1B | Matches actual infrastructure, doesn't overengineer |
+| Debugging | 27B | Finds multiple issues, quantifies impact |
+
+### Speed comparison
+
+```
+Task          1B      7B      27B     27B/1B
+────────────────────────────────────────────
+Webhook     16.3s   17.5s   59.0s    3.6x
+Evennia      2.1s    5.2s   11.4s    5.4x
+Cron YAML    9.0s    8.1s   16.8s    1.9x
+Async debug  5.8s   10.2s   53.2s    9.2x
+────────────────────────────────────────────
+Average      8.3s   10.3s   35.1s    4.2x
+```
+
+Debug has the highest ratio (9.2x) — 27B spends more time reasoning through the problem.
+
+### Recommendation
+
+- **27B:** Code review, debugging, architecture decisions
+- **7B:** Moderate complexity, quick drafts
+- **1B:** Fast lookups, simple generation, time-sensitive
+
+---
+
+## Full Outputs
+
+### Task 1 — 1B output (excerpt)
+
+```python
+class WebhookPayloadValidator:
+    """
+    Parses and validates a Gitea webhook payload.
+    
+    Attributes:
+        secret_key (str): The shared secret for HMAC verification.
+        allowed_events (list): List of allowed event types.
+    """
+    def __init__(self, secret_key: str, allowed_events: List[str]):
+        self.secret_key = secret_key
+        self.allowed_events = allowed_events
+        self.logger = logging.getLogger(__name__)
+```
+
+### Task 1 — 27B output (excerpt)
+
+```python
+@dataclass
+class WebhookEvent:
+    event_type: str
+    action: Optional[str]
+    repository: Dict[str, Any]
+    sender: Dict[str, Any]
+    raw_payload: Dict[str, Any]
+
+    def get_repo_full_name(self) -> str:
+        return self.repository.get("full_name", "")
+
+    def get_sender_login(self) -> str:
+        return self.sender.get("login", "")
+
+def verify_signature(payload: bytes, signature: str, secret: str) -> bool:
+    computed = hmac.new(secret.encode("utf-8"), payload, hashlib.sha256).hexdigest()
+    return hmac.compare_digest(computed, signature)
+```
+
+### Task 4 — 27B fix (excerpt)
+
+> This code contains a very common and subtle bug: **It runs the asynchronous tasks sequentially instead of concurrently.**
+>
+> The performance impact is massive. The original code will take approximately 100 times longer than necessary.
+>
+> Fix: Use `asyncio.gather()` to run all fetches concurrently, and share a single `ClientSession`.
+
+---
+
+*Generated 2026-04-13. All models via local Ollama.*