Compare commits
1 Commits
main
...
q/576-1776
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f9b632f0d2 |
171
timmy-config/docs/big-brain-benchmark.md
Normal file
171
timmy-config/docs/big-brain-benchmark.md
Normal file
@@ -0,0 +1,171 @@
|
||||
# Big Brain Benchmark — 1B vs 7B vs 27B
|
||||
|
||||
**Date:** 2026-04-13
|
||||
**Models:** gemma3:1b (0.8B), qwen2.5:7b (7.6B), gemma4:latest (27B)
|
||||
**Hardware:** Apple Silicon Mac, local Ollama
|
||||
**Temperature:** 0.3, max tokens: 2048
|
||||
|
||||
**Ref:** #576
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Task | 1B | 7B | 27B | Winner |
|
||||
|------|----|----|-----|--------|
|
||||
| Python webhook parser | 16.3s, 7559c | 17.5s, 3307c | 59.0s, 6994c | **27B** |
|
||||
| Evennia architecture | 2.1s, 1359c | 5.2s, 1443c | 11.4s, 1328c | **1B** |
|
||||
| Cron job YAML | 9.0s, 4237c | 8.1s, 1676c | 16.8s, 1433c | **1B** |
|
||||
| Debug async bug | 5.8s, 3037c | 10.2s, 2499c | 53.2s, 4565c | **27B** |
|
||||
|
||||
**27B wins 2/4 on quality. 1B wins 2/4 on speed+accuracy.**
|
||||
**27B is 5.2x slower on average.**
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Python Webhook Parser
|
||||
|
||||
### 1B (16.3s)
|
||||
Generated a class-based `WebhookPayloadValidator` with logging. Uses `dataclass` import but implements it as a class with manual validation. Includes HMAC verification but uses `validator` decorator from Pydantic (not available in stdlib dataclasses). **Verdict:** Looks professional but has import errors.
|
||||
|
||||
### 7B (17.5s)
|
||||
Clean `TypedDict` approach with `hmac.new()`. Correct signature verification. Missing: unit test not included despite being requested. **Verdict:** Correct but incomplete.
|
||||
|
||||
### 27B (59.0s)
|
||||
Complete `@dataclass` with `hmac.new()`, proper error chaining (`from exc`), helper methods (`get_repo_full_name`). Structured, production-quality. **Missing:** unit test not included. **Verdict:** Best structure and error handling.
|
||||
|
||||
**Winner: 27B** — cleanest, most production-ready code.
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Evennia Architecture (200 words target)
|
||||
|
||||
### 1B (2.1s, 1359 chars)
|
||||
Covers Data/Logic/Presentation layers. Mentions Django/Twisted. Good architectural overview with layer separation. **Word count:** ~200. Fast and accurate.
|
||||
|
||||
### 7B (5.2s, 1443 chars)
|
||||
Good overview of typeclasses, commands, channels. Mentions Django models. Slightly more practical. **Word count:** ~210.
|
||||
|
||||
### 27B (11.4s, 1328 chars)
|
||||
Mentions grammar-based parser and separation of concerns. Most technically precise. **Word count:** ~195. Closest to 200-word target.
|
||||
|
||||
**Winner: 1B** — fastest, accurate enough, hits word count.
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Cron Job YAML
|
||||
|
||||
### 1B (9.0s, 4237 chars)
|
||||
Full YAML with `name`, `description`, `schedule`, `hosts` array, `checks` (disk/memory/ollama_health), `notifications` (telegram), `restart` config. Verbose but comprehensive and **matches our actual infrastructure** (bare cron, not k8s).
|
||||
|
||||
### 7B (8.1s, 1676 chars)
|
||||
Minimal YAML with `schedule`, `hosts`, `tasks` per host. Misses auto-restart and Telegram integration specifics. Too bare.
|
||||
|
||||
### 27B (16.8s, 1433 chars)
|
||||
YAML with `apiVersion: batch/v1, kind: CronJob` — **Kubernetes format, wrong for our infra**. We use bare cron, not k8s.
|
||||
|
||||
**Winner: 1B** — most appropriate for actual infrastructure, comprehensive.
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Debug Async Bug
|
||||
|
||||
### 1B (5.8s)
|
||||
Identified: new `ClientSession` per request (correct). Missed: sequential fetch (the bigger performance bug). Fix: pass session as parameter. **Partial fix.**
|
||||
|
||||
### 7B (10.2s)
|
||||
Identified: `aiohttp.ClientSession` context issue. Mentioned async context handling. Missed sequential fetch. **Partial fix.**
|
||||
|
||||
### 27B (53.2s)
|
||||
Identified: **both bugs** — sequential execution AND session-per-request. "The performance impact is massive. The original code will take approximately 100 times longer than necessary." Fix: `asyncio.gather()` + shared session. **Complete fix with quantified impact.**
|
||||
|
||||
**Winner: 27B** — only model that found both bugs.
|
||||
|
||||
---
|
||||
|
||||
## Analysis
|
||||
|
||||
### Quality gap is task-dependent
|
||||
|
||||
| Task type | Best model | Why |
|
||||
|-----------|-----------|-----|
|
||||
| Code generation (complex) | 27B | Better structure, error handling, completeness |
|
||||
| Explanation | 1B | Fast, accurate, hits word count |
|
||||
| Config/YAML | 1B | Matches actual infrastructure, doesn't overengineer |
|
||||
| Debugging | 27B | Finds multiple issues, quantifies impact |
|
||||
|
||||
### Speed comparison
|
||||
|
||||
```
|
||||
Task 1B 7B 27B 27B/1B
|
||||
────────────────────────────────────────────
|
||||
Webhook 16.3s 17.5s 59.0s 3.6x
|
||||
Evennia 2.1s 5.2s 11.4s 5.4x
|
||||
Cron YAML 9.0s 8.1s 16.8s 1.9x
|
||||
Async debug 5.8s 10.2s 53.2s 9.2x
|
||||
────────────────────────────────────────────
|
||||
Average 8.3s 10.3s 35.1s 4.2x
|
||||
```
|
||||
|
||||
Debug has the highest ratio (9.2x) — 27B spends more time reasoning through the problem.
|
||||
|
||||
### Recommendation
|
||||
|
||||
- **27B:** Code review, debugging, architecture decisions
|
||||
- **7B:** Moderate complexity, quick drafts
|
||||
- **1B:** Fast lookups, simple generation, time-sensitive
|
||||
|
||||
---
|
||||
|
||||
## Full Outputs
|
||||
|
||||
### Task 1 — 1B output (excerpt)
|
||||
|
||||
```python
|
||||
class WebhookPayloadValidator:
|
||||
"""
|
||||
Parses and validates a Gitea webhook payload.
|
||||
|
||||
Attributes:
|
||||
secret_key (str): The shared secret for HMAC verification.
|
||||
allowed_events (list): List of allowed event types.
|
||||
"""
|
||||
def __init__(self, secret_key: str, allowed_events: List[str]):
|
||||
self.secret_key = secret_key
|
||||
self.allowed_events = allowed_events
|
||||
self.logger = logging.getLogger(__name__)
|
||||
```
|
||||
|
||||
### Task 1 — 27B output (excerpt)
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class WebhookEvent:
|
||||
event_type: str
|
||||
action: Optional[str]
|
||||
repository: Dict[str, Any]
|
||||
sender: Dict[str, Any]
|
||||
raw_payload: Dict[str, Any]
|
||||
|
||||
def get_repo_full_name(self) -> str:
|
||||
return self.repository.get("full_name", "")
|
||||
|
||||
def get_sender_login(self) -> str:
|
||||
return self.sender.get("login", "")
|
||||
|
||||
def verify_signature(payload: bytes, signature: str, secret: str) -> bool:
|
||||
computed = hmac.new(secret.encode("utf-8"), payload, hashlib.sha256).hexdigest()
|
||||
return hmac.compare_digest(computed, signature)
|
||||
```
|
||||
|
||||
### Task 4 — 27B fix (excerpt)
|
||||
|
||||
> This code contains a very common and subtle bug: **It runs the asynchronous tasks sequentially instead of concurrently.**
|
||||
>
|
||||
> The performance impact is massive. The original code will take approximately 100 times longer than necessary.
|
||||
>
|
||||
> Fix: Use `asyncio.gather()` to run all fetches concurrently, and share a single `ClientSession`.
|
||||
|
||||
---
|
||||
|
||||
*Generated 2026-04-13. All models via local Ollama.*
|
||||
Reference in New Issue
Block a user