Some checks failed
Smoke Test / smoke (push) Has been cancelled
Merge PR #639
294 lines
12 KiB
Markdown
294 lines
12 KiB
Markdown
# Big Brain Quality Benchmark
|
||
## Big Brain (gemma3:27b, RunPod L40S) vs Local (gemma3:1b)
|
||
|
||
**Date:** 2026-04-14
|
||
**Issue:** #576
|
||
**Milestone:** Big Brain Showcase — RunPod L40S Operational
|
||
|
||
---
|
||
|
||
## Environment
|
||
|
||
| Parameter | Big Brain | Local |
|
||
|-------------------|------------------------------------|---------------------|
|
||
| Model | gemma3:27b | gemma3:1b |
|
||
| Hardware | RunPod L40S 48GB | Apple Silicon (local Ollama) |
|
||
| Endpoint | 8lfr3j47a5r3gn-11434.proxy.runpod.net | localhost:11434 |
|
||
| Parameters | 27B | ~1B |
|
||
| Status | **OFFLINE (HTTP 404)** | Operational |
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
The Big Brain RunPod L40S pod was **unreachable** during this benchmark session
|
||
(HTTP 404 from Cloudflare proxy — pod likely terminated or not running). All four
|
||
tasks were executed against the local gemma3:1b model only.
|
||
|
||
Despite being a 1B-parameter model, gemma3:1b produced responses for all tasks
|
||
but with significant quality issues that a 27B model would be expected to avoid.
|
||
|
||
### Quality Assessment — Local gemma3:1b
|
||
|
||
| Task | Latency | Quality Score (1-5) | Verdict |
|
||
|------|---------|---------------------|---------|
|
||
| 1. Gitea webhook parser | 4,435 ms | 2/5 | **BROKEN** — fatal bugs in HMAC logic |
|
||
| 2. Evennia architecture explainer | 1,823 ms | 2/5 | **SUPERFICIAL** — vague, inaccurate on key concepts |
|
||
| 3. Fleet burn-down cron script | 11,039 ms | 2/5 | **NON-FUNCTIONAL** — uses fake commands, no RunPod API |
|
||
| 4. Debug async Python bug | 6,610 ms | 2/5 | **WRONG DIAGNOSIS** — missed the real bugs |
|
||
|
||
---
|
||
|
||
## Task-by-Task Results
|
||
|
||
### Task 1: Python function to parse and validate a Gitea webhook payload
|
||
|
||
**Local (gemma3:1b) — 4,435 ms**
|
||
|
||
```python
|
||
import json
|
||
import hashlib
|
||
|
||
def validate_gitea_payload(payload, webhook_secret):
|
||
"""
|
||
Parses and validates a Gitea webhook payload.
|
||
|
||
Args:
|
||
payload (dict): The JSON payload to validate.
|
||
webhook_secret (str): The Gitea webhook secret.
|
||
|
||
Returns:
|
||
tuple: A tuple containing:
|
||
- bool: True if the payload is valid, False otherwise.
|
||
- str: The parsed JSON payload if valid, None otherwise.
|
||
"""
|
||
try:
|
||
data = json.loads(payload)
|
||
if not isinstance(data, dict):
|
||
return False, None
|
||
|
||
if 'ref' not in data or 'repository' not in data or 'commits' not in data:
|
||
return False, None
|
||
|
||
# Verify X-Gitea-Signature header
|
||
signature = data.get('X-Gitea-Signature', '')
|
||
if not signature:
|
||
return False, None
|
||
|
||
# Hash the signature
|
||
calculated_hash = hashlib.sha256(signature.encode('utf-64')).hexdigest()
|
||
|
||
if calculated_hash != webhook_secret:
|
||
return False, None
|
||
|
||
return True, data
|
||
|
||
except json.JSONDecodeError:
|
||
return False, None
|
||
except Exception as e:
|
||
return False, None
|
||
```
|
||
|
||
**Quality issues:**
|
||
1. **FATAL:** Uses `hashlib.sha256` instead of `hmac.new()` with HMAC-SHA256 — the entire signature verification approach is wrong
|
||
2. **FATAL:** `signature.encode('utf-64')` — no such encoding exists (would crash with LookupError)
|
||
3. **FATAL:** Reads signature from payload body instead of HTTP header — architecturally wrong; webhook signatures are always in HTTP headers, never inside the JSON body
|
||
4. The function accepts `payload` as a dict but calls `json.loads()` on it — type confusion
|
||
5. Missing actual HMAC verification: should be `hmac.new(secret, payload_body, 'sha256').hexdigest()`
|
||
|
||
**Big Brain (gemma3:27b) — UNAVAILABLE**
|
||
|
||
Expected: Would correctly use `hmac.new(key, msg, hashlib.sha256)`, read signature from
|
||
HTTP headers (not payload body), handle bytes vs string encoding properly, and include
|
||
type hints with a proper dataclass result object.
|
||
|
||
---
|
||
|
||
### Task 2: Explain the architecture of the Evennia MUD framework in 200 words
|
||
|
||
**Local (gemma3:1b) — 1,823 ms**
|
||
|
||
> Evennia MUD utilizes a Django-based server model for core logic, providing a robust
|
||
> and scalable foundation. The server is structured around three primary components: a
|
||
> Session server, a Server, and a Portal. The Session server manages user sessions,
|
||
> storing data like player states and inventory. The Server handles game logic – movement,
|
||
> combat, and interactions – and the Portal acts as a central interface for players.
|
||
>
|
||
> Typeclasses are crucial, allowing for polymorphic behavior – functions can operate on
|
||
> different player types without modification. This is achieved through a system of
|
||
> inheritance and function signatures, enabling flexible game mechanics. The command
|
||
> infrastructure relies on a robust command parser, allowing players to interact with the
|
||
> game through a standardized set of commands.
|
||
>
|
||
> Scripting is handled through a specialized scripting language, allowing for complex game
|
||
> rules and custom events. Players use a 'script' file format, defining actions and
|
||
> conditions. A core scripting system utilizes a 'parse' function to interpret these files,
|
||
> triggering events and updating the game state. All data is stored in a database, optimized
|
||
> for efficient retrieval and manipulation.
|
||
|
||
**Quality issues:**
|
||
1. **INACCURATE:** Describes "Session server" as a separate component — Evennia has Session objects managed within the Server, not a separate server process
|
||
2. **INACCURATE:** Portal description is vague — the Portal specifically handles network connections and protocol negotiation (telnet, SSH, webclient), not just "a central interface"
|
||
3. **WRONG:** Evennia does NOT have a "specialized scripting language" — it uses Python scripts stored in the database via EvScript/Script typeclasses
|
||
4. **WRONG:** "Players use a 'script' file format" — this is fabricated; Evennia scripts are Python objects, not file-based
|
||
5. **VAGUE:** Typeclass explanation is superficial — miss the key concept that they're Django proxy models providing persistent game object behavior
|
||
6. Word count: ~185 words (close to target but imprecise)
|
||
|
||
**Big Brain (gemma3:27b) — UNAVAILABLE**
|
||
|
||
Expected: Would correctly describe Server/Portal architecture (Server=game logic+connections,
|
||
Portal=network protocol layer), explain typeclasses as Django ContentTypes-based proxy models
|
||
for Characters/Rooms/Exits/Scripts, detail CmdSet-based command routing, and describe
|
||
Script DB objects (not a scripting language).
|
||
|
||
---
|
||
|
||
### Task 3: Write a cron job description for a fleet burn-down script
|
||
|
||
**Local (gemma3:1b) — 11,039 ms**
|
||
|
||
```bash
|
||
burn_down_script() {
|
||
# Log entry
|
||
echo "$(date) - Fleet Burn-Down Started" | tee -a /var/log/fleet-burndown.log
|
||
|
||
# Iterate over all active RunPod pods
|
||
for pod in $(podman pginfo --all --all | awk '{print $1}');
|
||
do
|
||
# Check GPU utilization
|
||
gpu_util=$(podman pginfo --all --all | awk '{print $2}')
|
||
|
||
# Check if pod is idle for more than 30 minutes
|
||
idle_time=300 # 30 minutes in seconds
|
||
if [[ $(echo "$gpu_util > $idle_time" | bc) -eq 1 ]]; then
|
||
# Terminate the pod
|
||
podstop -d $pod
|
||
fi
|
||
done
|
||
|
||
# Send Slack notification
|
||
echo "Fleet Burn-Down Complete." | send -t 10 "Fleet Burn-Down Complete."
|
||
|
||
echo "Script completed."
|
||
}
|
||
```
|
||
|
||
Cron entry: `*/5 * * * * /path/to/fleet_burndown.sh`
|
||
|
||
**Quality issues:**
|
||
1. **FATAL:** Uses `podman pginfo` and `podstop` — these commands don't exist. Should use RunPod API (`curl` to `https://api.runpod.io/graphql`) or `runpodctl`
|
||
2. **FATAL:** Idle check logic is nonsensical — compares GPU utilization percentage against 300 seconds; these are incomparable units
|
||
3. **WRONG:** `idle_time=300` is 5 minutes, not 30 minutes (should be 1800)
|
||
4. **WRONG:** `send -t 10` is not a standard Slack notification method — should use `curl` to Slack webhook URL
|
||
5. **MISSING:** No actual RunPod API integration, no `RUNPOD_API_KEY` usage, no GPU utilization query
|
||
6. No error handling, no exit codes, no structured logging
|
||
7. The explanation text is verbose filler with self-evident comments
|
||
|
||
**Big Brain (gemma3:27b) — UNAVAILABLE**
|
||
|
||
Expected: Would produce a script using RunPod GraphQL API to list pods, query GPU metrics,
|
||
calculate idle time from last activity timestamp, use `curl` to terminate pods via API,
|
||
and post to Slack via webhook. Would include proper error handling, `set -euo pipefail`,
|
||
and structured JSON logging.
|
||
|
||
---
|
||
|
||
### Task 4: Debug a subtle Python async bug
|
||
|
||
**Buggy code provided:**
|
||
```python
|
||
import asyncio
|
||
import aiohttp
|
||
|
||
async def fetch_all(urls):
|
||
results = []
|
||
for url in urls:
|
||
async with aiohttp.ClientSession() as session:
|
||
resp = await session.get(url)
|
||
data = await resp.json()
|
||
results.append(data)
|
||
return results
|
||
|
||
async def main():
|
||
urls = ["https://api.example.com/data1", "https://api.example.com/data2"]
|
||
results = await fetch_all(urls)
|
||
print(f"Got {len(results)} results")
|
||
|
||
asyncio.run(main())
|
||
```
|
||
|
||
**Local (gemma3:1b) — 6,610 ms**
|
||
|
||
The model identified these "bugs":
|
||
1. "Missing `return` statement in `fetch_all`" — **WRONG**, the code clearly has `return results`
|
||
2. "No error handling for `resp.json()`" — This is a valid improvement suggestion but not a bug
|
||
3. "No `await` keyword in `main()`" — **WRONG**, `main()` is correctly `async def` and `asyncio.run()` is the entry point
|
||
|
||
The model's "fix" only added a try/except wrapper and kept all the real bugs.
|
||
|
||
**Actual bugs in the provided code:**
|
||
1. **Real Bug:** Creates a new `ClientSession` for every URL instead of reusing one session — wasteful, defeats connection pooling, and can cause resource leaks at scale
|
||
2. **Real Bug:** No `resp.raise_for_status()` — silently accepts HTTP error responses (404, 500) and tries to parse them as JSON
|
||
3. **Anti-pattern:** Sequential `await` in loop instead of `asyncio.gather()` for concurrent fetching — no parallelism despite using async
|
||
4. **Missing:** No timeout on `session.get()` — could hang forever
|
||
5. **Missing:** No error handling at all (the model's suggestion to add try/except was directionally right but missed the real bugs)
|
||
|
||
**Big Brain (gemma3:27b) — UNAVAILABLE**
|
||
|
||
Expected: Would correctly identify session reuse issue, lack of `raise_for_status()`,
|
||
sequential vs concurrent fetching, and provide a proper fix using `asyncio.gather()` with
|
||
a single shared session and timeout/deadline handling.
|
||
|
||
---
|
||
|
||
## Comparison Table
|
||
|
||
| Task | Local 1B (gemma3:1b) | Big Brain 27B (gemma3:27b) | Winner |
|
||
|------|---------------------|---------------------------|--------|
|
||
| 1. Gitea webhook parser | BROKEN — wrong HMAC, wrong encoding, wrong signature source | UNAVAILABLE (pod offline) | N/A |
|
||
| 2. Evennia architecture | SUPERFICIAL — vague, fabricated scripting language | UNAVAILABLE (pod offline) | N/A |
|
||
| 3. Fleet burn-down cron | NON-FUNCTIONAL — fake commands, unit mismatch | UNAVAILABLE (pod offline) | N/A |
|
||
| 4. Debug async bug | WRONG DIAGNOSIS — missed all real bugs | UNAVAILABLE (pod offline) | N/A |
|
||
|
||
---
|
||
|
||
## Latency Summary
|
||
|
||
| Task | Local gemma3:1b |
|
||
|------|-----------------|
|
||
| 1. Gitea webhook parser | 4,435 ms |
|
||
| 2. Evennia architecture | 1,823 ms |
|
||
| 3. Fleet burn-down cron | 11,039 ms |
|
||
| 4. Debug async bug | 6,610 ms |
|
||
| **Total** | **23,907 ms** |
|
||
|
||
Big Brain latency: N/A (pod offline)
|
||
|
||
---
|
||
|
||
## Key Finding
|
||
|
||
**The 1B model fails all four tasks in ways that would be immediately obvious to a developer.**
|
||
The failures fall into categories that large models reliably avoid:
|
||
|
||
- **Hallucinated APIs** (Task 3: `podman pginfo`, `podstop` don't exist)
|
||
- **Fundamental misunderstanding of security primitives** (Task 1: SHA-256 instead of HMAC, `utf-64` encoding)
|
||
- **Fabricated technical details** (Task 2: "specialized scripting language" in Evennia)
|
||
- **Wrong diagnosis of provided code** (Task 4: claimed bugs that don't exist, missed real bugs)
|
||
|
||
This benchmark demonstrates that even without Big Brain results, the quality gap between
|
||
1B and 27B models is expected to be substantial for technical/code generation tasks.
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
1. **Restart Big Brain pod** — RunPod pod 8lfr3j47a5r3gn is returning HTTP 404
|
||
2. **Re-run benchmark** with both models online to populate the comparison table
|
||
3. Consider testing with gemma3:4b (if available) as a middle-ground comparison
|
||
4. Run Big Brain at `temperature: 0.3` for consistency with local results
|
||
|
||
---
|
||
|
||
*Generated by Ezra (Hermes Agent) — Issue #576 — 2026-04-14*
|