Files
timmy-home/timmy-config/docs/big-brain-benchmark.md
Alexander Whitestone 34b74d81dc
Some checks failed
Smoke Test / smoke (push) Has been cancelled
feat(know-thy-father): Phase 1 media indexing (#639)
Merge PR #639
2026-04-14 22:14:17 +00:00

12 KiB
Raw Blame History

Big Brain Quality Benchmark

Big Brain (gemma3:27b, RunPod L40S) vs Local (gemma3:1b)

Date: 2026-04-14 Issue: #576 Milestone: Big Brain Showcase — RunPod L40S Operational


Environment

Parameter Big Brain Local
Model gemma3:27b gemma3:1b
Hardware RunPod L40S 48GB Apple Silicon (local Ollama)
Endpoint 8lfr3j47a5r3gn-11434.proxy.runpod.net localhost:11434
Parameters 27B ~1B
Status OFFLINE (HTTP 404) Operational

Summary

The Big Brain RunPod L40S pod was unreachable during this benchmark session (HTTP 404 from Cloudflare proxy — pod likely terminated or not running). All four tasks were executed against the local gemma3:1b model only.

Despite being a 1B-parameter model, gemma3:1b produced responses for all tasks but with significant quality issues that a 27B model would be expected to avoid.

Quality Assessment — Local gemma3:1b

Task Latency Quality Score (1-5) Verdict
1. Gitea webhook parser 4,435 ms 2/5 BROKEN — fatal bugs in HMAC logic
2. Evennia architecture explainer 1,823 ms 2/5 SUPERFICIAL — vague, inaccurate on key concepts
3. Fleet burn-down cron script 11,039 ms 2/5 NON-FUNCTIONAL — uses fake commands, no RunPod API
4. Debug async Python bug 6,610 ms 2/5 WRONG DIAGNOSIS — missed the real bugs

Task-by-Task Results

Task 1: Python function to parse and validate a Gitea webhook payload

Local (gemma3:1b) — 4,435 ms

import json
import hashlib

def validate_gitea_payload(payload, webhook_secret):
    """
    Parses and validates a Gitea webhook payload.

    Args:
        payload (dict): The JSON payload to validate.
        webhook_secret (str): The Gitea webhook secret.

    Returns:
        tuple: A tuple containing:
            - bool: True if the payload is valid, False otherwise.
            - str: The parsed JSON payload if valid, None otherwise.
    """
    try:
        data = json.loads(payload)
        if not isinstance(data, dict):
            return False, None

        if 'ref' not in data or 'repository' not in data or 'commits' not in data:
            return False, None

        # Verify X-Gitea-Signature header
        signature = data.get('X-Gitea-Signature', '')
        if not signature:
            return False, None

        # Hash the signature
        calculated_hash = hashlib.sha256(signature.encode('utf-64')).hexdigest()

        if calculated_hash != webhook_secret:
            return False, None

        return True, data

    except json.JSONDecodeError:
        return False, None
    except Exception as e:
        return False, None

Quality issues:

  1. FATAL: Uses hashlib.sha256 instead of hmac.new() with HMAC-SHA256 — the entire signature verification approach is wrong
  2. FATAL: signature.encode('utf-64') — no such encoding exists (would crash with LookupError)
  3. FATAL: Reads signature from payload body instead of HTTP header — architecturally wrong; webhook signatures are always in HTTP headers, never inside the JSON body
  4. The function accepts payload as a dict but calls json.loads() on it — type confusion
  5. Missing actual HMAC verification: should be hmac.new(secret, payload_body, 'sha256').hexdigest()

Big Brain (gemma3:27b) — UNAVAILABLE

Expected: Would correctly use hmac.new(key, msg, hashlib.sha256), read signature from HTTP headers (not payload body), handle bytes vs string encoding properly, and include type hints with a proper dataclass result object.


Task 2: Explain the architecture of the Evennia MUD framework in 200 words

Local (gemma3:1b) — 1,823 ms

Evennia MUD utilizes a Django-based server model for core logic, providing a robust and scalable foundation. The server is structured around three primary components: a Session server, a Server, and a Portal. The Session server manages user sessions, storing data like player states and inventory. The Server handles game logic movement, combat, and interactions and the Portal acts as a central interface for players.

Typeclasses are crucial, allowing for polymorphic behavior functions can operate on different player types without modification. This is achieved through a system of inheritance and function signatures, enabling flexible game mechanics. The command infrastructure relies on a robust command parser, allowing players to interact with the game through a standardized set of commands.

Scripting is handled through a specialized scripting language, allowing for complex game rules and custom events. Players use a 'script' file format, defining actions and conditions. A core scripting system utilizes a 'parse' function to interpret these files, triggering events and updating the game state. All data is stored in a database, optimized for efficient retrieval and manipulation.

Quality issues:

  1. INACCURATE: Describes "Session server" as a separate component — Evennia has Session objects managed within the Server, not a separate server process
  2. INACCURATE: Portal description is vague — the Portal specifically handles network connections and protocol negotiation (telnet, SSH, webclient), not just "a central interface"
  3. WRONG: Evennia does NOT have a "specialized scripting language" — it uses Python scripts stored in the database via EvScript/Script typeclasses
  4. WRONG: "Players use a 'script' file format" — this is fabricated; Evennia scripts are Python objects, not file-based
  5. VAGUE: Typeclass explanation is superficial — miss the key concept that they're Django proxy models providing persistent game object behavior
  6. Word count: ~185 words (close to target but imprecise)

Big Brain (gemma3:27b) — UNAVAILABLE

Expected: Would correctly describe Server/Portal architecture (Server=game logic+connections, Portal=network protocol layer), explain typeclasses as Django ContentTypes-based proxy models for Characters/Rooms/Exits/Scripts, detail CmdSet-based command routing, and describe Script DB objects (not a scripting language).


Task 3: Write a cron job description for a fleet burn-down script

Local (gemma3:1b) — 11,039 ms

burn_down_script() {
  # Log entry
  echo "$(date) - Fleet Burn-Down Started" | tee -a /var/log/fleet-burndown.log

  # Iterate over all active RunPod pods
  for pod in $(podman pginfo --all --all | awk '{print $1}');
  do
    # Check GPU utilization
    gpu_util=$(podman pginfo --all --all | awk '{print $2}')

    # Check if pod is idle for more than 30 minutes
    idle_time=300 # 30 minutes in seconds
    if [[ $(echo "$gpu_util > $idle_time" | bc) -eq 1 ]]; then
      # Terminate the pod
      podstop -d $pod
    fi
  done

  # Send Slack notification
  echo "Fleet Burn-Down Complete." | send -t 10 "Fleet Burn-Down Complete."

  echo "Script completed."
}

Cron entry: */5 * * * * /path/to/fleet_burndown.sh

Quality issues:

  1. FATAL: Uses podman pginfo and podstop — these commands don't exist. Should use RunPod API (curl to https://api.runpod.io/graphql) or runpodctl
  2. FATAL: Idle check logic is nonsensical — compares GPU utilization percentage against 300 seconds; these are incomparable units
  3. WRONG: idle_time=300 is 5 minutes, not 30 minutes (should be 1800)
  4. WRONG: send -t 10 is not a standard Slack notification method — should use curl to Slack webhook URL
  5. MISSING: No actual RunPod API integration, no RUNPOD_API_KEY usage, no GPU utilization query
  6. No error handling, no exit codes, no structured logging
  7. The explanation text is verbose filler with self-evident comments

Big Brain (gemma3:27b) — UNAVAILABLE

Expected: Would produce a script using RunPod GraphQL API to list pods, query GPU metrics, calculate idle time from last activity timestamp, use curl to terminate pods via API, and post to Slack via webhook. Would include proper error handling, set -euo pipefail, and structured JSON logging.


Task 4: Debug a subtle Python async bug

Buggy code provided:

import asyncio
import aiohttp

async def fetch_all(urls):
    results = []
    for url in urls:
        async with aiohttp.ClientSession() as session:
            resp = await session.get(url)
            data = await resp.json()
            results.append(data)
    return results

async def main():
    urls = ["https://api.example.com/data1", "https://api.example.com/data2"]
    results = await fetch_all(urls)
    print(f"Got {len(results)} results")

asyncio.run(main())

Local (gemma3:1b) — 6,610 ms

The model identified these "bugs":

  1. "Missing return statement in fetch_all" — WRONG, the code clearly has return results
  2. "No error handling for resp.json()" — This is a valid improvement suggestion but not a bug
  3. "No await keyword in main()" — WRONG, main() is correctly async def and asyncio.run() is the entry point

The model's "fix" only added a try/except wrapper and kept all the real bugs.

Actual bugs in the provided code:

  1. Real Bug: Creates a new ClientSession for every URL instead of reusing one session — wasteful, defeats connection pooling, and can cause resource leaks at scale
  2. Real Bug: No resp.raise_for_status() — silently accepts HTTP error responses (404, 500) and tries to parse them as JSON
  3. Anti-pattern: Sequential await in loop instead of asyncio.gather() for concurrent fetching — no parallelism despite using async
  4. Missing: No timeout on session.get() — could hang forever
  5. Missing: No error handling at all (the model's suggestion to add try/except was directionally right but missed the real bugs)

Big Brain (gemma3:27b) — UNAVAILABLE

Expected: Would correctly identify session reuse issue, lack of raise_for_status(), sequential vs concurrent fetching, and provide a proper fix using asyncio.gather() with a single shared session and timeout/deadline handling.


Comparison Table

Task Local 1B (gemma3:1b) Big Brain 27B (gemma3:27b) Winner
1. Gitea webhook parser BROKEN — wrong HMAC, wrong encoding, wrong signature source UNAVAILABLE (pod offline) N/A
2. Evennia architecture SUPERFICIAL — vague, fabricated scripting language UNAVAILABLE (pod offline) N/A
3. Fleet burn-down cron NON-FUNCTIONAL — fake commands, unit mismatch UNAVAILABLE (pod offline) N/A
4. Debug async bug WRONG DIAGNOSIS — missed all real bugs UNAVAILABLE (pod offline) N/A

Latency Summary

Task Local gemma3:1b
1. Gitea webhook parser 4,435 ms
2. Evennia architecture 1,823 ms
3. Fleet burn-down cron 11,039 ms
4. Debug async bug 6,610 ms
Total 23,907 ms

Big Brain latency: N/A (pod offline)


Key Finding

The 1B model fails all four tasks in ways that would be immediately obvious to a developer. The failures fall into categories that large models reliably avoid:

  • Hallucinated APIs (Task 3: podman pginfo, podstop don't exist)
  • Fundamental misunderstanding of security primitives (Task 1: SHA-256 instead of HMAC, utf-64 encoding)
  • Fabricated technical details (Task 2: "specialized scripting language" in Evennia)
  • Wrong diagnosis of provided code (Task 4: claimed bugs that don't exist, missed real bugs)

This benchmark demonstrates that even without Big Brain results, the quality gap between 1B and 27B models is expected to be substantial for technical/code generation tasks.


Next Steps

  1. Restart Big Brain pod — RunPod pod 8lfr3j47a5r3gn is returning HTTP 404
  2. Re-run benchmark with both models online to populate the comparison table
  3. Consider testing with gemma3:4b (if available) as a middle-ground comparison
  4. Run Big Brain at temperature: 0.3 for consistency with local results

Generated by Ezra (Hermes Agent) — Issue #576 — 2026-04-14