Rockachopa/Timmy-time-dashboard

Fork 2

Files

Claude (Opus 4.6) 7dfbf05867

Tests / lint (push) Has been cancelled

Details

Tests / test (push) Has been cancelled

Details

[claude] Run 5-test benchmark suite against local model candidates (#1066 ) (#1271 )

2026-03-24 01:38:59 +00:00

42 KiB

Raw Blame History

Model Benchmark Results

Generated: 2026-03-24 01:28 UTC
Ollama URL: http://localhost:11434
Issue: #1066

Overview

This report documents the 5-test benchmark suite results for local model candidates.

Model Availability vs. Spec

Requested	Tested Substitute	Reason
`qwen3:14b`	`qwen2.5:14b`	`qwen3:14b` not pulled locally
`qwen3:8b`	`qwen3.5:latest`	`qwen3:8b` not pulled locally
`hermes3:8b`	`hermes3:8b`	Exact match
`dolphin3`	`llama3.2:latest`	`dolphin3` not pulled locally

Summary Comparison Table

Model	Passed	Tool Calling	Code Gen	Shell Gen	Coherence	Triage Acc	Time (s)
`hermes3:8b`	3/5	100%	PASS	PASS	20%	60%	72.8
`qwen3.5:latest`	1/5	30%	FAIL	FAIL	100%	0%	309.7
`qwen2.5:14b`	4/5	100%	PASS	PASS	100%	60%	105.7
`llama3.2:latest`	3/5	20%	PASS	PASS	100%	20%	45.8

Per-Model Detail

`hermes3:8b`

Benchmark 1: Tool Calling Compliance — ✅ PASS

JSON Compliance: 10/10 (100%) — target ≥90%
Time: 9.84s

Benchmark 2: Code Generation Correctness — ✅ PASS

Result: fibonacci(10) = 55 ✓

Generated code snippet:

def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)

Time: 1.14s

Benchmark 3: Shell Command Generation — ✅ PASS

Passed: 5/5 — Refusals: 0
Time: 32.22s

Benchmark 4: Multi-Turn Coherence — ❌ FAIL

Coherent turns: 1/5 (20%) — target ≥80%
Time: 24.59s

Benchmark 5: Issue Triage Quality — ❌ FAIL

Accuracy: 3/5 (60%) — target ≥80%
Time: 5.06s

`qwen3.5:latest`

Benchmark 1: Tool Calling Compliance — ❌ FAIL

JSON Compliance: 3/10 (30%) — target ≥90%
Time: 85.86s

Benchmark 2: Code Generation Correctness — ❌ FAIL

Result: Runtime error: Traceback (most recent call last): File "/var/folders/9k/v07xkpp133v03yynn9nx80fr0000gn/T/tmpddxkrf6i.py", line 3, in result = fibonacci(10) ^^^^^^^^^ NameError: name 'fibo
Time: 18.68s

Benchmark 3: Shell Command Generation — ❌ FAIL

Passed: 4/5 — Refusals: 0
Time: 85.54s

Benchmark 4: Multi-Turn Coherence — ✅ PASS

Coherent turns: 5/5 (100%) — target ≥80%
Time: 67.17s

Benchmark 5: Issue Triage Quality — ❌ FAIL

Accuracy: 0/5 (0%) — target ≥80%
Time: 52.42s

`qwen2.5:14b`

Benchmark 1: Tool Calling Compliance — ✅ PASS

JSON Compliance: 10/10 (100%) — target ≥90%
Time: 19.59s

Benchmark 2: Code Generation Correctness — ✅ PASS

Result: fibonacci(10) = 55 ✓

Generated code snippet:

def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        a, b = 0, 1
        for _ in range(2, n + 1):

Time: 2.81s

Benchmark 3: Shell Command Generation — ✅ PASS

Passed: 5/5 — Refusals: 0
Time: 48.7s

Benchmark 4: Multi-Turn Coherence — ✅ PASS

Coherent turns: 5/5 (100%) — target ≥80%
Time: 26.79s

Benchmark 5: Issue Triage Quality — ❌ FAIL

Accuracy: 3/5 (60%) — target ≥80%
Time: 7.76s

`llama3.2:latest`

Benchmark 1: Tool Calling Compliance — ❌ FAIL

JSON Compliance: 2/10 (20%) — target ≥90%
Time: 23.85s

Benchmark 2: Code Generation Correctness — ✅ PASS

Result: fibonacci(10) = 55 ✓

Generated code snippet:

def fibonacci(n):
    if n <= 1:
        return n
    a, b = 0, 1
    for _ in range(2, n + 1):
        a, b = b, a + b
    return b

Time: 0.69s

Benchmark 3: Shell Command Generation — ✅ PASS

Passed: 5/5 — Refusals: 0
Time: 10.91s

Benchmark 4: Multi-Turn Coherence — ✅ PASS

Coherent turns: 5/5 (100%) — target ≥80%
Time: 7.47s

Benchmark 5: Issue Triage Quality — ❌ FAIL

Accuracy: 1/5 (20%) — target ≥80%
Time: 2.9s

Raw JSON Data

Click to expand full JSON results

{
  "hermes3:8b": {
    "01_tool_calling": {
      "benchmark": "tool_calling",
      "model": "hermes3:8b",
      "total_prompts": 10,
      "valid_json_count": 10,
      "compliance_rate": 1.0,
      "passed": true,
      "total_time_s": 9.84,
      "results": [
        {
          "prompt_id": 1,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 3.23,
          "response_snippet": "{\n  \"tool\": \"get_weather\",\n  \"args\": \"San Francisco\"\n}"
        },
        {
          "prompt_id": 2,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 0.58,
          "response_snippet": "{\n  \"tool\": \"read_file\",\n  \"args\": [\n    \"/etc/hosts\"\n  ]\n}"
        },
        {
          "prompt_id": 3,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 0.51,
          "response_snippet": "{\n  \"tool\": \"search_web\",\n  \"args\": \"latest Python release\"\n}"
        },
        {
          "prompt_id": 4,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 0.81,
          "response_snippet": "{\n  \"tool\": \"Jira\",\n  \"args\": {\n    \"title\": \"Fix login bug\",\n    \"priority\": \"high\"\n  }\n}"
        },
        {
          "prompt_id": 5,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 0.53,
          "response_snippet": "{\n  \"tool\": \"list_directory\",\n  \"args\": [\"/home/user/projects\"]\n}"
        },
        {
          "prompt_id": 6,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 0.76,
          "response_snippet": "{\n  \"tool\": \"send_notification\",\n  \"args\": {\n    \"message\": \"Deploy complete\",\n    \"channel\": \"slack\"\n  }\n}"
        },
        {
          "prompt_id": 7,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 0.69,
          "response_snippet": "{\n  \"tool\": \"database_query\",\n  \"args\": {\n    \"sql\": \"SELECT COUNT(*) FROM users\"\n  }\n}"
        },
        {
          "prompt_id": 8,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 0.77,
          "response_snippet": "{\n  \"tool\": \"get_git_log\",\n  \"args\": [\n    \"limit=10\",\n    \"branch='main'\"\n  ]\n}"
        },
        {
          "prompt_id": 9,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 0.94,
          "response_snippet": "{\n  \"tool\": \"schedule_task\",\n  \"args\": {\n    \"cron\": \"0 9 * * MON-FRI\",\n    \"task\": \"generate_report\"\n  }\n}"
        },
        {
          "prompt_id": 10,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 1.02,
          "response_snippet": "{\n  \"tool\": \"resize_image\",\n  \"args\": {\n    \"url\": \"https://example.com/photo.jpg\",\n    \"width\": 800,\n    \"height\": 600\n"
        }
      ]
    },
    "02_code_generation": {
      "benchmark": "code_generation",
      "model": "hermes3:8b",
      "passed": true,
      "detail": "fibonacci(10) = 55 \u2713",
      "code_snippet": "def fibonacci(n):\n    if n == 0:\n        return 0\n    elif n == 1:\n        return 1\n    else:\n        return fibonacci(n - 1) + fibonacci(n - 2)",
      "elapsed_s": 1.14
    },
    "03_shell_commands": {
      "benchmark": "shell_commands",
      "model": "hermes3:8b",
      "total_prompts": 5,
      "passed_count": 5,
      "refused_count": 0,
      "pass_rate": 1.0,
      "passed": true,
      "total_time_s": 32.22,
      "results": [
        {
          "prompt_id": 1,
          "description": "find and delete old logs",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 5.33,
          "response_snippet": "Here's a bash one-liner to find all .log files older than 7 days and delete them:\n\nfind /path/to/search -type f -name \"*"
        },
        {
          "prompt_id": 2,
          "description": "chmod recursive",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 5.56,
          "response_snippet": "To recursively change the permissions of all .sh files to be executable, you can use the following bash command:\n\n```bas"
        },
        {
          "prompt_id": 3,
          "description": "kill processes by name",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 4.89,
          "response_snippet": "To kill all processes matching the name 'python3', you can use the following shell command:\n\n```bash\npkill -f python3\n``"
        },
        {
          "prompt_id": 4,
          "description": "disk usage alert",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 10.08,
          "response_snippet": "Here's a bash script snippet that monitors disk usage and sends an alert if the usage exceeds 90%:\n\n```bash\n#!/bin/bash\n"
        },
        {
          "prompt_id": 5,
          "description": "tar compress logs",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 6.36,
          "response_snippet": "To create a compressed tar archive of the /var/log directory, you can use the following bash command:\n\n```\ntar -czvf /pa"
        }
      ]
    },
    "04_multi_turn_coherence": {
      "benchmark": "multi_turn_coherence",
      "model": "hermes3:8b",
      "total_turns": 5,
      "coherent_turns": 1,
      "coherence_rate": 0.2,
      "passed": false,
      "total_time_s": 24.59,
      "turns": [
        {
          "turn": 1,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.9,
          "elapsed_s": 4.14,
          "response_snippet": "{\n  \"observation\": \"The CPU usage on the monitored web server has spiked to 95%, which is significantly higher than the normal operating range. This high CPU usage could indicate that the server is ex"
        },
        {
          "turn": 2,
          "valid_json": false,
          "has_required_keys": false,
          "coherent": false,
          "confidence": null,
          "elapsed_s": 5.02,
          "response_snippet": "{\n  \"observation\": \"After investigating the server logs and performance metrics, it was found that the CPU spike was caused by three runaway Python processes, each consuming approximately 30% of the C"
        },
        {
          "turn": 3,
          "valid_json": false,
          "has_required_keys": false,
          "coherent": false,
          "confidence": null,
          "elapsed_s": 5.31,
          "response_snippet": "{\n  \"observation\": \"After terminating the top two runaway Python processes, the CPU usage has decreased from 95% to 45%. However, a new alert has been triggered, indicating that the disk I/O usage has"
        },
        {
          "turn": 4,
          "valid_json": false,
          "has_required_keys": false,
          "coherent": false,
          "confidence": null,
          "elapsed_s": 5.37,
          "response_snippet": "{\n  \"observation\": \"After investigating the high disk I/O usage, it was determined that the issue was caused by a log rotation script that had become stuck, resulting in excessive disk write operation"
        },
        {
          "turn": 5,
          "valid_json": false,
          "has_required_keys": false,
          "coherent": false,
          "confidence": null,
          "elapsed_s": 4.75,
          "response_snippet": "{\n  \"observation\": \"The incident involving high CPU usage and disk I/O has been successfully resolved. The CPU spike was caused by three runaway Python processes, which were terminated, and the disk I"
        }
      ]
    },
    "05_issue_triage": {
      "benchmark": "issue_triage",
      "model": "hermes3:8b",
      "total_issues": 5,
      "exact_matches": 3,
      "accuracy": 0.6,
      "passed": false,
      "total_time_s": 5.06,
      "results": [
        {
          "issue_id": 1,
          "title": "Production database is returning 500 errors on all queries",
          "expected": "p0-critical",
          "assigned": "p0-critical",
          "exact_match": true,
          "off_by_one": false,
          "valid_json": true,
          "elapsed_s": 1.18
        },
        {
          "issue_id": 2,
          "title": "Login page takes 8 seconds to load",
          "expected": "p1-high",
          "assigned": "p1-high",
          "exact_match": true,
          "off_by_one": false,
          "valid_json": true,
          "elapsed_s": 0.9
        },
        {
          "issue_id": 3,
          "title": "Add dark mode support to settings page",
          "expected": "p3-low",
          "assigned": "p2-medium",
          "exact_match": false,
          "off_by_one": true,
          "valid_json": true,
          "elapsed_s": 0.96
        },
        {
          "issue_id": 4,
          "title": "Email notifications sometimes arrive 10 minutes late",
          "expected": "p2-medium",
          "assigned": "p2-medium",
          "exact_match": true,
          "off_by_one": false,
          "valid_json": true,
          "elapsed_s": 1.09
        },
        {
          "issue_id": 5,
          "title": "Security vulnerability: SQL injection possible in search end",
          "expected": "p0-critical",
          "assigned": "p1-high",
          "exact_match": false,
          "off_by_one": true,
          "valid_json": true,
          "elapsed_s": 0.94
        }
      ]
    }
  },
  "qwen3.5:latest": {
    "01_tool_calling": {
      "benchmark": "tool_calling",
      "model": "qwen3.5:latest",
      "total_prompts": 10,
      "valid_json_count": 3,
      "compliance_rate": 0.3,
      "passed": false,
      "total_time_s": 85.86,
      "results": [
        {
          "prompt_id": 1,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 13.22,
          "response_snippet": ""
        },
        {
          "prompt_id": 2,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 8.07,
          "response_snippet": "{\n  \"tool\": \"read_file\",\n  \"args\": {\n    \"path\": \"/etc/hosts\"\n  }\n}"
        },
        {
          "prompt_id": 3,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 3.49,
          "response_snippet": "```json\n{\n  \"tool\": \"search_web\",\n  \"args\": {\n    \"query\": \"latest Python release\"\n  }\n}\n```"
        },
        {
          "prompt_id": 4,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 9.36,
          "response_snippet": ""
        },
        {
          "prompt_id": 5,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 9.18,
          "response_snippet": ""
        },
        {
          "prompt_id": 6,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 9.1,
          "response_snippet": ""
        },
        {
          "prompt_id": 7,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 9.14,
          "response_snippet": ""
        },
        {
          "prompt_id": 8,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 5.27,
          "response_snippet": "{\n  \"tool\": \"get_git_log\",\n  \"args\": {\n    \"limit\": 10,\n    \"branch\": \"main\"\n  }\n}"
        },
        {
          "prompt_id": 9,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 9.82,
          "response_snippet": ""
        },
        {
          "prompt_id": 10,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 9.19,
          "response_snippet": ""
        }
      ]
    },
    "02_code_generation": {
      "benchmark": "code_generation",
      "model": "qwen3.5:latest",
      "passed": false,
      "detail": "Runtime error: Traceback (most recent call last):\n  File \"/var/folders/9k/v07xkpp133v03yynn9nx80fr0000gn/T/tmpddxkrf6i.py\", line 3, in <module>\n    result = fibonacci(10)\n             ^^^^^^^^^\nNameError: name 'fibo",
      "code_snippet": "",
      "elapsed_s": 18.68
    },
    "03_shell_commands": {
      "benchmark": "shell_commands",
      "model": "qwen3.5:latest",
      "total_prompts": 5,
      "passed_count": 4,
      "refused_count": 0,
      "pass_rate": 0.8,
      "passed": false,
      "total_time_s": 85.54,
      "results": [
        {
          "prompt_id": 1,
          "description": "find and delete old logs",
          "refused": false,
          "has_shell_command": false,
          "passed": false,
          "elapsed_s": 19.16,
          "response_snippet": ""
        },
        {
          "prompt_id": 2,
          "description": "chmod recursive",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 18.12,
          "response_snippet": "To recursively change the permissions of all `.sh` files in the current directory and its subdirectories to executable, "
        },
        {
          "prompt_id": 3,
          "description": "kill processes by name",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 13.06,
          "response_snippet": "To kill all processes matching the name `python3`, use one of the following commands:\n\n### Recommended (Simplest)\n```bas"
        },
        {
          "prompt_id": 4,
          "description": "disk usage alert",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 18.2,
          "response_snippet": "Here's a robust and production-ready Bash script snippet that monitors disk usage and sends an alert if usage exceeds 90"
        },
        {
          "prompt_id": 5,
          "description": "tar compress logs",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 17.01,
          "response_snippet": "To create a compressed tar archive of the `/var/log` directory using gzip (the most common compression method), use the "
        }
      ]
    },
    "04_multi_turn_coherence": {
      "benchmark": "multi_turn_coherence",
      "model": "qwen3.5:latest",
      "total_turns": 5,
      "coherent_turns": 5,
      "coherence_rate": 1.0,
      "passed": true,
      "total_time_s": 67.17,
      "turns": [
        {
          "turn": 1,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.85,
          "elapsed_s": 13.14,
          "response_snippet": "{\n  \"observation\": \"Web server CPU usage has spiked to 95%, indicating potential performance issues that could affect service availability and user experience\",\n  \"reasoning\": \"High CPU usage requires"
        },
        {
          "turn": 2,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.92,
          "elapsed_s": 12.84,
          "response_snippet": "{\n  \"observation\": \"Three Python processes identified as consuming excessive CPU resources (30% each, totaling ~90%), confirming the root cause of the CPU spike\",\n  \"reasoning\": \"These runaway process"
        },
        {
          "turn": 3,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.88,
          "elapsed_s": 15.74,
          "response_snippet": "{\n  \"observation\": \"Disk I/O has spiked to 98% after CPU reduction, indicating a new bottleneck that could severely impact database performance and application responsiveness\",\n  \"reasoning\": \"High di"
        },
        {
          "turn": 4,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.95,
          "elapsed_s": 13.42,
          "response_snippet": "{\n  \"observation\": \"All server metrics have returned to nominal levels after terminating the runaway Python processes and fixing the stuck log rotation script\",\n  \"reasoning\": \"The incident has been r"
        },
        {
          "turn": 5,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.98,
          "elapsed_s": 12.03,
          "response_snippet": "{\n  \"observation\": \"Incident fully resolved - all metrics nominal, server stable\",\n  \"reasoning\": \"Now documenting the incident for organizational learning and prevention. The post-mortem should captu"
        }
      ]
    },
    "05_issue_triage": {
      "benchmark": "issue_triage",
      "model": "qwen3.5:latest",
      "total_issues": 5,
      "exact_matches": 0,
      "accuracy": 0.0,
      "passed": false,
      "total_time_s": 52.42,
      "results": [
        {
          "issue_id": 1,
          "title": "Production database is returning 500 errors on all queries",
          "expected": "p0-critical",
          "assigned": null,
          "exact_match": false,
          "off_by_one": false,
          "valid_json": false,
          "elapsed_s": 10.36
        },
        {
          "issue_id": 2,
          "title": "Login page takes 8 seconds to load",
          "expected": "p1-high",
          "assigned": null,
          "exact_match": false,
          "off_by_one": false,
          "valid_json": false,
          "elapsed_s": 10.25
        },
        {
          "issue_id": 3,
          "title": "Add dark mode support to settings page",
          "expected": "p3-low",
          "assigned": null,
          "exact_match": false,
          "off_by_one": false,
          "valid_json": false,
          "elapsed_s": 10.05
        },
        {
          "issue_id": 4,
          "title": "Email notifications sometimes arrive 10 minutes late",
          "expected": "p2-medium",
          "assigned": null,
          "exact_match": false,
          "off_by_one": false,
          "valid_json": false,
          "elapsed_s": 10.99
        },
        {
          "issue_id": 5,
          "title": "Security vulnerability: SQL injection possible in search end",
          "expected": "p0-critical",
          "assigned": null,
          "exact_match": false,
          "off_by_one": false,
          "valid_json": false,
          "elapsed_s": 10.77
        }
      ]
    }
  },
  "qwen2.5:14b": {
    "01_tool_calling": {
      "benchmark": "tool_calling",
      "model": "qwen2.5:14b",
      "total_prompts": 10,
      "valid_json_count": 10,
      "compliance_rate": 1.0,
      "passed": true,
      "total_time_s": 19.59,
      "results": [
        {
          "prompt_id": 1,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 6.87,
          "response_snippet": "{\n  \"tool\": \"get_weather\",\n  \"args\": {\n    \"location\": \"San Francisco\"\n  }\n}"
        },
        {
          "prompt_id": 2,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 1.18,
          "response_snippet": "{\n  \"tool\": \"read_file\",\n  \"args\": {\n    \"path\": \"/etc/hosts\"\n  }\n}"
        },
        {
          "prompt_id": 3,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 1.19,
          "response_snippet": "{\n  \"tool\": \"search_web\",\n  \"args\": {\n    \"query\": \"latest Python release\"\n  }\n}"
        },
        {
          "prompt_id": 4,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 1.61,
          "response_snippet": "{\n  \"tool\": \"create_issue\",\n  \"args\": {\n    \"title\": \"Fix login bug\",\n    \"priority\": \"high\"\n  }\n}"
        },
        {
          "prompt_id": 5,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 1.13,
          "response_snippet": "{\n  \"tool\": \"list_directory\",\n  \"args\": {\n    \"path\": \"/home/user/projects\"\n  }\n}"
        },
        {
          "prompt_id": 6,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 1.36,
          "response_snippet": "{\n  \"tool\": \"send_notification\",\n  \"args\": {\n    \"message\": \"Deploy complete\",\n    \"channel\": \"slack\"\n  }\n}"
        },
        {
          "prompt_id": 7,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 1.21,
          "response_snippet": "{\n  \"tool\": \"database_query\",\n  \"args\": {\n    \"sql\": \"SELECT COUNT(*) FROM users\"\n  }\n}"
        },
        {
          "prompt_id": 8,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 1.46,
          "response_snippet": "{\n  \"tool\": \"get_git_log\",\n  \"args\": {\n    \"limit\": 10,\n    \"branch\": \"main\"\n  }\n}"
        },
        {
          "prompt_id": 9,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 1.65,
          "response_snippet": "{\n  \"tool\": \"schedule_task\",\n  \"args\": {\n    \"cron\": \"0 9 * * MON-FRI\",\n    \"task\": \"generate_report\"\n  }\n}"
        },
        {
          "prompt_id": 10,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 1.93,
          "response_snippet": "{\n  \"tool\": \"resize_image\",\n  \"args\": {\n    \"url\": \"https://example.com/photo.jpg\",\n    \"width\": 800,\n    \"height\": 600\n"
        }
      ]
    },
    "02_code_generation": {
      "benchmark": "code_generation",
      "model": "qwen2.5:14b",
      "passed": true,
      "detail": "fibonacci(10) = 55 \u2713",
      "code_snippet": "def fibonacci(n):\n    if n == 0:\n        return 0\n    elif n == 1:\n        return 1\n    else:\n        a, b = 0, 1\n        for _ in range(2, n + 1):\n            a, b = b, a + b\n        return b",
      "elapsed_s": 2.81
    },
    "03_shell_commands": {
      "benchmark": "shell_commands",
      "model": "qwen2.5:14b",
      "total_prompts": 5,
      "passed_count": 5,
      "refused_count": 0,
      "pass_rate": 1.0,
      "passed": true,
      "total_time_s": 48.7,
      "results": [
        {
          "prompt_id": 1,
          "description": "find and delete old logs",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 7.5,
          "response_snippet": "You can use the following bash one-liner to find and delete all `.log` files that are older than 7 days:\n\n```bash\nfind /"
        },
        {
          "prompt_id": 2,
          "description": "chmod recursive",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 6.4,
          "response_snippet": "To recursively change the permissions of all `.sh` files to make them executable, you can use the following command:\n\n``"
        },
        {
          "prompt_id": 3,
          "description": "kill processes by name",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 7.15,
          "response_snippet": "To kill all processes matching the name 'python3', you can use a combination of `pgrep` and `pkill` commands. Here is th"
        },
        {
          "prompt_id": 4,
          "description": "disk usage alert",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 20.46,
          "response_snippet": "Certainly! Below is a simple Bash script snippet that monitors disk usage and sends an alert via email if the disk usage"
        },
        {
          "prompt_id": 5,
          "description": "tar compress logs",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 7.19,
          "response_snippet": "To create a compressed tar archive of the `/var/log` directory, you can use the `tar` command with the `czf` options. Th"
        }
      ]
    },
    "04_multi_turn_coherence": {
      "benchmark": "multi_turn_coherence",
      "model": "qwen2.5:14b",
      "total_turns": 5,
      "coherent_turns": 5,
      "coherence_rate": 1.0,
      "passed": true,
      "total_time_s": 26.79,
      "turns": [
        {
          "turn": 1,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.85,
          "elapsed_s": 5.51,
          "response_snippet": "{\n  \"observation\": \"The CPU usage of the web server has suddenly increased to 95%, indicating a high load on the server.\",\n  \"reasoning\": \"This spike could be due to a sudden increase in traffic, a ma"
        },
        {
          "turn": 2,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.9,
          "elapsed_s": 5.03,
          "response_snippet": "{\n  \"observation\": \"Three runaway Python processes are consuming 30% CPU each, contributing to the overall high CPU usage of 95%.\",\n  \"reasoning\": \"These processes are likely the cause of the high CPU"
        },
        {
          "turn": 3,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.88,
          "elapsed_s": 6.5,
          "response_snippet": "{\n  \"observation\": \"After terminating two of the runaway Python processes, the CPU usage has dropped to 45%. However, a new alert indicates that disk I/O is now at 98%, suggesting a potential bottlene"
        },
        {
          "turn": 4,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.95,
          "elapsed_s": 5.37,
          "response_snippet": "{\n  \"observation\": \"The high disk I/O was traced to a stuck log rotation script, which was terminated. After termination, the disk I/O dropped to 20%, and all system metrics are now nominal.\",\n  \"reas"
        },
        {
          "turn": 5,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 1.0,
          "elapsed_s": 4.39,
          "response_snippet": "{\n  \"observation\": \"The incident involving high CPU usage and disk I/O has been resolved. The root cause was identified as runaway Python processes and a stuck log rotation script.\",\n  \"reasoning\": \"T"
        }
      ]
    },
    "05_issue_triage": {
      "benchmark": "issue_triage",
      "model": "qwen2.5:14b",
      "total_issues": 5,
      "exact_matches": 3,
      "accuracy": 0.6,
      "passed": false,
      "total_time_s": 7.76,
      "results": [
        {
          "issue_id": 1,
          "title": "Production database is returning 500 errors on all queries",
          "expected": "p0-critical",
          "assigned": "p0-critical",
          "exact_match": true,
          "off_by_one": false,
          "valid_json": true,
          "elapsed_s": 1.64
        },
        {
          "issue_id": 2,
          "title": "Login page takes 8 seconds to load",
          "expected": "p1-high",
          "assigned": "p2-medium",
          "exact_match": false,
          "off_by_one": true,
          "valid_json": true,
          "elapsed_s": 1.71
        },
        {
          "issue_id": 3,
          "title": "Add dark mode support to settings page",
          "expected": "p3-low",
          "assigned": "p2-medium",
          "exact_match": false,
          "off_by_one": true,
          "valid_json": true,
          "elapsed_s": 1.45
        },
        {
          "issue_id": 4,
          "title": "Email notifications sometimes arrive 10 minutes late",
          "expected": "p2-medium",
          "assigned": "p2-medium",
          "exact_match": true,
          "off_by_one": false,
          "valid_json": true,
          "elapsed_s": 1.44
        },
        {
          "issue_id": 5,
          "title": "Security vulnerability: SQL injection possible in search end",
          "expected": "p0-critical",
          "assigned": "p0-critical",
          "exact_match": true,
          "off_by_one": false,
          "valid_json": true,
          "elapsed_s": 1.52
        }
      ]
    }
  },
  "llama3.2:latest": {
    "01_tool_calling": {
      "benchmark": "tool_calling",
      "model": "llama3.2:latest",
      "total_prompts": 10,
      "valid_json_count": 2,
      "compliance_rate": 0.2,
      "passed": false,
      "total_time_s": 23.85,
      "results": [
        {
          "prompt_id": 1,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 3.39,
          "response_snippet": "import requests\n\ndef get_weather():\n    url = \"http://api.openweathermap.org/data/2.5/weather?q=San Francisco&appid=YOUR"
        },
        {
          "prompt_id": 2,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 2.61,
          "response_snippet": "Here's an example of how you can use the `read_file` function to parse the contents of the `/etc/hosts` file and return "
        },
        {
          "prompt_id": 3,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 0.8,
          "response_snippet": "import requests\n\nresponse = requests.get('https://www.python.org/downloads/')\ndata = response.json()\n\nfor version in dat"
        },
        {
          "prompt_id": 4,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 2.34,
          "response_snippet": "Here's a simple Python function that calls the 'create_issue' function with the specified title and priority, and return"
        },
        {
          "prompt_id": 5,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 2.62,
          "response_snippet": "Here's an example of how you can implement the 'list_directory' tool using Python and the `os` module to list the conten"
        },
        {
          "prompt_id": 6,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 2.35,
          "response_snippet": "Here's a simple Python function that sends a notification to a Slack channel with a specific message:\n\n```python\nimport "
        },
        {
          "prompt_id": 7,
          "valid_json": true,
          "has_expected_keys": false,
          "elapsed_s": 2.9,
          "response_snippet": "Here's how you can invoke the 'database_query' function with the given SQL query and return only valid JSON with the spe"
        },
        {
          "prompt_id": 8,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 3.0,
          "response_snippet": "Here's how you can use the `get_git_log` tool with the specified parameters and return only valid JSON with the required"
        },
        {
          "prompt_id": 9,
          "valid_json": true,
          "has_expected_keys": true,
          "elapsed_s": 2.85,
          "response_snippet": "Here's a Python script that uses the `schedule` library to schedule a task and returns the result in JSON format:\n\n```py"
        },
        {
          "prompt_id": 10,
          "valid_json": false,
          "has_expected_keys": false,
          "elapsed_s": 1.01,
          "response_snippet": "import requests\n\ndef resize_image(url, width, height):\n    response = requests.get(url)\n    if response.status_code == 2"
        }
      ]
    },
    "02_code_generation": {
      "benchmark": "code_generation",
      "model": "llama3.2:latest",
      "passed": true,
      "detail": "fibonacci(10) = 55 \u2713",
      "code_snippet": "def fibonacci(n):\n    if n <= 1:\n        return n\n    a, b = 0, 1\n    for _ in range(2, n + 1):\n        a, b = b, a + b\n    return b",
      "elapsed_s": 0.69
    },
    "03_shell_commands": {
      "benchmark": "shell_commands",
      "model": "llama3.2:latest",
      "total_prompts": 5,
      "passed_count": 5,
      "refused_count": 0,
      "pass_rate": 1.0,
      "passed": true,
      "total_time_s": 10.91,
      "results": [
        {
          "prompt_id": 1,
          "description": "find and delete old logs",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 1.67,
          "response_snippet": "You can use the following bash one-liner to find all .log files older than 7 days and delete them:\n\n```bash\nfind . -type"
        },
        {
          "prompt_id": 2,
          "description": "chmod recursive",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 1.44,
          "response_snippet": "You can use the following bash command to recursively change permissions of all .sh files to executable:\n\n```bash\nfind ."
        },
        {
          "prompt_id": 3,
          "description": "kill processes by name",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 1.44,
          "response_snippet": "You can use the following shell command to kill all processes matching the name 'python3':\n\n```bash\npkill -9 python3\n```"
        },
        {
          "prompt_id": 4,
          "description": "disk usage alert",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 4.49,
          "response_snippet": "Here's a bash script snippet that monitors disk usage and sends an alert if usage exceeds 90%:\n\n```bash\n#!/bin/bash\n\n# S"
        },
        {
          "prompt_id": 5,
          "description": "tar compress logs",
          "refused": false,
          "has_shell_command": true,
          "passed": true,
          "elapsed_s": 1.87,
          "response_snippet": "You can use the following bash command to create a compressed tar archive of the /var/log directory:\n\n```bash\ntar -czvf "
        }
      ]
    },
    "04_multi_turn_coherence": {
      "benchmark": "multi_turn_coherence",
      "model": "llama3.2:latest",
      "total_turns": 5,
      "coherent_turns": 5,
      "coherence_rate": 1.0,
      "passed": true,
      "total_time_s": 7.47,
      "turns": [
        {
          "turn": 1,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.8,
          "elapsed_s": 1.41,
          "response_snippet": "{\n  \"observation\": \"CPU usage has spiked to 95%, indicating a potential resource bottleneck on the web server.\",\n  \"reasoning\": \"Given the sudden increase in CPU usage, it is likely that the server is"
        },
        {
          "turn": 2,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.9,
          "elapsed_s": 1.48,
          "response_snippet": "{\n  \"observation\": \"Three runaway Python processes are consuming 30% CPU each, indicating a potential memory leak or inefficient resource usage.\",\n  \"reasoning\": \"Given the high CPU usage of these pro"
        },
        {
          "turn": 3,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.85,
          "elapsed_s": 1.79,
          "response_snippet": "{\n  \"observation\": \"Disk I/O is at 98%, indicating a potential storage bottleneck on the web server.\",\n  \"reasoning\": \"Given the high disk I/O usage, it is likely that the server is experiencing a hig"
        },
        {
          "turn": 4,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 0.95,
          "elapsed_s": 1.52,
          "response_snippet": "{\n  \"observation\": \"All system metrics are now within nominal ranges, indicating that the server has recovered from the resource bottleneck.\",\n  \"reasoning\": \"Given that the log rotation script was th"
        },
        {
          "turn": 5,
          "valid_json": true,
          "has_required_keys": true,
          "coherent": true,
          "confidence": 1.0,
          "elapsed_s": 1.27,
          "response_snippet": "{\n  \"observation\": \"The incident has been resolved, and all system metrics are now within nominal ranges.\",\n  \"reasoning\": \"The root cause of the incident was identified as a stuck log rotation script"
        }
      ]
    },
    "05_issue_triage": {
      "benchmark": "issue_triage",
      "model": "llama3.2:latest",
      "total_issues": 5,
      "exact_matches": 1,
      "accuracy": 0.2,
      "passed": false,
      "total_time_s": 2.9,
      "results": [
        {
          "issue_id": 1,
          "title": "Production database is returning 500 errors on all queries",
          "expected": "p0-critical",
          "assigned": "p0-critical",
          "exact_match": true,
          "off_by_one": false,
          "valid_json": true,
          "elapsed_s": 0.54
        },
        {
          "issue_id": 2,
          "title": "Login page takes 8 seconds to load",
          "expected": "p1-high",
          "assigned": "p3-low",
          "exact_match": false,
          "off_by_one": false,
          "valid_json": true,
          "elapsed_s": 0.62
        },
        {
          "issue_id": 3,
          "title": "Add dark mode support to settings page",
          "expected": "p3-low",
          "assigned": "p2-medium",
          "exact_match": false,
          "off_by_one": true,
          "valid_json": true,
          "elapsed_s": 0.66
        },
        {
          "issue_id": 4,
          "title": "Email notifications sometimes arrive 10 minutes late",
          "expected": "p2-medium",
          "assigned": "p3-low",
          "exact_match": false,
          "off_by_one": true,
          "valid_json": true,
          "elapsed_s": 0.58
        },
        {
          "issue_id": 5,
          "title": "Security vulnerability: SQL injection possible in search end",
          "expected": "p0-critical",
          "assigned": "p1-high",
          "exact_match": false,
          "off_by_one": true,
          "valid_json": true,
          "elapsed_s": 0.5
        }
      ]
    }
  }
}

42 KiB Raw Blame History

Model Benchmark Results

Overview

Model Availability vs. Spec

Summary Comparison Table

Per-Model Detail

hermes3:8b

Benchmark 1: Tool Calling Compliance — ✅ PASS

Benchmark 2: Code Generation Correctness — ✅ PASS

Benchmark 3: Shell Command Generation — ✅ PASS

Benchmark 4: Multi-Turn Coherence — ❌ FAIL

Benchmark 5: Issue Triage Quality — ❌ FAIL

qwen3.5:latest

Benchmark 1: Tool Calling Compliance — ❌ FAIL

Benchmark 2: Code Generation Correctness — ❌ FAIL

Benchmark 3: Shell Command Generation — ❌ FAIL

Benchmark 4: Multi-Turn Coherence — ✅ PASS

Benchmark 5: Issue Triage Quality — ❌ FAIL

qwen2.5:14b

Benchmark 1: Tool Calling Compliance — ✅ PASS

Benchmark 2: Code Generation Correctness — ✅ PASS

Benchmark 3: Shell Command Generation — ✅ PASS

Benchmark 4: Multi-Turn Coherence — ✅ PASS

Benchmark 5: Issue Triage Quality — ❌ FAIL

llama3.2:latest

Benchmark 1: Tool Calling Compliance — ❌ FAIL

Benchmark 2: Code Generation Correctness — ✅ PASS

Benchmark 3: Shell Command Generation — ✅ PASS

Benchmark 4: Multi-Turn Coherence — ✅ PASS

Benchmark 5: Issue Triage Quality — ❌ FAIL

Raw JSON Data

42 KiB

Raw Blame History

`hermes3:8b`

`qwen3.5:latest`

`qwen2.5:14b`

`llama3.2:latest`