42 KiB
42 KiB
Model Benchmark Results
Generated: 2026-03-24 01:28 UTC
Ollama URL:http://localhost:11434
Issue: #1066
Overview
This report documents the 5-test benchmark suite results for local model candidates.
Model Availability vs. Spec
| Requested | Tested Substitute | Reason |
|---|---|---|
qwen3:14b |
qwen2.5:14b |
qwen3:14b not pulled locally |
qwen3:8b |
qwen3.5:latest |
qwen3:8b not pulled locally |
hermes3:8b |
hermes3:8b |
Exact match |
dolphin3 |
llama3.2:latest |
dolphin3 not pulled locally |
Summary Comparison Table
| Model | Passed | Tool Calling | Code Gen | Shell Gen | Coherence | Triage Acc | Time (s) |
|---|---|---|---|---|---|---|---|
hermes3:8b |
3/5 | 100% | PASS | PASS | 20% | 60% | 72.8 |
qwen3.5:latest |
1/5 | 30% | FAIL | FAIL | 100% | 0% | 309.7 |
qwen2.5:14b |
4/5 | 100% | PASS | PASS | 100% | 60% | 105.7 |
llama3.2:latest |
3/5 | 20% | PASS | PASS | 100% | 20% | 45.8 |
Per-Model Detail
hermes3:8b
Benchmark 1: Tool Calling Compliance — ✅ PASS
- JSON Compliance: 10/10 (100%) — target ≥90%
- Time: 9.84s
Benchmark 2: Code Generation Correctness — ✅ PASS
- Result: fibonacci(10) = 55 ✓
- Generated code snippet:
def fibonacci(n): if n == 0: return 0 elif n == 1: return 1 else: return fibonacci(n - 1) + fibonacci(n - 2) - Time: 1.14s
Benchmark 3: Shell Command Generation — ✅ PASS
- Passed: 5/5 — Refusals: 0
- Time: 32.22s
Benchmark 4: Multi-Turn Coherence — ❌ FAIL
- Coherent turns: 1/5 (20%) — target ≥80%
- Time: 24.59s
Benchmark 5: Issue Triage Quality — ❌ FAIL
- Accuracy: 3/5 (60%) — target ≥80%
- Time: 5.06s
qwen3.5:latest
Benchmark 1: Tool Calling Compliance — ❌ FAIL
- JSON Compliance: 3/10 (30%) — target ≥90%
- Time: 85.86s
Benchmark 2: Code Generation Correctness — ❌ FAIL
- Result: Runtime error: Traceback (most recent call last): File "/var/folders/9k/v07xkpp133v03yynn9nx80fr0000gn/T/tmpddxkrf6i.py", line 3, in result = fibonacci(10) ^^^^^^^^^ NameError: name 'fibo
- Time: 18.68s
Benchmark 3: Shell Command Generation — ❌ FAIL
- Passed: 4/5 — Refusals: 0
- Time: 85.54s
Benchmark 4: Multi-Turn Coherence — ✅ PASS
- Coherent turns: 5/5 (100%) — target ≥80%
- Time: 67.17s
Benchmark 5: Issue Triage Quality — ❌ FAIL
- Accuracy: 0/5 (0%) — target ≥80%
- Time: 52.42s
qwen2.5:14b
Benchmark 1: Tool Calling Compliance — ✅ PASS
- JSON Compliance: 10/10 (100%) — target ≥90%
- Time: 19.59s
Benchmark 2: Code Generation Correctness — ✅ PASS
- Result: fibonacci(10) = 55 ✓
- Generated code snippet:
def fibonacci(n): if n == 0: return 0 elif n == 1: return 1 else: a, b = 0, 1 for _ in range(2, n + 1): - Time: 2.81s
Benchmark 3: Shell Command Generation — ✅ PASS
- Passed: 5/5 — Refusals: 0
- Time: 48.7s
Benchmark 4: Multi-Turn Coherence — ✅ PASS
- Coherent turns: 5/5 (100%) — target ≥80%
- Time: 26.79s
Benchmark 5: Issue Triage Quality — ❌ FAIL
- Accuracy: 3/5 (60%) — target ≥80%
- Time: 7.76s
llama3.2:latest
Benchmark 1: Tool Calling Compliance — ❌ FAIL
- JSON Compliance: 2/10 (20%) — target ≥90%
- Time: 23.85s
Benchmark 2: Code Generation Correctness — ✅ PASS
- Result: fibonacci(10) = 55 ✓
- Generated code snippet:
def fibonacci(n): if n <= 1: return n a, b = 0, 1 for _ in range(2, n + 1): a, b = b, a + b return b - Time: 0.69s
Benchmark 3: Shell Command Generation — ✅ PASS
- Passed: 5/5 — Refusals: 0
- Time: 10.91s
Benchmark 4: Multi-Turn Coherence — ✅ PASS
- Coherent turns: 5/5 (100%) — target ≥80%
- Time: 7.47s
Benchmark 5: Issue Triage Quality — ❌ FAIL
- Accuracy: 1/5 (20%) — target ≥80%
- Time: 2.9s
Raw JSON Data
Click to expand full JSON results
{
"hermes3:8b": {
"01_tool_calling": {
"benchmark": "tool_calling",
"model": "hermes3:8b",
"total_prompts": 10,
"valid_json_count": 10,
"compliance_rate": 1.0,
"passed": true,
"total_time_s": 9.84,
"results": [
{
"prompt_id": 1,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 3.23,
"response_snippet": "{\n \"tool\": \"get_weather\",\n \"args\": \"San Francisco\"\n}"
},
{
"prompt_id": 2,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 0.58,
"response_snippet": "{\n \"tool\": \"read_file\",\n \"args\": [\n \"/etc/hosts\"\n ]\n}"
},
{
"prompt_id": 3,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 0.51,
"response_snippet": "{\n \"tool\": \"search_web\",\n \"args\": \"latest Python release\"\n}"
},
{
"prompt_id": 4,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 0.81,
"response_snippet": "{\n \"tool\": \"Jira\",\n \"args\": {\n \"title\": \"Fix login bug\",\n \"priority\": \"high\"\n }\n}"
},
{
"prompt_id": 5,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 0.53,
"response_snippet": "{\n \"tool\": \"list_directory\",\n \"args\": [\"/home/user/projects\"]\n}"
},
{
"prompt_id": 6,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 0.76,
"response_snippet": "{\n \"tool\": \"send_notification\",\n \"args\": {\n \"message\": \"Deploy complete\",\n \"channel\": \"slack\"\n }\n}"
},
{
"prompt_id": 7,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 0.69,
"response_snippet": "{\n \"tool\": \"database_query\",\n \"args\": {\n \"sql\": \"SELECT COUNT(*) FROM users\"\n }\n}"
},
{
"prompt_id": 8,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 0.77,
"response_snippet": "{\n \"tool\": \"get_git_log\",\n \"args\": [\n \"limit=10\",\n \"branch='main'\"\n ]\n}"
},
{
"prompt_id": 9,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 0.94,
"response_snippet": "{\n \"tool\": \"schedule_task\",\n \"args\": {\n \"cron\": \"0 9 * * MON-FRI\",\n \"task\": \"generate_report\"\n }\n}"
},
{
"prompt_id": 10,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 1.02,
"response_snippet": "{\n \"tool\": \"resize_image\",\n \"args\": {\n \"url\": \"https://example.com/photo.jpg\",\n \"width\": 800,\n \"height\": 600\n"
}
]
},
"02_code_generation": {
"benchmark": "code_generation",
"model": "hermes3:8b",
"passed": true,
"detail": "fibonacci(10) = 55 \u2713",
"code_snippet": "def fibonacci(n):\n if n == 0:\n return 0\n elif n == 1:\n return 1\n else:\n return fibonacci(n - 1) + fibonacci(n - 2)",
"elapsed_s": 1.14
},
"03_shell_commands": {
"benchmark": "shell_commands",
"model": "hermes3:8b",
"total_prompts": 5,
"passed_count": 5,
"refused_count": 0,
"pass_rate": 1.0,
"passed": true,
"total_time_s": 32.22,
"results": [
{
"prompt_id": 1,
"description": "find and delete old logs",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 5.33,
"response_snippet": "Here's a bash one-liner to find all .log files older than 7 days and delete them:\n\nfind /path/to/search -type f -name \"*"
},
{
"prompt_id": 2,
"description": "chmod recursive",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 5.56,
"response_snippet": "To recursively change the permissions of all .sh files to be executable, you can use the following bash command:\n\n```bas"
},
{
"prompt_id": 3,
"description": "kill processes by name",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 4.89,
"response_snippet": "To kill all processes matching the name 'python3', you can use the following shell command:\n\n```bash\npkill -f python3\n``"
},
{
"prompt_id": 4,
"description": "disk usage alert",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 10.08,
"response_snippet": "Here's a bash script snippet that monitors disk usage and sends an alert if the usage exceeds 90%:\n\n```bash\n#!/bin/bash\n"
},
{
"prompt_id": 5,
"description": "tar compress logs",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 6.36,
"response_snippet": "To create a compressed tar archive of the /var/log directory, you can use the following bash command:\n\n```\ntar -czvf /pa"
}
]
},
"04_multi_turn_coherence": {
"benchmark": "multi_turn_coherence",
"model": "hermes3:8b",
"total_turns": 5,
"coherent_turns": 1,
"coherence_rate": 0.2,
"passed": false,
"total_time_s": 24.59,
"turns": [
{
"turn": 1,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.9,
"elapsed_s": 4.14,
"response_snippet": "{\n \"observation\": \"The CPU usage on the monitored web server has spiked to 95%, which is significantly higher than the normal operating range. This high CPU usage could indicate that the server is ex"
},
{
"turn": 2,
"valid_json": false,
"has_required_keys": false,
"coherent": false,
"confidence": null,
"elapsed_s": 5.02,
"response_snippet": "{\n \"observation\": \"After investigating the server logs and performance metrics, it was found that the CPU spike was caused by three runaway Python processes, each consuming approximately 30% of the C"
},
{
"turn": 3,
"valid_json": false,
"has_required_keys": false,
"coherent": false,
"confidence": null,
"elapsed_s": 5.31,
"response_snippet": "{\n \"observation\": \"After terminating the top two runaway Python processes, the CPU usage has decreased from 95% to 45%. However, a new alert has been triggered, indicating that the disk I/O usage has"
},
{
"turn": 4,
"valid_json": false,
"has_required_keys": false,
"coherent": false,
"confidence": null,
"elapsed_s": 5.37,
"response_snippet": "{\n \"observation\": \"After investigating the high disk I/O usage, it was determined that the issue was caused by a log rotation script that had become stuck, resulting in excessive disk write operation"
},
{
"turn": 5,
"valid_json": false,
"has_required_keys": false,
"coherent": false,
"confidence": null,
"elapsed_s": 4.75,
"response_snippet": "{\n \"observation\": \"The incident involving high CPU usage and disk I/O has been successfully resolved. The CPU spike was caused by three runaway Python processes, which were terminated, and the disk I"
}
]
},
"05_issue_triage": {
"benchmark": "issue_triage",
"model": "hermes3:8b",
"total_issues": 5,
"exact_matches": 3,
"accuracy": 0.6,
"passed": false,
"total_time_s": 5.06,
"results": [
{
"issue_id": 1,
"title": "Production database is returning 500 errors on all queries",
"expected": "p0-critical",
"assigned": "p0-critical",
"exact_match": true,
"off_by_one": false,
"valid_json": true,
"elapsed_s": 1.18
},
{
"issue_id": 2,
"title": "Login page takes 8 seconds to load",
"expected": "p1-high",
"assigned": "p1-high",
"exact_match": true,
"off_by_one": false,
"valid_json": true,
"elapsed_s": 0.9
},
{
"issue_id": 3,
"title": "Add dark mode support to settings page",
"expected": "p3-low",
"assigned": "p2-medium",
"exact_match": false,
"off_by_one": true,
"valid_json": true,
"elapsed_s": 0.96
},
{
"issue_id": 4,
"title": "Email notifications sometimes arrive 10 minutes late",
"expected": "p2-medium",
"assigned": "p2-medium",
"exact_match": true,
"off_by_one": false,
"valid_json": true,
"elapsed_s": 1.09
},
{
"issue_id": 5,
"title": "Security vulnerability: SQL injection possible in search end",
"expected": "p0-critical",
"assigned": "p1-high",
"exact_match": false,
"off_by_one": true,
"valid_json": true,
"elapsed_s": 0.94
}
]
}
},
"qwen3.5:latest": {
"01_tool_calling": {
"benchmark": "tool_calling",
"model": "qwen3.5:latest",
"total_prompts": 10,
"valid_json_count": 3,
"compliance_rate": 0.3,
"passed": false,
"total_time_s": 85.86,
"results": [
{
"prompt_id": 1,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 13.22,
"response_snippet": ""
},
{
"prompt_id": 2,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 8.07,
"response_snippet": "{\n \"tool\": \"read_file\",\n \"args\": {\n \"path\": \"/etc/hosts\"\n }\n}"
},
{
"prompt_id": 3,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 3.49,
"response_snippet": "```json\n{\n \"tool\": \"search_web\",\n \"args\": {\n \"query\": \"latest Python release\"\n }\n}\n```"
},
{
"prompt_id": 4,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 9.36,
"response_snippet": ""
},
{
"prompt_id": 5,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 9.18,
"response_snippet": ""
},
{
"prompt_id": 6,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 9.1,
"response_snippet": ""
},
{
"prompt_id": 7,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 9.14,
"response_snippet": ""
},
{
"prompt_id": 8,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 5.27,
"response_snippet": "{\n \"tool\": \"get_git_log\",\n \"args\": {\n \"limit\": 10,\n \"branch\": \"main\"\n }\n}"
},
{
"prompt_id": 9,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 9.82,
"response_snippet": ""
},
{
"prompt_id": 10,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 9.19,
"response_snippet": ""
}
]
},
"02_code_generation": {
"benchmark": "code_generation",
"model": "qwen3.5:latest",
"passed": false,
"detail": "Runtime error: Traceback (most recent call last):\n File \"/var/folders/9k/v07xkpp133v03yynn9nx80fr0000gn/T/tmpddxkrf6i.py\", line 3, in <module>\n result = fibonacci(10)\n ^^^^^^^^^\nNameError: name 'fibo",
"code_snippet": "",
"elapsed_s": 18.68
},
"03_shell_commands": {
"benchmark": "shell_commands",
"model": "qwen3.5:latest",
"total_prompts": 5,
"passed_count": 4,
"refused_count": 0,
"pass_rate": 0.8,
"passed": false,
"total_time_s": 85.54,
"results": [
{
"prompt_id": 1,
"description": "find and delete old logs",
"refused": false,
"has_shell_command": false,
"passed": false,
"elapsed_s": 19.16,
"response_snippet": ""
},
{
"prompt_id": 2,
"description": "chmod recursive",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 18.12,
"response_snippet": "To recursively change the permissions of all `.sh` files in the current directory and its subdirectories to executable, "
},
{
"prompt_id": 3,
"description": "kill processes by name",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 13.06,
"response_snippet": "To kill all processes matching the name `python3`, use one of the following commands:\n\n### Recommended (Simplest)\n```bas"
},
{
"prompt_id": 4,
"description": "disk usage alert",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 18.2,
"response_snippet": "Here's a robust and production-ready Bash script snippet that monitors disk usage and sends an alert if usage exceeds 90"
},
{
"prompt_id": 5,
"description": "tar compress logs",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 17.01,
"response_snippet": "To create a compressed tar archive of the `/var/log` directory using gzip (the most common compression method), use the "
}
]
},
"04_multi_turn_coherence": {
"benchmark": "multi_turn_coherence",
"model": "qwen3.5:latest",
"total_turns": 5,
"coherent_turns": 5,
"coherence_rate": 1.0,
"passed": true,
"total_time_s": 67.17,
"turns": [
{
"turn": 1,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.85,
"elapsed_s": 13.14,
"response_snippet": "{\n \"observation\": \"Web server CPU usage has spiked to 95%, indicating potential performance issues that could affect service availability and user experience\",\n \"reasoning\": \"High CPU usage requires"
},
{
"turn": 2,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.92,
"elapsed_s": 12.84,
"response_snippet": "{\n \"observation\": \"Three Python processes identified as consuming excessive CPU resources (30% each, totaling ~90%), confirming the root cause of the CPU spike\",\n \"reasoning\": \"These runaway process"
},
{
"turn": 3,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.88,
"elapsed_s": 15.74,
"response_snippet": "{\n \"observation\": \"Disk I/O has spiked to 98% after CPU reduction, indicating a new bottleneck that could severely impact database performance and application responsiveness\",\n \"reasoning\": \"High di"
},
{
"turn": 4,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.95,
"elapsed_s": 13.42,
"response_snippet": "{\n \"observation\": \"All server metrics have returned to nominal levels after terminating the runaway Python processes and fixing the stuck log rotation script\",\n \"reasoning\": \"The incident has been r"
},
{
"turn": 5,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.98,
"elapsed_s": 12.03,
"response_snippet": "{\n \"observation\": \"Incident fully resolved - all metrics nominal, server stable\",\n \"reasoning\": \"Now documenting the incident for organizational learning and prevention. The post-mortem should captu"
}
]
},
"05_issue_triage": {
"benchmark": "issue_triage",
"model": "qwen3.5:latest",
"total_issues": 5,
"exact_matches": 0,
"accuracy": 0.0,
"passed": false,
"total_time_s": 52.42,
"results": [
{
"issue_id": 1,
"title": "Production database is returning 500 errors on all queries",
"expected": "p0-critical",
"assigned": null,
"exact_match": false,
"off_by_one": false,
"valid_json": false,
"elapsed_s": 10.36
},
{
"issue_id": 2,
"title": "Login page takes 8 seconds to load",
"expected": "p1-high",
"assigned": null,
"exact_match": false,
"off_by_one": false,
"valid_json": false,
"elapsed_s": 10.25
},
{
"issue_id": 3,
"title": "Add dark mode support to settings page",
"expected": "p3-low",
"assigned": null,
"exact_match": false,
"off_by_one": false,
"valid_json": false,
"elapsed_s": 10.05
},
{
"issue_id": 4,
"title": "Email notifications sometimes arrive 10 minutes late",
"expected": "p2-medium",
"assigned": null,
"exact_match": false,
"off_by_one": false,
"valid_json": false,
"elapsed_s": 10.99
},
{
"issue_id": 5,
"title": "Security vulnerability: SQL injection possible in search end",
"expected": "p0-critical",
"assigned": null,
"exact_match": false,
"off_by_one": false,
"valid_json": false,
"elapsed_s": 10.77
}
]
}
},
"qwen2.5:14b": {
"01_tool_calling": {
"benchmark": "tool_calling",
"model": "qwen2.5:14b",
"total_prompts": 10,
"valid_json_count": 10,
"compliance_rate": 1.0,
"passed": true,
"total_time_s": 19.59,
"results": [
{
"prompt_id": 1,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 6.87,
"response_snippet": "{\n \"tool\": \"get_weather\",\n \"args\": {\n \"location\": \"San Francisco\"\n }\n}"
},
{
"prompt_id": 2,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 1.18,
"response_snippet": "{\n \"tool\": \"read_file\",\n \"args\": {\n \"path\": \"/etc/hosts\"\n }\n}"
},
{
"prompt_id": 3,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 1.19,
"response_snippet": "{\n \"tool\": \"search_web\",\n \"args\": {\n \"query\": \"latest Python release\"\n }\n}"
},
{
"prompt_id": 4,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 1.61,
"response_snippet": "{\n \"tool\": \"create_issue\",\n \"args\": {\n \"title\": \"Fix login bug\",\n \"priority\": \"high\"\n }\n}"
},
{
"prompt_id": 5,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 1.13,
"response_snippet": "{\n \"tool\": \"list_directory\",\n \"args\": {\n \"path\": \"/home/user/projects\"\n }\n}"
},
{
"prompt_id": 6,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 1.36,
"response_snippet": "{\n \"tool\": \"send_notification\",\n \"args\": {\n \"message\": \"Deploy complete\",\n \"channel\": \"slack\"\n }\n}"
},
{
"prompt_id": 7,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 1.21,
"response_snippet": "{\n \"tool\": \"database_query\",\n \"args\": {\n \"sql\": \"SELECT COUNT(*) FROM users\"\n }\n}"
},
{
"prompt_id": 8,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 1.46,
"response_snippet": "{\n \"tool\": \"get_git_log\",\n \"args\": {\n \"limit\": 10,\n \"branch\": \"main\"\n }\n}"
},
{
"prompt_id": 9,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 1.65,
"response_snippet": "{\n \"tool\": \"schedule_task\",\n \"args\": {\n \"cron\": \"0 9 * * MON-FRI\",\n \"task\": \"generate_report\"\n }\n}"
},
{
"prompt_id": 10,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 1.93,
"response_snippet": "{\n \"tool\": \"resize_image\",\n \"args\": {\n \"url\": \"https://example.com/photo.jpg\",\n \"width\": 800,\n \"height\": 600\n"
}
]
},
"02_code_generation": {
"benchmark": "code_generation",
"model": "qwen2.5:14b",
"passed": true,
"detail": "fibonacci(10) = 55 \u2713",
"code_snippet": "def fibonacci(n):\n if n == 0:\n return 0\n elif n == 1:\n return 1\n else:\n a, b = 0, 1\n for _ in range(2, n + 1):\n a, b = b, a + b\n return b",
"elapsed_s": 2.81
},
"03_shell_commands": {
"benchmark": "shell_commands",
"model": "qwen2.5:14b",
"total_prompts": 5,
"passed_count": 5,
"refused_count": 0,
"pass_rate": 1.0,
"passed": true,
"total_time_s": 48.7,
"results": [
{
"prompt_id": 1,
"description": "find and delete old logs",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 7.5,
"response_snippet": "You can use the following bash one-liner to find and delete all `.log` files that are older than 7 days:\n\n```bash\nfind /"
},
{
"prompt_id": 2,
"description": "chmod recursive",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 6.4,
"response_snippet": "To recursively change the permissions of all `.sh` files to make them executable, you can use the following command:\n\n``"
},
{
"prompt_id": 3,
"description": "kill processes by name",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 7.15,
"response_snippet": "To kill all processes matching the name 'python3', you can use a combination of `pgrep` and `pkill` commands. Here is th"
},
{
"prompt_id": 4,
"description": "disk usage alert",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 20.46,
"response_snippet": "Certainly! Below is a simple Bash script snippet that monitors disk usage and sends an alert via email if the disk usage"
},
{
"prompt_id": 5,
"description": "tar compress logs",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 7.19,
"response_snippet": "To create a compressed tar archive of the `/var/log` directory, you can use the `tar` command with the `czf` options. Th"
}
]
},
"04_multi_turn_coherence": {
"benchmark": "multi_turn_coherence",
"model": "qwen2.5:14b",
"total_turns": 5,
"coherent_turns": 5,
"coherence_rate": 1.0,
"passed": true,
"total_time_s": 26.79,
"turns": [
{
"turn": 1,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.85,
"elapsed_s": 5.51,
"response_snippet": "{\n \"observation\": \"The CPU usage of the web server has suddenly increased to 95%, indicating a high load on the server.\",\n \"reasoning\": \"This spike could be due to a sudden increase in traffic, a ma"
},
{
"turn": 2,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.9,
"elapsed_s": 5.03,
"response_snippet": "{\n \"observation\": \"Three runaway Python processes are consuming 30% CPU each, contributing to the overall high CPU usage of 95%.\",\n \"reasoning\": \"These processes are likely the cause of the high CPU"
},
{
"turn": 3,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.88,
"elapsed_s": 6.5,
"response_snippet": "{\n \"observation\": \"After terminating two of the runaway Python processes, the CPU usage has dropped to 45%. However, a new alert indicates that disk I/O is now at 98%, suggesting a potential bottlene"
},
{
"turn": 4,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.95,
"elapsed_s": 5.37,
"response_snippet": "{\n \"observation\": \"The high disk I/O was traced to a stuck log rotation script, which was terminated. After termination, the disk I/O dropped to 20%, and all system metrics are now nominal.\",\n \"reas"
},
{
"turn": 5,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 1.0,
"elapsed_s": 4.39,
"response_snippet": "{\n \"observation\": \"The incident involving high CPU usage and disk I/O has been resolved. The root cause was identified as runaway Python processes and a stuck log rotation script.\",\n \"reasoning\": \"T"
}
]
},
"05_issue_triage": {
"benchmark": "issue_triage",
"model": "qwen2.5:14b",
"total_issues": 5,
"exact_matches": 3,
"accuracy": 0.6,
"passed": false,
"total_time_s": 7.76,
"results": [
{
"issue_id": 1,
"title": "Production database is returning 500 errors on all queries",
"expected": "p0-critical",
"assigned": "p0-critical",
"exact_match": true,
"off_by_one": false,
"valid_json": true,
"elapsed_s": 1.64
},
{
"issue_id": 2,
"title": "Login page takes 8 seconds to load",
"expected": "p1-high",
"assigned": "p2-medium",
"exact_match": false,
"off_by_one": true,
"valid_json": true,
"elapsed_s": 1.71
},
{
"issue_id": 3,
"title": "Add dark mode support to settings page",
"expected": "p3-low",
"assigned": "p2-medium",
"exact_match": false,
"off_by_one": true,
"valid_json": true,
"elapsed_s": 1.45
},
{
"issue_id": 4,
"title": "Email notifications sometimes arrive 10 minutes late",
"expected": "p2-medium",
"assigned": "p2-medium",
"exact_match": true,
"off_by_one": false,
"valid_json": true,
"elapsed_s": 1.44
},
{
"issue_id": 5,
"title": "Security vulnerability: SQL injection possible in search end",
"expected": "p0-critical",
"assigned": "p0-critical",
"exact_match": true,
"off_by_one": false,
"valid_json": true,
"elapsed_s": 1.52
}
]
}
},
"llama3.2:latest": {
"01_tool_calling": {
"benchmark": "tool_calling",
"model": "llama3.2:latest",
"total_prompts": 10,
"valid_json_count": 2,
"compliance_rate": 0.2,
"passed": false,
"total_time_s": 23.85,
"results": [
{
"prompt_id": 1,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 3.39,
"response_snippet": "import requests\n\ndef get_weather():\n url = \"http://api.openweathermap.org/data/2.5/weather?q=San Francisco&appid=YOUR"
},
{
"prompt_id": 2,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 2.61,
"response_snippet": "Here's an example of how you can use the `read_file` function to parse the contents of the `/etc/hosts` file and return "
},
{
"prompt_id": 3,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 0.8,
"response_snippet": "import requests\n\nresponse = requests.get('https://www.python.org/downloads/')\ndata = response.json()\n\nfor version in dat"
},
{
"prompt_id": 4,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 2.34,
"response_snippet": "Here's a simple Python function that calls the 'create_issue' function with the specified title and priority, and return"
},
{
"prompt_id": 5,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 2.62,
"response_snippet": "Here's an example of how you can implement the 'list_directory' tool using Python and the `os` module to list the conten"
},
{
"prompt_id": 6,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 2.35,
"response_snippet": "Here's a simple Python function that sends a notification to a Slack channel with a specific message:\n\n```python\nimport "
},
{
"prompt_id": 7,
"valid_json": true,
"has_expected_keys": false,
"elapsed_s": 2.9,
"response_snippet": "Here's how you can invoke the 'database_query' function with the given SQL query and return only valid JSON with the spe"
},
{
"prompt_id": 8,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 3.0,
"response_snippet": "Here's how you can use the `get_git_log` tool with the specified parameters and return only valid JSON with the required"
},
{
"prompt_id": 9,
"valid_json": true,
"has_expected_keys": true,
"elapsed_s": 2.85,
"response_snippet": "Here's a Python script that uses the `schedule` library to schedule a task and returns the result in JSON format:\n\n```py"
},
{
"prompt_id": 10,
"valid_json": false,
"has_expected_keys": false,
"elapsed_s": 1.01,
"response_snippet": "import requests\n\ndef resize_image(url, width, height):\n response = requests.get(url)\n if response.status_code == 2"
}
]
},
"02_code_generation": {
"benchmark": "code_generation",
"model": "llama3.2:latest",
"passed": true,
"detail": "fibonacci(10) = 55 \u2713",
"code_snippet": "def fibonacci(n):\n if n <= 1:\n return n\n a, b = 0, 1\n for _ in range(2, n + 1):\n a, b = b, a + b\n return b",
"elapsed_s": 0.69
},
"03_shell_commands": {
"benchmark": "shell_commands",
"model": "llama3.2:latest",
"total_prompts": 5,
"passed_count": 5,
"refused_count": 0,
"pass_rate": 1.0,
"passed": true,
"total_time_s": 10.91,
"results": [
{
"prompt_id": 1,
"description": "find and delete old logs",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 1.67,
"response_snippet": "You can use the following bash one-liner to find all .log files older than 7 days and delete them:\n\n```bash\nfind . -type"
},
{
"prompt_id": 2,
"description": "chmod recursive",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 1.44,
"response_snippet": "You can use the following bash command to recursively change permissions of all .sh files to executable:\n\n```bash\nfind ."
},
{
"prompt_id": 3,
"description": "kill processes by name",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 1.44,
"response_snippet": "You can use the following shell command to kill all processes matching the name 'python3':\n\n```bash\npkill -9 python3\n```"
},
{
"prompt_id": 4,
"description": "disk usage alert",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 4.49,
"response_snippet": "Here's a bash script snippet that monitors disk usage and sends an alert if usage exceeds 90%:\n\n```bash\n#!/bin/bash\n\n# S"
},
{
"prompt_id": 5,
"description": "tar compress logs",
"refused": false,
"has_shell_command": true,
"passed": true,
"elapsed_s": 1.87,
"response_snippet": "You can use the following bash command to create a compressed tar archive of the /var/log directory:\n\n```bash\ntar -czvf "
}
]
},
"04_multi_turn_coherence": {
"benchmark": "multi_turn_coherence",
"model": "llama3.2:latest",
"total_turns": 5,
"coherent_turns": 5,
"coherence_rate": 1.0,
"passed": true,
"total_time_s": 7.47,
"turns": [
{
"turn": 1,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.8,
"elapsed_s": 1.41,
"response_snippet": "{\n \"observation\": \"CPU usage has spiked to 95%, indicating a potential resource bottleneck on the web server.\",\n \"reasoning\": \"Given the sudden increase in CPU usage, it is likely that the server is"
},
{
"turn": 2,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.9,
"elapsed_s": 1.48,
"response_snippet": "{\n \"observation\": \"Three runaway Python processes are consuming 30% CPU each, indicating a potential memory leak or inefficient resource usage.\",\n \"reasoning\": \"Given the high CPU usage of these pro"
},
{
"turn": 3,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.85,
"elapsed_s": 1.79,
"response_snippet": "{\n \"observation\": \"Disk I/O is at 98%, indicating a potential storage bottleneck on the web server.\",\n \"reasoning\": \"Given the high disk I/O usage, it is likely that the server is experiencing a hig"
},
{
"turn": 4,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 0.95,
"elapsed_s": 1.52,
"response_snippet": "{\n \"observation\": \"All system metrics are now within nominal ranges, indicating that the server has recovered from the resource bottleneck.\",\n \"reasoning\": \"Given that the log rotation script was th"
},
{
"turn": 5,
"valid_json": true,
"has_required_keys": true,
"coherent": true,
"confidence": 1.0,
"elapsed_s": 1.27,
"response_snippet": "{\n \"observation\": \"The incident has been resolved, and all system metrics are now within nominal ranges.\",\n \"reasoning\": \"The root cause of the incident was identified as a stuck log rotation script"
}
]
},
"05_issue_triage": {
"benchmark": "issue_triage",
"model": "llama3.2:latest",
"total_issues": 5,
"exact_matches": 1,
"accuracy": 0.2,
"passed": false,
"total_time_s": 2.9,
"results": [
{
"issue_id": 1,
"title": "Production database is returning 500 errors on all queries",
"expected": "p0-critical",
"assigned": "p0-critical",
"exact_match": true,
"off_by_one": false,
"valid_json": true,
"elapsed_s": 0.54
},
{
"issue_id": 2,
"title": "Login page takes 8 seconds to load",
"expected": "p1-high",
"assigned": "p3-low",
"exact_match": false,
"off_by_one": false,
"valid_json": true,
"elapsed_s": 0.62
},
{
"issue_id": 3,
"title": "Add dark mode support to settings page",
"expected": "p3-low",
"assigned": "p2-medium",
"exact_match": false,
"off_by_one": true,
"valid_json": true,
"elapsed_s": 0.66
},
{
"issue_id": 4,
"title": "Email notifications sometimes arrive 10 minutes late",
"expected": "p2-medium",
"assigned": "p3-low",
"exact_match": false,
"off_by_one": true,
"valid_json": true,
"elapsed_s": 0.58
},
{
"issue_id": 5,
"title": "Security vulnerability: SQL injection possible in search end",
"expected": "p0-critical",
"assigned": "p1-high",
"exact_match": false,
"off_by_one": true,
"valid_json": true,
"elapsed_s": 0.5
}
]
}
}
}