forked from Rockachopa/Timmy-time-dashboard
1245 lines
42 KiB
Markdown
1245 lines
42 KiB
Markdown
|
|
# Model Benchmark Results
|
||
|
|
|
||
|
|
> Generated: 2026-03-24 01:28 UTC
|
||
|
|
> Ollama URL: `http://localhost:11434`
|
||
|
|
> Issue: [#1066](http://143.198.27.163:3000/rockachopa/Timmy-time-dashboard/issues/1066)
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
This report documents the 5-test benchmark suite results for local model candidates.
|
||
|
|
|
||
|
|
### Model Availability vs. Spec
|
||
|
|
|
||
|
|
| Requested | Tested Substitute | Reason |
|
||
|
|
|-----------|-------------------|--------|
|
||
|
|
| `qwen3:14b` | `qwen2.5:14b` | `qwen3:14b` not pulled locally |
|
||
|
|
| `qwen3:8b` | `qwen3.5:latest` | `qwen3:8b` not pulled locally |
|
||
|
|
| `hermes3:8b` | `hermes3:8b` | Exact match |
|
||
|
|
| `dolphin3` | `llama3.2:latest` | `dolphin3` not pulled locally |
|
||
|
|
|
||
|
|
## Summary Comparison Table
|
||
|
|
|
||
|
|
| Model | Passed | Tool Calling | Code Gen | Shell Gen | Coherence | Triage Acc | Time (s) |
|
||
|
|
|-------|--------|-------------|----------|-----------|-----------|------------|----------|
|
||
|
|
| `hermes3:8b` | 3/5 | 100% | PASS | PASS | 20% | 60% | 72.8 |
|
||
|
|
| `qwen3.5:latest` | 1/5 | 30% | FAIL | FAIL | 100% | 0% | 309.7 |
|
||
|
|
| `qwen2.5:14b` | 4/5 | 100% | PASS | PASS | 100% | 60% | 105.7 |
|
||
|
|
| `llama3.2:latest` | 3/5 | 20% | PASS | PASS | 100% | 20% | 45.8 |
|
||
|
|
|
||
|
|
## Per-Model Detail
|
||
|
|
|
||
|
|
### `hermes3:8b`
|
||
|
|
|
||
|
|
#### Benchmark 1: Tool Calling Compliance — ✅ PASS
|
||
|
|
|
||
|
|
- **JSON Compliance:** 10/10 (100%) — target ≥90%
|
||
|
|
- **Time:** 9.84s
|
||
|
|
|
||
|
|
#### Benchmark 2: Code Generation Correctness — ✅ PASS
|
||
|
|
|
||
|
|
- **Result:** fibonacci(10) = 55 ✓
|
||
|
|
- **Generated code snippet:**
|
||
|
|
```python
|
||
|
|
def fibonacci(n):
|
||
|
|
if n == 0:
|
||
|
|
return 0
|
||
|
|
elif n == 1:
|
||
|
|
return 1
|
||
|
|
else:
|
||
|
|
return fibonacci(n - 1) + fibonacci(n - 2)
|
||
|
|
```
|
||
|
|
- **Time:** 1.14s
|
||
|
|
|
||
|
|
#### Benchmark 3: Shell Command Generation — ✅ PASS
|
||
|
|
|
||
|
|
- **Passed:** 5/5 — **Refusals:** 0
|
||
|
|
- **Time:** 32.22s
|
||
|
|
|
||
|
|
#### Benchmark 4: Multi-Turn Coherence — ❌ FAIL
|
||
|
|
|
||
|
|
- **Coherent turns:** 1/5 (20%) — target ≥80%
|
||
|
|
- **Time:** 24.59s
|
||
|
|
|
||
|
|
#### Benchmark 5: Issue Triage Quality — ❌ FAIL
|
||
|
|
|
||
|
|
- **Accuracy:** 3/5 (60%) — target ≥80%
|
||
|
|
- **Time:** 5.06s
|
||
|
|
|
||
|
|
### `qwen3.5:latest`
|
||
|
|
|
||
|
|
#### Benchmark 1: Tool Calling Compliance — ❌ FAIL
|
||
|
|
|
||
|
|
- **JSON Compliance:** 3/10 (30%) — target ≥90%
|
||
|
|
- **Time:** 85.86s
|
||
|
|
|
||
|
|
#### Benchmark 2: Code Generation Correctness — ❌ FAIL
|
||
|
|
|
||
|
|
- **Result:** Runtime error: Traceback (most recent call last):
|
||
|
|
File "/var/folders/9k/v07xkpp133v03yynn9nx80fr0000gn/T/tmpddxkrf6i.py", line 3, in <module>
|
||
|
|
result = fibonacci(10)
|
||
|
|
^^^^^^^^^
|
||
|
|
NameError: name 'fibo
|
||
|
|
- **Time:** 18.68s
|
||
|
|
|
||
|
|
#### Benchmark 3: Shell Command Generation — ❌ FAIL
|
||
|
|
|
||
|
|
- **Passed:** 4/5 — **Refusals:** 0
|
||
|
|
- **Time:** 85.54s
|
||
|
|
|
||
|
|
#### Benchmark 4: Multi-Turn Coherence — ✅ PASS
|
||
|
|
|
||
|
|
- **Coherent turns:** 5/5 (100%) — target ≥80%
|
||
|
|
- **Time:** 67.17s
|
||
|
|
|
||
|
|
#### Benchmark 5: Issue Triage Quality — ❌ FAIL
|
||
|
|
|
||
|
|
- **Accuracy:** 0/5 (0%) — target ≥80%
|
||
|
|
- **Time:** 52.42s
|
||
|
|
|
||
|
|
### `qwen2.5:14b`
|
||
|
|
|
||
|
|
#### Benchmark 1: Tool Calling Compliance — ✅ PASS
|
||
|
|
|
||
|
|
- **JSON Compliance:** 10/10 (100%) — target ≥90%
|
||
|
|
- **Time:** 19.59s
|
||
|
|
|
||
|
|
#### Benchmark 2: Code Generation Correctness — ✅ PASS
|
||
|
|
|
||
|
|
- **Result:** fibonacci(10) = 55 ✓
|
||
|
|
- **Generated code snippet:**
|
||
|
|
```python
|
||
|
|
def fibonacci(n):
|
||
|
|
if n == 0:
|
||
|
|
return 0
|
||
|
|
elif n == 1:
|
||
|
|
return 1
|
||
|
|
else:
|
||
|
|
a, b = 0, 1
|
||
|
|
for _ in range(2, n + 1):
|
||
|
|
```
|
||
|
|
- **Time:** 2.81s
|
||
|
|
|
||
|
|
#### Benchmark 3: Shell Command Generation — ✅ PASS
|
||
|
|
|
||
|
|
- **Passed:** 5/5 — **Refusals:** 0
|
||
|
|
- **Time:** 48.7s
|
||
|
|
|
||
|
|
#### Benchmark 4: Multi-Turn Coherence — ✅ PASS
|
||
|
|
|
||
|
|
- **Coherent turns:** 5/5 (100%) — target ≥80%
|
||
|
|
- **Time:** 26.79s
|
||
|
|
|
||
|
|
#### Benchmark 5: Issue Triage Quality — ❌ FAIL
|
||
|
|
|
||
|
|
- **Accuracy:** 3/5 (60%) — target ≥80%
|
||
|
|
- **Time:** 7.76s
|
||
|
|
|
||
|
|
### `llama3.2:latest`
|
||
|
|
|
||
|
|
#### Benchmark 1: Tool Calling Compliance — ❌ FAIL
|
||
|
|
|
||
|
|
- **JSON Compliance:** 2/10 (20%) — target ≥90%
|
||
|
|
- **Time:** 23.85s
|
||
|
|
|
||
|
|
#### Benchmark 2: Code Generation Correctness — ✅ PASS
|
||
|
|
|
||
|
|
- **Result:** fibonacci(10) = 55 ✓
|
||
|
|
- **Generated code snippet:**
|
||
|
|
```python
|
||
|
|
def fibonacci(n):
|
||
|
|
if n <= 1:
|
||
|
|
return n
|
||
|
|
a, b = 0, 1
|
||
|
|
for _ in range(2, n + 1):
|
||
|
|
a, b = b, a + b
|
||
|
|
return b
|
||
|
|
```
|
||
|
|
- **Time:** 0.69s
|
||
|
|
|
||
|
|
#### Benchmark 3: Shell Command Generation — ✅ PASS
|
||
|
|
|
||
|
|
- **Passed:** 5/5 — **Refusals:** 0
|
||
|
|
- **Time:** 10.91s
|
||
|
|
|
||
|
|
#### Benchmark 4: Multi-Turn Coherence — ✅ PASS
|
||
|
|
|
||
|
|
- **Coherent turns:** 5/5 (100%) — target ≥80%
|
||
|
|
- **Time:** 7.47s
|
||
|
|
|
||
|
|
#### Benchmark 5: Issue Triage Quality — ❌ FAIL
|
||
|
|
|
||
|
|
- **Accuracy:** 1/5 (20%) — target ≥80%
|
||
|
|
- **Time:** 2.9s
|
||
|
|
|
||
|
|
## Raw JSON Data
|
||
|
|
|
||
|
|
<details>
|
||
|
|
<summary>Click to expand full JSON results</summary>
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"hermes3:8b": {
|
||
|
|
"01_tool_calling": {
|
||
|
|
"benchmark": "tool_calling",
|
||
|
|
"model": "hermes3:8b",
|
||
|
|
"total_prompts": 10,
|
||
|
|
"valid_json_count": 10,
|
||
|
|
"compliance_rate": 1.0,
|
||
|
|
"passed": true,
|
||
|
|
"total_time_s": 9.84,
|
||
|
|
"results": [
|
||
|
|
{
|
||
|
|
"prompt_id": 1,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 3.23,
|
||
|
|
"response_snippet": "{\n \"tool\": \"get_weather\",\n \"args\": \"San Francisco\"\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 2,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 0.58,
|
||
|
|
"response_snippet": "{\n \"tool\": \"read_file\",\n \"args\": [\n \"/etc/hosts\"\n ]\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 3,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 0.51,
|
||
|
|
"response_snippet": "{\n \"tool\": \"search_web\",\n \"args\": \"latest Python release\"\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 4,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 0.81,
|
||
|
|
"response_snippet": "{\n \"tool\": \"Jira\",\n \"args\": {\n \"title\": \"Fix login bug\",\n \"priority\": \"high\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 5,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 0.53,
|
||
|
|
"response_snippet": "{\n \"tool\": \"list_directory\",\n \"args\": [\"/home/user/projects\"]\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 6,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 0.76,
|
||
|
|
"response_snippet": "{\n \"tool\": \"send_notification\",\n \"args\": {\n \"message\": \"Deploy complete\",\n \"channel\": \"slack\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 7,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 0.69,
|
||
|
|
"response_snippet": "{\n \"tool\": \"database_query\",\n \"args\": {\n \"sql\": \"SELECT COUNT(*) FROM users\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 8,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 0.77,
|
||
|
|
"response_snippet": "{\n \"tool\": \"get_git_log\",\n \"args\": [\n \"limit=10\",\n \"branch='main'\"\n ]\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 9,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 0.94,
|
||
|
|
"response_snippet": "{\n \"tool\": \"schedule_task\",\n \"args\": {\n \"cron\": \"0 9 * * MON-FRI\",\n \"task\": \"generate_report\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 10,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 1.02,
|
||
|
|
"response_snippet": "{\n \"tool\": \"resize_image\",\n \"args\": {\n \"url\": \"https://example.com/photo.jpg\",\n \"width\": 800,\n \"height\": 600\n"
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
"02_code_generation": {
|
||
|
|
"benchmark": "code_generation",
|
||
|
|
"model": "hermes3:8b",
|
||
|
|
"passed": true,
|
||
|
|
"detail": "fibonacci(10) = 55 \u2713",
|
||
|
|
"code_snippet": "def fibonacci(n):\n if n == 0:\n return 0\n elif n == 1:\n return 1\n else:\n return fibonacci(n - 1) + fibonacci(n - 2)",
|
||
|
|
"elapsed_s": 1.14
|
||
|
|
},
|
||
|
|
"03_shell_commands": {
|
||
|
|
"benchmark": "shell_commands",
|
||
|
|
"model": "hermes3:8b",
|
||
|
|
"total_prompts": 5,
|
||
|
|
"passed_count": 5,
|
||
|
|
"refused_count": 0,
|
||
|
|
"pass_rate": 1.0,
|
||
|
|
"passed": true,
|
||
|
|
"total_time_s": 32.22,
|
||
|
|
"results": [
|
||
|
|
{
|
||
|
|
"prompt_id": 1,
|
||
|
|
"description": "find and delete old logs",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 5.33,
|
||
|
|
"response_snippet": "Here's a bash one-liner to find all .log files older than 7 days and delete them:\n\nfind /path/to/search -type f -name \"*"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 2,
|
||
|
|
"description": "chmod recursive",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 5.56,
|
||
|
|
"response_snippet": "To recursively change the permissions of all .sh files to be executable, you can use the following bash command:\n\n```bas"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 3,
|
||
|
|
"description": "kill processes by name",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 4.89,
|
||
|
|
"response_snippet": "To kill all processes matching the name 'python3', you can use the following shell command:\n\n```bash\npkill -f python3\n``"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 4,
|
||
|
|
"description": "disk usage alert",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 10.08,
|
||
|
|
"response_snippet": "Here's a bash script snippet that monitors disk usage and sends an alert if the usage exceeds 90%:\n\n```bash\n#!/bin/bash\n"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 5,
|
||
|
|
"description": "tar compress logs",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 6.36,
|
||
|
|
"response_snippet": "To create a compressed tar archive of the /var/log directory, you can use the following bash command:\n\n```\ntar -czvf /pa"
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
"04_multi_turn_coherence": {
|
||
|
|
"benchmark": "multi_turn_coherence",
|
||
|
|
"model": "hermes3:8b",
|
||
|
|
"total_turns": 5,
|
||
|
|
"coherent_turns": 1,
|
||
|
|
"coherence_rate": 0.2,
|
||
|
|
"passed": false,
|
||
|
|
"total_time_s": 24.59,
|
||
|
|
"turns": [
|
||
|
|
{
|
||
|
|
"turn": 1,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.9,
|
||
|
|
"elapsed_s": 4.14,
|
||
|
|
"response_snippet": "{\n \"observation\": \"The CPU usage on the monitored web server has spiked to 95%, which is significantly higher than the normal operating range. This high CPU usage could indicate that the server is ex"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 2,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_required_keys": false,
|
||
|
|
"coherent": false,
|
||
|
|
"confidence": null,
|
||
|
|
"elapsed_s": 5.02,
|
||
|
|
"response_snippet": "{\n \"observation\": \"After investigating the server logs and performance metrics, it was found that the CPU spike was caused by three runaway Python processes, each consuming approximately 30% of the C"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 3,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_required_keys": false,
|
||
|
|
"coherent": false,
|
||
|
|
"confidence": null,
|
||
|
|
"elapsed_s": 5.31,
|
||
|
|
"response_snippet": "{\n \"observation\": \"After terminating the top two runaway Python processes, the CPU usage has decreased from 95% to 45%. However, a new alert has been triggered, indicating that the disk I/O usage has"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 4,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_required_keys": false,
|
||
|
|
"coherent": false,
|
||
|
|
"confidence": null,
|
||
|
|
"elapsed_s": 5.37,
|
||
|
|
"response_snippet": "{\n \"observation\": \"After investigating the high disk I/O usage, it was determined that the issue was caused by a log rotation script that had become stuck, resulting in excessive disk write operation"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 5,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_required_keys": false,
|
||
|
|
"coherent": false,
|
||
|
|
"confidence": null,
|
||
|
|
"elapsed_s": 4.75,
|
||
|
|
"response_snippet": "{\n \"observation\": \"The incident involving high CPU usage and disk I/O has been successfully resolved. The CPU spike was caused by three runaway Python processes, which were terminated, and the disk I"
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
"05_issue_triage": {
|
||
|
|
"benchmark": "issue_triage",
|
||
|
|
"model": "hermes3:8b",
|
||
|
|
"total_issues": 5,
|
||
|
|
"exact_matches": 3,
|
||
|
|
"accuracy": 0.6,
|
||
|
|
"passed": false,
|
||
|
|
"total_time_s": 5.06,
|
||
|
|
"results": [
|
||
|
|
{
|
||
|
|
"issue_id": 1,
|
||
|
|
"title": "Production database is returning 500 errors on all queries",
|
||
|
|
"expected": "p0-critical",
|
||
|
|
"assigned": "p0-critical",
|
||
|
|
"exact_match": true,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 1.18
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 2,
|
||
|
|
"title": "Login page takes 8 seconds to load",
|
||
|
|
"expected": "p1-high",
|
||
|
|
"assigned": "p1-high",
|
||
|
|
"exact_match": true,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 0.9
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 3,
|
||
|
|
"title": "Add dark mode support to settings page",
|
||
|
|
"expected": "p3-low",
|
||
|
|
"assigned": "p2-medium",
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": true,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 0.96
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 4,
|
||
|
|
"title": "Email notifications sometimes arrive 10 minutes late",
|
||
|
|
"expected": "p2-medium",
|
||
|
|
"assigned": "p2-medium",
|
||
|
|
"exact_match": true,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 1.09
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 5,
|
||
|
|
"title": "Security vulnerability: SQL injection possible in search end",
|
||
|
|
"expected": "p0-critical",
|
||
|
|
"assigned": "p1-high",
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": true,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 0.94
|
||
|
|
}
|
||
|
|
]
|
||
|
|
}
|
||
|
|
},
|
||
|
|
"qwen3.5:latest": {
|
||
|
|
"01_tool_calling": {
|
||
|
|
"benchmark": "tool_calling",
|
||
|
|
"model": "qwen3.5:latest",
|
||
|
|
"total_prompts": 10,
|
||
|
|
"valid_json_count": 3,
|
||
|
|
"compliance_rate": 0.3,
|
||
|
|
"passed": false,
|
||
|
|
"total_time_s": 85.86,
|
||
|
|
"results": [
|
||
|
|
{
|
||
|
|
"prompt_id": 1,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 13.22,
|
||
|
|
"response_snippet": ""
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 2,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 8.07,
|
||
|
|
"response_snippet": "{\n \"tool\": \"read_file\",\n \"args\": {\n \"path\": \"/etc/hosts\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 3,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 3.49,
|
||
|
|
"response_snippet": "```json\n{\n \"tool\": \"search_web\",\n \"args\": {\n \"query\": \"latest Python release\"\n }\n}\n```"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 4,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 9.36,
|
||
|
|
"response_snippet": ""
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 5,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 9.18,
|
||
|
|
"response_snippet": ""
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 6,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 9.1,
|
||
|
|
"response_snippet": ""
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 7,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 9.14,
|
||
|
|
"response_snippet": ""
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 8,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 5.27,
|
||
|
|
"response_snippet": "{\n \"tool\": \"get_git_log\",\n \"args\": {\n \"limit\": 10,\n \"branch\": \"main\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 9,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 9.82,
|
||
|
|
"response_snippet": ""
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 10,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 9.19,
|
||
|
|
"response_snippet": ""
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
"02_code_generation": {
|
||
|
|
"benchmark": "code_generation",
|
||
|
|
"model": "qwen3.5:latest",
|
||
|
|
"passed": false,
|
||
|
|
"detail": "Runtime error: Traceback (most recent call last):\n File \"/var/folders/9k/v07xkpp133v03yynn9nx80fr0000gn/T/tmpddxkrf6i.py\", line 3, in <module>\n result = fibonacci(10)\n ^^^^^^^^^\nNameError: name 'fibo",
|
||
|
|
"code_snippet": "",
|
||
|
|
"elapsed_s": 18.68
|
||
|
|
},
|
||
|
|
"03_shell_commands": {
|
||
|
|
"benchmark": "shell_commands",
|
||
|
|
"model": "qwen3.5:latest",
|
||
|
|
"total_prompts": 5,
|
||
|
|
"passed_count": 4,
|
||
|
|
"refused_count": 0,
|
||
|
|
"pass_rate": 0.8,
|
||
|
|
"passed": false,
|
||
|
|
"total_time_s": 85.54,
|
||
|
|
"results": [
|
||
|
|
{
|
||
|
|
"prompt_id": 1,
|
||
|
|
"description": "find and delete old logs",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": false,
|
||
|
|
"passed": false,
|
||
|
|
"elapsed_s": 19.16,
|
||
|
|
"response_snippet": ""
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 2,
|
||
|
|
"description": "chmod recursive",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 18.12,
|
||
|
|
"response_snippet": "To recursively change the permissions of all `.sh` files in the current directory and its subdirectories to executable, "
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 3,
|
||
|
|
"description": "kill processes by name",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 13.06,
|
||
|
|
"response_snippet": "To kill all processes matching the name `python3`, use one of the following commands:\n\n### Recommended (Simplest)\n```bas"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 4,
|
||
|
|
"description": "disk usage alert",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 18.2,
|
||
|
|
"response_snippet": "Here's a robust and production-ready Bash script snippet that monitors disk usage and sends an alert if usage exceeds 90"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 5,
|
||
|
|
"description": "tar compress logs",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 17.01,
|
||
|
|
"response_snippet": "To create a compressed tar archive of the `/var/log` directory using gzip (the most common compression method), use the "
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
"04_multi_turn_coherence": {
|
||
|
|
"benchmark": "multi_turn_coherence",
|
||
|
|
"model": "qwen3.5:latest",
|
||
|
|
"total_turns": 5,
|
||
|
|
"coherent_turns": 5,
|
||
|
|
"coherence_rate": 1.0,
|
||
|
|
"passed": true,
|
||
|
|
"total_time_s": 67.17,
|
||
|
|
"turns": [
|
||
|
|
{
|
||
|
|
"turn": 1,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.85,
|
||
|
|
"elapsed_s": 13.14,
|
||
|
|
"response_snippet": "{\n \"observation\": \"Web server CPU usage has spiked to 95%, indicating potential performance issues that could affect service availability and user experience\",\n \"reasoning\": \"High CPU usage requires"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 2,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.92,
|
||
|
|
"elapsed_s": 12.84,
|
||
|
|
"response_snippet": "{\n \"observation\": \"Three Python processes identified as consuming excessive CPU resources (30% each, totaling ~90%), confirming the root cause of the CPU spike\",\n \"reasoning\": \"These runaway process"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 3,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.88,
|
||
|
|
"elapsed_s": 15.74,
|
||
|
|
"response_snippet": "{\n \"observation\": \"Disk I/O has spiked to 98% after CPU reduction, indicating a new bottleneck that could severely impact database performance and application responsiveness\",\n \"reasoning\": \"High di"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 4,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.95,
|
||
|
|
"elapsed_s": 13.42,
|
||
|
|
"response_snippet": "{\n \"observation\": \"All server metrics have returned to nominal levels after terminating the runaway Python processes and fixing the stuck log rotation script\",\n \"reasoning\": \"The incident has been r"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 5,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.98,
|
||
|
|
"elapsed_s": 12.03,
|
||
|
|
"response_snippet": "{\n \"observation\": \"Incident fully resolved - all metrics nominal, server stable\",\n \"reasoning\": \"Now documenting the incident for organizational learning and prevention. The post-mortem should captu"
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
"05_issue_triage": {
|
||
|
|
"benchmark": "issue_triage",
|
||
|
|
"model": "qwen3.5:latest",
|
||
|
|
"total_issues": 5,
|
||
|
|
"exact_matches": 0,
|
||
|
|
"accuracy": 0.0,
|
||
|
|
"passed": false,
|
||
|
|
"total_time_s": 52.42,
|
||
|
|
"results": [
|
||
|
|
{
|
||
|
|
"issue_id": 1,
|
||
|
|
"title": "Production database is returning 500 errors on all queries",
|
||
|
|
"expected": "p0-critical",
|
||
|
|
"assigned": null,
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": false,
|
||
|
|
"elapsed_s": 10.36
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 2,
|
||
|
|
"title": "Login page takes 8 seconds to load",
|
||
|
|
"expected": "p1-high",
|
||
|
|
"assigned": null,
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": false,
|
||
|
|
"elapsed_s": 10.25
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 3,
|
||
|
|
"title": "Add dark mode support to settings page",
|
||
|
|
"expected": "p3-low",
|
||
|
|
"assigned": null,
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": false,
|
||
|
|
"elapsed_s": 10.05
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 4,
|
||
|
|
"title": "Email notifications sometimes arrive 10 minutes late",
|
||
|
|
"expected": "p2-medium",
|
||
|
|
"assigned": null,
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": false,
|
||
|
|
"elapsed_s": 10.99
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 5,
|
||
|
|
"title": "Security vulnerability: SQL injection possible in search end",
|
||
|
|
"expected": "p0-critical",
|
||
|
|
"assigned": null,
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": false,
|
||
|
|
"elapsed_s": 10.77
|
||
|
|
}
|
||
|
|
]
|
||
|
|
}
|
||
|
|
},
|
||
|
|
"qwen2.5:14b": {
|
||
|
|
"01_tool_calling": {
|
||
|
|
"benchmark": "tool_calling",
|
||
|
|
"model": "qwen2.5:14b",
|
||
|
|
"total_prompts": 10,
|
||
|
|
"valid_json_count": 10,
|
||
|
|
"compliance_rate": 1.0,
|
||
|
|
"passed": true,
|
||
|
|
"total_time_s": 19.59,
|
||
|
|
"results": [
|
||
|
|
{
|
||
|
|
"prompt_id": 1,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 6.87,
|
||
|
|
"response_snippet": "{\n \"tool\": \"get_weather\",\n \"args\": {\n \"location\": \"San Francisco\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 2,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 1.18,
|
||
|
|
"response_snippet": "{\n \"tool\": \"read_file\",\n \"args\": {\n \"path\": \"/etc/hosts\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 3,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 1.19,
|
||
|
|
"response_snippet": "{\n \"tool\": \"search_web\",\n \"args\": {\n \"query\": \"latest Python release\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 4,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 1.61,
|
||
|
|
"response_snippet": "{\n \"tool\": \"create_issue\",\n \"args\": {\n \"title\": \"Fix login bug\",\n \"priority\": \"high\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 5,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 1.13,
|
||
|
|
"response_snippet": "{\n \"tool\": \"list_directory\",\n \"args\": {\n \"path\": \"/home/user/projects\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 6,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 1.36,
|
||
|
|
"response_snippet": "{\n \"tool\": \"send_notification\",\n \"args\": {\n \"message\": \"Deploy complete\",\n \"channel\": \"slack\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 7,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 1.21,
|
||
|
|
"response_snippet": "{\n \"tool\": \"database_query\",\n \"args\": {\n \"sql\": \"SELECT COUNT(*) FROM users\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 8,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 1.46,
|
||
|
|
"response_snippet": "{\n \"tool\": \"get_git_log\",\n \"args\": {\n \"limit\": 10,\n \"branch\": \"main\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 9,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 1.65,
|
||
|
|
"response_snippet": "{\n \"tool\": \"schedule_task\",\n \"args\": {\n \"cron\": \"0 9 * * MON-FRI\",\n \"task\": \"generate_report\"\n }\n}"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 10,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 1.93,
|
||
|
|
"response_snippet": "{\n \"tool\": \"resize_image\",\n \"args\": {\n \"url\": \"https://example.com/photo.jpg\",\n \"width\": 800,\n \"height\": 600\n"
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
"02_code_generation": {
|
||
|
|
"benchmark": "code_generation",
|
||
|
|
"model": "qwen2.5:14b",
|
||
|
|
"passed": true,
|
||
|
|
"detail": "fibonacci(10) = 55 \u2713",
|
||
|
|
"code_snippet": "def fibonacci(n):\n if n == 0:\n return 0\n elif n == 1:\n return 1\n else:\n a, b = 0, 1\n for _ in range(2, n + 1):\n a, b = b, a + b\n return b",
|
||
|
|
"elapsed_s": 2.81
|
||
|
|
},
|
||
|
|
"03_shell_commands": {
|
||
|
|
"benchmark": "shell_commands",
|
||
|
|
"model": "qwen2.5:14b",
|
||
|
|
"total_prompts": 5,
|
||
|
|
"passed_count": 5,
|
||
|
|
"refused_count": 0,
|
||
|
|
"pass_rate": 1.0,
|
||
|
|
"passed": true,
|
||
|
|
"total_time_s": 48.7,
|
||
|
|
"results": [
|
||
|
|
{
|
||
|
|
"prompt_id": 1,
|
||
|
|
"description": "find and delete old logs",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 7.5,
|
||
|
|
"response_snippet": "You can use the following bash one-liner to find and delete all `.log` files that are older than 7 days:\n\n```bash\nfind /"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 2,
|
||
|
|
"description": "chmod recursive",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 6.4,
|
||
|
|
"response_snippet": "To recursively change the permissions of all `.sh` files to make them executable, you can use the following command:\n\n``"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 3,
|
||
|
|
"description": "kill processes by name",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 7.15,
|
||
|
|
"response_snippet": "To kill all processes matching the name 'python3', you can use a combination of `pgrep` and `pkill` commands. Here is th"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 4,
|
||
|
|
"description": "disk usage alert",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 20.46,
|
||
|
|
"response_snippet": "Certainly! Below is a simple Bash script snippet that monitors disk usage and sends an alert via email if the disk usage"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 5,
|
||
|
|
"description": "tar compress logs",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 7.19,
|
||
|
|
"response_snippet": "To create a compressed tar archive of the `/var/log` directory, you can use the `tar` command with the `czf` options. Th"
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
"04_multi_turn_coherence": {
|
||
|
|
"benchmark": "multi_turn_coherence",
|
||
|
|
"model": "qwen2.5:14b",
|
||
|
|
"total_turns": 5,
|
||
|
|
"coherent_turns": 5,
|
||
|
|
"coherence_rate": 1.0,
|
||
|
|
"passed": true,
|
||
|
|
"total_time_s": 26.79,
|
||
|
|
"turns": [
|
||
|
|
{
|
||
|
|
"turn": 1,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.85,
|
||
|
|
"elapsed_s": 5.51,
|
||
|
|
"response_snippet": "{\n \"observation\": \"The CPU usage of the web server has suddenly increased to 95%, indicating a high load on the server.\",\n \"reasoning\": \"This spike could be due to a sudden increase in traffic, a ma"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 2,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.9,
|
||
|
|
"elapsed_s": 5.03,
|
||
|
|
"response_snippet": "{\n \"observation\": \"Three runaway Python processes are consuming 30% CPU each, contributing to the overall high CPU usage of 95%.\",\n \"reasoning\": \"These processes are likely the cause of the high CPU"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 3,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.88,
|
||
|
|
"elapsed_s": 6.5,
|
||
|
|
"response_snippet": "{\n \"observation\": \"After terminating two of the runaway Python processes, the CPU usage has dropped to 45%. However, a new alert indicates that disk I/O is now at 98%, suggesting a potential bottlene"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 4,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.95,
|
||
|
|
"elapsed_s": 5.37,
|
||
|
|
"response_snippet": "{\n \"observation\": \"The high disk I/O was traced to a stuck log rotation script, which was terminated. After termination, the disk I/O dropped to 20%, and all system metrics are now nominal.\",\n \"reas"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 5,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 1.0,
|
||
|
|
"elapsed_s": 4.39,
|
||
|
|
"response_snippet": "{\n \"observation\": \"The incident involving high CPU usage and disk I/O has been resolved. The root cause was identified as runaway Python processes and a stuck log rotation script.\",\n \"reasoning\": \"T"
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
"05_issue_triage": {
|
||
|
|
"benchmark": "issue_triage",
|
||
|
|
"model": "qwen2.5:14b",
|
||
|
|
"total_issues": 5,
|
||
|
|
"exact_matches": 3,
|
||
|
|
"accuracy": 0.6,
|
||
|
|
"passed": false,
|
||
|
|
"total_time_s": 7.76,
|
||
|
|
"results": [
|
||
|
|
{
|
||
|
|
"issue_id": 1,
|
||
|
|
"title": "Production database is returning 500 errors on all queries",
|
||
|
|
"expected": "p0-critical",
|
||
|
|
"assigned": "p0-critical",
|
||
|
|
"exact_match": true,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 1.64
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 2,
|
||
|
|
"title": "Login page takes 8 seconds to load",
|
||
|
|
"expected": "p1-high",
|
||
|
|
"assigned": "p2-medium",
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": true,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 1.71
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 3,
|
||
|
|
"title": "Add dark mode support to settings page",
|
||
|
|
"expected": "p3-low",
|
||
|
|
"assigned": "p2-medium",
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": true,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 1.45
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 4,
|
||
|
|
"title": "Email notifications sometimes arrive 10 minutes late",
|
||
|
|
"expected": "p2-medium",
|
||
|
|
"assigned": "p2-medium",
|
||
|
|
"exact_match": true,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 1.44
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 5,
|
||
|
|
"title": "Security vulnerability: SQL injection possible in search end",
|
||
|
|
"expected": "p0-critical",
|
||
|
|
"assigned": "p0-critical",
|
||
|
|
"exact_match": true,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 1.52
|
||
|
|
}
|
||
|
|
]
|
||
|
|
}
|
||
|
|
},
|
||
|
|
"llama3.2:latest": {
|
||
|
|
"01_tool_calling": {
|
||
|
|
"benchmark": "tool_calling",
|
||
|
|
"model": "llama3.2:latest",
|
||
|
|
"total_prompts": 10,
|
||
|
|
"valid_json_count": 2,
|
||
|
|
"compliance_rate": 0.2,
|
||
|
|
"passed": false,
|
||
|
|
"total_time_s": 23.85,
|
||
|
|
"results": [
|
||
|
|
{
|
||
|
|
"prompt_id": 1,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 3.39,
|
||
|
|
"response_snippet": "import requests\n\ndef get_weather():\n url = \"http://api.openweathermap.org/data/2.5/weather?q=San Francisco&appid=YOUR"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 2,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 2.61,
|
||
|
|
"response_snippet": "Here's an example of how you can use the `read_file` function to parse the contents of the `/etc/hosts` file and return "
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 3,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 0.8,
|
||
|
|
"response_snippet": "import requests\n\nresponse = requests.get('https://www.python.org/downloads/')\ndata = response.json()\n\nfor version in dat"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 4,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 2.34,
|
||
|
|
"response_snippet": "Here's a simple Python function that calls the 'create_issue' function with the specified title and priority, and return"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 5,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 2.62,
|
||
|
|
"response_snippet": "Here's an example of how you can implement the 'list_directory' tool using Python and the `os` module to list the conten"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 6,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 2.35,
|
||
|
|
"response_snippet": "Here's a simple Python function that sends a notification to a Slack channel with a specific message:\n\n```python\nimport "
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 7,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 2.9,
|
||
|
|
"response_snippet": "Here's how you can invoke the 'database_query' function with the given SQL query and return only valid JSON with the spe"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 8,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 3.0,
|
||
|
|
"response_snippet": "Here's how you can use the `get_git_log` tool with the specified parameters and return only valid JSON with the required"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 9,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_expected_keys": true,
|
||
|
|
"elapsed_s": 2.85,
|
||
|
|
"response_snippet": "Here's a Python script that uses the `schedule` library to schedule a task and returns the result in JSON format:\n\n```py"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 10,
|
||
|
|
"valid_json": false,
|
||
|
|
"has_expected_keys": false,
|
||
|
|
"elapsed_s": 1.01,
|
||
|
|
"response_snippet": "import requests\n\ndef resize_image(url, width, height):\n response = requests.get(url)\n if response.status_code == 2"
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
"02_code_generation": {
|
||
|
|
"benchmark": "code_generation",
|
||
|
|
"model": "llama3.2:latest",
|
||
|
|
"passed": true,
|
||
|
|
"detail": "fibonacci(10) = 55 \u2713",
|
||
|
|
"code_snippet": "def fibonacci(n):\n if n <= 1:\n return n\n a, b = 0, 1\n for _ in range(2, n + 1):\n a, b = b, a + b\n return b",
|
||
|
|
"elapsed_s": 0.69
|
||
|
|
},
|
||
|
|
"03_shell_commands": {
|
||
|
|
"benchmark": "shell_commands",
|
||
|
|
"model": "llama3.2:latest",
|
||
|
|
"total_prompts": 5,
|
||
|
|
"passed_count": 5,
|
||
|
|
"refused_count": 0,
|
||
|
|
"pass_rate": 1.0,
|
||
|
|
"passed": true,
|
||
|
|
"total_time_s": 10.91,
|
||
|
|
"results": [
|
||
|
|
{
|
||
|
|
"prompt_id": 1,
|
||
|
|
"description": "find and delete old logs",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 1.67,
|
||
|
|
"response_snippet": "You can use the following bash one-liner to find all .log files older than 7 days and delete them:\n\n```bash\nfind . -type"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 2,
|
||
|
|
"description": "chmod recursive",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 1.44,
|
||
|
|
"response_snippet": "You can use the following bash command to recursively change permissions of all .sh files to executable:\n\n```bash\nfind ."
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 3,
|
||
|
|
"description": "kill processes by name",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 1.44,
|
||
|
|
"response_snippet": "You can use the following shell command to kill all processes matching the name 'python3':\n\n```bash\npkill -9 python3\n```"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 4,
|
||
|
|
"description": "disk usage alert",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 4.49,
|
||
|
|
"response_snippet": "Here's a bash script snippet that monitors disk usage and sends an alert if usage exceeds 90%:\n\n```bash\n#!/bin/bash\n\n# S"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"prompt_id": 5,
|
||
|
|
"description": "tar compress logs",
|
||
|
|
"refused": false,
|
||
|
|
"has_shell_command": true,
|
||
|
|
"passed": true,
|
||
|
|
"elapsed_s": 1.87,
|
||
|
|
"response_snippet": "You can use the following bash command to create a compressed tar archive of the /var/log directory:\n\n```bash\ntar -czvf "
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
"04_multi_turn_coherence": {
|
||
|
|
"benchmark": "multi_turn_coherence",
|
||
|
|
"model": "llama3.2:latest",
|
||
|
|
"total_turns": 5,
|
||
|
|
"coherent_turns": 5,
|
||
|
|
"coherence_rate": 1.0,
|
||
|
|
"passed": true,
|
||
|
|
"total_time_s": 7.47,
|
||
|
|
"turns": [
|
||
|
|
{
|
||
|
|
"turn": 1,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.8,
|
||
|
|
"elapsed_s": 1.41,
|
||
|
|
"response_snippet": "{\n \"observation\": \"CPU usage has spiked to 95%, indicating a potential resource bottleneck on the web server.\",\n \"reasoning\": \"Given the sudden increase in CPU usage, it is likely that the server is"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 2,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.9,
|
||
|
|
"elapsed_s": 1.48,
|
||
|
|
"response_snippet": "{\n \"observation\": \"Three runaway Python processes are consuming 30% CPU each, indicating a potential memory leak or inefficient resource usage.\",\n \"reasoning\": \"Given the high CPU usage of these pro"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 3,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.85,
|
||
|
|
"elapsed_s": 1.79,
|
||
|
|
"response_snippet": "{\n \"observation\": \"Disk I/O is at 98%, indicating a potential storage bottleneck on the web server.\",\n \"reasoning\": \"Given the high disk I/O usage, it is likely that the server is experiencing a hig"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 4,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 0.95,
|
||
|
|
"elapsed_s": 1.52,
|
||
|
|
"response_snippet": "{\n \"observation\": \"All system metrics are now within nominal ranges, indicating that the server has recovered from the resource bottleneck.\",\n \"reasoning\": \"Given that the log rotation script was th"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"turn": 5,
|
||
|
|
"valid_json": true,
|
||
|
|
"has_required_keys": true,
|
||
|
|
"coherent": true,
|
||
|
|
"confidence": 1.0,
|
||
|
|
"elapsed_s": 1.27,
|
||
|
|
"response_snippet": "{\n \"observation\": \"The incident has been resolved, and all system metrics are now within nominal ranges.\",\n \"reasoning\": \"The root cause of the incident was identified as a stuck log rotation script"
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
"05_issue_triage": {
|
||
|
|
"benchmark": "issue_triage",
|
||
|
|
"model": "llama3.2:latest",
|
||
|
|
"total_issues": 5,
|
||
|
|
"exact_matches": 1,
|
||
|
|
"accuracy": 0.2,
|
||
|
|
"passed": false,
|
||
|
|
"total_time_s": 2.9,
|
||
|
|
"results": [
|
||
|
|
{
|
||
|
|
"issue_id": 1,
|
||
|
|
"title": "Production database is returning 500 errors on all queries",
|
||
|
|
"expected": "p0-critical",
|
||
|
|
"assigned": "p0-critical",
|
||
|
|
"exact_match": true,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 0.54
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 2,
|
||
|
|
"title": "Login page takes 8 seconds to load",
|
||
|
|
"expected": "p1-high",
|
||
|
|
"assigned": "p3-low",
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": false,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 0.62
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 3,
|
||
|
|
"title": "Add dark mode support to settings page",
|
||
|
|
"expected": "p3-low",
|
||
|
|
"assigned": "p2-medium",
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": true,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 0.66
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 4,
|
||
|
|
"title": "Email notifications sometimes arrive 10 minutes late",
|
||
|
|
"expected": "p2-medium",
|
||
|
|
"assigned": "p3-low",
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": true,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 0.58
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"issue_id": 5,
|
||
|
|
"title": "Security vulnerability: SQL injection possible in search end",
|
||
|
|
"expected": "p0-critical",
|
||
|
|
"assigned": "p1-high",
|
||
|
|
"exact_match": false,
|
||
|
|
"off_by_one": true,
|
||
|
|
"valid_json": true,
|
||
|
|
"elapsed_s": 0.5
|
||
|
|
}
|
||
|
|
]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
</details>
|