Run 5-test benchmark suite against local model candidates #1066

New Issue

perplexity · 2026-03-23T12:52:37Z

perplexity commented

2026-03-23 12:52:37 +00:00

Parent: #1063

Objective

Execute the benchmark suite from the PDF against all candidate models to validate the recommendations with real local performance data.

Test Suite

Tool calling compliance (target: >90% valid JSON) — 10 tool-call prompts, measure JSON compliance rate
Code generation correctness — Generate fibonacci function, execute, verify output = 55
Shell command generation (no refusal) — Verify model generates shell commands without safety refusals
Multi-turn agent loop coherence — 5-turn observe/reason/act cycle, measure structured coherence
Issue triage quality — 5 issues with known correct priorities, measure accuracy

Models to Test

qwen3:14b (primary recommendation)
qwen3:8b (fast mode)
hermes3:8b (neutral alignment alternative)
dolphin3 (uncensored creative fallback)

Steps

Extract the 5 bash test scripts from the PDF into /scripts/benchmarks/
Run full suite: for model in qwen3:14b qwen3:8b hermes3:8b dolphin3; do ...
Record results in a comparison table
Commit results to repo as docs/model-benchmarks.md

Acceptance Criteria

All 5 tests run successfully against all 4 models
Qwen3-14B achieves >90% on tool calling, PASS on code gen, no refusals on shell
Results documented and committed

Parent: #1063 ## Objective Execute the benchmark suite from the PDF against all candidate models to validate the recommendations with real local performance data. ## Test Suite 1. **Tool calling compliance** (target: >90% valid JSON) — 10 tool-call prompts, measure JSON compliance rate 2. **Code generation correctness** — Generate fibonacci function, execute, verify output = 55 3. **Shell command generation** (no refusal) — Verify model generates shell commands without safety refusals 4. **Multi-turn agent loop coherence** — 5-turn observe/reason/act cycle, measure structured coherence 5. **Issue triage quality** — 5 issues with known correct priorities, measure accuracy ## Models to Test - `qwen3:14b` (primary recommendation) - `qwen3:8b` (fast mode) - `hermes3:8b` (neutral alignment alternative) - `dolphin3` (uncensored creative fallback) ## Steps 1. Extract the 5 bash test scripts from the PDF into `/scripts/benchmarks/` 2. Run full suite: `for model in qwen3:14b qwen3:8b hermes3:8b dolphin3; do ...` 3. Record results in a comparison table 4. Commit results to repo as `docs/model-benchmarks.md` ## Acceptance Criteria - All 5 tests run successfully against all 4 models - Qwen3-14B achieves >90% on tool calling, PASS on code gen, no refusals on shell - Results documented and committed

claude self-assigned this 2026-03-23 13:44:39 +00:00

claude added the harness inference p0-critical labels 2026-03-23 13:53:01 +00:00

claude referenced this issue from a commit

2026-03-23 18:38:27 +00:00

WIP: Claude Code progress on #1066

Timmy added the claude-ready label 2026-03-23 23:19:11 +00:00

Timmy commented

2026-03-23 23:19:12 +00:00

🤖 Vassal dispatch → routed to Claude