[P2-4] Run full quality comparison: turbo4 vs f16 on 10 test prompts #24

Closed
opened 2026-03-31 04:34:06 +00:00 by Timmy · 1 comment
Owner

Parent: #1 | Depends on: P2-1 (quality gate pass), P2-2 (prompts written), P2-3 (server running)

Steps

  1. Start llama-server with f16 KV (baseline):
./llama-server -m <model> --port 8081 -c 32768 --kv-type f16
  1. Run all 10 test prompts, save responses:
python3 benchmarks/run_comparison.py --config f16 --output benchmarks/results_f16.json
  1. Stop server. Restart with turbo4:
./llama-server -m <model> --port 8081 -c 32768 --kv-type turbo4
  1. Run same 10 prompts:
python3 benchmarks/run_comparison.py --config turbo4 --output benchmarks/results_turbo4.json
  1. Score and compare:
python3 benchmarks/score_comparison.py \
  --baseline benchmarks/results_f16.json \
  --test benchmarks/results_turbo4.json \
  --output benchmarks/comparison_report.md

Scoring criteria per prompt:

  • Does the response contain the expected_pattern? (binary pass/fail)
  • Response length (shorter = possible degradation)
  • Coherence (manual review for the first run)
  • Tool-call validity (#6 — did it produce valid JSON?)

Acceptance Criteria

  • All 10 prompts run on both configs
  • Results saved as JSON
  • Comparison report generated
  • PASS: ≥ 9/10 prompts produce equivalent quality
  • Tool-call prompt (#6) works on turbo4
  • Multi-turn prompt (#7) retains context on turbo4
## Parent: #1 | Depends on: P2-1 (quality gate pass), P2-2 (prompts written), P2-3 (server running) ### Steps 1. Start llama-server with f16 KV (baseline): ```bash ./llama-server -m <model> --port 8081 -c 32768 --kv-type f16 ``` 2. Run all 10 test prompts, save responses: ```bash python3 benchmarks/run_comparison.py --config f16 --output benchmarks/results_f16.json ``` 3. Stop server. Restart with turbo4: ```bash ./llama-server -m <model> --port 8081 -c 32768 --kv-type turbo4 ``` 4. Run same 10 prompts: ```bash python3 benchmarks/run_comparison.py --config turbo4 --output benchmarks/results_turbo4.json ``` 5. Score and compare: ```bash python3 benchmarks/score_comparison.py \ --baseline benchmarks/results_f16.json \ --test benchmarks/results_turbo4.json \ --output benchmarks/comparison_report.md ``` ### Scoring criteria per prompt: - Does the response contain the expected_pattern? (binary pass/fail) - Response length (shorter = possible degradation) - Coherence (manual review for the first run) - Tool-call validity (#6 — did it produce valid JSON?) ### Acceptance Criteria - [ ] All 10 prompts run on both configs - [ ] Results saved as JSON - [ ] Comparison report generated - [ ] **PASS: ≥ 9/10 prompts produce equivalent quality** - [ ] Tool-call prompt (#6) works on turbo4 - [ ] Multi-turn prompt (#7) retains context on turbo4
Timmy self-assigned this 2026-03-31 04:34:06 +00:00
Timmy closed this issue 2026-04-05 21:58:15 +00:00
Author
Owner

Stale issue — closed during backlog cleanup. Reopen if still relevant.

Stale issue — closed during backlog cleanup. Reopen if still relevant.
Sign in to join this conversation.