[P2-4] Run full quality comparison: turbo4 vs f16 on 10 test prompts #24

New Issue

Timmy · 2026-03-31T04:34:06Z

Timmy commented

2026-03-31 04:34:06 +00:00

Parent: #1 | Depends on: P2-1 (quality gate pass), P2-2 (prompts written), P2-3 (server running)

Steps

Start llama-server with f16 KV (baseline):

./llama-server -m <model> --port 8081 -c 32768 --kv-type f16

Run all 10 test prompts, save responses:

python3 benchmarks/run_comparison.py --config f16 --output benchmarks/results_f16.json

Stop server. Restart with turbo4:

./llama-server -m <model> --port 8081 -c 32768 --kv-type turbo4

Run same 10 prompts:

python3 benchmarks/run_comparison.py --config turbo4 --output benchmarks/results_turbo4.json

Score and compare:

python3 benchmarks/score_comparison.py \
  --baseline benchmarks/results_f16.json \
  --test benchmarks/results_turbo4.json \
  --output benchmarks/comparison_report.md

Scoring criteria per prompt:

Does the response contain the expected_pattern? (binary pass/fail)
Response length (shorter = possible degradation)
Coherence (manual review for the first run)
Tool-call validity (#6 — did it produce valid JSON?)

Acceptance Criteria

All 10 prompts run on both configs
Results saved as JSON
Comparison report generated
PASS: ≥ 9/10 prompts produce equivalent quality
Tool-call prompt (#6) works on turbo4
Multi-turn prompt (#7) retains context on turbo4

## Parent: #1 | Depends on: P2-1 (quality gate pass), P2-2 (prompts written), P2-3 (server running) ### Steps 1. Start llama-server with f16 KV (baseline): ```bash ./llama-server -m <model> --port 8081 -c 32768 --kv-type f16 ``` 2. Run all 10 test prompts, save responses: ```bash python3 benchmarks/run_comparison.py --config f16 --output benchmarks/results_f16.json ``` 3. Stop server. Restart with turbo4: ```bash ./llama-server -m <model> --port 8081 -c 32768 --kv-type turbo4 ``` 4. Run same 10 prompts: ```bash python3 benchmarks/run_comparison.py --config turbo4 --output benchmarks/results_turbo4.json ``` 5. Score and compare: ```bash python3 benchmarks/score_comparison.py \ --baseline benchmarks/results_f16.json \ --test benchmarks/results_turbo4.json \ --output benchmarks/comparison_report.md ``` ### Scoring criteria per prompt: - Does the response contain the expected_pattern? (binary pass/fail) - Response length (shorter = possible degradation) - Coherence (manual review for the first run) - Tool-call validity (#6 — did it produce valid JSON?) ### Acceptance Criteria - [ ] All 10 prompts run on both configs - [ ] Results saved as JSON - [ ] Comparison report generated - [ ] **PASS: ≥ 9/10 prompts produce equivalent quality** - [ ] Tool-call prompt (#6) works on turbo4 - [ ] Multi-turn prompt (#7) retains context on turbo4

Timmy self-assigned this 2026-03-31 04:34:06 +00:00

allegro referenced this issue

2026-04-01 03:53:13 +00:00

[P2-2] Write 10 test prompts for quality comparison #22

Timmy closed this issue

2026-04-05 21:58:15 +00:00

Timmy commented

2026-04-05 21:58:15 +00:00

Stale issue — closed during backlog cleanup. Reopen if still relevant.

Sign in to join this conversation.