fix(benchmarks): separate quality measurement from efficiency proxy (issue #63)
All checks were successful
Smoke Test / smoke (pull_request) Successful in 27s

- Add --quality flag to run_benchmarks.py that delegates to llama-perplexity
- Clarify token/sec is an efficiency metric, not perplexity
- Ollama cannot provide true logprob-based PPL (no logprob API)
- Quality gate now runs llama-perplexity binary directly when requested

Closes #63
This commit is contained in:
2026-04-26 10:55:40 -04:00
parent 7797b9b4c8
commit ccbcc8ab7b
2 changed files with 111 additions and 15 deletions

View File

@@ -1,8 +1,9 @@
#!/usr/bin/env python3
"""
TurboQuant Perplexity Quality Gate (Issue #21)
TurboQuant Perplexity Quality Gate (Issues #21, #63)
Compares text generation quality between f16 KV and turbo4 KV cache
Measures true perplexity via llama-perplexity binary (logprob-based).
Ollama cannot provide perplexity due to missing logprob API (issue #63).
configurations using llama.cpp's perplexity tool on the wikitext-2 corpus.
Usage: