fix(benchmarks): separate quality measurement from efficiency proxy (issue #63)

- Add --quality flag to run_benchmarks.py that delegates to llama-perplexity - Clarify token/sec is an efficiency metric, not perplexity - Ollama cannot provide true logprob-based PPL (no logprob API) - Quality gate now runs llama-perplexity binary directly when requested Closes #63
2026-04-26 10:55:40 -04:00
parent 7797b9b4c8
commit ccbcc8ab7b
2 changed files with 111 additions and 15 deletions
--- a/benchmarks/run_perplexity.py
+++ b/benchmarks/run_perplexity.py
@@ -1,8 +1,9 @@
 #!/usr/bin/env python3
 """
-TurboQuant Perplexity Quality Gate (Issue #21)
+TurboQuant Perplexity Quality Gate (Issues #21, #63)

-Compares text generation quality between f16 KV and turbo4 KV cache
+Measures true perplexity via llama-perplexity binary (logprob-based).
+Ollama cannot provide perplexity due to missing logprob API (issue #63).
 configurations using llama.cpp's perplexity tool on the wikitext-2 corpus.

 Usage: