diff --git a/benchmarks/run_benchmarks.py b/benchmarks/run_benchmarks.py index 11367f5c..1bbe331b 100644 --- a/benchmarks/run_benchmarks.py +++ b/benchmarks/run_benchmarks.py @@ -5,8 +5,16 @@ TurboQuant Benchmarking Suite — Multi-Backend (Issue #29) Supports Ollama and llama-server backends with KV cache type configuration. Measures: TTFT, tokens/sec, latency, peak memory. +IMPORTANT — Perplexity Limitation (Issue #63): + Ollama does NOT expose token logprobs. This means: + - True perplexity (PPL) cannot be measured via the Ollama backend + - The metrics here (tok/s, latency) are throughput proxies, not quality gates + - For real perplexity measurement, use benchmarks/run_perplexity.py + which calls llama-perplexity directly (--logprobs support) + - The pass criterion "PPL delta <= 0.5" cannot be validated via Ollama + Usage: - # Ollama (default) + # Ollama (default) — throughput benchmarks only, NOT perplexity python3 benchmarks/run_benchmarks.py --backend ollama --model llama3 # llama-server with turbo4 KV