Compare commits
1 Commits
step35/75-
...
fix/63-per
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
90b5eddfa1 |
@@ -5,8 +5,16 @@ TurboQuant Benchmarking Suite — Multi-Backend (Issue #29)
|
|||||||
Supports Ollama and llama-server backends with KV cache type configuration.
|
Supports Ollama and llama-server backends with KV cache type configuration.
|
||||||
Measures: TTFT, tokens/sec, latency, peak memory.
|
Measures: TTFT, tokens/sec, latency, peak memory.
|
||||||
|
|
||||||
|
IMPORTANT — Perplexity Limitation (Issue #63):
|
||||||
|
Ollama does NOT expose token logprobs. This means:
|
||||||
|
- True perplexity (PPL) cannot be measured via the Ollama backend
|
||||||
|
- The metrics here (tok/s, latency) are throughput proxies, not quality gates
|
||||||
|
- For real perplexity measurement, use benchmarks/run_perplexity.py
|
||||||
|
which calls llama-perplexity directly (--logprobs support)
|
||||||
|
- The pass criterion "PPL delta <= 0.5" cannot be validated via Ollama
|
||||||
|
|
||||||
Usage:
|
Usage:
|
||||||
# Ollama (default)
|
# Ollama (default) — throughput benchmarks only, NOT perplexity
|
||||||
python3 benchmarks/run_benchmarks.py --backend ollama --model llama3
|
python3 benchmarks/run_benchmarks.py --backend ollama --model llama3
|
||||||
|
|
||||||
# llama-server with turbo4 KV
|
# llama-server with turbo4 KV
|
||||||
|
|||||||
Reference in New Issue
Block a user