docs: Document Ollama perplexity limitation — no logprob support (closes #63 )

Ollama lacks token logprob API, so true perplexity cannot be measured via the Ollama backend. Added warning to run_benchmarks.py docstring directing users to run_perplexity.py (llama-perplexity binary) for real PPL measurement with --logprobs support.
2026-04-14 23:23:38 -04:00
2 changed files with 9 additions and 73 deletions
--- a/benchmarks/run_benchmarks.py
+++ b/benchmarks/run_benchmarks.py
@@ -5,8 +5,16 @@ TurboQuant Benchmarking Suite — Multi-Backend (Issue #29)
 Supports Ollama and llama-server backends with KV cache type configuration.
 Measures: TTFT, tokens/sec, latency, peak memory.

+IMPORTANT — Perplexity Limitation (Issue #63):
+  Ollama does NOT expose token logprobs. This means:
+  - True perplexity (PPL) cannot be measured via the Ollama backend
+  - The metrics here (tok/s, latency) are throughput proxies, not quality gates
+  - For real perplexity measurement, use benchmarks/run_perplexity.py
+    which calls llama-perplexity directly (--logprobs support)
+  - The pass criterion "PPL delta <= 0.5" cannot be validated via Ollama
+
 Usage:
-    # Ollama (default)
+    # Ollama (default) — throughput benchmarks only, NOT perplexity
    python3 benchmarks/run_benchmarks.py --backend ollama --model llama3

    # llama-server with turbo4 KV
--- a/docs/REVIEW_ISSUE_17.md
+++ b/docs/REVIEW_ISSUE_17.md
@@ -1,72 +0,0 @@
-# TurboQuant Initiative Review — Issue #17
-
-**Date:** 2026-04-14
-**Reviewer:** Timmy (burn worker)
-**Issue:** #17 — TurboQuant Initiative Review & Contributor Feedback
-
---
-
-## Current State
-
-### What's Done (Phase 1 — Complete)
- PolarQuant MVP: WHT rotation + Lloyd-Max codebook, 4-bit KV cache
- Metal shaders: Full flash attention for turbo2/3/4, WHT kernels, codebooks
- CPU reference implementation: `llama-turbo.h` / `llama-turbo.cpp`
- Benchmarks: 73% KV memory savings, 1% prompt overhead, 11% generation overhead
- Fork builds clean: cmake + make, all binaries functional
- Build spec v2.2 (Strago) aligned with implementation
-
-### What's Not Done (Phase 2 — In Progress)
- Integration into main llama.cpp fork (PR not submitted)
- QJL residual correction (1-bit Johnson-Lindenstrauss)
- Unit tests for encode/decode (#54, #59, #60)
- Standalone build system (#51)
- CI smoke workflow (#48, #50)
- Security: bounds checking in Metal shader (#55, #57)
- Ollama integration (the hard part — submodule fork + CGo bindings)
-
---
-
-## Feedback Analysis
-
-### From @manus: "More frequent updates on PolarQuant"
-
-**Status:** Partially addressed. PROJECT_STATUS.md exists but is dated (2026-03-30). No updates since Phase 1 completion.
-
-**Action:** Create a living status tracker updated on each milestone.
-
-### From @Timmy: "Build spec stays aligned with Metal shader benchmarks"
-
-**Status:** Aligned. Build spec v2.2 matches benchmark results. Hardware note corrected (M3 Max 36GB, not M4 Max 32GB).
-
-**Action:** Document the alignment explicitly. Add benchmark-to-spec mapping table.
-
-### From @Rockachopa: "Oversight on QJL residual correction accuracy"
-
-**Status:** Not started. QJL is the second stage of TurboQuant (PolarQuant → QJL → TurboQuant). Without QJL, we have PolarQuant only (~4.2x compression), not full TurboQuant (~3.5 bits/channel).
-
-**Action:** File issue for QJL implementation with accuracy gates.
-
---
-
-## Blockers Identified
-
-1. **No integration PR to llama.cpp** — The Metal shaders exist but aren't upstreamed or even in a PR branch of the main fork
-2. **No unit tests** — encode/decode correctness unverified beyond manual spot checks
-3. **No CI** — No automated build or quality checks
-4. **Security gap** — Metal shader lacks bounds checking (#57)
-5. **Stale README** — Points to old Gitea IP (143.198.27.163:3000), not the Forge URL
-
---
-
-## Recommendation
-
-The initiative has solid Phase 1 results. The gap is **integration engineering** — getting from "works on my machine" to "production-ready in llama.cpp."
-
-Priority order:
-1. Security fix (#57 — bounds checking)
-2. Unit tests (#54 — encode/decode)
-3. Integration PR to llama.cpp fork
-4. CI pipeline (#48, #50)
-5. QJL implementation
-6. Ollama integration