From cefaa6e7782c760653ba685b8514a2ef5aa983f1 Mon Sep 17 00:00:00 2001 From: Timmy Date: Mon, 30 Mar 2026 13:11:45 -0400 Subject: [PATCH] Add build spec v2.2 and README TurboQuant KV cache compression for M4 Max local inference. Spec by Strago, triaged into 16 issues across 4 phases. Ref #1 --- BUILD-SPEC.md | 449 ++++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 33 +++- 2 files changed, 480 insertions(+), 2 deletions(-) create mode 100644 BUILD-SPEC.md diff --git a/BUILD-SPEC.md b/BUILD-SPEC.md new file mode 100644 index 0000000..eab2e09 --- /dev/null +++ b/BUILD-SPEC.md @@ -0,0 +1,449 @@ +# TurboQuant Implementation — Build Spec (v2) +**Prepared by:** Strago | **Date:** 2026-03-30 | **Updated:** 2026-03-30 (v2 — external review fixes) +**Task:** STR-2026-03-30-01 | **For:** Cid (build) + Frankie (coordination) +**Inputs read:** turboquant-2026-03-25.md (Google brief), turboquant-2026-03-30-recon-update.md (Locke recon), infra-bulletin.md, MEMORY.md, external Opus review + +--- + +## Situation + +John wants maximum local inference quality on the MacBook Pro (M4 Max, 32GB unified memory) using TurboQuant-level KV cache compression. Currently running `qwen3.5:27b` via Ollama at `10.0.0.133:11434`. The goal: run a larger or better model within the same 32GB memory envelope by compressing the KV cache during inference. + +TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method: +1. **PolarQuant** — random rotation + polar coordinates + fixed scalar codebook. No normalization constants. ~4.2× compression. +2. **QJL** — 1-bit quantized Johnson-Lindenstrauss on the residual. Zero-overhead bias correction. +3. **TurboQuant** — PolarQuant for main signal + QJL for residual = unbiased inner product quantizer at ~3.5 bits/channel with zero accuracy loss. + +Community status: multiple `llama.cpp` forks, MLX proof-of-concepts, and a vLLM plugin exist. Nothing upstreamed to official `llama.cpp`, MLX, or Ollama yet. Author QJL code is public. Enough is public to build from. + +--- + +## 1a. PolarQuant Technical Detail — What Cid Needs to Verify + +This section specifies the PolarQuant algorithm concretely so Cid can verify that the community fork implements it correctly. A fork that gets the rotation wrong or uses the wrong codebook boundaries will compress successfully but degrade quality in ways that short PPL benchmarks may not catch — the damage surfaces during long production sessions with sustained context pressure. + +### The Algorithm (per KV vector) + +**Step 1 — Random Rotation (Preconditioning):** +- Apply a fixed random orthogonal rotation to each KV vector before quantization. +- The paper uses a **Walsh-Hadamard transform** (WHT) — a structured orthogonal matrix that's O(d log d) to apply, not O(d²) like a dense random matrix. +- **Why:** Raw KV vectors have non-uniform coordinate distributions (some dimensions carry more energy). WHT spreads energy uniformly across all coordinates, making the post-rotation distribution predictable and concentrated. This is what eliminates the need for per-vector normalization constants. +- **Cid verification:** The fork must use a fixed WHT (or equivalent structured orthogonal rotation), not a learned or per-layer rotation. The rotation matrix must be identical at quantization and dequantization. If the fork uses a dense random matrix instead of WHT, it's functionally correct but slower — flag it. + +**Step 2 — Polar Coordinate Transform:** +- After rotation, decompose each vector into **radius** (L2 norm / signal strength) and **angle** (direction on the unit sphere). +- The radius is stored at higher precision (FP16 or FP32) — it's one scalar per vector, negligible overhead. +- The angle coordinates are what get quantized. Because WHT made their distribution predictable, you can use a fixed codebook without per-vector calibration. + +**Step 3 — Lloyd-Max Scalar Quantization:** +- Each angle coordinate is independently quantized using a **Lloyd-Max optimal scalar quantizer**. +- Lloyd-Max minimizes mean squared error for a known distribution. Because WHT makes the distribution analytically computable, the codebook boundaries are **precomputed once** and fixed for all vectors. +- **Codebook sizes by compression target:** + - `turbo4` = 4 bits per coordinate = 16 codebook entries per dimension + - `turbo3` = 3 bits = 8 entries + - `turbo2` = 2 bits = 4 entries +- **Cid verification:** Check that the fork's codebook boundaries match what the paper/PolarQuant paper specifies for the target distribution. If the fork uses uniform quantization instead of Lloyd-Max, that's a quality regression — uniform is simpler but wastes bits on low-probability regions. + +**Step 4 — Bit Packing + Storage:** +- Quantized indices are packed into the KV cache format (turbo2/3/4 nibble-packed). +- Radius stored separately. No normalization constants, no scale factors, no zero-points — this is the key advantage over standard quantization. + +### Dequantization During Attention + +When the model computes attention scores (Q·K^T) and weighted values (softmax·V): +1. Read packed indices from cache +2. Look up codebook values (single table lookup per coordinate) +3. Reconstruct angle coordinates +4. Scale by stored radius +5. Compute dot product in reconstructed space + +**Critical property:** The inner product between a full-precision query Q and a PolarQuant-compressed K must be an unbiased estimator of the true Q·K dot product. The WHT rotation preserves this because orthogonal transforms preserve inner products. If the fork adds any non-orthogonal transformation (e.g., learned projection, PCA), the unbiasedness guarantee breaks. + +### PolarQuant Initialization — Codebook + Rotation Matrix Setup + +PolarQuant requires two things to be initialized before inference can start: + +1. **Walsh-Hadamard rotation matrix:** This is deterministic — a WHT of size d (model head dimension, typically 128) is computed from the recursive Hadamard construction. It's the same for every session, every model. Compute once at model load, store in memory. Cost: O(d log d) per head dimension — microseconds. No impact on model load time. + +2. **Lloyd-Max codebook:** The quantization boundaries are precomputed for the known post-WHT distribution. For a given bit width (turbo4 = 4 bits = 16 entries), the codebook is a fixed lookup table of 16 boundary values + 16 reconstruction values. This is identical across sessions and models of the same head dimension. Can be hardcoded as a constant array or computed once at load time from the analytical distribution formula. + +**Expected initialization overhead:** Negligible — both are small deterministic computations. But **measure it during Phase 1**: time the gap between Ollama receiving a request and the first token appearing, with and without TurboQuant. If initialization adds >1 second to cold model load, investigate caching the tables to disk alongside the model file. + +**Cid measurement target:** Report model load time (cold start) with and without TurboQuant. If >5 second delta, flag as UX issue. + +**Cid verification checklist (before trusting benchmark numbers):** +- [ ] Rotation is WHT or equivalent structured orthogonal (not learned, not dense random) +- [ ] Same rotation matrix used for quantization and dequantization +- [ ] Codebook is Lloyd-Max (not uniform), boundaries precomputed for post-WHT distribution +- [ ] Radius stored separately at FP16+ precision +- [ ] No per-vector normalization constants stored (this is the whole point) +- [ ] Dequant path in Metal shader matches the quantization path exactly + +--- + +## 1. Model Targeting — What Can We Run? + +### Memory Budget — Realistic, Not Theoretical + +On a 32GB M4 Max running macOS, you do NOT have 32GB for inference. Realistic budget: + +| Consumer | Estimate | +|----------|----------| +| macOS + system services | ~2-3GB | +| Metal command buffer + GPU driver overhead | ~1-2GB | +| Ollama process overhead | ~0.5GB | +| Activation memory (intermediate tensors during forward pass) | ~1-3GB (varies by model/batch) | +| **Available for model weights + KV cache** | **~26-28GB** | + +**Use 27GB as the planning ceiling.** The v1 spec said "leaves 2GB for OS" at 30GB peak — that's too tight. All memory calculations below use 27GB available. + +### Current State (No TurboQuant) +- **qwen3.5:27b** at Q4_K_M (~16GB model weights) — fits within 27GB budget with room for KV cache +- At 32K context, KV cache for a 27B model at FP16 ≈ 4-6GB → total ~20-22GB. Comfortable. +- At 64K context, KV cache ≈ 8-12GB → total ~24-28GB. Marginal — may swap. +- At 128K context, KV cache grows to ~16-24GB → doesn't fit. Context-limited. + +### With TurboQuant (4× KV Compression) +- KV cache at 32K drops from ~5GB → ~1.2GB +- KV cache at 64K drops from ~10GB → ~2.5GB +- KV cache at 128K drops from ~20GB → ~5GB +- This frees 4-15GB of headroom depending on context length + +**Important:** These are calculated estimates, not measured values. Actual memory consumption can exceed theoretical due to fragmentation, allocation overhead, and implementation-specific buffering. Phase 1 **must** include actual peak memory measurement (see validation section). If measured exceeds calculated by >15%, the context ceiling drops accordingly. + +### Model Recommendations + +**Primary target: qwen3.5:27b at Q4_K_M with extended context** +- Model weights: ~16GB at Q4_K_M +- With TurboQuant KV cache at 64K context: ~2.5GB cache + ~2GB activations → ~20-21GB total. Comfortable within 27GB budget. +- With TurboQuant at 128K: ~5GB cache + ~2GB activations → ~23GB total. Fits, but tight — **needs measured validation.** +- Without TurboQuant: 64K context KV cache ≈ 10GB → ~28GB total. OOM risk. +- **Win: 64K context becomes reliable, 128K becomes possible.** This is the real unlock. + +**Stretch target: Qwen 3.5 32B (Q4_K_M)** +- Model weights: ~18-19GB at Q4_K_M +- With TurboQuant at 64K: ~2.5GB cache + ~2.5GB activations → ~23-24GB. Fits within 27GB but leaves only ~3GB headroom. +- **Verdict: worth testing in Phase 1 benchmarks alongside 27B.** If it fits, marginally better quality. If it's marginal, stay on 27B. + +**Not recommended: Qwen 3.5 72B (Q2_K or IQ3_XXS)** +- Model weights at Q2_K: ~27GB. Leaves ~0GB for anything else. +- **Verdict: does not fit.** Even with TurboQuant, no room for KV cache + activations + Metal overhead. And quality at Q2_K is poor — weight quantization damage cancels the parameter count advantage. + +**Recommended path: Stay on 27B class, use TurboQuant to unlock longer context (64K-128K) rather than a bigger model.** The real win on 32GB unified is context length, not parameter count. A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context. + +**Alternative worth testing: Mistral/Codestral 25B-class models** at Q5_K_M (~18GB) with TurboQuant. Locke's research notes TurboQuant was benchmarked on Mistral — community results may be more reproducible. + +--- + +## 2. Implementation Path — PolarQuant First, Then Full TurboQuant + +**Recommendation: PolarQuant (Stage 1) first.** Matches Locke's recommendation. Rationale: + +- PolarQuant alone delivers ~4.2× compression — that's the bulk of the win +- Full TurboQuant adds QJL residual correction for marginal quality improvement at extreme compression (2.5 bits) +- At 3.5+ bits/channel, PolarQuant is sufficient for zero accuracy loss +- QJL adds kernel complexity for small incremental gain at our target compression ratio +- We can always add QJL in Phase 2 if PolarQuant quality isn't sufficient + +### Source Repos (Priority Order) + +| Repo | What | Why | Risk | +|------|------|-----|------| +| **`TheTom/llama-cpp-turboquant`** | `llama.cpp` fork with Metal support | Most directly useful — same stack as Ollama. Reports PPL numbers on M-series. | Community fork, not upstream. May lag `llama.cpp` HEAD. | +| **`TheTom/turboquant_plus`** | Standalone C implementation + Python tests | Most detailed reverse-engineering. 511+ tests. PolarQuant + Walsh-Hadamard + turbo2/3/4 formats. | Extends beyond paper ("Plus"). May include non-paper innovations. | +| **`amirzandieh/QJL`** | Author's QJL CUDA implementation | Official author code. CUDA kernels, eval scripts, LongBench commands. | CUDA only — needs Metal port for MacBook. Phase 2 dependency. | +| **`rachittshah/mlx-turboquant`** | MLX proof-of-concept | Native Apple Silicon. Correct module layout (codebooks, polar_quant, qjl). | May be partial implementation. Naming drift noted. | + +**Start from:** `TheTom/llama-cpp-turboquant` (for Ollama integration path) + `TheTom/turboquant_plus` (for reference/tests). + +### Community Fork Risk Assessment + +The v1 spec understated this. Community `llama.cpp` forks can diverge significantly from HEAD, especially in the Metal backend where Apple Silicon optimizations change frequently. The risk isn't "it doesn't build" — it's "it builds fine on the fork's base commit but breaks when cherry-picked onto current HEAD." + +**Specific risk areas:** +- **KV cache layer:** `llama.cpp` has refactored KV cache internals multiple times in 2026. A fork based on a 4-week-old commit may touch structs/functions that have been renamed or restructured upstream. +- **Metal shaders:** Apple Silicon Metal optimizations are actively changing. Custom Metal kernels for TurboQuant dequant may conflict with upstream shader refactors. +- **Memory management:** `ggml` memory allocation has evolved. The fork's cache allocation assumptions may not match current `ggml` memory pools. + +**Mitigation plan (Phase 1 Step 0 — before any benchmarking):** + +1. **Check fork freshness:** `git log --oneline -1` on the fork. Compare base commit date against `llama.cpp` HEAD. If >4 weeks stale, flag as HIGH risk. +2. **If fresh (< 2 weeks from HEAD):** Build directly. Likely works. +3. **If stale (2-4 weeks):** Attempt cherry-pick of TurboQuant-specific commits onto current HEAD. If merge conflicts are limited to TurboQuant files → resolve manually. If conflicts touch core KV cache / Metal code → stop, evaluate effort. +4. **If very stale (> 4 weeks) or conflicts are extensive:** Switch to **clean-room approach** — use `TheTom/turboquant_plus` as the algorithm reference and implement the KV cache types directly into current `llama.cpp` HEAD. This is more work (~60-90 min instead of ~20-40 min) but avoids the merge conflict maze. +5. **Escape hatch:** If `llama.cpp` path is blocked, fall back to `rachittshah/mlx-turboquant` (MLX native, no fork divergence risk, but requires API proxy for Ollama compatibility). + +**Cid decision point:** After Step 0, report fork age + conflict assessment before proceeding. If clean-room is needed, update the time estimate and Frankie adjusts the schedule. Don't spend more than 15 minutes fighting merge conflicts — switch to clean-room. + +### Metal Kernel Risk — The Single Highest-Risk Assumption + +The spec assumes the `llama.cpp` fork has working **Metal shaders** for PolarQuant KV dequantization. KV dequant happens in the attention computation hot path — every token, every layer, every head. If the fork only has CPU or CUDA dequant kernels and no Metal implementation, the MacBook will either: +- Fall back to CPU dequant → **catastrophic** performance loss (10-50× slower attention) +- Fail to build entirely for Metal backend + +**Cid's actual first action** (before building, before benchmarking, before anything): + +```bash +# Clone the fork +git clone https://github.com/TheTom/llama-cpp-turboquant.git +cd llama-cpp-turboquant + +# Check for Metal shader files referencing TurboQuant/PolarQuant +grep -rn "turbo\|polar\|turboquant\|polarquant" ggml/src/ggml-metal* 2>/dev/null +grep -rn "turbo\|polar" ggml/src/ggml-metal.metal 2>/dev/null + +# Check for Metal kernel dispatch for turbo KV types +grep -rn "GGML_TYPE_.*TURBO\|turbo.*metal\|kv.*turbo" . --include="*.m" --include="*.metal" --include="*.h" 2>/dev/null +``` + +**If Metal shaders exist:** Proceed with `llama.cpp` fork path (primary). +**If Metal shaders do NOT exist:** MLX becomes the **primary** path, not the fallback. Switch to `rachittshah/mlx-turboquant` immediately. Reframe Phase 1 around MLX + API proxy for Ollama compatibility. Report this finding before spending any more time on the `llama.cpp` path. + +This check takes 2 minutes and determines the entire build strategy. Do it first. + +--- + +## 3. Integration Target — llama.cpp → Ollama + +**Primary: `llama.cpp` fork → custom Ollama build.** + +Why not MLX: +- Our entire fleet uses Ollama. Model management, API compatibility, endpoint routing — all built around Ollama. +- MLX would require a separate inference server, separate model format, separate API integration. +- Ollama is built on `llama.cpp`/`ggml`. KV cache changes in `llama.cpp` propagate to Ollama. + +**Integration strategy:** +1. Build/test TurboQuant KV cache in a `llama.cpp` fork (Metal backend) +2. Validate quality + performance +3. Build custom Ollama from our `llama.cpp` fork (Ollama builds `llama.cpp` as a submodule) +4. Deploy to MacBook as replacement Ollama binary +5. Existing model files, API, and endpoint (`10.0.0.133:11434`) remain identical — only the inference engine changes + +**Fallback: MLX standalone** if `llama.cpp` Metal integration proves too complex. `rachittshah/mlx-turboquant` as starting point. Would require a small proxy server to maintain API compatibility with our Ollama endpoint. + +--- + +## 4. Validation Plan — How We Know It Works + +### Quality Validation + +**Test matrix (run each model with and without TurboQuant):** + +| Test | What It Measures | Tool | Pass Criteria | +|------|-----------------|------|--------------| +| Perplexity (PPL) | Overall language modeling quality | `llama-perplexity` on WikiText-2 | PPL delta ≤ 0.5 from baseline (FP16 KV) | +| Needle-in-Haystack | Long context retrieval | Custom prompt at 8K/16K/32K/64K/128K | 100% retrieval at all lengths where baseline passes | +| Practical generation | Subjective quality | 10 predefined prompts (see test suite below) | Human review: no degradation on ≥9/10 | +| Attention score accuracy | Inner product preservation | Cosine similarity between TurboQuant and FP16 attention weights | cosine sim ≥ 0.995 | + +**Predefined Test Prompts (10 prompts, run identically on TurboQuant and FP16 KV baseline):** + +| # | Category | Prompt Description | What It Tests | +|---|----------|-------------------|---------------| +| 1 | Long-context summarization | Feed 20K tokens of a research paper, ask for structured summary with citations | KV cache quality at length — compressed K/V must retain source detail | +| 2 | Multi-step reasoning | 5-step math word problem requiring chain-of-thought | Whether compressed KV degrades intermediate reasoning steps | +| 3 | Code generation | Write a Python script with 3 functions, error handling, type hints | Precise token prediction — code is unforgiving of subtle quality drops | +| 4 | Code debugging | Provide buggy code (3 bugs), ask to identify and fix all three | Attention to detail across context — must reference earlier code correctly | +| 5 | Factual recall (early context) | Provide 10 facts in the first 1K tokens, continue for 8K tokens of filler, ask about fact #3 | Retrieval from early context through compressed KV | +| 6 | Creative writing | Write a 500-word short story with specific constraints (setting, character, twist) | Compression artifacts surface as repetition or coherence loss | +| 7 | Multi-turn conversation | 10-turn technical Q&A where later questions reference earlier answers | Cross-turn coherence through accumulated compressed KV | +| 8 | Structured output | Generate a JSON schema with 15+ fields, nested objects, and validation rules | Format precision — compressed KV must maintain structural consistency | +| 9 | Translation + analysis | Translate a paragraph EN→ES, then analyze the translation choices | Tests both generation quality and meta-reasoning about own output | +| 10 | Instruction following | Complex prompt with 8 specific formatting requirements (headers, bullet style, word limits, etc.) | Whether compression causes the model to "forget" constraints mid-generation | + +**Prompts must be written and saved to `projects/sovereign-stack/turboquant-test-prompts.md` before Phase 2 benchmarks run.** Same prompts, same order, both configurations. This prevents unconscious cherry-picking. + +**Asymmetric K/V test:** Run K at Q8_0, V at turbo4. Community reports this works well on sensitive models. Compare PPL vs symmetric turbo4 K+V. + +**Long-session quality test (Phase 2 only):** Short-context PPL benchmarks can miss quality degradation that surfaces during sustained context pressure. During Phase 2, run one extended production simulation: +- Generate a 50-turn multi-step reasoning conversation (code gen → debug → refactor → test → iterate) +- Compare output quality vs same conversation on FP16 KV baseline +- Specifically watch for: coherence drift after turn 30+, hallucinated references to earlier context, attention score softmax concentration (if measurable) +- This catches the case where codebook boundary errors accumulate over many KV cache writes in a single session + +### Performance Validation + +| Metric | Measure | Pass Criteria | +|--------|---------|--------------| +| Tokens/second (generation) | `llama-bench` | ≥90% of baseline tok/s (small decode overhead acceptable) | +| Time to first token (TTFT) | Timed prompt eval | ≤110% of baseline | +| Peak resident memory | `footprint -p ` or `vmmap --summary` at each context length | Stays under 27GB at target context length | +| Memory vs theoretical | Compare measured peak to calculated estimate | If measured exceeds calculated by >15% → reduce context ceiling | +| Context length ceiling | Binary search: max context before OOM or swap pressure | 64K minimum (vs ~32K baseline for 27B) | + +### Kill Criteria +- PPL regression > 1.0 at any compression level → abort that compression level +- OOM at 32K context (baseline capability) → regression, abort +- tok/s drops > 25% → dequant overhead too high, need kernel optimization before deploy + +--- + +## 5. Who Does What + +| Role | Owner | What | +|------|-------|------| +| Build spec | Strago | This document ✅ | +| Implementation | Cid | Fork `llama.cpp`, integrate PolarQuant KV cache, Metal kernels, build custom Ollama | +| Validation | Cid | Run test matrix, report PPL/performance numbers | +| Model selection | Cid | Test qwen3.5:27b + one Mistral variant, recommend best config | +| MacBook deployment | Cid | Replace Ollama binary on MacBook, verify endpoint works | +| Quality review | John | Review 10-prompt practical generation comparison | +| Research support | Locke | If Cid hits a wall on the math, Locke deep-dives the paper/QJL code | + +--- + +## 6. Phasing + +### Phase 1 — PolarQuant MVP (Target: this week) + +**Scope:** + +**Step 0 — Fork Assessment (do this FIRST, report before proceeding):** +- Clone `TheTom/llama-cpp-turboquant` +- Check base commit age vs `llama.cpp` HEAD (`git log --oneline -1`) +- Check `sysctl hw.memsize` on MacBook (resolve the 32/36/48GB question) +- If fork < 2 weeks stale → proceed to build +- If 2-4 weeks stale → attempt cherry-pick, report conflict scope +- If > 4 weeks or conflicts extensive → switch to clean-room (see Fork Risk Assessment above) +- Report: fork age, conflict assessment, MacBook actual RAM, estimated build path time + +**Step 1 — Build + Verify:** +- Build `llama.cpp` fork (or clean-room) with Metal backend on MacBook (M4 Max) +- Run the Section 1a verification checklist against the fork's implementation before trusting any benchmarks +- Run FP16 KV baseline: `llama-perplexity` on WikiText-2 with `qwen3.5:27b` at 8K context (this is the number we're comparing against) + +**Step 2 — Benchmark PolarQuant:** +- Run perplexity test with PolarQuant KV (turbo4 format) vs FP16 KV baseline +- Run `llama-bench` for tok/s comparison +- Test at 8K, 32K, and 64K context lengths +- Run asymmetric test: K at Q8_0, V at turbo4 +- **Measure actual peak resident memory** at each context length (`footprint -p ` or `vmmap --summary`). Compare measured vs calculated. If measured exceeds calculated by >15%, note the delta — it reduces the achievable context ceiling. +- Report: PPL delta per context length, tok/s delta, **measured peak memory per context length**, max context before OOM/swap, asymmetric vs symmetric results + +**Deliverable:** Working `llama.cpp` build on MacBook with PolarQuant KV cache. PPL + performance numbers. Section 1a verification checklist completed. + +**Estimated Cid time (honest range):** +- **Best case** — fork is fresh, builds clean on first try, Metal shaders work: 20-40 min +- **Typical case** — fork needs CMake flag tweaks, Xcode SDK adjustments, minor Metal fixes: 1-2 hours +- **Worst case** — fork is stale, conflicts extensive, or Metal shaders missing: clean-room build 2-4 hours, or MLX pivot + +**2-hour build troubleshooting cap:** If the `llama.cpp` fork doesn't compile and pass a basic smoke test (load model, generate 10 tokens) within 2 hours of troubleshooting, stop. Pivot to MLX path. Don't sink more time into Xcode/CMake/Metal debug loops when a working MLX PoC exists. Report what broke — the information is useful even if the path is abandoned. + +**Decision gate:** If PPL delta ≤ 0.5 and tok/s ≥ 90% baseline AND Section 1a checklist passes → proceed to Phase 2. If PPL fails but checklist passes → try asymmetric K/V or lower compression (turbo3 instead of turbo4). If checklist fails → fix implementation before trusting benchmarks. + +### Phase 2 — Ollama Integration + Production Deploy + +**Scope:** + +**Step 0 — Ollama API Compatibility Check (before building):** +Ollama pins a specific `llama.cpp` commit and calls it through CGo bindings in `llm/`. If our fork changes any function signatures, struct layouts, or enum values that Ollama's Go code references, the build will either fail or produce subtle runtime bugs. + +```bash +# Clone Ollama source +git clone https://github.com/ollama/ollama.git +cd ollama + +# Find the pinned llama.cpp commit +cat llm/llama.cpp/CMakeLists.txt | head -5 # or check go.mod / Makefile + +# Diff our fork's API surface against Ollama's expected API +# Focus on: llama.h, ggml.h function signatures that Ollama calls +diff <(grep -h "^LLAMA_API\|^GGML_API" llm/llama.cpp/include/*.h | sort) \ + <(grep -h "^LLAMA_API\|^GGML_API" /path/to/our-fork/include/*.h | sort) +``` + +If API surface differs: check if TurboQuant changes are additive (new functions/types only) or modify existing signatures. Additive = safe. Modified existing = need to update Ollama's CGo bindings. + +**Build steps:** +- Build custom Ollama binary using our `llama.cpp` fork as submodule +- Deploy to MacBook as replacement Ollama +- Verify existing endpoint (`10.0.0.133:11434`) works identically +- Run full test matrix (all 4 quality tests + all 4 performance tests) +- Test with qwen3.5:27b at 64K and 128K context +- If 128K works: update Ollama model config to advertise larger context +- Run 10-prompt practical generation comparison for John review + +**Deliverable:** Production Ollama on MacBook with TurboQuant KV cache. Full benchmark report. John signs off on quality. + +**Estimated Cid time:** 15-25 min (Ollama build is straightforward once `llama.cpp` fork is validated). + +### Phase 2.5 — Per-Layer Quantization Profiles (Optimization, Optional) + +Not all transformer layers have equal sensitivity to KV cache quantization. Research and community experimentation show early layers (first 2-4) and late layers (last 2-4) tend to be more sensitive than middle layers. If the fork supports per-layer KV cache type configuration: + +- **Sensitive layers (first 3 + last 3):** K at Q8_0, V at turbo4 (or full FP16 KV) +- **Middle layers:** K and V both at turbo4 (or even turbo3) + +This gives the same average compression ratio as uniform turbo4 but concentrates precision where it matters most. The PPL improvement can be meaningful (0.1-0.3) at zero memory cost. + +**When to pursue:** Only after Phase 2 is stable and baseline quality is confirmed. This is tuning, not architecture. If uniform turbo4 passes all quality gates, per-layer optimization is nice-to-have, not necessary. + +**Cid note:** During Phase 1, check whether the fork exposes per-layer KV type config. If it does, note it for later. Don't implement it yet. + +### Phase 3 — QJL Residual Correction (Optional) + +**Scope:** Add QJL 1-bit residual correction for full TurboQuant behavior. Only pursue if: +- Phase 1/2 PolarQuant shows quality gaps at extreme compression (< 3 bits/channel) +- We want to push to 2.5 bits/channel for even more context headroom + +**Source:** `amirzandieh/QJL` repo (CUDA → Metal port needed) + +**Estimated Cid time:** 30-60 min (Metal port of QJL kernels is real engineering work) + +**Decision gate:** Only proceed if PolarQuant alone doesn't meet quality bar at target compression. + +### Phase 4 — Upstream Watch + +**Scope:** Monitor `llama.cpp` upstream and Ollama for official TurboQuant support. When it lands: +- Evaluate upstream implementation vs our fork +- If upstream is better: migrate off our fork to official +- If our fork is better: contribute upstream (optional) + +**Owner:** Locke (monitoring) + Cid (evaluation when it lands) + +--- + +## What This Spec Does NOT Cover + +- **Weight quantization** — TurboQuant is KV cache compression only. Model weight quantization (GGUF Q4_K_M etc.) is a separate concern and already handled by Ollama. +- **Predator (desktop) deployment** — this spec targets MacBook only. Predator runs NVIDIA (CUDA) which is a different kernel backend. Can extend later. +- **Multi-model serving** — TurboQuant helps with single-model memory but doesn't change Ollama's single-model-at-a-time constraint. +- **Ollama upstream contribution** — out of scope for now. We build for ourselves first. + +--- + +## Open Questions for John + +**None blocking.** One informational: + +- **MacBook Pro memory:** Confirmed M4 Max 32GB from memory/2026-03-14.md. If it's actually 36GB or 48GB (M4 Max comes in 36/48/128 configs), that changes the model ceiling. Can Cid check `sysctl hw.memsize` on the MacBook during Phase 1? Non-blocking — doesn't change the approach, just the model size ceiling. + +--- + +## Reference Files + +| File | Location | +|------|----------| +| TurboQuant Google Brief | `projects/sovereign-stack/research/turboquant-2026-03-25.md` | +| Locke Recon Update | `projects/sovereign-stack/research/turboquant-2026-03-30-recon-update.md` | +| `llama.cpp` TurboQuant fork | `github.com/TheTom/llama-cpp-turboquant` | +| TurboQuant+ reference impl | `github.com/TheTom/turboquant_plus` | +| QJL author code | `github.com/amirzandieh/QJL` | +| MLX PoC (fallback) | `github.com/rachittshah/mlx-turboquant` | +| TurboQuant paper | `arxiv.org/abs/2504.19874` | +| PolarQuant paper | `arxiv.org/abs/2502.02617` | + +--- + +--- + +## Changelog + +- **v1 (2026-03-30 12:26 ET):** Initial spec. +- **v2 (2026-03-30 12:55 ET):** Added Section 1a (PolarQuant technical detail + Cid verification checklist), expanded fork risk assessment with mitigation plan, added Phase 1 Step 0 (fork assessment before benchmarking), added long-session quality test for Phase 2, updated Phase 1 time estimate for clean-room path. Changes driven by external Opus review round 1. +- **v2.1 (2026-03-30 13:00 ET):** Added Metal kernel risk check (grep before build — determines llama.cpp vs MLX primary path), corrected memory budget (27GB available, not 30GB — accounts for OS + Metal driver + activations), added measured memory profiling requirement to Phase 1, added Ollama CGo API compatibility check to Phase 2 Step 0, tightened model ceiling estimates. Changes driven by external Opus review round 2. +- **v2.2 (2026-03-30 13:05 ET):** Added honest time estimate range (20 min best → 2-4 hr worst), 2-hour build troubleshooting cap before MLX pivot, PolarQuant initialization detail (WHT + Lloyd-Max codebook setup + cold-start measurement target), 10 predefined test prompts with rationale (prevents cherry-picking), per-layer quantization profiles as Phase 2.5 optimization path. Changes driven by external Opus review round 3. + +--- + +*Build spec v2 ready for Cid intake. No clarifying questions needed.* diff --git a/README.md b/README.md index 60f2c3f..be1883c 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,32 @@ -# turboquant +# TurboQuant -TurboQuant KV cache compression for local inference — PolarQuant + QJL on M4 Max via llama.cpp/Ollama. Build spec from Strago, build by Cid, coordination by Frankie. \ No newline at end of file +KV cache compression for local inference on M4 Max MacBook Pro. + +## What +TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method: +1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression) +2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction +3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss + +## Why +Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory. +A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context. + +## Status +See [issues](http://143.198.27.163:3000/Timmy_Foundation/turboquant/issues) for current progress. + +## Roles +- **Strago:** Build spec author +- **Cid:** Implementation, benchmarks, deployment +- **Locke:** Research support, upstream watch +- **John:** Quality review +- **Frankie:** Coordination + +## Source Repos +- [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) — llama.cpp fork with Metal +- [TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus) — Reference impl, 511+ tests +- [amirzandieh/QJL](https://github.com/amirzandieh/QJL) — Author QJL code (CUDA) +- [rachittshah/mlx-turboquant](https://github.com/rachittshah/mlx-turboquant) — MLX fallback + +## Docs +- [BUILD-SPEC.md](BUILD-SPEC.md) — Full build specification (Strago, v2.2)