From 94c880d3063869f00f53bf0cb2c8c64b0fd50809 Mon Sep 17 00:00:00 2001
From: Google AI Agent <gemini@hermes.local>
Date: Mon, 13 Apr 2026 00:32:31 +0000
Subject: [PATCH] feat: consolidate project reports into docs/PROJECT_STATUS.md

---
 docs/PROJECT_STATUS.md | 847 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 847 insertions(+)
 create mode 100644 docs/PROJECT_STATUS.md

diff --git a/docs/PROJECT_STATUS.md b/docs/PROJECT_STATUS.md
new file mode 100644
index 00000000..c798a1f2
--- /dev/null
+++ b/docs/PROJECT_STATUS.md
@@ -0,0 +1,847 @@
+# TurboQuant Project Status
+
+# TurboQuant Phase 1 Report — PolarQuant MVP
+
+**Date:** 2026-03-30
+**Prepared by:** Timmy (execution) for Frankie's team (Strago, Cid, Locke, John)
+**Spec:** turboquant-build-spec v2.2 (Strago)
+
+---
+
+## Executive Summary
+
+Phase 1 is COMPLETE. TurboQuant KV cache compression works on Apple Silicon with production-quality Metal shaders. turbo4 delivers **73% KV memory savings with only 1% prompt processing overhead and 11% generation overhead.** The path to 128K context on 36GB hardware is clear.
+
+**Hardware correction:** The MacBook is M3 Max 36GB (not M4 Max 32GB as in spec). This INCREASES our memory budget from 27GB to ~31GB.
+
+---
+
+## Gate Check (#2): PASSED ✅
+
+Metal shaders exist and are comprehensive:
+- Full flash attention for turbo2/3/4 with dk32-dk576 variants
+- WHT rotation kernels (turbo_fwht_128, turbo_rotate_forward/inverse)
+- PolarQuant codebooks hardcoded (Lloyd-Max for N(0, 1/√128))
+- Asymmetric K/V support (q8_0 × turbo mixed pairs)
+- M4+ optimizations (4-mag LUT), sparse V dequant, profiling modes
+- Additional experiment branches: layer-adaptive, fused-centroid-decode, speed-optimization
+
+**Decision: llama.cpp path confirmed. No MLX pivot needed.**
+
+---
+
+## Fork Assessment (#3): PASSED ✅
+
+- Branch: `feature/turboquant-kv-cache` (commit adac2c6)
+- Fork freshness: ADEQUATE (recent enough for direct build)
+- Build: Clean cmake + make, 100% success in ~3 minutes
+- All binaries: llama-cli, llama-bench, llama-perplexity, llama-server
+
+---
+
+## PolarQuant Verification (#5): 5/6 PASS, 1 PARTIAL ✅
+
+| Item | Verdict |
+|------|---------|
+| WHT rotation (structured orthogonal) | PARTIAL PASS — Metal GPU uses WHT ✅. CPU turbo4 ref uses dense random (legacy, not production) |
+| Same rotation quant/dequant | PASS — turbo_rotate_forward() ↔ turbo_rotate_inverse() identical sign arrays |
+| Lloyd-Max codebook (not uniform) | PASS — non-uniform centroids, "Lloyd-Max for N(0, 1/128)" |
+| Radius at FP16+ | PASS — ggml_half norm per 128-element group |
+| No per-vector normalization | PASS — one group norm only, static_asserts enforce block sizes |
+| Dequant matches quant in Metal | PASS — same centroids, signs, butterfly structure |
+
+**⚠️ Flag for Cid:** CPU turbo4 reference path is incompatible with Metal dequant. Only matters if CPU fallback is ever invoked for turbo4.
+
+---
+
+## Benchmark Results
+
+### Model Under Test
+- **Hermes-4-14B Q4_K_M** (8.38 GiB, 14.77B params)
+- Machine: Apple M3 Max, 36GB unified, Metal GPU Family 9
+
+### Throughput (3-run averages)
+
+| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ |
+|:-------------|:---------------|:--|:-------------------|:--|
+| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — |
+| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** |
+| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% |
+| q8_0/turbo4 (asym) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% |
+
+### KV Cache Memory (turbo4 vs f16)
+
+| Context | f16 KV | turbo4 KV | Savings |
+|:--------|:-------|:----------|:--------|
+| 2K | 320 MiB | 85 MiB | 73.4% |
+| 8K | 1,280 MiB | 340 MiB | 73.4% |
+| 32K | 5,120 MiB | 1,360 MiB | 73.4% |
+| 65K | 10,240 MiB | 2,720 MiB | 73.4% |
+
+Measured matches calculated exactly — zero fragmentation overhead.
+
+### Pass Criteria Assessment
+
+| Criteria | Threshold | Result | Verdict |
+|:---------|:----------|:-------|:--------|
+| PPL delta ≤ 0.5 | ≤ 0.5 | ⏭️ Not tested (no wikitext corpus) | DEFERRED |
+| tok/s ≥ 90% baseline (prompt) | ≥ 274 t/s | 300.00 t/s (98.9%) | **PASS** |
+| tok/s ≥ 90% baseline (gen) | ≥ 24.7 t/s | 22.45 t/s (89%) | **BORDERLINE** |
+| No OOM at 32K | No crash | Runs clean | **PASS** |
+| Memory consistent with theory | ±15% | 0% delta | **PASS** |
+
+---
+
+## What This Means for qwen3.5:27b (Spec Target)
+
+| Scenario | Total Memory | Fits in 31GB? |
+|:---------|:-------------|:--------------|
+| 27B Q4_K_M + f16 KV @ 64K | ~26 GB | ⚠️ Tight |
+| 27B Q4_K_M + f16 KV @ 128K | ~38 GB | ❌ No |
+| 27B Q4_K_M + **turbo4 KV @ 64K** | ~20.5 GB | ✅ Comfortable |
+| 27B Q4_K_M + **turbo4 KV @ 128K** | ~23.4 GB | ✅ Fits (7.6GB headroom) |
+
+**TurboQuant turns 128K context from impossible to comfortable.**
+
+---
+
+## Open Items for Phase 2
+
+1. **Perplexity test** — Need wikitext-2-raw corpus downloaded. PPL is the most important quality metric and we don't have it yet.
+2. **Ollama integration** — CLI is a broken symlink. Need to fix Ollama install, then build custom Ollama with our fork as submodule.
+3. **qwen3.5:27b model** — Need to download the actual target model (only have Hermes-4-14B on disk currently).
+4. **10 test prompts** — Need to be written before Phase 2 quality comparison.
+5. **Generation speed borderline** — tg128 at 89% is just below the 90% threshold. May improve with the speed-optimization branch. Worth testing.
+
+---
+
+## Recommendation
+
+**PROCEED TO PHASE 2.**
+
+turbo4 delivers the goods: 73% KV memory savings, near-zero prompt overhead, acceptable generation overhead. The verification checklist confirms the implementation is algorithmically sound. The only gap is PPL testing, which is a corpus download away — not a fundamental risk.
+
+The real unlock — 128K context on 36GB hardware — is within reach. Phase 2 is Ollama integration and production deployment.
+
+---
+
+## Issues Closed
+
+- [x] #2 Metal kernel check — PASSED
+- [x] #3 Fork assessment — PASSED
+- [x] #4 Build llama.cpp fork — COMPLETE
+- [x] #5 PolarQuant verification — 5/6 PASS
+- [x] #6 FP16 baseline benchmarks — RECORDED
+- [x] #7 TurboQuant benchmarks — RECORDED
+- [x] #8 Memory profiling — COMPLETE
+
+---
+
+*Phase 1 execution time: ~25 minutes (build) + ~20 minutes (benchmarks) = ~45 minutes total.*
+*Within "typical case" estimate from spec (1-2 hours).*
+
+
+---
+
+# TurboQuant — Full Knowledge Transfer Report
+
+**Date:** 2026-03-30
+**Prepared for:** Frankie's Team (Strago, Cid, Locke, John)
+**Spec:** turboquant-build-spec v2.2 (Strago)
+
+---
+
+## TL;DR
+
+TurboQuant works. PolarQuant KV cache compression delivers **73% memory savings with 1% prompt overhead**. 128K context on the MacBook becomes viable. Custom Ollama build is deferred (multi-day effort), but the fork's `llama-server` is a ready drop-in. Per-layer adaptive quantization is already implemented. QJL is infrastructure-only — not needed at current compression targets.
+
+---
+
+## Hardware Correction
+
+**Spec says:** M4 Max, 32GB
+**Actual:** M3 Max, 36GB (sysctl hw.memsize = 38,654,705,664 bytes)
+
+Impact: Memory budget **increases** from ~27GB to ~31GB usable. Model ceiling improves.
+
+---
+
+## Phase 1 — PolarQuant MVP: COMPLETE ✅
+
+### Gate Check (#2): Metal Shaders EXIST
+The `feature/turboquant-kv-cache` branch has production-quality Metal support:
+- Flash attention for turbo2/3/4 (all dk variants)
+- WHT rotation kernels (turbo_fwht_128)
+- Lloyd-Max codebooks (hardcoded, non-uniform)
+- Asymmetric K/V (q8_0 × turbo mixed)
+- Runtime optimizations: 4-mag LUT (M4+), sparse V dequant, profiling
+
+**Note:** Allegro's analysis (checking only `master` branch) incorrectly concluded "NO TurboQuant." The implementation lives on the feature branch.
+
+### PolarQuant Verification (#5): 5/6 PASS
+
+| Item | Verdict |
+|------|---------|
+| WHT rotation (structured orthogonal) | PASS (Metal). CPU turbo4 ref uses dense random (legacy) |
+| Same rotation quant/dequant | PASS |
+| Lloyd-Max codebook (not uniform) | PASS |
+| Radius at FP16+ | PASS |
+| No per-vector normalization | PASS |
+| Dequant matches quant in Metal | PASS |
+
+**Flag:** CPU turbo4 reference path is algorithmically incompatible with Metal dequant. Only matters if CPU fallback invoked for turbo4. Metal production path is clean.
+
+### Benchmark Results
+
+**Model tested:** Hermes-4-14B Q4_K_M (8.38 GiB)
+
+#### Throughput
+
+| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ |
+|:-------------|:---------------|:--|:-------------------|:--|
+| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — |
+| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** |
+| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% |
+| q8_0/turbo4 (asymmetric) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% |
+
+#### KV Memory Savings
+
+| Context | f16 KV | turbo4 KV | Savings |
+|:--------|:-------|:----------|:--------|
+| 2K | 320 MiB | 85 MiB | 73.4% |
+| 8K | 1,280 MiB | 340 MiB | 73.4% |
+| 32K | 5,120 MiB | 1,360 MiB | 73.4% |
+| 65K | 10,240 MiB | 2,720 MiB | 73.4% |
+
+Measured matches calculated exactly. Zero fragmentation overhead.
+
+#### What This Means for qwen3.5:27b
+
+| Scenario | Total Memory | Fits 31GB? |
+|:---------|:-------------|:-----------|
+| 27B + f16 KV @ 128K | ~38 GB | ❌ No |
+| 27B + **turbo4 KV @ 128K** | **~23.4 GB** | **✅ Yes (7.6GB headroom)** |
+
+---
+
+## Phase 2 — Ollama Integration: PARTIALLY COMPLETE
+
+### What Works
+- Ollama installation fixed (v0.17.7, running on :11434)
+- API compatibility assessed: TurboQuant changes are additive (new types/ops only)
+
+### What Doesn't (Yet)
+Custom Ollama build is **not feasible** in current timeframe:
+- Ollama vendors llama.cpp with 34 custom patches
+- Fork diverges from Ollama's pinned commit
+- Integration requires patching 30+ files across Metal/CUDA/CPU backends
+- Ollama's own HEAD has pre-existing build failures
+
+**This is deferred to Phase 4 / upstream watch.** When Ollama updates their llama.cpp pin or TurboQuant lands upstream, the gap narrows.
+
+### Production Alternative: llama-server
+
+The fork's `llama-server` binary is **already built and working**:
+
+```bash
+# Drop-in replacement for Ollama's API endpoint
+/path/to/llama-server \
+  -m /path/to/qwen3.5-27b-q4_k_m.gguf \
+  --port 11434 \
+  -ctk turbo4 -ctv turbo4 \
+  -c 131072
+```
+
+- OpenAI-compatible chat completions API
+- Streaming SSE support
+- All TurboQuant KV types supported
+- Per-layer adaptive via TURBO_LAYER_ADAPTIVE env var
+- Same port/protocol as Ollama — clients don't need to change
+
+### Outstanding Phase 2 Items for Cid
+- [ ] Download qwen3.5:27b Q4_K_M model
+- [ ] Deploy llama-server with turbo4 on MacBook
+- [ ] Run full 10-prompt quality matrix (prompts written by Allegro on #16)
+- [ ] PPL test with wikitext-2-raw corpus
+- [ ] John quality sign-off
+
+---
+
+## Phase 2.5 — Per-Layer Quantization: ALREADY IMPLEMENTED ✅
+
+Found in the fork. No additional work needed.
+
+### Mechanism
+`TURBO_LAYER_ADAPTIVE` environment variable, 7 modes:
+
+| Mode | Strategy | Use Case |
+|:-----|:---------|:---------|
+| 0 | Uniform (default) | Simple, consistent |
+| 1 | q8_0 for first 4 + last 4 layers | Protect sensitive layers |
+| 7 | **Recommended:** first2+last2 V=q8_0, rest V=turbo2 | Best quality/compression ratio |
+
+### Usage
+```bash
+export TURBO_LAYER_ADAPTIVE=7
+llama-server -m model.gguf -ctk turbo4 -ctv turbo4
+```
+
+### Benchmark Status
+Mode benchmarks queued. Uniform turbo4 baseline established. Per-layer modes expected to improve quality at same compression ratio.
+
+---
+
+## Phase 3 — QJL: ASSESSED, NOT NEEDED ✅
+
+### Finding
+**turbo4 is pure 4-bit PolarQuant** — QJL is NOT active.
+
+`TURBO4_USE_4BIT` defaults to 1 in `ggml-common.h`. The legacy 3-bit+QJL path exists but is disabled. QJL infrastructure (sign arrays, WHT transforms, 128x128 projection matrices) is embedded in Metal but referenced by no active kernel.
+
+### Recommendation
+**Not needed for current goals.** 4-bit PolarQuant already delivers 73% savings with minimal quality impact. QJL only matters below 3 bits/channel, which isn't required on 36GB hardware with the updated memory budget.
+
+---
+
+## Source Repos Assessment
+
+| Repo | Status | Value |
+|:-----|:-------|:------|
+| TheTom/llama-cpp-turboquant | **PRIMARY** — production Metal shaders on feature branch | Build from this |
+| TheTom/turboquant_plus | Python reference + 511 tests | Algorithm verification |
+| rachittshah/mlx-turboquant | Complete MLX PoC, 2-5x slower (no Metal fusion) | Quality validation reference |
+| amirzandieh/QJL | Author CUDA (~1500 lines) | Future QJL Metal port reference |
+
+---
+
+## Risk Register
+
+| Risk | Status | Mitigation |
+|:-----|:-------|:-----------|
+| Metal shaders missing | ✅ RESOLVED — they exist | — |
+| Fork too stale | ✅ RESOLVED — builds clean | — |
+| Ollama integration blocked | ⚠️ ACTIVE — multi-day effort | Use llama-server instead |
+| PPL regression | ⏸️ UNTESTED — needs wikitext corpus | Download and test in prod |
+| tg128 borderline (89% vs 90% threshold) | ⚠️ MINOR — within measurement noise | speed-optimization branch may help |
+| CPU turbo4 incompatible with Metal | ℹ️ LOW — only matters if Metal unavailable | Document; Metal is production path |
+
+---
+
+## Recommended Deployment Plan for Cid
+
+```
+Step 1: Download qwen3.5:27b Q4_K_M via HuggingFace
+        huggingface-cli download bartowski/qwen3.5-27B-GGUF qwen3.5-27b-q4_k_m.gguf
+
+Step 2: Build fork (if not already done)
+        cd /path/to/llama-cpp-turboquant
+        git checkout feature/turboquant-kv-cache
+        cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
+        cmake --build build -j$(sysctl -n hw.ncpu)
+
+Step 3: Deploy llama-server
+        export TURBO_LAYER_ADAPTIVE=7
+        ./build/bin/llama-server \
+          -m /path/to/qwen3.5-27b-q4_k_m.gguf \
+          --port 11434 \
+          -ctk turbo4 -ctv turbo4 \
+          -c 131072 \
+          --host 0.0.0.0
+
+Step 4: Validate
+        curl http://localhost:11434/v1/chat/completions \
+          -H "Content-Type: application/json" \
+          -d '{"model":"qwen3.5","messages":[{"role":"user","content":"hello"}]}'
+
+Step 5: Run quality matrix (prompts on issue #16)
+Step 6: John reviews output quality
+Step 7: If pass → production. If fail → drop to turbo3 or adjust per-layer profile.
+```
+
+---
+
+## Issues Summary
+
+| # | Title | Status |
+|:--|:------|:-------|
+| 1 | Epic: TurboQuant KV Cache Compression | Open (tracker) |
+| 2 | Metal kernel check | ✅ Closed — PASS |
+| 3 | Fork assessment | ✅ Closed — PASS, M3 Max 36GB |
+| 4 | Build llama.cpp fork | ✅ Closed — clean build |
+| 5 | PolarQuant verification | ✅ Closed — 5/6 PASS |
+| 6 | Baseline benchmarks | ✅ Closed — recorded |
+| 7 | TurboQuant benchmarks | ✅ Closed — 73% savings |
+| 8 | Memory profiling | ✅ Closed — 0% fragmentation |
+| 9 | Ollama API check | ✅ Closed — additive, but diverged |
+| 10 | Custom Ollama build | ✅ Closed — deferred, llama-server instead |
+| 11 | Full test matrix | Open — awaiting production deploy |
+| 12 | Long-session test | Open — awaiting production deploy |
+| 13 | Per-layer profiles | ✅ Closed — already implemented |
+| 14 | QJL assessment | ✅ Closed — not needed |
+| 15 | Upstream watch | Open — ongoing |
+| 16 | Test prompts | Open — Allegro contributed prompts |
+
+**12/16 issues resolved. 4 remaining are production validation tasks for Cid.**
+
+---
+
+*Repo: http://143.198.27.163:3000/Timmy_Foundation/turboquant*
+*Build: /tmp/llama-cpp-turboquant/build/bin/ (all binaries)*
+*Branch: feature/turboquant-kv-cache*
+
+
+---
+
+# TurboQuant Implementation — Build Spec (v2)
+**Prepared by:** Strago | **Date:** 2026-03-30 | **Updated:** 2026-03-30 (v2 — external review fixes)
+**Task:** STR-2026-03-30-01 | **For:** Cid (build) + Frankie (coordination)
+**Inputs read:** turboquant-2026-03-25.md (Google brief), turboquant-2026-03-30-recon-update.md (Locke recon), infra-bulletin.md, MEMORY.md, external Opus review
+
+---
+
+## Situation
+
+John wants maximum local inference quality on the MacBook Pro (M4 Max, 32GB unified memory) using TurboQuant-level KV cache compression. Currently running `qwen3.5:27b` via Ollama at `10.0.0.133:11434`. The goal: run a larger or better model within the same 32GB memory envelope by compressing the KV cache during inference.
+
+TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method:
+1. **PolarQuant** — random rotation + polar coordinates + fixed scalar codebook. No normalization constants. ~4.2× compression.
+2. **QJL** — 1-bit quantized Johnson-Lindenstrauss on the residual. Zero-overhead bias correction.
+3. **TurboQuant** — PolarQuant for main signal + QJL for residual = unbiased inner product quantizer at ~3.5 bits/channel with zero accuracy loss.
+
+Community status: multiple `llama.cpp` forks, MLX proof-of-concepts, and a vLLM plugin exist. Nothing upstreamed to official `llama.cpp`, MLX, or Ollama yet. Author QJL code is public. Enough is public to build from.
+
+---
+
+## 1a. PolarQuant Technical Detail — What Cid Needs to Verify
+
+This section specifies the PolarQuant algorithm concretely so Cid can verify that the community fork implements it correctly. A fork that gets the rotation wrong or uses the wrong codebook boundaries will compress successfully but degrade quality in ways that short PPL benchmarks may not catch — the damage surfaces during long production sessions with sustained context pressure.
+
+### The Algorithm (per KV vector)
+
+**Step 1 — Random Rotation (Preconditioning):**
+- Apply a fixed random orthogonal rotation to each KV vector before quantization.
+- The paper uses a **Walsh-Hadamard transform** (WHT) — a structured orthogonal matrix that's O(d log d) to apply, not O(d²) like a dense random matrix.
+- **Why:** Raw KV vectors have non-uniform coordinate distributions (some dimensions carry more energy). WHT spreads energy uniformly across all coordinates, making the post-rotation distribution predictable and concentrated. This is what eliminates the need for per-vector normalization constants.
+- **Cid verification:** The fork must use a fixed WHT (or equivalent structured orthogonal rotation), not a learned or per-layer rotation. The rotation matrix must be identical at quantization and dequantization. If the fork uses a dense random matrix instead of WHT, it's functionally correct but slower — flag it.
+
+**Step 2 — Polar Coordinate Transform:**
+- After rotation, decompose each vector into **radius** (L2 norm / signal strength) and **angle** (direction on the unit sphere).
+- The radius is stored at higher precision (FP16 or FP32) — it's one scalar per vector, negligible overhead.
+- The angle coordinates are what get quantized. Because WHT made their distribution predictable, you can use a fixed codebook without per-vector calibration.
+
+**Step 3 — Lloyd-Max Scalar Quantization:**
+- Each angle coordinate is independently quantized using a **Lloyd-Max optimal scalar quantizer**.
+- Lloyd-Max minimizes mean squared error for a known distribution. Because WHT makes the distribution analytically computable, the codebook boundaries are **precomputed once** and fixed for all vectors.
+- **Codebook sizes by compression target:**
+  - `turbo4` = 4 bits per coordinate = 16 codebook entries per dimension
+  - `turbo3` = 3 bits = 8 entries
+  - `turbo2` = 2 bits = 4 entries
+- **Cid verification:** Check that the fork's codebook boundaries match what the paper/PolarQuant paper specifies for the target distribution. If the fork uses uniform quantization instead of Lloyd-Max, that's a quality regression — uniform is simpler but wastes bits on low-probability regions.
+
+**Step 4 — Bit Packing + Storage:**
+- Quantized indices are packed into the KV cache format (turbo2/3/4 nibble-packed).
+- Radius stored separately. No normalization constants, no scale factors, no zero-points — this is the key advantage over standard quantization.
+
+### Dequantization During Attention
+
+When the model computes attention scores (Q·K^T) and weighted values (softmax·V):
+1. Read packed indices from cache
+2. Look up codebook values (single table lookup per coordinate)
+3. Reconstruct angle coordinates
+4. Scale by stored radius
+5. Compute dot product in reconstructed space
+
+**Critical property:** The inner product between a full-precision query Q and a PolarQuant-compressed K must be an unbiased estimator of the true Q·K dot product. The WHT rotation preserves this because orthogonal transforms preserve inner products. If the fork adds any non-orthogonal transformation (e.g., learned projection, PCA), the unbiasedness guarantee breaks.
+
+### PolarQuant Initialization — Codebook + Rotation Matrix Setup
+
+PolarQuant requires two things to be initialized before inference can start:
+
+1. **Walsh-Hadamard rotation matrix:** This is deterministic — a WHT of size d (model head dimension, typically 128) is computed from the recursive Hadamard construction. It's the same for every session, every model. Compute once at model load, store in memory. Cost: O(d log d) per head dimension — microseconds. No impact on model load time.
+
+2. **Lloyd-Max codebook:** The quantization boundaries are precomputed for the known post-WHT distribution. For a given bit width (turbo4 = 4 bits = 16 entries), the codebook is a fixed lookup table of 16 boundary values + 16 reconstruction values. This is identical across sessions and models of the same head dimension. Can be hardcoded as a constant array or computed once at load time from the analytical distribution formula.
+
+**Expected initialization overhead:** Negligible — both are small deterministic computations. But **measure it during Phase 1**: time the gap between Ollama receiving a request and the first token appearing, with and without TurboQuant. If initialization adds >1 second to cold model load, investigate caching the tables to disk alongside the model file.
+
+**Cid measurement target:** Report model load time (cold start) with and without TurboQuant. If >5 second delta, flag as UX issue.
+
+**Cid verification checklist (before trusting benchmark numbers):**
+- [ ] Rotation is WHT or equivalent structured orthogonal (not learned, not dense random)
+- [ ] Same rotation matrix used for quantization and dequantization
+- [ ] Codebook is Lloyd-Max (not uniform), boundaries precomputed for post-WHT distribution
+- [ ] Radius stored separately at FP16+ precision
+- [ ] No per-vector normalization constants stored (this is the whole point)
+- [ ] Dequant path in Metal shader matches the quantization path exactly
+
+---
+
+## 1. Model Targeting — What Can We Run?
+
+### Memory Budget — Realistic, Not Theoretical
+
+On a 32GB M4 Max running macOS, you do NOT have 32GB for inference. Realistic budget:
+
+| Consumer | Estimate |
+|----------|----------|
+| macOS + system services | ~2-3GB |
+| Metal command buffer + GPU driver overhead | ~1-2GB |
+| Ollama process overhead | ~0.5GB |
+| Activation memory (intermediate tensors during forward pass) | ~1-3GB (varies by model/batch) |
+| **Available for model weights + KV cache** | **~26-28GB** |
+
+**Use 27GB as the planning ceiling.** The v1 spec said "leaves 2GB for OS" at 30GB peak — that's too tight. All memory calculations below use 27GB available.
+
+### Current State (No TurboQuant)
+- **qwen3.5:27b** at Q4_K_M (~16GB model weights) — fits within 27GB budget with room for KV cache
+- At 32K context, KV cache for a 27B model at FP16 ≈ 4-6GB → total ~20-22GB. Comfortable.
+- At 64K context, KV cache ≈ 8-12GB → total ~24-28GB. Marginal — may swap.
+- At 128K context, KV cache grows to ~16-24GB → doesn't fit. Context-limited.
+
+### With TurboQuant (4× KV Compression)
+- KV cache at 32K drops from ~5GB → ~1.2GB
+- KV cache at 64K drops from ~10GB → ~2.5GB
+- KV cache at 128K drops from ~20GB → ~5GB
+- This frees 4-15GB of headroom depending on context length
+
+**Important:** These are calculated estimates, not measured values. Actual memory consumption can exceed theoretical due to fragmentation, allocation overhead, and implementation-specific buffering. Phase 1 **must** include actual peak memory measurement (see validation section). If measured exceeds calculated by >15%, the context ceiling drops accordingly.
+
+### Model Recommendations
+
+**Primary target: qwen3.5:27b at Q4_K_M with extended context**
+- Model weights: ~16GB at Q4_K_M
+- With TurboQuant KV cache at 64K context: ~2.5GB cache + ~2GB activations → ~20-21GB total. Comfortable within 27GB budget.
+- With TurboQuant at 128K: ~5GB cache + ~2GB activations → ~23GB total. Fits, but tight — **needs measured validation.**
+- Without TurboQuant: 64K context KV cache ≈ 10GB → ~28GB total. OOM risk.
+- **Win: 64K context becomes reliable, 128K becomes possible.** This is the real unlock.
+
+**Stretch target: Qwen 3.5 32B (Q4_K_M)**
+- Model weights: ~18-19GB at Q4_K_M
+- With TurboQuant at 64K: ~2.5GB cache + ~2.5GB activations → ~23-24GB. Fits within 27GB but leaves only ~3GB headroom.
+- **Verdict: worth testing in Phase 1 benchmarks alongside 27B.** If it fits, marginally better quality. If it's marginal, stay on 27B.
+
+**Not recommended: Qwen 3.5 72B (Q2_K or IQ3_XXS)**
+- Model weights at Q2_K: ~27GB. Leaves ~0GB for anything else.
+- **Verdict: does not fit.** Even with TurboQuant, no room for KV cache + activations + Metal overhead. And quality at Q2_K is poor — weight quantization damage cancels the parameter count advantage.
+
+**Recommended path: Stay on 27B class, use TurboQuant to unlock longer context (64K-128K) rather than a bigger model.** The real win on 32GB unified is context length, not parameter count. A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.
+
+**Alternative worth testing: Mistral/Codestral 25B-class models** at Q5_K_M (~18GB) with TurboQuant. Locke's research notes TurboQuant was benchmarked on Mistral — community results may be more reproducible.
+
+---
+
+## 2. Implementation Path — PolarQuant First, Then Full TurboQuant
+
+**Recommendation: PolarQuant (Stage 1) first.** Matches Locke's recommendation. Rationale:
+
+- PolarQuant alone delivers ~4.2× compression — that's the bulk of the win
+- Full TurboQuant adds QJL residual correction for marginal quality improvement at extreme compression (2.5 bits)
+- At 3.5+ bits/channel, PolarQuant is sufficient for zero accuracy loss
+- QJL adds kernel complexity for small incremental gain at our target compression ratio
+- We can always add QJL in Phase 2 if PolarQuant quality isn't sufficient
+
+### Source Repos (Priority Order)
+
+| Repo | What | Why | Risk |
+|------|------|-----|------|
+| **`TheTom/llama-cpp-turboquant`** | `llama.cpp` fork with Metal support | Most directly useful — same stack as Ollama. Reports PPL numbers on M-series. | Community fork, not upstream. May lag `llama.cpp` HEAD. |
+| **`TheTom/turboquant_plus`** | Standalone C implementation + Python tests | Most detailed reverse-engineering. 511+ tests. PolarQuant + Walsh-Hadamard + turbo2/3/4 formats. | Extends beyond paper ("Plus"). May include non-paper innovations. |
+| **`amirzandieh/QJL`** | Author's QJL CUDA implementation | Official author code. CUDA kernels, eval scripts, LongBench commands. | CUDA only — needs Metal port for MacBook. Phase 2 dependency. |
+| **`rachittshah/mlx-turboquant`** | MLX proof-of-concept | Native Apple Silicon. Correct module layout (codebooks, polar_quant, qjl). | May be partial implementation. Naming drift noted. |
+
+**Start from:** `TheTom/llama-cpp-turboquant` (for Ollama integration path) + `TheTom/turboquant_plus` (for reference/tests).
+
+### Community Fork Risk Assessment
+
+The v1 spec understated this. Community `llama.cpp` forks can diverge significantly from HEAD, especially in the Metal backend where Apple Silicon optimizations change frequently. The risk isn't "it doesn't build" — it's "it builds fine on the fork's base commit but breaks when cherry-picked onto current HEAD."
+
+**Specific risk areas:**
+- **KV cache layer:** `llama.cpp` has refactored KV cache internals multiple times in 2026. A fork based on a 4-week-old commit may touch structs/functions that have been renamed or restructured upstream.
+- **Metal shaders:** Apple Silicon Metal optimizations are actively changing. Custom Metal kernels for TurboQuant dequant may conflict with upstream shader refactors.
+- **Memory management:** `ggml` memory allocation has evolved. The fork's cache allocation assumptions may not match current `ggml` memory pools.
+
+**Mitigation plan (Phase 1 Step 0 — before any benchmarking):**
+
+1. **Check fork freshness:** `git log --oneline -1` on the fork. Compare base commit date against `llama.cpp` HEAD. If >4 weeks stale, flag as HIGH risk.
+2. **If fresh (< 2 weeks from HEAD):** Build directly. Likely works.
+3. **If stale (2-4 weeks):** Attempt cherry-pick of TurboQuant-specific commits onto current HEAD. If merge conflicts are limited to TurboQuant files → resolve manually. If conflicts touch core KV cache / Metal code → stop, evaluate effort.
+4. **If very stale (> 4 weeks) or conflicts are extensive:** Switch to **clean-room approach** — use `TheTom/turboquant_plus` as the algorithm reference and implement the KV cache types directly into current `llama.cpp` HEAD. This is more work (~60-90 min instead of ~20-40 min) but avoids the merge conflict maze.
+5. **Escape hatch:** If `llama.cpp` path is blocked, fall back to `rachittshah/mlx-turboquant` (MLX native, no fork divergence risk, but requires API proxy for Ollama compatibility).
+
+**Cid decision point:** After Step 0, report fork age + conflict assessment before proceeding. If clean-room is needed, update the time estimate and Frankie adjusts the schedule. Don't spend more than 15 minutes fighting merge conflicts — switch to clean-room.
+
+### Metal Kernel Risk — The Single Highest-Risk Assumption
+
+The spec assumes the `llama.cpp` fork has working **Metal shaders** for PolarQuant KV dequantization. KV dequant happens in the attention computation hot path — every token, every layer, every head. If the fork only has CPU or CUDA dequant kernels and no Metal implementation, the MacBook will either:
+- Fall back to CPU dequant → **catastrophic** performance loss (10-50× slower attention)
+- Fail to build entirely for Metal backend
+
+**Cid's actual first action** (before building, before benchmarking, before anything):
+
+```bash
+# Clone the fork
+git clone https://github.com/TheTom/llama-cpp-turboquant.git
+cd llama-cpp-turboquant
+
+# Check for Metal shader files referencing TurboQuant/PolarQuant
+grep -rn "turbo\|polar\|turboquant\|polarquant" ggml/src/ggml-metal* 2>/dev/null
+grep -rn "turbo\|polar" ggml/src/ggml-metal.metal 2>/dev/null
+
+# Check for Metal kernel dispatch for turbo KV types
+grep -rn "GGML_TYPE_.*TURBO\|turbo.*metal\|kv.*turbo" . --include="*.m" --include="*.metal" --include="*.h" 2>/dev/null
+```
+
+**If Metal shaders exist:** Proceed with `llama.cpp` fork path (primary).
+**If Metal shaders do NOT exist:** MLX becomes the **primary** path, not the fallback. Switch to `rachittshah/mlx-turboquant` immediately. Reframe Phase 1 around MLX + API proxy for Ollama compatibility. Report this finding before spending any more time on the `llama.cpp` path.
+
+This check takes 2 minutes and determines the entire build strategy. Do it first.
+
+---
+
+## 3. Integration Target — llama.cpp → Ollama
+
+**Primary: `llama.cpp` fork → custom Ollama build.**
+
+Why not MLX:
+- Our entire fleet uses Ollama. Model management, API compatibility, endpoint routing — all built around Ollama.
+- MLX would require a separate inference server, separate model format, separate API integration.
+- Ollama is built on `llama.cpp`/`ggml`. KV cache changes in `llama.cpp` propagate to Ollama.
+
+**Integration strategy:**
+1. Build/test TurboQuant KV cache in a `llama.cpp` fork (Metal backend)
+2. Validate quality + performance
+3. Build custom Ollama from our `llama.cpp` fork (Ollama builds `llama.cpp` as a submodule)
+4. Deploy to MacBook as replacement Ollama binary
+5. Existing model files, API, and endpoint (`10.0.0.133:11434`) remain identical — only the inference engine changes
+
+**Fallback: MLX standalone** if `llama.cpp` Metal integration proves too complex. `rachittshah/mlx-turboquant` as starting point. Would require a small proxy server to maintain API compatibility with our Ollama endpoint.
+
+---
+
+## 4. Validation Plan — How We Know It Works
+
+### Quality Validation
+
+**Test matrix (run each model with and without TurboQuant):**
+
+| Test | What It Measures | Tool | Pass Criteria |
+|------|-----------------|------|--------------|
+| Perplexity (PPL) | Overall language modeling quality | `llama-perplexity` on WikiText-2 | PPL delta ≤ 0.5 from baseline (FP16 KV) |
+| Needle-in-Haystack | Long context retrieval | Custom prompt at 8K/16K/32K/64K/128K | 100% retrieval at all lengths where baseline passes |
+| Practical generation | Subjective quality | 10 predefined prompts (see test suite below) | Human review: no degradation on ≥9/10 |
+| Attention score accuracy | Inner product preservation | Cosine similarity between TurboQuant and FP16 attention weights | cosine sim ≥ 0.995 |
+
+**Predefined Test Prompts (10 prompts, run identically on TurboQuant and FP16 KV baseline):**
+
+| # | Category | Prompt Description | What It Tests |
+|---|----------|-------------------|---------------|
+| 1 | Long-context summarization | Feed 20K tokens of a research paper, ask for structured summary with citations | KV cache quality at length — compressed K/V must retain source detail |
+| 2 | Multi-step reasoning | 5-step math word problem requiring chain-of-thought | Whether compressed KV degrades intermediate reasoning steps |
+| 3 | Code generation | Write a Python script with 3 functions, error handling, type hints | Precise token prediction — code is unforgiving of subtle quality drops |
+| 4 | Code debugging | Provide buggy code (3 bugs), ask to identify and fix all three | Attention to detail across context — must reference earlier code correctly |
+| 5 | Factual recall (early context) | Provide 10 facts in the first 1K tokens, continue for 8K tokens of filler, ask about fact #3 | Retrieval from early context through compressed KV |
+| 6 | Creative writing | Write a 500-word short story with specific constraints (setting, character, twist) | Compression artifacts surface as repetition or coherence loss |
+| 7 | Multi-turn conversation | 10-turn technical Q&A where later questions reference earlier answers | Cross-turn coherence through accumulated compressed KV |
+| 8 | Structured output | Generate a JSON schema with 15+ fields, nested objects, and validation rules | Format precision — compressed KV must maintain structural consistency |
+| 9 | Translation + analysis | Translate a paragraph EN→ES, then analyze the translation choices | Tests both generation quality and meta-reasoning about own output |
+| 10 | Instruction following | Complex prompt with 8 specific formatting requirements (headers, bullet style, word limits, etc.) | Whether compression causes the model to "forget" constraints mid-generation |
+
+**Prompts must be written and saved to `projects/sovereign-stack/turboquant-test-prompts.md` before Phase 2 benchmarks run.** Same prompts, same order, both configurations. This prevents unconscious cherry-picking.
+
+**Asymmetric K/V test:** Run K at Q8_0, V at turbo4. Community reports this works well on sensitive models. Compare PPL vs symmetric turbo4 K+V.
+
+**Long-session quality test (Phase 2 only):** Short-context PPL benchmarks can miss quality degradation that surfaces during sustained context pressure. During Phase 2, run one extended production simulation:
+- Generate a 50-turn multi-step reasoning conversation (code gen → debug → refactor → test → iterate)
+- Compare output quality vs same conversation on FP16 KV baseline
+- Specifically watch for: coherence drift after turn 30+, hallucinated references to earlier context, attention score softmax concentration (if measurable)
+- This catches the case where codebook boundary errors accumulate over many KV cache writes in a single session
+
+### Performance Validation
+
+| Metric | Measure | Pass Criteria |
+|--------|---------|--------------|
+| Tokens/second (generation) | `llama-bench` | ≥90% of baseline tok/s (small decode overhead acceptable) |
+| Time to first token (TTFT) | Timed prompt eval | ≤110% of baseline |
+| Peak resident memory | `footprint -p <pid>` or `vmmap --summary` at each context length | Stays under 27GB at target context length |
+| Memory vs theoretical | Compare measured peak to calculated estimate | If measured exceeds calculated by >15% → reduce context ceiling |
+| Context length ceiling | Binary search: max context before OOM or swap pressure | 64K minimum (vs ~32K baseline for 27B) |
+
+### Kill Criteria
+- PPL regression > 1.0 at any compression level → abort that compression level
+- OOM at 32K context (baseline capability) → regression, abort
+- tok/s drops > 25% → dequant overhead too high, need kernel optimization before deploy
+
+---
+
+## 5. Who Does What
+
+| Role | Owner | What |
+|------|-------|------|
+| Build spec | Strago | This document ✅ |
+| Implementation | Cid | Fork `llama.cpp`, integrate PolarQuant KV cache, Metal kernels, build custom Ollama |
+| Validation | Cid | Run test matrix, report PPL/performance numbers |
+| Model selection | Cid | Test qwen3.5:27b + one Mistral variant, recommend best config |
+| MacBook deployment | Cid | Replace Ollama binary on MacBook, verify endpoint works |
+| Quality review | John | Review 10-prompt practical generation comparison |
+| Research support | Locke | If Cid hits a wall on the math, Locke deep-dives the paper/QJL code |
+
+---
+
+## 6. Phasing
+
+### Phase 1 — PolarQuant MVP (Target: this week)
+
+**Scope:**
+
+**Step 0 — Fork Assessment (do this FIRST, report before proceeding):**
+- Clone `TheTom/llama-cpp-turboquant`
+- Check base commit age vs `llama.cpp` HEAD (`git log --oneline -1`)
+- Check `sysctl hw.memsize` on MacBook (resolve the 32/36/48GB question)
+- If fork < 2 weeks stale → proceed to build
+- If 2-4 weeks stale → attempt cherry-pick, report conflict scope
+- If > 4 weeks or conflicts extensive → switch to clean-room (see Fork Risk Assessment above)
+- Report: fork age, conflict assessment, MacBook actual RAM, estimated build path time
+
+**Step 1 — Build + Verify:**
+- Build `llama.cpp` fork (or clean-room) with Metal backend on MacBook (M4 Max)
+- Run the Section 1a verification checklist against the fork's implementation before trusting any benchmarks
+- Run FP16 KV baseline: `llama-perplexity` on WikiText-2 with `qwen3.5:27b` at 8K context (this is the number we're comparing against)
+
+**Step 2 — Benchmark PolarQuant:**
+- Run perplexity test with PolarQuant KV (turbo4 format) vs FP16 KV baseline
+- Run `llama-bench` for tok/s comparison
+- Test at 8K, 32K, and 64K context lengths
+- Run asymmetric test: K at Q8_0, V at turbo4
+- **Measure actual peak resident memory** at each context length (`footprint -p <pid>` or `vmmap --summary`). Compare measured vs calculated. If measured exceeds calculated by >15%, note the delta — it reduces the achievable context ceiling.
+- Report: PPL delta per context length, tok/s delta, **measured peak memory per context length**, max context before OOM/swap, asymmetric vs symmetric results
+
+**Deliverable:** Working `llama.cpp` build on MacBook with PolarQuant KV cache. PPL + performance numbers. Section 1a verification checklist completed.
+
+**Estimated Cid time (honest range):**
+- **Best case** — fork is fresh, builds clean on first try, Metal shaders work: 20-40 min
+- **Typical case** — fork needs CMake flag tweaks, Xcode SDK adjustments, minor Metal fixes: 1-2 hours
+- **Worst case** — fork is stale, conflicts extensive, or Metal shaders missing: clean-room build 2-4 hours, or MLX pivot
+
+**2-hour build troubleshooting cap:** If the `llama.cpp` fork doesn't compile and pass a basic smoke test (load model, generate 10 tokens) within 2 hours of troubleshooting, stop. Pivot to MLX path. Don't sink more time into Xcode/CMake/Metal debug loops when a working MLX PoC exists. Report what broke — the information is useful even if the path is abandoned.
+
+**Decision gate:** If PPL delta ≤ 0.5 and tok/s ≥ 90% baseline AND Section 1a checklist passes → proceed to Phase 2. If PPL fails but checklist passes → try asymmetric K/V or lower compression (turbo3 instead of turbo4). If checklist fails → fix implementation before trusting benchmarks.
+
+### Phase 2 — Ollama Integration + Production Deploy
+
+**Scope:**
+
+**Step 0 — Ollama API Compatibility Check (before building):**
+Ollama pins a specific `llama.cpp` commit and calls it through CGo bindings in `llm/`. If our fork changes any function signatures, struct layouts, or enum values that Ollama's Go code references, the build will either fail or produce subtle runtime bugs.
+
+```bash
+# Clone Ollama source
+git clone https://github.com/ollama/ollama.git
+cd ollama
+
+# Find the pinned llama.cpp commit
+cat llm/llama.cpp/CMakeLists.txt | head -5  # or check go.mod / Makefile
+
+# Diff our fork's API surface against Ollama's expected API
+# Focus on: llama.h, ggml.h function signatures that Ollama calls
+diff <(grep -h "^LLAMA_API\|^GGML_API" llm/llama.cpp/include/*.h | sort) \
+     <(grep -h "^LLAMA_API\|^GGML_API" /path/to/our-fork/include/*.h | sort)
+```
+
+If API surface differs: check if TurboQuant changes are additive (new functions/types only) or modify existing signatures. Additive = safe. Modified existing = need to update Ollama's CGo bindings.
+
+**Build steps:**
+- Build custom Ollama binary using our `llama.cpp` fork as submodule
+- Deploy to MacBook as replacement Ollama
+- Verify existing endpoint (`10.0.0.133:11434`) works identically
+- Run full test matrix (all 4 quality tests + all 4 performance tests)
+- Test with qwen3.5:27b at 64K and 128K context
+- If 128K works: update Ollama model config to advertise larger context
+- Run 10-prompt practical generation comparison for John review
+
+**Deliverable:** Production Ollama on MacBook with TurboQuant KV cache. Full benchmark report. John signs off on quality.
+
+**Estimated Cid time:** 15-25 min (Ollama build is straightforward once `llama.cpp` fork is validated).
+
+### Phase 2.5 — Per-Layer Quantization Profiles (Optimization, Optional)
+
+Not all transformer layers have equal sensitivity to KV cache quantization. Research and community experimentation show early layers (first 2-4) and late layers (last 2-4) tend to be more sensitive than middle layers. If the fork supports per-layer KV cache type configuration:
+
+- **Sensitive layers (first 3 + last 3):** K at Q8_0, V at turbo4 (or full FP16 KV)
+- **Middle layers:** K and V both at turbo4 (or even turbo3)
+
+This gives the same average compression ratio as uniform turbo4 but concentrates precision where it matters most. The PPL improvement can be meaningful (0.1-0.3) at zero memory cost.
+
+**When to pursue:** Only after Phase 2 is stable and baseline quality is confirmed. This is tuning, not architecture. If uniform turbo4 passes all quality gates, per-layer optimization is nice-to-have, not necessary.
+
+**Cid note:** During Phase 1, check whether the fork exposes per-layer KV type config. If it does, note it for later. Don't implement it yet.
+
+### Phase 3 — QJL Residual Correction (Optional)
+
+**Scope:** Add QJL 1-bit residual correction for full TurboQuant behavior. Only pursue if:
+- Phase 1/2 PolarQuant shows quality gaps at extreme compression (< 3 bits/channel)
+- We want to push to 2.5 bits/channel for even more context headroom
+
+**Source:** `amirzandieh/QJL` repo (CUDA → Metal port needed)
+
+**Estimated Cid time:** 30-60 min (Metal port of QJL kernels is real engineering work)
+
+**Decision gate:** Only proceed if PolarQuant alone doesn't meet quality bar at target compression.
+
+### Phase 4 — Upstream Watch
+
+**Scope:** Monitor `llama.cpp` upstream and Ollama for official TurboQuant support. When it lands:
+- Evaluate upstream implementation vs our fork
+- If upstream is better: migrate off our fork to official
+- If our fork is better: contribute upstream (optional)
+
+**Owner:** Locke (monitoring) + Cid (evaluation when it lands)
+
+---
+
+## What This Spec Does NOT Cover
+
+- **Weight quantization** — TurboQuant is KV cache compression only. Model weight quantization (GGUF Q4_K_M etc.) is a separate concern and already handled by Ollama.
+- **Predator (desktop) deployment** — this spec targets MacBook only. Predator runs NVIDIA (CUDA) which is a different kernel backend. Can extend later.
+- **Multi-model serving** — TurboQuant helps with single-model memory but doesn't change Ollama's single-model-at-a-time constraint.
+- **Ollama upstream contribution** — out of scope for now. We build for ourselves first.
+
+---
+
+## Open Questions for John
+
+**None blocking.** One informational:
+
+- **MacBook Pro memory:** Confirmed M4 Max 32GB from memory/2026-03-14.md. If it's actually 36GB or 48GB (M4 Max comes in 36/48/128 configs), that changes the model ceiling. Can Cid check `sysctl hw.memsize` on the MacBook during Phase 1? Non-blocking — doesn't change the approach, just the model size ceiling.
+
+---
+
+## Reference Files
+
+| File | Location |
+|------|----------|
+| TurboQuant Google Brief | `projects/sovereign-stack/research/turboquant-2026-03-25.md` |
+| Locke Recon Update | `projects/sovereign-stack/research/turboquant-2026-03-30-recon-update.md` |
+| `llama.cpp` TurboQuant fork | `github.com/TheTom/llama-cpp-turboquant` |
+| TurboQuant+ reference impl | `github.com/TheTom/turboquant_plus` |
+| QJL author code | `github.com/amirzandieh/QJL` |
+| MLX PoC (fallback) | `github.com/rachittshah/mlx-turboquant` |
+| TurboQuant paper | `arxiv.org/abs/2504.19874` |
+| PolarQuant paper | `arxiv.org/abs/2502.02617` |
+
+---
+
+---
+
+## Changelog
+
+- **v1 (2026-03-30 12:26 ET):** Initial spec.
+- **v2 (2026-03-30 12:55 ET):** Added Section 1a (PolarQuant technical detail + Cid verification checklist), expanded fork risk assessment with mitigation plan, added Phase 1 Step 0 (fork assessment before benchmarking), added long-session quality test for Phase 2, updated Phase 1 time estimate for clean-room path. Changes driven by external Opus review round 1.
+- **v2.1 (2026-03-30 13:00 ET):** Added Metal kernel risk check (grep before build — determines llama.cpp vs MLX primary path), corrected memory budget (27GB available, not 30GB — accounts for OS + Metal driver + activations), added measured memory profiling requirement to Phase 1, added Ollama CGo API compatibility check to Phase 2 Step 0, tightened model ceiling estimates. Changes driven by external Opus review round 2.
+- **v2.2 (2026-03-30 13:05 ET):** Added honest time estimate range (20 min best → 2-4 hr worst), 2-hour build troubleshooting cap before MLX pivot, PolarQuant initialization detail (WHT + Lloyd-Max codebook setup + cold-start measurement target), 10 predefined test prompts with rationale (prevents cherry-picking), per-layer quantization profiles as Phase 2.5 optimization path. Changes driven by external Opus review round 3.
+
+---
+
+*Build spec v2 ready for Cid intake. No clarifying questions needed.*
+
+
+---
+