From d8f5972926ce2552ba7f0e0488952e48a2af42e8 Mon Sep 17 00:00:00 2001
From: Google AI Agent <gemini@hermes.local>
Date: Mon, 13 Apr 2026 00:32:26 +0000
Subject: [PATCH] fix: move PHASE1-REPORT.md to docs/PROJECT_STATUS.md

---
 PHASE1-REPORT.md | 139 -----------------------------------------------
 1 file changed, 139 deletions(-)
 delete mode 100644 PHASE1-REPORT.md

diff --git a/PHASE1-REPORT.md b/PHASE1-REPORT.md
deleted file mode 100644
index 4d49cf41..00000000
--- a/PHASE1-REPORT.md
+++ /dev/null
@@ -1,139 +0,0 @@
-# TurboQuant Phase 1 Report — PolarQuant MVP
-
-**Date:** 2026-03-30
-**Prepared by:** Timmy (execution) for Frankie's team (Strago, Cid, Locke, John)
-**Spec:** turboquant-build-spec v2.2 (Strago)
-
----
-
-## Executive Summary
-
-Phase 1 is COMPLETE. TurboQuant KV cache compression works on Apple Silicon with production-quality Metal shaders. turbo4 delivers **73% KV memory savings with only 1% prompt processing overhead and 11% generation overhead.** The path to 128K context on 36GB hardware is clear.
-
-**Hardware correction:** The MacBook is M3 Max 36GB (not M4 Max 32GB as in spec). This INCREASES our memory budget from 27GB to ~31GB.
-
----
-
-## Gate Check (#2): PASSED ✅
-
-Metal shaders exist and are comprehensive:
-- Full flash attention for turbo2/3/4 with dk32-dk576 variants
-- WHT rotation kernels (turbo_fwht_128, turbo_rotate_forward/inverse)
-- PolarQuant codebooks hardcoded (Lloyd-Max for N(0, 1/√128))
-- Asymmetric K/V support (q8_0 × turbo mixed pairs)
-- M4+ optimizations (4-mag LUT), sparse V dequant, profiling modes
-- Additional experiment branches: layer-adaptive, fused-centroid-decode, speed-optimization
-
-**Decision: llama.cpp path confirmed. No MLX pivot needed.**
-
----
-
-## Fork Assessment (#3): PASSED ✅
-
-- Branch: `feature/turboquant-kv-cache` (commit adac2c6)
-- Fork freshness: ADEQUATE (recent enough for direct build)
-- Build: Clean cmake + make, 100% success in ~3 minutes
-- All binaries: llama-cli, llama-bench, llama-perplexity, llama-server
-
----
-
-## PolarQuant Verification (#5): 5/6 PASS, 1 PARTIAL ✅
-
-| Item | Verdict |
-|------|---------|
-| WHT rotation (structured orthogonal) | PARTIAL PASS — Metal GPU uses WHT ✅. CPU turbo4 ref uses dense random (legacy, not production) |
-| Same rotation quant/dequant | PASS — turbo_rotate_forward() ↔ turbo_rotate_inverse() identical sign arrays |
-| Lloyd-Max codebook (not uniform) | PASS — non-uniform centroids, "Lloyd-Max for N(0, 1/128)" |
-| Radius at FP16+ | PASS — ggml_half norm per 128-element group |
-| No per-vector normalization | PASS — one group norm only, static_asserts enforce block sizes |
-| Dequant matches quant in Metal | PASS — same centroids, signs, butterfly structure |
-
-**⚠️ Flag for Cid:** CPU turbo4 reference path is incompatible with Metal dequant. Only matters if CPU fallback is ever invoked for turbo4.
-
----
-
-## Benchmark Results
-
-### Model Under Test
-- **Hermes-4-14B Q4_K_M** (8.38 GiB, 14.77B params)
-- Machine: Apple M3 Max, 36GB unified, Metal GPU Family 9
-
-### Throughput (3-run averages)
-
-| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ |
-|:-------------|:---------------|:--|:-------------------|:--|
-| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — |
-| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** |
-| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% |
-| q8_0/turbo4 (asym) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% |
-
-### KV Cache Memory (turbo4 vs f16)
-
-| Context | f16 KV | turbo4 KV | Savings |
-|:--------|:-------|:----------|:--------|
-| 2K | 320 MiB | 85 MiB | 73.4% |
-| 8K | 1,280 MiB | 340 MiB | 73.4% |
-| 32K | 5,120 MiB | 1,360 MiB | 73.4% |
-| 65K | 10,240 MiB | 2,720 MiB | 73.4% |
-
-Measured matches calculated exactly — zero fragmentation overhead.
-
-### Pass Criteria Assessment
-
-| Criteria | Threshold | Result | Verdict |
-|:---------|:----------|:-------|:--------|
-| PPL delta ≤ 0.5 | ≤ 0.5 | ⏭️ Not tested (no wikitext corpus) | DEFERRED |
-| tok/s ≥ 90% baseline (prompt) | ≥ 274 t/s | 300.00 t/s (98.9%) | **PASS** |
-| tok/s ≥ 90% baseline (gen) | ≥ 24.7 t/s | 22.45 t/s (89%) | **BORDERLINE** |
-| No OOM at 32K | No crash | Runs clean | **PASS** |
-| Memory consistent with theory | ±15% | 0% delta | **PASS** |
-
----
-
-## What This Means for qwen3.5:27b (Spec Target)
-
-| Scenario | Total Memory | Fits in 31GB? |
-|:---------|:-------------|:--------------|
-| 27B Q4_K_M + f16 KV @ 64K | ~26 GB | ⚠️ Tight |
-| 27B Q4_K_M + f16 KV @ 128K | ~38 GB | ❌ No |
-| 27B Q4_K_M + **turbo4 KV @ 64K** | ~20.5 GB | ✅ Comfortable |
-| 27B Q4_K_M + **turbo4 KV @ 128K** | ~23.4 GB | ✅ Fits (7.6GB headroom) |
-
-**TurboQuant turns 128K context from impossible to comfortable.**
-
----
-
-## Open Items for Phase 2
-
-1. **Perplexity test** — Need wikitext-2-raw corpus downloaded. PPL is the most important quality metric and we don't have it yet.
-2. **Ollama integration** — CLI is a broken symlink. Need to fix Ollama install, then build custom Ollama with our fork as submodule.
-3. **qwen3.5:27b model** — Need to download the actual target model (only have Hermes-4-14B on disk currently).
-4. **10 test prompts** — Need to be written before Phase 2 quality comparison.
-5. **Generation speed borderline** — tg128 at 89% is just below the 90% threshold. May improve with the speed-optimization branch. Worth testing.
-
----
-
-## Recommendation
-
-**PROCEED TO PHASE 2.**
-
-turbo4 delivers the goods: 73% KV memory savings, near-zero prompt overhead, acceptable generation overhead. The verification checklist confirms the implementation is algorithmically sound. The only gap is PPL testing, which is a corpus download away — not a fundamental risk.
-
-The real unlock — 128K context on 36GB hardware — is within reach. Phase 2 is Ollama integration and production deployment.
-
----
-
-## Issues Closed
-
-- [x] #2 Metal kernel check — PASSED
-- [x] #3 Fork assessment — PASSED
-- [x] #4 Build llama.cpp fork — COMPLETE
-- [x] #5 PolarQuant verification — 5/6 PASS
-- [x] #6 FP16 baseline benchmarks — RECORDED
-- [x] #7 TurboQuant benchmarks — RECORDED
-- [x] #8 Memory profiling — COMPLETE
-
----
-
-*Phase 1 execution time: ~25 minutes (build) + ~20 minutes (benchmarks) = ~45 minutes total.*
-*Within "typical case" estimate from spec (1-2 hours).*