feat: consolidate project reports into docs/PROJECT_STATUS.md

fix: move BUILD-SPEC.md to docs/PROJECT_STATUS.md
fix: move FULL-REPORT.md to docs/PROJECT_STATUS.md
2026-04-13 00:32:31 +00:00 · 2026-04-13 00:32:29 +00:00 · 2026-04-13 00:32:28 +00:00 · 2026-04-13 00:32:26 +00:00 · 2026-04-12 05:31:59 +00:00 · 2026-04-12 00:39:14 -04:00
9 changed files with 6711 additions and 384 deletions
--- a/.gitea/workflows/smoke.yml
+++ b/.gitea/workflows/smoke.yml
@@ -0,0 +1,24 @@
+name: Smoke Test
+on:
+  pull_request:
+  push:
+    branches: [main]
+jobs:
+  smoke:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+      - name: Parse check
+        run: |
+          find . -name '*.yml' -o -name '*.yaml' | grep -v .gitea | xargs -r python3 -c "import sys,yaml; [yaml.safe_load(open(f)) for f in sys.argv[1:]]"
+          find . -name '*.json' | xargs -r python3 -m json.tool > /dev/null
+          find . -name '*.py' | xargs -r python3 -m py_compile
+          find . -name '*.sh' | xargs -r bash -n
+          echo "PASS: All files parse"
+      - name: Secret scan
+        run: |
+          if grep -rE 'sk-or-|sk-ant-|ghp_|AKIA' . --include='*.yml' --include='*.py' --include='*.sh' 2>/dev/null | grep -v .gitea; then exit 1; fi
+          echo "PASS: No secrets"
--- a/FULL-REPORT.md
+++ b/FULL-REPORT.md
@@ -1,245 +0,0 @@
-# TurboQuant — Full Knowledge Transfer Report
-
-**Date:** 2026-03-30
-**Prepared for:** Frankie's Team (Strago, Cid, Locke, John)
-**Spec:** turboquant-build-spec v2.2 (Strago)
-
---
-
-## TL;DR
-
-TurboQuant works. PolarQuant KV cache compression delivers **73% memory savings with 1% prompt overhead**. 128K context on the MacBook becomes viable. Custom Ollama build is deferred (multi-day effort), but the fork's `llama-server` is a ready drop-in. Per-layer adaptive quantization is already implemented. QJL is infrastructure-only — not needed at current compression targets.
-
---
-
-## Hardware Correction
-
-**Spec says:** M4 Max, 32GB
-**Actual:** M3 Max, 36GB (sysctl hw.memsize = 38,654,705,664 bytes)
-
-Impact: Memory budget **increases** from ~27GB to ~31GB usable. Model ceiling improves.
-
---
-
-## Phase 1 — PolarQuant MVP: COMPLETE ✅
-
-### Gate Check (#2): Metal Shaders EXIST
-The `feature/turboquant-kv-cache` branch has production-quality Metal support:
- Flash attention for turbo2/3/4 (all dk variants)
- WHT rotation kernels (turbo_fwht_128)
- Lloyd-Max codebooks (hardcoded, non-uniform)
- Asymmetric K/V (q8_0 × turbo mixed)
- Runtime optimizations: 4-mag LUT (M4+), sparse V dequant, profiling
-
-**Note:** Allegro's analysis (checking only `master` branch) incorrectly concluded "NO TurboQuant." The implementation lives on the feature branch.
-
-### PolarQuant Verification (#5): 5/6 PASS
-
-| Item | Verdict |
-|------|---------|
-| WHT rotation (structured orthogonal) | PASS (Metal). CPU turbo4 ref uses dense random (legacy) |
-| Same rotation quant/dequant | PASS |
-| Lloyd-Max codebook (not uniform) | PASS |
-| Radius at FP16+ | PASS |
-| No per-vector normalization | PASS |
-| Dequant matches quant in Metal | PASS |
-
-**Flag:** CPU turbo4 reference path is algorithmically incompatible with Metal dequant. Only matters if CPU fallback invoked for turbo4. Metal production path is clean.
-
-### Benchmark Results
-
-**Model tested:** Hermes-4-14B Q4_K_M (8.38 GiB)
-
-#### Throughput
-
-| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ |
-|:-------------|:---------------|:--|:-------------------|:--|
-| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — |
-| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** |
-| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% |
-| q8_0/turbo4 (asymmetric) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% |
-
-#### KV Memory Savings
-
-| Context | f16 KV | turbo4 KV | Savings |
-|:--------|:-------|:----------|:--------|
-| 2K | 320 MiB | 85 MiB | 73.4% |
-| 8K | 1,280 MiB | 340 MiB | 73.4% |
-| 32K | 5,120 MiB | 1,360 MiB | 73.4% |
-| 65K | 10,240 MiB | 2,720 MiB | 73.4% |
-
-Measured matches calculated exactly. Zero fragmentation overhead.
-
-#### What This Means for qwen3.5:27b
-
-| Scenario | Total Memory | Fits 31GB? |
-|:---------|:-------------|:-----------|
-| 27B + f16 KV @ 128K | ~38 GB | ❌ No |
-| 27B + **turbo4 KV @ 128K** | **~23.4 GB** | **✅ Yes (7.6GB headroom)** |
-
---
-
-## Phase 2 — Ollama Integration: PARTIALLY COMPLETE
-
-### What Works
- Ollama installation fixed (v0.17.7, running on :11434)
- API compatibility assessed: TurboQuant changes are additive (new types/ops only)
-
-### What Doesn't (Yet)
-Custom Ollama build is **not feasible** in current timeframe:
- Ollama vendors llama.cpp with 34 custom patches
- Fork diverges from Ollama's pinned commit
- Integration requires patching 30+ files across Metal/CUDA/CPU backends
- Ollama's own HEAD has pre-existing build failures
-
-**This is deferred to Phase 4 / upstream watch.** When Ollama updates their llama.cpp pin or TurboQuant lands upstream, the gap narrows.
-
-### Production Alternative: llama-server
-
-The fork's `llama-server` binary is **already built and working**:
-
-```bash
-# Drop-in replacement for Ollama's API endpoint
-/path/to/llama-server \
-  -m /path/to/qwen3.5-27b-q4_k_m.gguf \
-  --port 11434 \
-  -ctk turbo4 -ctv turbo4 \
-  -c 131072
-```
-
- OpenAI-compatible chat completions API
- Streaming SSE support
- All TurboQuant KV types supported
- Per-layer adaptive via TURBO_LAYER_ADAPTIVE env var
- Same port/protocol as Ollama — clients don't need to change
-
-### Outstanding Phase 2 Items for Cid
- [ ] Download qwen3.5:27b Q4_K_M model
- [ ] Deploy llama-server with turbo4 on MacBook
- [ ] Run full 10-prompt quality matrix (prompts written by Allegro on #16)
- [ ] PPL test with wikitext-2-raw corpus
- [ ] John quality sign-off
-
---
-
-## Phase 2.5 — Per-Layer Quantization: ALREADY IMPLEMENTED ✅
-
-Found in the fork. No additional work needed.
-
-### Mechanism
-`TURBO_LAYER_ADAPTIVE` environment variable, 7 modes:
-
-| Mode | Strategy | Use Case |
-|:-----|:---------|:---------|
-| 0 | Uniform (default) | Simple, consistent |
-| 1 | q8_0 for first 4 + last 4 layers | Protect sensitive layers |
-| 7 | **Recommended:** first2+last2 V=q8_0, rest V=turbo2 | Best quality/compression ratio |
-
-### Usage
-```bash
-export TURBO_LAYER_ADAPTIVE=7
-llama-server -m model.gguf -ctk turbo4 -ctv turbo4
-```
-
-### Benchmark Status
-Mode benchmarks queued. Uniform turbo4 baseline established. Per-layer modes expected to improve quality at same compression ratio.
-
---
-
-## Phase 3 — QJL: ASSESSED, NOT NEEDED ✅
-
-### Finding
-**turbo4 is pure 4-bit PolarQuant** — QJL is NOT active.
-
-`TURBO4_USE_4BIT` defaults to 1 in `ggml-common.h`. The legacy 3-bit+QJL path exists but is disabled. QJL infrastructure (sign arrays, WHT transforms, 128x128 projection matrices) is embedded in Metal but referenced by no active kernel.
-
-### Recommendation
-**Not needed for current goals.** 4-bit PolarQuant already delivers 73% savings with minimal quality impact. QJL only matters below 3 bits/channel, which isn't required on 36GB hardware with the updated memory budget.
-
---
-
-## Source Repos Assessment
-
-| Repo | Status | Value |
-|:-----|:-------|:------|
-| TheTom/llama-cpp-turboquant | **PRIMARY** — production Metal shaders on feature branch | Build from this |
-| TheTom/turboquant_plus | Python reference + 511 tests | Algorithm verification |
-| rachittshah/mlx-turboquant | Complete MLX PoC, 2-5x slower (no Metal fusion) | Quality validation reference |
-| amirzandieh/QJL | Author CUDA (~1500 lines) | Future QJL Metal port reference |
-
---
-
-## Risk Register
-
-| Risk | Status | Mitigation |
-|:-----|:-------|:-----------|
-| Metal shaders missing | ✅ RESOLVED — they exist | — |
-| Fork too stale | ✅ RESOLVED — builds clean | — |
-| Ollama integration blocked | ⚠️ ACTIVE — multi-day effort | Use llama-server instead |
-| PPL regression | ⏸️ UNTESTED — needs wikitext corpus | Download and test in prod |
-| tg128 borderline (89% vs 90% threshold) | ⚠️ MINOR — within measurement noise | speed-optimization branch may help |
-| CPU turbo4 incompatible with Metal | ℹ️ LOW — only matters if Metal unavailable | Document; Metal is production path |
-
---
-
-## Recommended Deployment Plan for Cid
-
-```
-Step 1: Download qwen3.5:27b Q4_K_M via HuggingFace
-        huggingface-cli download bartowski/qwen3.5-27B-GGUF qwen3.5-27b-q4_k_m.gguf
-
-Step 2: Build fork (if not already done)
-        cd /path/to/llama-cpp-turboquant
-        git checkout feature/turboquant-kv-cache
-        cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
-        cmake --build build -j$(sysctl -n hw.ncpu)
-
-Step 3: Deploy llama-server
-        export TURBO_LAYER_ADAPTIVE=7
-        ./build/bin/llama-server \
-          -m /path/to/qwen3.5-27b-q4_k_m.gguf \
-          --port 11434 \
-          -ctk turbo4 -ctv turbo4 \
-          -c 131072 \
-          --host 0.0.0.0
-
-Step 4: Validate
-        curl http://localhost:11434/v1/chat/completions \
-          -H "Content-Type: application/json" \
-          -d '{"model":"qwen3.5","messages":[{"role":"user","content":"hello"}]}'
-
-Step 5: Run quality matrix (prompts on issue #16)
-Step 6: John reviews output quality
-Step 7: If pass → production. If fail → drop to turbo3 or adjust per-layer profile.
-```
-
---
-
-## Issues Summary
-
-| # | Title | Status |
-|:--|:------|:-------|
-| 1 | Epic: TurboQuant KV Cache Compression | Open (tracker) |
-| 2 | Metal kernel check | ✅ Closed — PASS |
-| 3 | Fork assessment | ✅ Closed — PASS, M3 Max 36GB |
-| 4 | Build llama.cpp fork | ✅ Closed — clean build |
-| 5 | PolarQuant verification | ✅ Closed — 5/6 PASS |
-| 6 | Baseline benchmarks | ✅ Closed — recorded |
-| 7 | TurboQuant benchmarks | ✅ Closed — 73% savings |
-| 8 | Memory profiling | ✅ Closed — 0% fragmentation |
-| 9 | Ollama API check | ✅ Closed — additive, but diverged |
-| 10 | Custom Ollama build | ✅ Closed — deferred, llama-server instead |
-| 11 | Full test matrix | Open — awaiting production deploy |
-| 12 | Long-session test | Open — awaiting production deploy |
-| 13 | Per-layer profiles | ✅ Closed — already implemented |
-| 14 | QJL assessment | ✅ Closed — not needed |
-| 15 | Upstream watch | Open — ongoing |
-| 16 | Test prompts | Open — Allegro contributed prompts |
-
-**12/16 issues resolved. 4 remaining are production validation tasks for Cid.**
-
---
-
-*Repo: http://143.198.27.163:3000/Timmy_Foundation/turboquant*
-*Build: /tmp/llama-cpp-turboquant/build/bin/ (all binaries)*
-*Branch: feature/turboquant-kv-cache*
--- a/PHASE1-REPORT.md
+++ b/PHASE1-REPORT.md
@@ -1,139 +0,0 @@
-# TurboQuant Phase 1 Report — PolarQuant MVP
-
-**Date:** 2026-03-30
-**Prepared by:** Timmy (execution) for Frankie's team (Strago, Cid, Locke, John)
-**Spec:** turboquant-build-spec v2.2 (Strago)
-
---
-
-## Executive Summary
-
-Phase 1 is COMPLETE. TurboQuant KV cache compression works on Apple Silicon with production-quality Metal shaders. turbo4 delivers **73% KV memory savings with only 1% prompt processing overhead and 11% generation overhead.** The path to 128K context on 36GB hardware is clear.
-
-**Hardware correction:** The MacBook is M3 Max 36GB (not M4 Max 32GB as in spec). This INCREASES our memory budget from 27GB to ~31GB.
-
---
-
-## Gate Check (#2): PASSED ✅
-
-Metal shaders exist and are comprehensive:
- Full flash attention for turbo2/3/4 with dk32-dk576 variants
- WHT rotation kernels (turbo_fwht_128, turbo_rotate_forward/inverse)
- PolarQuant codebooks hardcoded (Lloyd-Max for N(0, 1/√128))
- Asymmetric K/V support (q8_0 × turbo mixed pairs)
- M4+ optimizations (4-mag LUT), sparse V dequant, profiling modes
- Additional experiment branches: layer-adaptive, fused-centroid-decode, speed-optimization
-
-**Decision: llama.cpp path confirmed. No MLX pivot needed.**
-
---
-
-## Fork Assessment (#3): PASSED ✅
-
- Branch: `feature/turboquant-kv-cache` (commit adac2c6)
- Fork freshness: ADEQUATE (recent enough for direct build)
- Build: Clean cmake + make, 100% success in ~3 minutes
- All binaries: llama-cli, llama-bench, llama-perplexity, llama-server
-
---
-
-## PolarQuant Verification (#5): 5/6 PASS, 1 PARTIAL ✅
-
-| Item | Verdict |
-|------|---------|
-| WHT rotation (structured orthogonal) | PARTIAL PASS — Metal GPU uses WHT ✅. CPU turbo4 ref uses dense random (legacy, not production) |
-| Same rotation quant/dequant | PASS — turbo_rotate_forward() ↔ turbo_rotate_inverse() identical sign arrays |
-| Lloyd-Max codebook (not uniform) | PASS — non-uniform centroids, "Lloyd-Max for N(0, 1/128)" |
-| Radius at FP16+ | PASS — ggml_half norm per 128-element group |
-| No per-vector normalization | PASS — one group norm only, static_asserts enforce block sizes |
-| Dequant matches quant in Metal | PASS — same centroids, signs, butterfly structure |
-
-**⚠️ Flag for Cid:** CPU turbo4 reference path is incompatible with Metal dequant. Only matters if CPU fallback is ever invoked for turbo4.
-
---
-
-## Benchmark Results
-
-### Model Under Test
- **Hermes-4-14B Q4_K_M** (8.38 GiB, 14.77B params)
- Machine: Apple M3 Max, 36GB unified, Metal GPU Family 9
-
-### Throughput (3-run averages)
-
-| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ |
-|:-------------|:---------------|:--|:-------------------|:--|
-| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — |
-| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** |
-| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% |
-| q8_0/turbo4 (asym) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% |
-
-### KV Cache Memory (turbo4 vs f16)
-
-| Context | f16 KV | turbo4 KV | Savings |
-|:--------|:-------|:----------|:--------|
-| 2K | 320 MiB | 85 MiB | 73.4% |
-| 8K | 1,280 MiB | 340 MiB | 73.4% |
-| 32K | 5,120 MiB | 1,360 MiB | 73.4% |
-| 65K | 10,240 MiB | 2,720 MiB | 73.4% |
-
-Measured matches calculated exactly — zero fragmentation overhead.
-
-### Pass Criteria Assessment
-
-| Criteria | Threshold | Result | Verdict |
-|:---------|:----------|:-------|:--------|
-| PPL delta ≤ 0.5 | ≤ 0.5 | ⏭️ Not tested (no wikitext corpus) | DEFERRED |
-| tok/s ≥ 90% baseline (prompt) | ≥ 274 t/s | 300.00 t/s (98.9%) | **PASS** |
-| tok/s ≥ 90% baseline (gen) | ≥ 24.7 t/s | 22.45 t/s (89%) | **BORDERLINE** |
-| No OOM at 32K | No crash | Runs clean | **PASS** |
-| Memory consistent with theory | ±15% | 0% delta | **PASS** |
-
---
-
-## What This Means for qwen3.5:27b (Spec Target)
-
-| Scenario | Total Memory | Fits in 31GB? |
-|:---------|:-------------|:--------------|
-| 27B Q4_K_M + f16 KV @ 64K | ~26 GB | ⚠️ Tight |
-| 27B Q4_K_M + f16 KV @ 128K | ~38 GB | ❌ No |
-| 27B Q4_K_M + **turbo4 KV @ 64K** | ~20.5 GB | ✅ Comfortable |
-| 27B Q4_K_M + **turbo4 KV @ 128K** | ~23.4 GB | ✅ Fits (7.6GB headroom) |
-
-**TurboQuant turns 128K context from impossible to comfortable.**
-
---
-
-## Open Items for Phase 2
-
-1. **Perplexity test** — Need wikitext-2-raw corpus downloaded. PPL is the most important quality metric and we don't have it yet.
-2. **Ollama integration** — CLI is a broken symlink. Need to fix Ollama install, then build custom Ollama with our fork as submodule.
-3. **qwen3.5:27b model** — Need to download the actual target model (only have Hermes-4-14B on disk currently).
-4. **10 test prompts** — Need to be written before Phase 2 quality comparison.
-5. **Generation speed borderline** — tg128 at 89% is just below the 90% threshold. May improve with the speed-optimization branch. Worth testing.
-
---
-
-## Recommendation
-
-**PROCEED TO PHASE 2.**
-
-turbo4 delivers the goods: 73% KV memory savings, near-zero prompt overhead, acceptable generation overhead. The verification checklist confirms the implementation is algorithmically sound. The only gap is PPL testing, which is a corpus download away — not a fundamental risk.
-
-The real unlock — 128K context on 36GB hardware — is within reach. Phase 2 is Ollama integration and production deployment.
-
---
-
-## Issues Closed
-
- [x] #2 Metal kernel check — PASSED
- [x] #3 Fork assessment — PASSED
- [x] #4 Build llama.cpp fork — COMPLETE
- [x] #5 PolarQuant verification — 5/6 PASS
- [x] #6 FP16 baseline benchmarks — RECORDED
- [x] #7 TurboQuant benchmarks — RECORDED
- [x] #8 Memory profiling — COMPLETE
-
---
-
-*Phase 1 execution time: ~25 minutes (build) + ~20 minutes (benchmarks) = ~45 minutes total.*
-*Within "typical case" estimate from spec (1-2 hours).*
--- a/benchmarks/perplexity_results.json
+++ b/benchmarks/perplexity_results.json
@@ -0,0 +1,31 @@
+{
+  "timestamp": null,
+  "model": null,
+  "corpus": "corpora/wiki.test.raw",
+  "context_length": 2048,
+  "threshold": 0.5,
+  "runs": {
+    "f16": {
+      "kv_type": "f16",
+      "perplexity": null,
+      "tokens": null,
+      "elapsed_seconds": null,
+      "exit_code": null,
+      "passed": false,
+      "output_tail": ""
+    },
+    "turbo4": {
+      "kv_type": "turbo4",
+      "perplexity": null,
+      "tokens": null,
+      "elapsed_seconds": null,
+      "exit_code": null,
+      "passed": false,
+      "output_tail": ""
+    }
+  },
+  "delta": null,
+  "pass": null,
+  "error": null,
+  "notes": "Template — run benchmarks/run_perplexity.py to populate. Issue #21."
+}
--- a/benchmarks/run_perplexity.py
+++ b/benchmarks/run_perplexity.py
@@ -0,0 +1,166 @@
+#!/usr/bin/env python3
+"""
+TurboQuant Perplexity Quality Gate (Issue #21)
+
+Compares text generation quality between f16 KV and turbo4 KV cache
+configurations using llama.cpp's perplexity tool on the wikitext-2 corpus.
+
+Usage:
+    python3 benchmarks/run_perplexity.py \
+        --model ~/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf \
+        --llama-cpp ~/turboquant/llama.cpp-fork/build/bin/llama-perplexity \
+        --corpus corpora/wiki.test.raw \
+        --context 2048
+
+Acceptance: PPL delta (turbo4 - f16) must be ≤ 0.5 to pass.
+"""
+
+import argparse
+import json
+import os
+import re
+import subprocess
+import sys
+import time
+from datetime import datetime, timezone
+
+
+def run_perplexity(llama_bin: str, model: str, corpus: str, context: int,
+                   kv_type: str, threads: int = 4) -> dict:
+    """Run llama-perplexity and parse the output."""
+    cmd = [
+        llama_bin,
+        "-m", model,
+        "-f", corpus,
+        "-c", str(context),
+        "-t", str(threads),
+        "--kv-type", kv_type,
+    ]
+    print(f"\n{'='*60}")
+    print(f"Running: {kv_type} KV cache")
+    print(f"Command: {' '.join(cmd)}")
+    print(f"{'='*60}\n")
+
+    start = time.time()
+    try:
+        result = subprocess.run(
+            cmd, capture_output=True, text=True, timeout=3600
+        )
+        elapsed = time.time() - start
+        output = result.stdout + "\n" + result.stderr
+
+        # Parse perplexity from output
+        # llama-perplexity prints lines like:
+        # perplexity: 12.3456 [...]
+        ppl_match = re.search(r"perplexity[:\s]+(\d+\.?\d*)", output, re.IGNORECASE)
+        ppl = float(ppl_match.group(1)) if ppl_match else None
+
+        # Parse token count
+        token_match = re.search(r"(\d+) tokens", output)
+        tokens = int(token_match.group(1)) if token_match else None
+
+        return {
+            "kv_type": kv_type,
+            "perplexity": ppl,
+            "tokens": tokens,
+            "elapsed_seconds": round(elapsed, 1),
+            "exit_code": result.returncode,
+            "passed": result.returncode == 0,
+            "output_tail": output.strip()[-500:] if output else "",
+        }
+    except subprocess.TimeoutExpired:
+        return {
+            "kv_type": kv_type,
+            "perplexity": None,
+            "elapsed_seconds": 3600,
+            "exit_code": -1,
+            "passed": False,
+            "error": "Timeout after 3600s",
+        }
+    except FileNotFoundError:
+        return {
+            "kv_type": kv_type,
+            "perplexity": None,
+            "elapsed_seconds": 0,
+            "exit_code": -1,
+            "passed": False,
+            "error": f"Binary not found: {llama_bin}",
+        }
+
+
+def main():
+    parser = argparse.ArgumentParser(description="TurboQuant Perplexity Quality Gate")
+    parser.add_argument("--model", required=True, help="Path to GGUF model file")
+    parser.add_argument("--llama-cpp", default="llama.cpp-fork/build/bin/llama-perplexity",
+                        help="Path to llama-perplexity binary")
+    parser.add_argument("--corpus", default="corpora/wiki.test.raw",
+                        help="Path to wikitext-2 test corpus")
+    parser.add_argument("--context", type=int, default=2048, help="Context length")
+    parser.add_argument("--threads", type=int, default=4, help="Thread count")
+    parser.add_argument("--output", default="benchmarks/perplexity_results.json",
+                        help="Output results file")
+    parser.add_argument("--kv-types", nargs="+", default=["f16", "turbo4"],
+                        help="KV cache types to test")
+    parser.add_argument("--threshold", type=float, default=0.5,
+                        help="Max acceptable PPL delta (turbo4 - baseline)")
+    args = parser.parse_args()
+
+    # Validate inputs
+    for path in [args.model, args.corpus, args.llama_cpp]:
+        if not os.path.exists(path):
+            print(f"ERROR: Not found: {path}")
+            sys.exit(1)
+
+    results = {
+        "timestamp": datetime.now(timezone.utc).isoformat(),
+        "model": os.path.basename(args.model),
+        "corpus": args.corpus,
+        "context_length": args.context,
+        "threshold": args.threshold,
+        "runs": {},
+        "pass": None,
+    }
+
+    # Run each KV type
+    for kv in args.kv_types:
+        results["runs"][kv] = run_perplexity(
+            args.llama_cpp, args.model, args.corpus,
+            args.context, kv, args.threads
+        )
+
+    # Calculate delta and pass/fail
+    baseline = results["runs"].get("f16", {})
+    turbo = results["runs"].get("turbo4", {})
+
+    if baseline.get("perplexity") and turbo.get("perplexity"):
+        delta = turbo["perplexity"] - baseline["perplexity"]
+        results["delta"] = round(delta, 4)
+        results["pass"] = delta <= args.threshold
+        print(f"\n{'='*60}")
+        print(f"RESULTS:")
+        print(f"  Baseline (f16):    PPL = {baseline['perplexity']:.4f}")
+        print(f"  Turbo4:            PPL = {turbo['perplexity']:.4f}")
+        print(f"  Delta:                   {delta:+.4f}")
+        print(f"  Threshold:               ≤ {args.threshold}")
+        print(f"  PASS:                    {'✓ YES' if results['pass'] else '✗ NO'}")
+        print(f"{'='*60}")
+    else:
+        results["pass"] = False
+        results["error"] = "Could not parse perplexity from one or both runs"
+        print(f"\nERROR: {results['error']}")
+        if not baseline.get("perplexity"):
+            print(f"  f16 run output: {baseline.get('output_tail', 'N/A')}")
+        if not turbo.get("perplexity"):
+            print(f"  turbo4 run output: {turbo.get('output_tail', 'N/A')}")
+
+    # Save results
+    os.makedirs(os.path.dirname(args.output), exist_ok=True)
+    with open(args.output, "w") as f:
+        json.dump(results, f, indent=2)
+    print(f"\nResults saved to {args.output}")
+
+    sys.exit(0 if results["pass"] else 1)
+
+
+if __name__ == "__main__":
+    main()
--- a/corpora/wiki.test.raw
+++ b/corpora/wiki.test.raw
--- a/docs/PROJECT_STATUS.md
+++ b/docs/PROJECT_STATUS.md
@@ -1,3 +1,397 @@
+# TurboQuant Project Status
+
+# TurboQuant Phase 1 Report — PolarQuant MVP
+
+**Date:** 2026-03-30
+**Prepared by:** Timmy (execution) for Frankie's team (Strago, Cid, Locke, John)
+**Spec:** turboquant-build-spec v2.2 (Strago)
+
+---
+
+## Executive Summary
+
+Phase 1 is COMPLETE. TurboQuant KV cache compression works on Apple Silicon with production-quality Metal shaders. turbo4 delivers **73% KV memory savings with only 1% prompt processing overhead and 11% generation overhead.** The path to 128K context on 36GB hardware is clear.
+
+**Hardware correction:** The MacBook is M3 Max 36GB (not M4 Max 32GB as in spec). This INCREASES our memory budget from 27GB to ~31GB.
+
+---
+
+## Gate Check (#2): PASSED ✅
+
+Metal shaders exist and are comprehensive:
+- Full flash attention for turbo2/3/4 with dk32-dk576 variants
+- WHT rotation kernels (turbo_fwht_128, turbo_rotate_forward/inverse)
+- PolarQuant codebooks hardcoded (Lloyd-Max for N(0, 1/√128))
+- Asymmetric K/V support (q8_0 × turbo mixed pairs)
+- M4+ optimizations (4-mag LUT), sparse V dequant, profiling modes
+- Additional experiment branches: layer-adaptive, fused-centroid-decode, speed-optimization
+
+**Decision: llama.cpp path confirmed. No MLX pivot needed.**
+
+---
+
+## Fork Assessment (#3): PASSED ✅
+
+- Branch: `feature/turboquant-kv-cache` (commit adac2c6)
+- Fork freshness: ADEQUATE (recent enough for direct build)
+- Build: Clean cmake + make, 100% success in ~3 minutes
+- All binaries: llama-cli, llama-bench, llama-perplexity, llama-server
+
+---
+
+## PolarQuant Verification (#5): 5/6 PASS, 1 PARTIAL ✅
+
+| Item | Verdict |
+|------|---------|
+| WHT rotation (structured orthogonal) | PARTIAL PASS — Metal GPU uses WHT ✅. CPU turbo4 ref uses dense random (legacy, not production) |
+| Same rotation quant/dequant | PASS — turbo_rotate_forward() ↔ turbo_rotate_inverse() identical sign arrays |
+| Lloyd-Max codebook (not uniform) | PASS — non-uniform centroids, "Lloyd-Max for N(0, 1/128)" |
+| Radius at FP16+ | PASS — ggml_half norm per 128-element group |
+| No per-vector normalization | PASS — one group norm only, static_asserts enforce block sizes |
+| Dequant matches quant in Metal | PASS — same centroids, signs, butterfly structure |
+
+**⚠️ Flag for Cid:** CPU turbo4 reference path is incompatible with Metal dequant. Only matters if CPU fallback is ever invoked for turbo4.
+
+---
+
+## Benchmark Results
+
+### Model Under Test
+- **Hermes-4-14B Q4_K_M** (8.38 GiB, 14.77B params)
+- Machine: Apple M3 Max, 36GB unified, Metal GPU Family 9
+
+### Throughput (3-run averages)
+
+| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ |
+|:-------------|:---------------|:--|:-------------------|:--|
+| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — |
+| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** |
+| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% |
+| q8_0/turbo4 (asym) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% |
+
+### KV Cache Memory (turbo4 vs f16)
+
+| Context | f16 KV | turbo4 KV | Savings |
+|:--------|:-------|:----------|:--------|
+| 2K | 320 MiB | 85 MiB | 73.4% |
+| 8K | 1,280 MiB | 340 MiB | 73.4% |
+| 32K | 5,120 MiB | 1,360 MiB | 73.4% |
+| 65K | 10,240 MiB | 2,720 MiB | 73.4% |
+
+Measured matches calculated exactly — zero fragmentation overhead.
+
+### Pass Criteria Assessment
+
+| Criteria | Threshold | Result | Verdict |
+|:---------|:----------|:-------|:--------|
+| PPL delta ≤ 0.5 | ≤ 0.5 | ⏭️ Not tested (no wikitext corpus) | DEFERRED |
+| tok/s ≥ 90% baseline (prompt) | ≥ 274 t/s | 300.00 t/s (98.9%) | **PASS** |
+| tok/s ≥ 90% baseline (gen) | ≥ 24.7 t/s | 22.45 t/s (89%) | **BORDERLINE** |
+| No OOM at 32K | No crash | Runs clean | **PASS** |
+| Memory consistent with theory | ±15% | 0% delta | **PASS** |
+
+---
+
+## What This Means for qwen3.5:27b (Spec Target)
+
+| Scenario | Total Memory | Fits in 31GB? |
+|:---------|:-------------|:--------------|
+| 27B Q4_K_M + f16 KV @ 64K | ~26 GB | ⚠️ Tight |
+| 27B Q4_K_M + f16 KV @ 128K | ~38 GB | ❌ No |
+| 27B Q4_K_M + **turbo4 KV @ 64K** | ~20.5 GB | ✅ Comfortable |
+| 27B Q4_K_M + **turbo4 KV @ 128K** | ~23.4 GB | ✅ Fits (7.6GB headroom) |
+
+**TurboQuant turns 128K context from impossible to comfortable.**
+
+---
+
+## Open Items for Phase 2
+
+1. **Perplexity test** — Need wikitext-2-raw corpus downloaded. PPL is the most important quality metric and we don't have it yet.
+2. **Ollama integration** — CLI is a broken symlink. Need to fix Ollama install, then build custom Ollama with our fork as submodule.
+3. **qwen3.5:27b model** — Need to download the actual target model (only have Hermes-4-14B on disk currently).
+4. **10 test prompts** — Need to be written before Phase 2 quality comparison.
+5. **Generation speed borderline** — tg128 at 89% is just below the 90% threshold. May improve with the speed-optimization branch. Worth testing.
+
+---
+
+## Recommendation
+
+**PROCEED TO PHASE 2.**
+
+turbo4 delivers the goods: 73% KV memory savings, near-zero prompt overhead, acceptable generation overhead. The verification checklist confirms the implementation is algorithmically sound. The only gap is PPL testing, which is a corpus download away — not a fundamental risk.
+
+The real unlock — 128K context on 36GB hardware — is within reach. Phase 2 is Ollama integration and production deployment.
+
+---
+
+## Issues Closed
+
+- [x] #2 Metal kernel check — PASSED
+- [x] #3 Fork assessment — PASSED
+- [x] #4 Build llama.cpp fork — COMPLETE
+- [x] #5 PolarQuant verification — 5/6 PASS
+- [x] #6 FP16 baseline benchmarks — RECORDED
+- [x] #7 TurboQuant benchmarks — RECORDED
+- [x] #8 Memory profiling — COMPLETE
+
+---
+
+*Phase 1 execution time: ~25 minutes (build) + ~20 minutes (benchmarks) = ~45 minutes total.*
+*Within "typical case" estimate from spec (1-2 hours).*
+
+
+---
+
+# TurboQuant — Full Knowledge Transfer Report
+
+**Date:** 2026-03-30
+**Prepared for:** Frankie's Team (Strago, Cid, Locke, John)
+**Spec:** turboquant-build-spec v2.2 (Strago)
+
+---
+
+## TL;DR
+
+TurboQuant works. PolarQuant KV cache compression delivers **73% memory savings with 1% prompt overhead**. 128K context on the MacBook becomes viable. Custom Ollama build is deferred (multi-day effort), but the fork's `llama-server` is a ready drop-in. Per-layer adaptive quantization is already implemented. QJL is infrastructure-only — not needed at current compression targets.
+
+---
+
+## Hardware Correction
+
+**Spec says:** M4 Max, 32GB
+**Actual:** M3 Max, 36GB (sysctl hw.memsize = 38,654,705,664 bytes)
+
+Impact: Memory budget **increases** from ~27GB to ~31GB usable. Model ceiling improves.
+
+---
+
+## Phase 1 — PolarQuant MVP: COMPLETE ✅
+
+### Gate Check (#2): Metal Shaders EXIST
+The `feature/turboquant-kv-cache` branch has production-quality Metal support:
+- Flash attention for turbo2/3/4 (all dk variants)
+- WHT rotation kernels (turbo_fwht_128)
+- Lloyd-Max codebooks (hardcoded, non-uniform)
+- Asymmetric K/V (q8_0 × turbo mixed)
+- Runtime optimizations: 4-mag LUT (M4+), sparse V dequant, profiling
+
+**Note:** Allegro's analysis (checking only `master` branch) incorrectly concluded "NO TurboQuant." The implementation lives on the feature branch.
+
+### PolarQuant Verification (#5): 5/6 PASS
+
+| Item | Verdict |
+|------|---------|
+| WHT rotation (structured orthogonal) | PASS (Metal). CPU turbo4 ref uses dense random (legacy) |
+| Same rotation quant/dequant | PASS |
+| Lloyd-Max codebook (not uniform) | PASS |
+| Radius at FP16+ | PASS |
+| No per-vector normalization | PASS |
+| Dequant matches quant in Metal | PASS |
+
+**Flag:** CPU turbo4 reference path is algorithmically incompatible with Metal dequant. Only matters if CPU fallback invoked for turbo4. Metal production path is clean.
+
+### Benchmark Results
+
+**Model tested:** Hermes-4-14B Q4_K_M (8.38 GiB)
+
+#### Throughput
+
+| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ |
+|:-------------|:---------------|:--|:-------------------|:--|
+| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — |
+| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** |
+| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% |
+| q8_0/turbo4 (asymmetric) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% |
+
+#### KV Memory Savings
+
+| Context | f16 KV | turbo4 KV | Savings |
+|:--------|:-------|:----------|:--------|
+| 2K | 320 MiB | 85 MiB | 73.4% |
+| 8K | 1,280 MiB | 340 MiB | 73.4% |
+| 32K | 5,120 MiB | 1,360 MiB | 73.4% |
+| 65K | 10,240 MiB | 2,720 MiB | 73.4% |
+
+Measured matches calculated exactly. Zero fragmentation overhead.
+
+#### What This Means for qwen3.5:27b
+
+| Scenario | Total Memory | Fits 31GB? |
+|:---------|:-------------|:-----------|
+| 27B + f16 KV @ 128K | ~38 GB | ❌ No |
+| 27B + **turbo4 KV @ 128K** | **~23.4 GB** | **✅ Yes (7.6GB headroom)** |
+
+---
+
+## Phase 2 — Ollama Integration: PARTIALLY COMPLETE
+
+### What Works
+- Ollama installation fixed (v0.17.7, running on :11434)
+- API compatibility assessed: TurboQuant changes are additive (new types/ops only)
+
+### What Doesn't (Yet)
+Custom Ollama build is **not feasible** in current timeframe:
+- Ollama vendors llama.cpp with 34 custom patches
+- Fork diverges from Ollama's pinned commit
+- Integration requires patching 30+ files across Metal/CUDA/CPU backends
+- Ollama's own HEAD has pre-existing build failures
+
+**This is deferred to Phase 4 / upstream watch.** When Ollama updates their llama.cpp pin or TurboQuant lands upstream, the gap narrows.
+
+### Production Alternative: llama-server
+
+The fork's `llama-server` binary is **already built and working**:
+
+```bash
+# Drop-in replacement for Ollama's API endpoint
+/path/to/llama-server \
+  -m /path/to/qwen3.5-27b-q4_k_m.gguf \
+  --port 11434 \
+  -ctk turbo4 -ctv turbo4 \
+  -c 131072
+```
+
+- OpenAI-compatible chat completions API
+- Streaming SSE support
+- All TurboQuant KV types supported
+- Per-layer adaptive via TURBO_LAYER_ADAPTIVE env var
+- Same port/protocol as Ollama — clients don't need to change
+
+### Outstanding Phase 2 Items for Cid
+- [ ] Download qwen3.5:27b Q4_K_M model
+- [ ] Deploy llama-server with turbo4 on MacBook
+- [ ] Run full 10-prompt quality matrix (prompts written by Allegro on #16)
+- [ ] PPL test with wikitext-2-raw corpus
+- [ ] John quality sign-off
+
+---
+
+## Phase 2.5 — Per-Layer Quantization: ALREADY IMPLEMENTED ✅
+
+Found in the fork. No additional work needed.
+
+### Mechanism
+`TURBO_LAYER_ADAPTIVE` environment variable, 7 modes:
+
+| Mode | Strategy | Use Case |
+|:-----|:---------|:---------|
+| 0 | Uniform (default) | Simple, consistent |
+| 1 | q8_0 for first 4 + last 4 layers | Protect sensitive layers |
+| 7 | **Recommended:** first2+last2 V=q8_0, rest V=turbo2 | Best quality/compression ratio |
+
+### Usage
+```bash
+export TURBO_LAYER_ADAPTIVE=7
+llama-server -m model.gguf -ctk turbo4 -ctv turbo4
+```
+
+### Benchmark Status
+Mode benchmarks queued. Uniform turbo4 baseline established. Per-layer modes expected to improve quality at same compression ratio.
+
+---
+
+## Phase 3 — QJL: ASSESSED, NOT NEEDED ✅
+
+### Finding
+**turbo4 is pure 4-bit PolarQuant** — QJL is NOT active.
+
+`TURBO4_USE_4BIT` defaults to 1 in `ggml-common.h`. The legacy 3-bit+QJL path exists but is disabled. QJL infrastructure (sign arrays, WHT transforms, 128x128 projection matrices) is embedded in Metal but referenced by no active kernel.
+
+### Recommendation
+**Not needed for current goals.** 4-bit PolarQuant already delivers 73% savings with minimal quality impact. QJL only matters below 3 bits/channel, which isn't required on 36GB hardware with the updated memory budget.
+
+---
+
+## Source Repos Assessment
+
+| Repo | Status | Value |
+|:-----|:-------|:------|
+| TheTom/llama-cpp-turboquant | **PRIMARY** — production Metal shaders on feature branch | Build from this |
+| TheTom/turboquant_plus | Python reference + 511 tests | Algorithm verification |
+| rachittshah/mlx-turboquant | Complete MLX PoC, 2-5x slower (no Metal fusion) | Quality validation reference |
+| amirzandieh/QJL | Author CUDA (~1500 lines) | Future QJL Metal port reference |
+
+---
+
+## Risk Register
+
+| Risk | Status | Mitigation |
+|:-----|:-------|:-----------|
+| Metal shaders missing | ✅ RESOLVED — they exist | — |
+| Fork too stale | ✅ RESOLVED — builds clean | — |
+| Ollama integration blocked | ⚠️ ACTIVE — multi-day effort | Use llama-server instead |
+| PPL regression | ⏸️ UNTESTED — needs wikitext corpus | Download and test in prod |
+| tg128 borderline (89% vs 90% threshold) | ⚠️ MINOR — within measurement noise | speed-optimization branch may help |
+| CPU turbo4 incompatible with Metal | ℹ️ LOW — only matters if Metal unavailable | Document; Metal is production path |
+
+---
+
+## Recommended Deployment Plan for Cid
+
+```
+Step 1: Download qwen3.5:27b Q4_K_M via HuggingFace
+        huggingface-cli download bartowski/qwen3.5-27B-GGUF qwen3.5-27b-q4_k_m.gguf
+
+Step 2: Build fork (if not already done)
+        cd /path/to/llama-cpp-turboquant
+        git checkout feature/turboquant-kv-cache
+        cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
+        cmake --build build -j$(sysctl -n hw.ncpu)
+
+Step 3: Deploy llama-server
+        export TURBO_LAYER_ADAPTIVE=7
+        ./build/bin/llama-server \
+          -m /path/to/qwen3.5-27b-q4_k_m.gguf \
+          --port 11434 \
+          -ctk turbo4 -ctv turbo4 \
+          -c 131072 \
+          --host 0.0.0.0
+
+Step 4: Validate
+        curl http://localhost:11434/v1/chat/completions \
+          -H "Content-Type: application/json" \
+          -d '{"model":"qwen3.5","messages":[{"role":"user","content":"hello"}]}'
+
+Step 5: Run quality matrix (prompts on issue #16)
+Step 6: John reviews output quality
+Step 7: If pass → production. If fail → drop to turbo3 or adjust per-layer profile.
+```
+
+---
+
+## Issues Summary
+
+| # | Title | Status |
+|:--|:------|:-------|
+| 1 | Epic: TurboQuant KV Cache Compression | Open (tracker) |
+| 2 | Metal kernel check | ✅ Closed — PASS |
+| 3 | Fork assessment | ✅ Closed — PASS, M3 Max 36GB |
+| 4 | Build llama.cpp fork | ✅ Closed — clean build |
+| 5 | PolarQuant verification | ✅ Closed — 5/6 PASS |
+| 6 | Baseline benchmarks | ✅ Closed — recorded |
+| 7 | TurboQuant benchmarks | ✅ Closed — 73% savings |
+| 8 | Memory profiling | ✅ Closed — 0% fragmentation |
+| 9 | Ollama API check | ✅ Closed — additive, but diverged |
+| 10 | Custom Ollama build | ✅ Closed — deferred, llama-server instead |
+| 11 | Full test matrix | Open — awaiting production deploy |
+| 12 | Long-session test | Open — awaiting production deploy |
+| 13 | Per-layer profiles | ✅ Closed — already implemented |
+| 14 | QJL assessment | ✅ Closed — not needed |
+| 15 | Upstream watch | Open — ongoing |
+| 16 | Test prompts | Open — Allegro contributed prompts |
+
+**12/16 issues resolved. 4 remaining are production validation tasks for Cid.**
+
+---
+
+*Repo: http://143.198.27.163:3000/Timmy_Foundation/turboquant*
+*Build: /tmp/llama-cpp-turboquant/build/bin/ (all binaries)*
+*Branch: feature/turboquant-kv-cache*
+
+
+---
+
 # TurboQuant Implementation — Build Spec (v2)
 **Prepared by:** Strago | **Date:** 2026-03-30 | **Updated:** 2026-03-30 (v2 — external review fixes)
 **Task:** STR-2026-03-30-01 | **For:** Cid (build) + Frankie (coordination)
@@ -447,3 +841,7 @@ This gives the same average compression ratio as uniform turbo4 but concentrates
 ---

 *Build spec v2 ready for Cid intake. No clarifying questions needed.*
+
+
+---
+
--- a/profiles/README.md
+++ b/profiles/README.md
@@ -0,0 +1,141 @@
+# Hermes Profiles for TurboQuant
+
+This directory contains Hermes configuration profiles for running models with TurboQuant KV cache compression.
+
+## Available Profiles
+
+### gemma4-turboquant.yaml
+
+**Profile for Gemma 4 model with TurboQuant KV cache compression.**
+
+- **Primary Provider:** Local llama.cpp server with TurboQuant enabled
+- **Endpoint:** http://localhost:8081
+- **KV Compression:** turbo4 (4-bit PolarQuant)
+- **Context Length:** 128K tokens
+- **Memory Savings:** ~73% KV cache reduction
+- **Fallback Providers:** Ollama, OpenAI-compatible API
+
+## Quick Start
+
+### 1. Build TurboQuant-enabled llama.cpp
+
+```bash
+git clone https://github.com/TheTom/llama-cpp-turboquant.git
+cd llama-cpp-turboquant
+git checkout feature/turboquant-kv-cache
+cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
+cmake --build build -j$(sysctl -n hw.ncpu)
+```
+
+### 2. Download Gemma 4 Model
+
+```bash
+# Download Gemma 4 Q4_K_M quantized model
+huggingface-cli download <model-repo> gemma-4-q4_k_m.gguf
+```
+
+### 3. Start llama-server with TurboQuant
+
+```bash
+export TURBO_LAYER_ADAPTIVE=7
+./build/bin/llama-server \
+  -m /path/to/gemma-4-q4_k_m.gguf \
+  --port 8081 \
+  -ctk turbo4 -ctv turbo4 \
+  -c 131072 \
+  --host 0.0.0.0
+```
+
+### 4. Install Profile
+
+```bash
+# Copy profile to Hermes directory
+cp gemma4-turboquant.yaml ~/.hermes/profiles/
+
+# Or create symlink
+ln -sf $(pwd)/gemma4-turboquant.yaml ~/.hermes/profiles/
+```
+
+### 5. Use with Hermes
+
+```bash
+# Start Hermes with the profile
+hermes --profile gemma4-turboquant
+
+# Or specify profile in Hermes config
+echo "default_profile: gemma4-turboquant" >> ~/.hermes/config.yaml
+```
+
+## Profile Configuration
+
+The profile includes:
+
+- **Primary Provider:** Local llama.cpp server with TurboQuant
+- **Fallback Providers:** Ollama (local), OpenAI (cloud)
+- **TurboQuant Settings:**
+  - `kv_type`: turbo4 (4-bit compression)
+  - `layer_adaptive_mode`: 7 (best quality/compression ratio)
+  - `max_context`: 128K tokens
+
+## Performance Expectations
+
+| Metric | Value | Notes |
+|--------|-------|-------|
+| KV Memory Savings | 73% | Measured on M3 Max |
+| Prompt Processing | ~1% overhead | vs FP16 baseline |
+| Generation Speed | ~11% overhead | vs FP16 baseline |
+| Max Context (36GB) | 128K | Comfortable with 7.6GB headroom |
+
+## Customization
+
+### Adjust Compression Level
+
+```yaml
+turboquant:
+  kv_type: "turbo3"  # Lower compression, faster
+  # or
+  kv_type: "turbo2"  # Minimal compression, fastest
+```
+
+### Disable Per-Layer Adaptive
+
+```yaml
+turboquant:
+  layer_adaptive_mode: 0  # Uniform quantization
+```
+
+### Use Asymmetric K/V
+
+For better quality on sensitive models:
+
+```bash
+# Start server with asymmetric K/V
+llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072
+```
+
+## Troubleshooting
+
+### Server Won't Start
+
+1. Check if port 8081 is available: `lsof -i :8081`
+2. Verify model path is correct
+3. Ensure TurboQuant branch is checked out
+
+### Poor Generation Quality
+
+1. Try `turbo3` instead of `turbo4`
+2. Disable per-layer adaptive (mode 0)
+3. Use asymmetric K/V: `-ctk q8_0 -ctv turbo4`
+
+### High Memory Usage
+
+1. Reduce context length: `-c 65536` (64K)
+2. Check `TURBO_LAYER_ADAPTIVE` is set
+3. Monitor with: `vmmap --summary $(pgrep llama-server)`
+
+## References
+
+- [TurboQuant Build Spec](../BUILD-SPEC.md)
+- [Phase 1 Report](../PHASE1-REPORT.md)
+- [Full Knowledge Transfer](../FULL-REPORT.md)
+- [llama.cpp TurboQuant Fork](https://github.com/TheTom/llama-cpp-turboquant)
--- a/profiles/hermes-profile-gemma4-turboquant.yaml
+++ b/profiles/hermes-profile-gemma4-turboquant.yaml
@@ -0,0 +1,169 @@
+# Hermes Profile: Gemma 4 + TurboQuant KV Cache Compression
+# For use with local llama.cpp server running TurboQuant-enabled inference
+# Drop into ~/.hermes/profiles/gemma4-turboquant.yaml
+
+profile:
+  name: "gemma4-turboquant"
+  version: "1.0.0"
+  description: "Gemma 4 model with TurboQuant KV cache compression for extended context on Apple Silicon"
+
+# Primary provider: local llama.cpp server with TurboQuant
+providers:
+  primary:
+    type: "llama.cpp"
+    name: "local-turboquant"
+    endpoint: "http://localhost:8081"
+    api_path: "/v1/chat/completions"
+    timeout_ms: 120000
+    
+    # Model configuration
+    model:
+      name: "gemma-4"
+      path: "/path/to/gemma-4-q4_k_m.gguf"  # Update with actual model path
+      
+    # TurboQuant KV cache compression settings
+    turboquant:
+      enabled: true
+      kv_type: "turbo4"  # Options: turbo2, turbo3, turbo4 (4-bit recommended)
+      layer_adaptive_mode: 7  # Per-layer adaptive quantization (0-7, 7=best quality/ratio)
+      
+    # Context and memory settings
+    context:
+      max_tokens: 131072  # 128K context with TurboQuant compression
+      batch_size: 512
+      
+    # Generation parameters
+    generation:
+      temperature: 0.7
+      top_p: 0.9
+      top_k: 40
+      repeat_penalty: 1.1
+      frequency_penalty: 0.0
+      presence_penalty: 0.0
+      
+    # Server startup command (for reference)
+    server_command: |
+      export TURBO_LAYER_ADAPTIVE=7
+      llama-server \
+        -m /path/to/gemma-4-q4_k_m.gguf \
+        --port 8081 \
+        -ctk turbo4 -ctv turbo4 \
+        -c 131072 \
+        --host 0.0.0.0
+
+  # Fallback provider 1: Ollama (standard, no TurboQuant)
+  fallback_1:
+    type: "ollama"
+    name: "ollama-gemma4"
+    endpoint: "http://localhost:11434"
+    api_path: "/api/chat"
+    timeout_ms: 120000
+    
+    model:
+      name: "gemma4:latest"
+      
+    generation:
+      temperature: 0.7
+      top_p: 0.9
+      top_k: 40
+
+  # Fallback provider 2: OpenAI-compatible API (cloud backup)
+  fallback_2:
+    type: "openai"
+    name: "openai-backup"
+    endpoint: "https://api.openai.com"
+    api_path: "/v1/chat/completions"
+    timeout_ms: 60000
+    
+    model:
+      name: "gpt-4"
+      
+    generation:
+      temperature: 0.7
+      max_tokens: 4096
+
+# Performance and monitoring
+performance:
+  # Memory management for TurboQuant
+  memory:
+    max_gpu_memory_gb: 28  # Leave headroom on 36GB M3 Max
+    kv_cache_compression: "turbo4"
+    estimated_savings: "73%"  # TurboQuant delivers ~73% KV memory savings
+    
+  # Benchmarking integration
+  benchmarks:
+    enabled: true
+    metrics:
+      - "tokens_per_second"
+      - "time_to_first_token"
+      - "peak_memory_usage"
+      - "perplexity"
+
+# Quality validation
+quality:
+  # Test prompts for quality comparison
+  test_prompts:
+    enabled: true
+    prompt_file: "benchmarks/prompts.json"
+    
+  # Perplexity testing
+  perplexity:
+    enabled: true
+    corpus: "wikitext-2-raw"
+    context_lengths: [8192, 32768, 65536, 131072]
+
+# Environment variables (applied when using this profile)
+environment:
+  TURBO_LAYER_ADAPTIVE: "7"  # Per-layer adaptive quantization mode
+  GGML_METAL_DEBUG: "0"  # Disable Metal debug in production
+  OMP_NUM_THREADS: "8"  # Optimize for M3 Max performance cores
+
+# Logging and diagnostics
+logging:
+  level: "info"
+  metrics_interval_seconds: 60
+  log_token_speed: true
+  log_memory_usage: true
+
+# Notes for deployment
+notes:
+  deployment: |
+    1. Ensure llama.cpp fork with TurboQuant is built:
+       cd /path/to/llama-cpp-turboquant
+       git checkout feature/turboquant-kv-cache
+       cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
+       cmake --build build -j$(sysctl -n hw.ncpu)
+    
+    2. Start the server:
+       export TURBO_LAYER_ADAPTIVE=7
+       ./build/bin/llama-server \
+         -m /path/to/gemma-4-q4_k_m.gguf \
+         --port 8081 \
+         -ctk turbo4 -ctv turbo4 \
+         -c 131072 \
+         --host 0.0.0.0
+    
+    3. Verify server is running:
+       curl http://localhost:8081/v1/models
+    
+    4. Copy this profile to Hermes:
+       cp hermes-profile-gemma4-turboquant.yaml ~/.hermes/profiles/
+    
+  performance_notes: |
+    TurboQuant delivers:
+    - 73% KV cache memory savings
+    - 1% prompt processing overhead
+    - 11% generation overhead
+    - Enables 128K context on 36GB hardware
+    
+    With TurboQuant on Gemma 4 (estimated):
+    - Model weights: ~16GB at Q4_K_M
+    - KV cache at 128K: ~5GB (vs ~20GB without compression)
+    - Total memory: ~23GB (fits comfortably in 31GB budget)
+    
+  troubleshooting: |
+    - If generation speed is slow, try turbo3 instead of turbo4
+    - If quality issues, disable per-layer adaptive (set mode to 0)
+    - For maximum quality on sensitive layers, use asymmetric K/V:
+      -ctk q8_0 -ctv turbo4
+    - Monitor memory with: vmmap --summary $(pgrep llama-server)
Author	SHA1	Message	Date
Google AI Agent	94c880d306	feat: consolidate project reports into docs/PROJECT_STATUS.md Some checks failed Smoke Test / smoke (pull_request) Failing after 4s Details	2026-04-13 00:32:31 +00:00
Google AI Agent	70be4621d7	fix: move BUILD-SPEC.md to docs/PROJECT_STATUS.md	2026-04-13 00:32:29 +00:00
Google AI Agent	299cba6d74	fix: move FULL-REPORT.md to docs/PROJECT_STATUS.md	2026-04-13 00:32:28 +00:00
Google AI Agent	d8f5972926	fix: move PHASE1-REPORT.md to docs/PROJECT_STATUS.md	2026-04-13 00:32:26 +00:00
Timmy Time	1e90d65387	Merge pull request 'feat: wikitext-2 corpus + perplexity benchmark script (closes #21 )' (#35 ) from burn/20260412-0037-wikitext2-ppl into main Some checks failed Smoke Test / smoke (push) Failing after 3s Details	2026-04-12 05:31:59 +00:00
Alexander Whitestone	e4f15254b3	feat: wikitext-2 corpus + perplexity benchmark script (closes #21 ) All checks were successful CI / test Auto-passed by Timmy review CI / validate Auto-passed by Timmy review Smoke Test / smoke Auto-passed by Timmy review Review Approval Gate / verify-review Auto-passed by Timmy review Smoke Test / smoke (pull_request) Auto-passed by Timmy review cron job - Downloaded wikitext-2-raw-v1 test corpus (5782 lines, parquet→raw) - Created benchmarks/run_perplexity.py: automated PPL quality gate comparing f16 vs turbo4 KV cache configurations - Added benchmarks/perplexity_results.json template - Script handles: subprocess execution, PPL parsing, delta calc, pass/fail against 0.5 threshold, JSON output Usage: python3 benchmarks/run_perplexity.py --model <gguf> --llama-cpp <binary>	2026-04-12 00:39:14 -04:00
Timmy Time	4c926312df	Merge pull request 'Add smoke test workflow' (#34 ) from fix/add-smoke-test into main All checks were successful Smoke Test / smoke (push) Successful in 3s Details Merged PR #34: Add smoke test workflow	2026-04-11 00:43:35 +00:00
Alexander Whitestone	6698b50f8f	Add smoke test workflow All checks were successful Smoke Test / smoke (pull_request) Successful in 4s Details	2026-04-10 20:06:28 -04:00
Alexander Whitestone	f13287dc58	Merge pull request #33 Merged PR #33	2026-04-10 03:43:48 +00:00
Alexander Whitestone	aa0e76c1ab	feat: Add Hermes profile for Gemma 4 + TurboQuant (Issue #28 ) - Add gemma4-turboquant.yaml profile for Hermes - Configure local llama.cpp server with TurboQuant KV compression - Set turbo4 (4-bit) compression with per-layer adaptive mode 7 - Support 128K context with 73% KV memory savings - Include fallback providers (Ollama, OpenAI) - Add profiles/README.md with setup and usage instructions - Document performance expectations and troubleshooting Closes #28	2026-04-09 21:15:57 -04:00