fix: consolidate project reports and cleanup muda
Some checks failed
Smoke Test / smoke (push) Failing after 4s
Some checks failed
Smoke Test / smoke (push) Failing after 4s
Merge PR #36: fix: consolidate project reports and cleanup muda
This commit was merged in pull request #36.
This commit is contained in:
245
FULL-REPORT.md
245
FULL-REPORT.md
@@ -1,245 +0,0 @@
|
|||||||
# TurboQuant — Full Knowledge Transfer Report
|
|
||||||
|
|
||||||
**Date:** 2026-03-30
|
|
||||||
**Prepared for:** Frankie's Team (Strago, Cid, Locke, John)
|
|
||||||
**Spec:** turboquant-build-spec v2.2 (Strago)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## TL;DR
|
|
||||||
|
|
||||||
TurboQuant works. PolarQuant KV cache compression delivers **73% memory savings with 1% prompt overhead**. 128K context on the MacBook becomes viable. Custom Ollama build is deferred (multi-day effort), but the fork's `llama-server` is a ready drop-in. Per-layer adaptive quantization is already implemented. QJL is infrastructure-only — not needed at current compression targets.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Hardware Correction
|
|
||||||
|
|
||||||
**Spec says:** M4 Max, 32GB
|
|
||||||
**Actual:** M3 Max, 36GB (sysctl hw.memsize = 38,654,705,664 bytes)
|
|
||||||
|
|
||||||
Impact: Memory budget **increases** from ~27GB to ~31GB usable. Model ceiling improves.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Phase 1 — PolarQuant MVP: COMPLETE ✅
|
|
||||||
|
|
||||||
### Gate Check (#2): Metal Shaders EXIST
|
|
||||||
The `feature/turboquant-kv-cache` branch has production-quality Metal support:
|
|
||||||
- Flash attention for turbo2/3/4 (all dk variants)
|
|
||||||
- WHT rotation kernels (turbo_fwht_128)
|
|
||||||
- Lloyd-Max codebooks (hardcoded, non-uniform)
|
|
||||||
- Asymmetric K/V (q8_0 × turbo mixed)
|
|
||||||
- Runtime optimizations: 4-mag LUT (M4+), sparse V dequant, profiling
|
|
||||||
|
|
||||||
**Note:** Allegro's analysis (checking only `master` branch) incorrectly concluded "NO TurboQuant." The implementation lives on the feature branch.
|
|
||||||
|
|
||||||
### PolarQuant Verification (#5): 5/6 PASS
|
|
||||||
|
|
||||||
| Item | Verdict |
|
|
||||||
|------|---------|
|
|
||||||
| WHT rotation (structured orthogonal) | PASS (Metal). CPU turbo4 ref uses dense random (legacy) |
|
|
||||||
| Same rotation quant/dequant | PASS |
|
|
||||||
| Lloyd-Max codebook (not uniform) | PASS |
|
|
||||||
| Radius at FP16+ | PASS |
|
|
||||||
| No per-vector normalization | PASS |
|
|
||||||
| Dequant matches quant in Metal | PASS |
|
|
||||||
|
|
||||||
**Flag:** CPU turbo4 reference path is algorithmically incompatible with Metal dequant. Only matters if CPU fallback invoked for turbo4. Metal production path is clean.
|
|
||||||
|
|
||||||
### Benchmark Results
|
|
||||||
|
|
||||||
**Model tested:** Hermes-4-14B Q4_K_M (8.38 GiB)
|
|
||||||
|
|
||||||
#### Throughput
|
|
||||||
|
|
||||||
| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ |
|
|
||||||
|:-------------|:---------------|:--|:-------------------|:--|
|
|
||||||
| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — |
|
|
||||||
| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** |
|
|
||||||
| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% |
|
|
||||||
| q8_0/turbo4 (asymmetric) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% |
|
|
||||||
|
|
||||||
#### KV Memory Savings
|
|
||||||
|
|
||||||
| Context | f16 KV | turbo4 KV | Savings |
|
|
||||||
|:--------|:-------|:----------|:--------|
|
|
||||||
| 2K | 320 MiB | 85 MiB | 73.4% |
|
|
||||||
| 8K | 1,280 MiB | 340 MiB | 73.4% |
|
|
||||||
| 32K | 5,120 MiB | 1,360 MiB | 73.4% |
|
|
||||||
| 65K | 10,240 MiB | 2,720 MiB | 73.4% |
|
|
||||||
|
|
||||||
Measured matches calculated exactly. Zero fragmentation overhead.
|
|
||||||
|
|
||||||
#### What This Means for qwen3.5:27b
|
|
||||||
|
|
||||||
| Scenario | Total Memory | Fits 31GB? |
|
|
||||||
|:---------|:-------------|:-----------|
|
|
||||||
| 27B + f16 KV @ 128K | ~38 GB | ❌ No |
|
|
||||||
| 27B + **turbo4 KV @ 128K** | **~23.4 GB** | **✅ Yes (7.6GB headroom)** |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Phase 2 — Ollama Integration: PARTIALLY COMPLETE
|
|
||||||
|
|
||||||
### What Works
|
|
||||||
- Ollama installation fixed (v0.17.7, running on :11434)
|
|
||||||
- API compatibility assessed: TurboQuant changes are additive (new types/ops only)
|
|
||||||
|
|
||||||
### What Doesn't (Yet)
|
|
||||||
Custom Ollama build is **not feasible** in current timeframe:
|
|
||||||
- Ollama vendors llama.cpp with 34 custom patches
|
|
||||||
- Fork diverges from Ollama's pinned commit
|
|
||||||
- Integration requires patching 30+ files across Metal/CUDA/CPU backends
|
|
||||||
- Ollama's own HEAD has pre-existing build failures
|
|
||||||
|
|
||||||
**This is deferred to Phase 4 / upstream watch.** When Ollama updates their llama.cpp pin or TurboQuant lands upstream, the gap narrows.
|
|
||||||
|
|
||||||
### Production Alternative: llama-server
|
|
||||||
|
|
||||||
The fork's `llama-server` binary is **already built and working**:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Drop-in replacement for Ollama's API endpoint
|
|
||||||
/path/to/llama-server \
|
|
||||||
-m /path/to/qwen3.5-27b-q4_k_m.gguf \
|
|
||||||
--port 11434 \
|
|
||||||
-ctk turbo4 -ctv turbo4 \
|
|
||||||
-c 131072
|
|
||||||
```
|
|
||||||
|
|
||||||
- OpenAI-compatible chat completions API
|
|
||||||
- Streaming SSE support
|
|
||||||
- All TurboQuant KV types supported
|
|
||||||
- Per-layer adaptive via TURBO_LAYER_ADAPTIVE env var
|
|
||||||
- Same port/protocol as Ollama — clients don't need to change
|
|
||||||
|
|
||||||
### Outstanding Phase 2 Items for Cid
|
|
||||||
- [ ] Download qwen3.5:27b Q4_K_M model
|
|
||||||
- [ ] Deploy llama-server with turbo4 on MacBook
|
|
||||||
- [ ] Run full 10-prompt quality matrix (prompts written by Allegro on #16)
|
|
||||||
- [ ] PPL test with wikitext-2-raw corpus
|
|
||||||
- [ ] John quality sign-off
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Phase 2.5 — Per-Layer Quantization: ALREADY IMPLEMENTED ✅
|
|
||||||
|
|
||||||
Found in the fork. No additional work needed.
|
|
||||||
|
|
||||||
### Mechanism
|
|
||||||
`TURBO_LAYER_ADAPTIVE` environment variable, 7 modes:
|
|
||||||
|
|
||||||
| Mode | Strategy | Use Case |
|
|
||||||
|:-----|:---------|:---------|
|
|
||||||
| 0 | Uniform (default) | Simple, consistent |
|
|
||||||
| 1 | q8_0 for first 4 + last 4 layers | Protect sensitive layers |
|
|
||||||
| 7 | **Recommended:** first2+last2 V=q8_0, rest V=turbo2 | Best quality/compression ratio |
|
|
||||||
|
|
||||||
### Usage
|
|
||||||
```bash
|
|
||||||
export TURBO_LAYER_ADAPTIVE=7
|
|
||||||
llama-server -m model.gguf -ctk turbo4 -ctv turbo4
|
|
||||||
```
|
|
||||||
|
|
||||||
### Benchmark Status
|
|
||||||
Mode benchmarks queued. Uniform turbo4 baseline established. Per-layer modes expected to improve quality at same compression ratio.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Phase 3 — QJL: ASSESSED, NOT NEEDED ✅
|
|
||||||
|
|
||||||
### Finding
|
|
||||||
**turbo4 is pure 4-bit PolarQuant** — QJL is NOT active.
|
|
||||||
|
|
||||||
`TURBO4_USE_4BIT` defaults to 1 in `ggml-common.h`. The legacy 3-bit+QJL path exists but is disabled. QJL infrastructure (sign arrays, WHT transforms, 128x128 projection matrices) is embedded in Metal but referenced by no active kernel.
|
|
||||||
|
|
||||||
### Recommendation
|
|
||||||
**Not needed for current goals.** 4-bit PolarQuant already delivers 73% savings with minimal quality impact. QJL only matters below 3 bits/channel, which isn't required on 36GB hardware with the updated memory budget.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Source Repos Assessment
|
|
||||||
|
|
||||||
| Repo | Status | Value |
|
|
||||||
|:-----|:-------|:------|
|
|
||||||
| TheTom/llama-cpp-turboquant | **PRIMARY** — production Metal shaders on feature branch | Build from this |
|
|
||||||
| TheTom/turboquant_plus | Python reference + 511 tests | Algorithm verification |
|
|
||||||
| rachittshah/mlx-turboquant | Complete MLX PoC, 2-5x slower (no Metal fusion) | Quality validation reference |
|
|
||||||
| amirzandieh/QJL | Author CUDA (~1500 lines) | Future QJL Metal port reference |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Risk Register
|
|
||||||
|
|
||||||
| Risk | Status | Mitigation |
|
|
||||||
|:-----|:-------|:-----------|
|
|
||||||
| Metal shaders missing | ✅ RESOLVED — they exist | — |
|
|
||||||
| Fork too stale | ✅ RESOLVED — builds clean | — |
|
|
||||||
| Ollama integration blocked | ⚠️ ACTIVE — multi-day effort | Use llama-server instead |
|
|
||||||
| PPL regression | ⏸️ UNTESTED — needs wikitext corpus | Download and test in prod |
|
|
||||||
| tg128 borderline (89% vs 90% threshold) | ⚠️ MINOR — within measurement noise | speed-optimization branch may help |
|
|
||||||
| CPU turbo4 incompatible with Metal | ℹ️ LOW — only matters if Metal unavailable | Document; Metal is production path |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Recommended Deployment Plan for Cid
|
|
||||||
|
|
||||||
```
|
|
||||||
Step 1: Download qwen3.5:27b Q4_K_M via HuggingFace
|
|
||||||
huggingface-cli download bartowski/qwen3.5-27B-GGUF qwen3.5-27b-q4_k_m.gguf
|
|
||||||
|
|
||||||
Step 2: Build fork (if not already done)
|
|
||||||
cd /path/to/llama-cpp-turboquant
|
|
||||||
git checkout feature/turboquant-kv-cache
|
|
||||||
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
|
|
||||||
cmake --build build -j$(sysctl -n hw.ncpu)
|
|
||||||
|
|
||||||
Step 3: Deploy llama-server
|
|
||||||
export TURBO_LAYER_ADAPTIVE=7
|
|
||||||
./build/bin/llama-server \
|
|
||||||
-m /path/to/qwen3.5-27b-q4_k_m.gguf \
|
|
||||||
--port 11434 \
|
|
||||||
-ctk turbo4 -ctv turbo4 \
|
|
||||||
-c 131072 \
|
|
||||||
--host 0.0.0.0
|
|
||||||
|
|
||||||
Step 4: Validate
|
|
||||||
curl http://localhost:11434/v1/chat/completions \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"model":"qwen3.5","messages":[{"role":"user","content":"hello"}]}'
|
|
||||||
|
|
||||||
Step 5: Run quality matrix (prompts on issue #16)
|
|
||||||
Step 6: John reviews output quality
|
|
||||||
Step 7: If pass → production. If fail → drop to turbo3 or adjust per-layer profile.
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Issues Summary
|
|
||||||
|
|
||||||
| # | Title | Status |
|
|
||||||
|:--|:------|:-------|
|
|
||||||
| 1 | Epic: TurboQuant KV Cache Compression | Open (tracker) |
|
|
||||||
| 2 | Metal kernel check | ✅ Closed — PASS |
|
|
||||||
| 3 | Fork assessment | ✅ Closed — PASS, M3 Max 36GB |
|
|
||||||
| 4 | Build llama.cpp fork | ✅ Closed — clean build |
|
|
||||||
| 5 | PolarQuant verification | ✅ Closed — 5/6 PASS |
|
|
||||||
| 6 | Baseline benchmarks | ✅ Closed — recorded |
|
|
||||||
| 7 | TurboQuant benchmarks | ✅ Closed — 73% savings |
|
|
||||||
| 8 | Memory profiling | ✅ Closed — 0% fragmentation |
|
|
||||||
| 9 | Ollama API check | ✅ Closed — additive, but diverged |
|
|
||||||
| 10 | Custom Ollama build | ✅ Closed — deferred, llama-server instead |
|
|
||||||
| 11 | Full test matrix | Open — awaiting production deploy |
|
|
||||||
| 12 | Long-session test | Open — awaiting production deploy |
|
|
||||||
| 13 | Per-layer profiles | ✅ Closed — already implemented |
|
|
||||||
| 14 | QJL assessment | ✅ Closed — not needed |
|
|
||||||
| 15 | Upstream watch | Open — ongoing |
|
|
||||||
| 16 | Test prompts | Open — Allegro contributed prompts |
|
|
||||||
|
|
||||||
**12/16 issues resolved. 4 remaining are production validation tasks for Cid.**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
*Repo: http://143.198.27.163:3000/Timmy_Foundation/turboquant*
|
|
||||||
*Build: /tmp/llama-cpp-turboquant/build/bin/ (all binaries)*
|
|
||||||
*Branch: feature/turboquant-kv-cache*
|
|
||||||
139
PHASE1-REPORT.md
139
PHASE1-REPORT.md
@@ -1,139 +0,0 @@
|
|||||||
# TurboQuant Phase 1 Report — PolarQuant MVP
|
|
||||||
|
|
||||||
**Date:** 2026-03-30
|
|
||||||
**Prepared by:** Timmy (execution) for Frankie's team (Strago, Cid, Locke, John)
|
|
||||||
**Spec:** turboquant-build-spec v2.2 (Strago)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Executive Summary
|
|
||||||
|
|
||||||
Phase 1 is COMPLETE. TurboQuant KV cache compression works on Apple Silicon with production-quality Metal shaders. turbo4 delivers **73% KV memory savings with only 1% prompt processing overhead and 11% generation overhead.** The path to 128K context on 36GB hardware is clear.
|
|
||||||
|
|
||||||
**Hardware correction:** The MacBook is M3 Max 36GB (not M4 Max 32GB as in spec). This INCREASES our memory budget from 27GB to ~31GB.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Gate Check (#2): PASSED ✅
|
|
||||||
|
|
||||||
Metal shaders exist and are comprehensive:
|
|
||||||
- Full flash attention for turbo2/3/4 with dk32-dk576 variants
|
|
||||||
- WHT rotation kernels (turbo_fwht_128, turbo_rotate_forward/inverse)
|
|
||||||
- PolarQuant codebooks hardcoded (Lloyd-Max for N(0, 1/√128))
|
|
||||||
- Asymmetric K/V support (q8_0 × turbo mixed pairs)
|
|
||||||
- M4+ optimizations (4-mag LUT), sparse V dequant, profiling modes
|
|
||||||
- Additional experiment branches: layer-adaptive, fused-centroid-decode, speed-optimization
|
|
||||||
|
|
||||||
**Decision: llama.cpp path confirmed. No MLX pivot needed.**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Fork Assessment (#3): PASSED ✅
|
|
||||||
|
|
||||||
- Branch: `feature/turboquant-kv-cache` (commit adac2c6)
|
|
||||||
- Fork freshness: ADEQUATE (recent enough for direct build)
|
|
||||||
- Build: Clean cmake + make, 100% success in ~3 minutes
|
|
||||||
- All binaries: llama-cli, llama-bench, llama-perplexity, llama-server
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## PolarQuant Verification (#5): 5/6 PASS, 1 PARTIAL ✅
|
|
||||||
|
|
||||||
| Item | Verdict |
|
|
||||||
|------|---------|
|
|
||||||
| WHT rotation (structured orthogonal) | PARTIAL PASS — Metal GPU uses WHT ✅. CPU turbo4 ref uses dense random (legacy, not production) |
|
|
||||||
| Same rotation quant/dequant | PASS — turbo_rotate_forward() ↔ turbo_rotate_inverse() identical sign arrays |
|
|
||||||
| Lloyd-Max codebook (not uniform) | PASS — non-uniform centroids, "Lloyd-Max for N(0, 1/128)" |
|
|
||||||
| Radius at FP16+ | PASS — ggml_half norm per 128-element group |
|
|
||||||
| No per-vector normalization | PASS — one group norm only, static_asserts enforce block sizes |
|
|
||||||
| Dequant matches quant in Metal | PASS — same centroids, signs, butterfly structure |
|
|
||||||
|
|
||||||
**⚠️ Flag for Cid:** CPU turbo4 reference path is incompatible with Metal dequant. Only matters if CPU fallback is ever invoked for turbo4.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Benchmark Results
|
|
||||||
|
|
||||||
### Model Under Test
|
|
||||||
- **Hermes-4-14B Q4_K_M** (8.38 GiB, 14.77B params)
|
|
||||||
- Machine: Apple M3 Max, 36GB unified, Metal GPU Family 9
|
|
||||||
|
|
||||||
### Throughput (3-run averages)
|
|
||||||
|
|
||||||
| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ |
|
|
||||||
|:-------------|:---------------|:--|:-------------------|:--|
|
|
||||||
| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — |
|
|
||||||
| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** |
|
|
||||||
| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% |
|
|
||||||
| q8_0/turbo4 (asym) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% |
|
|
||||||
|
|
||||||
### KV Cache Memory (turbo4 vs f16)
|
|
||||||
|
|
||||||
| Context | f16 KV | turbo4 KV | Savings |
|
|
||||||
|:--------|:-------|:----------|:--------|
|
|
||||||
| 2K | 320 MiB | 85 MiB | 73.4% |
|
|
||||||
| 8K | 1,280 MiB | 340 MiB | 73.4% |
|
|
||||||
| 32K | 5,120 MiB | 1,360 MiB | 73.4% |
|
|
||||||
| 65K | 10,240 MiB | 2,720 MiB | 73.4% |
|
|
||||||
|
|
||||||
Measured matches calculated exactly — zero fragmentation overhead.
|
|
||||||
|
|
||||||
### Pass Criteria Assessment
|
|
||||||
|
|
||||||
| Criteria | Threshold | Result | Verdict |
|
|
||||||
|:---------|:----------|:-------|:--------|
|
|
||||||
| PPL delta ≤ 0.5 | ≤ 0.5 | ⏭️ Not tested (no wikitext corpus) | DEFERRED |
|
|
||||||
| tok/s ≥ 90% baseline (prompt) | ≥ 274 t/s | 300.00 t/s (98.9%) | **PASS** |
|
|
||||||
| tok/s ≥ 90% baseline (gen) | ≥ 24.7 t/s | 22.45 t/s (89%) | **BORDERLINE** |
|
|
||||||
| No OOM at 32K | No crash | Runs clean | **PASS** |
|
|
||||||
| Memory consistent with theory | ±15% | 0% delta | **PASS** |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## What This Means for qwen3.5:27b (Spec Target)
|
|
||||||
|
|
||||||
| Scenario | Total Memory | Fits in 31GB? |
|
|
||||||
|:---------|:-------------|:--------------|
|
|
||||||
| 27B Q4_K_M + f16 KV @ 64K | ~26 GB | ⚠️ Tight |
|
|
||||||
| 27B Q4_K_M + f16 KV @ 128K | ~38 GB | ❌ No |
|
|
||||||
| 27B Q4_K_M + **turbo4 KV @ 64K** | ~20.5 GB | ✅ Comfortable |
|
|
||||||
| 27B Q4_K_M + **turbo4 KV @ 128K** | ~23.4 GB | ✅ Fits (7.6GB headroom) |
|
|
||||||
|
|
||||||
**TurboQuant turns 128K context from impossible to comfortable.**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Open Items for Phase 2
|
|
||||||
|
|
||||||
1. **Perplexity test** — Need wikitext-2-raw corpus downloaded. PPL is the most important quality metric and we don't have it yet.
|
|
||||||
2. **Ollama integration** — CLI is a broken symlink. Need to fix Ollama install, then build custom Ollama with our fork as submodule.
|
|
||||||
3. **qwen3.5:27b model** — Need to download the actual target model (only have Hermes-4-14B on disk currently).
|
|
||||||
4. **10 test prompts** — Need to be written before Phase 2 quality comparison.
|
|
||||||
5. **Generation speed borderline** — tg128 at 89% is just below the 90% threshold. May improve with the speed-optimization branch. Worth testing.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Recommendation
|
|
||||||
|
|
||||||
**PROCEED TO PHASE 2.**
|
|
||||||
|
|
||||||
turbo4 delivers the goods: 73% KV memory savings, near-zero prompt overhead, acceptable generation overhead. The verification checklist confirms the implementation is algorithmically sound. The only gap is PPL testing, which is a corpus download away — not a fundamental risk.
|
|
||||||
|
|
||||||
The real unlock — 128K context on 36GB hardware — is within reach. Phase 2 is Ollama integration and production deployment.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Issues Closed
|
|
||||||
|
|
||||||
- [x] #2 Metal kernel check — PASSED
|
|
||||||
- [x] #3 Fork assessment — PASSED
|
|
||||||
- [x] #4 Build llama.cpp fork — COMPLETE
|
|
||||||
- [x] #5 PolarQuant verification — 5/6 PASS
|
|
||||||
- [x] #6 FP16 baseline benchmarks — RECORDED
|
|
||||||
- [x] #7 TurboQuant benchmarks — RECORDED
|
|
||||||
- [x] #8 Memory profiling — COMPLETE
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
*Phase 1 execution time: ~25 minutes (build) + ~20 minutes (benchmarks) = ~45 minutes total.*
|
|
||||||
*Within "typical case" estimate from spec (1-2 hours).*
|
|
||||||
@@ -1,3 +1,397 @@
|
|||||||
|
# TurboQuant Project Status
|
||||||
|
|
||||||
|
# TurboQuant Phase 1 Report — PolarQuant MVP
|
||||||
|
|
||||||
|
**Date:** 2026-03-30
|
||||||
|
**Prepared by:** Timmy (execution) for Frankie's team (Strago, Cid, Locke, John)
|
||||||
|
**Spec:** turboquant-build-spec v2.2 (Strago)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
Phase 1 is COMPLETE. TurboQuant KV cache compression works on Apple Silicon with production-quality Metal shaders. turbo4 delivers **73% KV memory savings with only 1% prompt processing overhead and 11% generation overhead.** The path to 128K context on 36GB hardware is clear.
|
||||||
|
|
||||||
|
**Hardware correction:** The MacBook is M3 Max 36GB (not M4 Max 32GB as in spec). This INCREASES our memory budget from 27GB to ~31GB.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Gate Check (#2): PASSED ✅
|
||||||
|
|
||||||
|
Metal shaders exist and are comprehensive:
|
||||||
|
- Full flash attention for turbo2/3/4 with dk32-dk576 variants
|
||||||
|
- WHT rotation kernels (turbo_fwht_128, turbo_rotate_forward/inverse)
|
||||||
|
- PolarQuant codebooks hardcoded (Lloyd-Max for N(0, 1/√128))
|
||||||
|
- Asymmetric K/V support (q8_0 × turbo mixed pairs)
|
||||||
|
- M4+ optimizations (4-mag LUT), sparse V dequant, profiling modes
|
||||||
|
- Additional experiment branches: layer-adaptive, fused-centroid-decode, speed-optimization
|
||||||
|
|
||||||
|
**Decision: llama.cpp path confirmed. No MLX pivot needed.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Fork Assessment (#3): PASSED ✅
|
||||||
|
|
||||||
|
- Branch: `feature/turboquant-kv-cache` (commit adac2c6)
|
||||||
|
- Fork freshness: ADEQUATE (recent enough for direct build)
|
||||||
|
- Build: Clean cmake + make, 100% success in ~3 minutes
|
||||||
|
- All binaries: llama-cli, llama-bench, llama-perplexity, llama-server
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PolarQuant Verification (#5): 5/6 PASS, 1 PARTIAL ✅
|
||||||
|
|
||||||
|
| Item | Verdict |
|
||||||
|
|------|---------|
|
||||||
|
| WHT rotation (structured orthogonal) | PARTIAL PASS — Metal GPU uses WHT ✅. CPU turbo4 ref uses dense random (legacy, not production) |
|
||||||
|
| Same rotation quant/dequant | PASS — turbo_rotate_forward() ↔ turbo_rotate_inverse() identical sign arrays |
|
||||||
|
| Lloyd-Max codebook (not uniform) | PASS — non-uniform centroids, "Lloyd-Max for N(0, 1/128)" |
|
||||||
|
| Radius at FP16+ | PASS — ggml_half norm per 128-element group |
|
||||||
|
| No per-vector normalization | PASS — one group norm only, static_asserts enforce block sizes |
|
||||||
|
| Dequant matches quant in Metal | PASS — same centroids, signs, butterfly structure |
|
||||||
|
|
||||||
|
**⚠️ Flag for Cid:** CPU turbo4 reference path is incompatible with Metal dequant. Only matters if CPU fallback is ever invoked for turbo4.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Benchmark Results
|
||||||
|
|
||||||
|
### Model Under Test
|
||||||
|
- **Hermes-4-14B Q4_K_M** (8.38 GiB, 14.77B params)
|
||||||
|
- Machine: Apple M3 Max, 36GB unified, Metal GPU Family 9
|
||||||
|
|
||||||
|
### Throughput (3-run averages)
|
||||||
|
|
||||||
|
| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ |
|
||||||
|
|:-------------|:---------------|:--|:-------------------|:--|
|
||||||
|
| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — |
|
||||||
|
| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** |
|
||||||
|
| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% |
|
||||||
|
| q8_0/turbo4 (asym) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% |
|
||||||
|
|
||||||
|
### KV Cache Memory (turbo4 vs f16)
|
||||||
|
|
||||||
|
| Context | f16 KV | turbo4 KV | Savings |
|
||||||
|
|:--------|:-------|:----------|:--------|
|
||||||
|
| 2K | 320 MiB | 85 MiB | 73.4% |
|
||||||
|
| 8K | 1,280 MiB | 340 MiB | 73.4% |
|
||||||
|
| 32K | 5,120 MiB | 1,360 MiB | 73.4% |
|
||||||
|
| 65K | 10,240 MiB | 2,720 MiB | 73.4% |
|
||||||
|
|
||||||
|
Measured matches calculated exactly — zero fragmentation overhead.
|
||||||
|
|
||||||
|
### Pass Criteria Assessment
|
||||||
|
|
||||||
|
| Criteria | Threshold | Result | Verdict |
|
||||||
|
|:---------|:----------|:-------|:--------|
|
||||||
|
| PPL delta ≤ 0.5 | ≤ 0.5 | ⏭️ Not tested (no wikitext corpus) | DEFERRED |
|
||||||
|
| tok/s ≥ 90% baseline (prompt) | ≥ 274 t/s | 300.00 t/s (98.9%) | **PASS** |
|
||||||
|
| tok/s ≥ 90% baseline (gen) | ≥ 24.7 t/s | 22.45 t/s (89%) | **BORDERLINE** |
|
||||||
|
| No OOM at 32K | No crash | Runs clean | **PASS** |
|
||||||
|
| Memory consistent with theory | ±15% | 0% delta | **PASS** |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What This Means for qwen3.5:27b (Spec Target)
|
||||||
|
|
||||||
|
| Scenario | Total Memory | Fits in 31GB? |
|
||||||
|
|:---------|:-------------|:--------------|
|
||||||
|
| 27B Q4_K_M + f16 KV @ 64K | ~26 GB | ⚠️ Tight |
|
||||||
|
| 27B Q4_K_M + f16 KV @ 128K | ~38 GB | ❌ No |
|
||||||
|
| 27B Q4_K_M + **turbo4 KV @ 64K** | ~20.5 GB | ✅ Comfortable |
|
||||||
|
| 27B Q4_K_M + **turbo4 KV @ 128K** | ~23.4 GB | ✅ Fits (7.6GB headroom) |
|
||||||
|
|
||||||
|
**TurboQuant turns 128K context from impossible to comfortable.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open Items for Phase 2
|
||||||
|
|
||||||
|
1. **Perplexity test** — Need wikitext-2-raw corpus downloaded. PPL is the most important quality metric and we don't have it yet.
|
||||||
|
2. **Ollama integration** — CLI is a broken symlink. Need to fix Ollama install, then build custom Ollama with our fork as submodule.
|
||||||
|
3. **qwen3.5:27b model** — Need to download the actual target model (only have Hermes-4-14B on disk currently).
|
||||||
|
4. **10 test prompts** — Need to be written before Phase 2 quality comparison.
|
||||||
|
5. **Generation speed borderline** — tg128 at 89% is just below the 90% threshold. May improve with the speed-optimization branch. Worth testing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommendation
|
||||||
|
|
||||||
|
**PROCEED TO PHASE 2.**
|
||||||
|
|
||||||
|
turbo4 delivers the goods: 73% KV memory savings, near-zero prompt overhead, acceptable generation overhead. The verification checklist confirms the implementation is algorithmically sound. The only gap is PPL testing, which is a corpus download away — not a fundamental risk.
|
||||||
|
|
||||||
|
The real unlock — 128K context on 36GB hardware — is within reach. Phase 2 is Ollama integration and production deployment.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Issues Closed
|
||||||
|
|
||||||
|
- [x] #2 Metal kernel check — PASSED
|
||||||
|
- [x] #3 Fork assessment — PASSED
|
||||||
|
- [x] #4 Build llama.cpp fork — COMPLETE
|
||||||
|
- [x] #5 PolarQuant verification — 5/6 PASS
|
||||||
|
- [x] #6 FP16 baseline benchmarks — RECORDED
|
||||||
|
- [x] #7 TurboQuant benchmarks — RECORDED
|
||||||
|
- [x] #8 Memory profiling — COMPLETE
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Phase 1 execution time: ~25 minutes (build) + ~20 minutes (benchmarks) = ~45 minutes total.*
|
||||||
|
*Within "typical case" estimate from spec (1-2 hours).*
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# TurboQuant — Full Knowledge Transfer Report
|
||||||
|
|
||||||
|
**Date:** 2026-03-30
|
||||||
|
**Prepared for:** Frankie's Team (Strago, Cid, Locke, John)
|
||||||
|
**Spec:** turboquant-build-spec v2.2 (Strago)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## TL;DR
|
||||||
|
|
||||||
|
TurboQuant works. PolarQuant KV cache compression delivers **73% memory savings with 1% prompt overhead**. 128K context on the MacBook becomes viable. Custom Ollama build is deferred (multi-day effort), but the fork's `llama-server` is a ready drop-in. Per-layer adaptive quantization is already implemented. QJL is infrastructure-only — not needed at current compression targets.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Hardware Correction
|
||||||
|
|
||||||
|
**Spec says:** M4 Max, 32GB
|
||||||
|
**Actual:** M3 Max, 36GB (sysctl hw.memsize = 38,654,705,664 bytes)
|
||||||
|
|
||||||
|
Impact: Memory budget **increases** from ~27GB to ~31GB usable. Model ceiling improves.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 — PolarQuant MVP: COMPLETE ✅
|
||||||
|
|
||||||
|
### Gate Check (#2): Metal Shaders EXIST
|
||||||
|
The `feature/turboquant-kv-cache` branch has production-quality Metal support:
|
||||||
|
- Flash attention for turbo2/3/4 (all dk variants)
|
||||||
|
- WHT rotation kernels (turbo_fwht_128)
|
||||||
|
- Lloyd-Max codebooks (hardcoded, non-uniform)
|
||||||
|
- Asymmetric K/V (q8_0 × turbo mixed)
|
||||||
|
- Runtime optimizations: 4-mag LUT (M4+), sparse V dequant, profiling
|
||||||
|
|
||||||
|
**Note:** Allegro's analysis (checking only `master` branch) incorrectly concluded "NO TurboQuant." The implementation lives on the feature branch.
|
||||||
|
|
||||||
|
### PolarQuant Verification (#5): 5/6 PASS
|
||||||
|
|
||||||
|
| Item | Verdict |
|
||||||
|
|------|---------|
|
||||||
|
| WHT rotation (structured orthogonal) | PASS (Metal). CPU turbo4 ref uses dense random (legacy) |
|
||||||
|
| Same rotation quant/dequant | PASS |
|
||||||
|
| Lloyd-Max codebook (not uniform) | PASS |
|
||||||
|
| Radius at FP16+ | PASS |
|
||||||
|
| No per-vector normalization | PASS |
|
||||||
|
| Dequant matches quant in Metal | PASS |
|
||||||
|
|
||||||
|
**Flag:** CPU turbo4 reference path is algorithmically incompatible with Metal dequant. Only matters if CPU fallback invoked for turbo4. Metal production path is clean.
|
||||||
|
|
||||||
|
### Benchmark Results
|
||||||
|
|
||||||
|
**Model tested:** Hermes-4-14B Q4_K_M (8.38 GiB)
|
||||||
|
|
||||||
|
#### Throughput
|
||||||
|
|
||||||
|
| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ |
|
||||||
|
|:-------------|:---------------|:--|:-------------------|:--|
|
||||||
|
| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — |
|
||||||
|
| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** |
|
||||||
|
| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% |
|
||||||
|
| q8_0/turbo4 (asymmetric) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% |
|
||||||
|
|
||||||
|
#### KV Memory Savings
|
||||||
|
|
||||||
|
| Context | f16 KV | turbo4 KV | Savings |
|
||||||
|
|:--------|:-------|:----------|:--------|
|
||||||
|
| 2K | 320 MiB | 85 MiB | 73.4% |
|
||||||
|
| 8K | 1,280 MiB | 340 MiB | 73.4% |
|
||||||
|
| 32K | 5,120 MiB | 1,360 MiB | 73.4% |
|
||||||
|
| 65K | 10,240 MiB | 2,720 MiB | 73.4% |
|
||||||
|
|
||||||
|
Measured matches calculated exactly. Zero fragmentation overhead.
|
||||||
|
|
||||||
|
#### What This Means for qwen3.5:27b
|
||||||
|
|
||||||
|
| Scenario | Total Memory | Fits 31GB? |
|
||||||
|
|:---------|:-------------|:-----------|
|
||||||
|
| 27B + f16 KV @ 128K | ~38 GB | ❌ No |
|
||||||
|
| 27B + **turbo4 KV @ 128K** | **~23.4 GB** | **✅ Yes (7.6GB headroom)** |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 — Ollama Integration: PARTIALLY COMPLETE
|
||||||
|
|
||||||
|
### What Works
|
||||||
|
- Ollama installation fixed (v0.17.7, running on :11434)
|
||||||
|
- API compatibility assessed: TurboQuant changes are additive (new types/ops only)
|
||||||
|
|
||||||
|
### What Doesn't (Yet)
|
||||||
|
Custom Ollama build is **not feasible** in current timeframe:
|
||||||
|
- Ollama vendors llama.cpp with 34 custom patches
|
||||||
|
- Fork diverges from Ollama's pinned commit
|
||||||
|
- Integration requires patching 30+ files across Metal/CUDA/CPU backends
|
||||||
|
- Ollama's own HEAD has pre-existing build failures
|
||||||
|
|
||||||
|
**This is deferred to Phase 4 / upstream watch.** When Ollama updates their llama.cpp pin or TurboQuant lands upstream, the gap narrows.
|
||||||
|
|
||||||
|
### Production Alternative: llama-server
|
||||||
|
|
||||||
|
The fork's `llama-server` binary is **already built and working**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Drop-in replacement for Ollama's API endpoint
|
||||||
|
/path/to/llama-server \
|
||||||
|
-m /path/to/qwen3.5-27b-q4_k_m.gguf \
|
||||||
|
--port 11434 \
|
||||||
|
-ctk turbo4 -ctv turbo4 \
|
||||||
|
-c 131072
|
||||||
|
```
|
||||||
|
|
||||||
|
- OpenAI-compatible chat completions API
|
||||||
|
- Streaming SSE support
|
||||||
|
- All TurboQuant KV types supported
|
||||||
|
- Per-layer adaptive via TURBO_LAYER_ADAPTIVE env var
|
||||||
|
- Same port/protocol as Ollama — clients don't need to change
|
||||||
|
|
||||||
|
### Outstanding Phase 2 Items for Cid
|
||||||
|
- [ ] Download qwen3.5:27b Q4_K_M model
|
||||||
|
- [ ] Deploy llama-server with turbo4 on MacBook
|
||||||
|
- [ ] Run full 10-prompt quality matrix (prompts written by Allegro on #16)
|
||||||
|
- [ ] PPL test with wikitext-2-raw corpus
|
||||||
|
- [ ] John quality sign-off
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2.5 — Per-Layer Quantization: ALREADY IMPLEMENTED ✅
|
||||||
|
|
||||||
|
Found in the fork. No additional work needed.
|
||||||
|
|
||||||
|
### Mechanism
|
||||||
|
`TURBO_LAYER_ADAPTIVE` environment variable, 7 modes:
|
||||||
|
|
||||||
|
| Mode | Strategy | Use Case |
|
||||||
|
|:-----|:---------|:---------|
|
||||||
|
| 0 | Uniform (default) | Simple, consistent |
|
||||||
|
| 1 | q8_0 for first 4 + last 4 layers | Protect sensitive layers |
|
||||||
|
| 7 | **Recommended:** first2+last2 V=q8_0, rest V=turbo2 | Best quality/compression ratio |
|
||||||
|
|
||||||
|
### Usage
|
||||||
|
```bash
|
||||||
|
export TURBO_LAYER_ADAPTIVE=7
|
||||||
|
llama-server -m model.gguf -ctk turbo4 -ctv turbo4
|
||||||
|
```
|
||||||
|
|
||||||
|
### Benchmark Status
|
||||||
|
Mode benchmarks queued. Uniform turbo4 baseline established. Per-layer modes expected to improve quality at same compression ratio.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3 — QJL: ASSESSED, NOT NEEDED ✅
|
||||||
|
|
||||||
|
### Finding
|
||||||
|
**turbo4 is pure 4-bit PolarQuant** — QJL is NOT active.
|
||||||
|
|
||||||
|
`TURBO4_USE_4BIT` defaults to 1 in `ggml-common.h`. The legacy 3-bit+QJL path exists but is disabled. QJL infrastructure (sign arrays, WHT transforms, 128x128 projection matrices) is embedded in Metal but referenced by no active kernel.
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
**Not needed for current goals.** 4-bit PolarQuant already delivers 73% savings with minimal quality impact. QJL only matters below 3 bits/channel, which isn't required on 36GB hardware with the updated memory budget.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Source Repos Assessment
|
||||||
|
|
||||||
|
| Repo | Status | Value |
|
||||||
|
|:-----|:-------|:------|
|
||||||
|
| TheTom/llama-cpp-turboquant | **PRIMARY** — production Metal shaders on feature branch | Build from this |
|
||||||
|
| TheTom/turboquant_plus | Python reference + 511 tests | Algorithm verification |
|
||||||
|
| rachittshah/mlx-turboquant | Complete MLX PoC, 2-5x slower (no Metal fusion) | Quality validation reference |
|
||||||
|
| amirzandieh/QJL | Author CUDA (~1500 lines) | Future QJL Metal port reference |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risk Register
|
||||||
|
|
||||||
|
| Risk | Status | Mitigation |
|
||||||
|
|:-----|:-------|:-----------|
|
||||||
|
| Metal shaders missing | ✅ RESOLVED — they exist | — |
|
||||||
|
| Fork too stale | ✅ RESOLVED — builds clean | — |
|
||||||
|
| Ollama integration blocked | ⚠️ ACTIVE — multi-day effort | Use llama-server instead |
|
||||||
|
| PPL regression | ⏸️ UNTESTED — needs wikitext corpus | Download and test in prod |
|
||||||
|
| tg128 borderline (89% vs 90% threshold) | ⚠️ MINOR — within measurement noise | speed-optimization branch may help |
|
||||||
|
| CPU turbo4 incompatible with Metal | ℹ️ LOW — only matters if Metal unavailable | Document; Metal is production path |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Deployment Plan for Cid
|
||||||
|
|
||||||
|
```
|
||||||
|
Step 1: Download qwen3.5:27b Q4_K_M via HuggingFace
|
||||||
|
huggingface-cli download bartowski/qwen3.5-27B-GGUF qwen3.5-27b-q4_k_m.gguf
|
||||||
|
|
||||||
|
Step 2: Build fork (if not already done)
|
||||||
|
cd /path/to/llama-cpp-turboquant
|
||||||
|
git checkout feature/turboquant-kv-cache
|
||||||
|
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
|
||||||
|
cmake --build build -j$(sysctl -n hw.ncpu)
|
||||||
|
|
||||||
|
Step 3: Deploy llama-server
|
||||||
|
export TURBO_LAYER_ADAPTIVE=7
|
||||||
|
./build/bin/llama-server \
|
||||||
|
-m /path/to/qwen3.5-27b-q4_k_m.gguf \
|
||||||
|
--port 11434 \
|
||||||
|
-ctk turbo4 -ctv turbo4 \
|
||||||
|
-c 131072 \
|
||||||
|
--host 0.0.0.0
|
||||||
|
|
||||||
|
Step 4: Validate
|
||||||
|
curl http://localhost:11434/v1/chat/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"model":"qwen3.5","messages":[{"role":"user","content":"hello"}]}'
|
||||||
|
|
||||||
|
Step 5: Run quality matrix (prompts on issue #16)
|
||||||
|
Step 6: John reviews output quality
|
||||||
|
Step 7: If pass → production. If fail → drop to turbo3 or adjust per-layer profile.
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Issues Summary
|
||||||
|
|
||||||
|
| # | Title | Status |
|
||||||
|
|:--|:------|:-------|
|
||||||
|
| 1 | Epic: TurboQuant KV Cache Compression | Open (tracker) |
|
||||||
|
| 2 | Metal kernel check | ✅ Closed — PASS |
|
||||||
|
| 3 | Fork assessment | ✅ Closed — PASS, M3 Max 36GB |
|
||||||
|
| 4 | Build llama.cpp fork | ✅ Closed — clean build |
|
||||||
|
| 5 | PolarQuant verification | ✅ Closed — 5/6 PASS |
|
||||||
|
| 6 | Baseline benchmarks | ✅ Closed — recorded |
|
||||||
|
| 7 | TurboQuant benchmarks | ✅ Closed — 73% savings |
|
||||||
|
| 8 | Memory profiling | ✅ Closed — 0% fragmentation |
|
||||||
|
| 9 | Ollama API check | ✅ Closed — additive, but diverged |
|
||||||
|
| 10 | Custom Ollama build | ✅ Closed — deferred, llama-server instead |
|
||||||
|
| 11 | Full test matrix | Open — awaiting production deploy |
|
||||||
|
| 12 | Long-session test | Open — awaiting production deploy |
|
||||||
|
| 13 | Per-layer profiles | ✅ Closed — already implemented |
|
||||||
|
| 14 | QJL assessment | ✅ Closed — not needed |
|
||||||
|
| 15 | Upstream watch | Open — ongoing |
|
||||||
|
| 16 | Test prompts | Open — Allegro contributed prompts |
|
||||||
|
|
||||||
|
**12/16 issues resolved. 4 remaining are production validation tasks for Cid.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Repo: http://143.198.27.163:3000/Timmy_Foundation/turboquant*
|
||||||
|
*Build: /tmp/llama-cpp-turboquant/build/bin/ (all binaries)*
|
||||||
|
*Branch: feature/turboquant-kv-cache*
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
# TurboQuant Implementation — Build Spec (v2)
|
# TurboQuant Implementation — Build Spec (v2)
|
||||||
**Prepared by:** Strago | **Date:** 2026-03-30 | **Updated:** 2026-03-30 (v2 — external review fixes)
|
**Prepared by:** Strago | **Date:** 2026-03-30 | **Updated:** 2026-03-30 (v2 — external review fixes)
|
||||||
**Task:** STR-2026-03-30-01 | **For:** Cid (build) + Frankie (coordination)
|
**Task:** STR-2026-03-30-01 | **For:** Cid (build) + Frankie (coordination)
|
||||||
@@ -447,3 +841,7 @@ This gives the same average compression ratio as uniform turbo4 but concentrates
|
|||||||
---
|
---
|
||||||
|
|
||||||
*Build spec v2 ready for Cid intake. No clarifying questions needed.*
|
*Build spec v2 ready for Cid intake. No clarifying questions needed.*
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
Reference in New Issue
Block a user