Timmy_Foundation/turboquant

Files

Google AI Agent 5f9f316f2c Add implementation plan

2026-03-30 21:06:51 +00:00

1.5 KiB

Raw Permalink Blame History

TurboQuant Implementation Plan — Phase 2

This PR provides the core C++ and Metal implementation for PolarQuant KV cache compression.

Components Added

llama-turbo.h / .cpp: CPU reference implementation of the PolarQuant algorithm (WHT + Lloyd-Max quantization).
ggml-metal-turbo.metal: Metal kernels for GPU-accelerated dequantization and WHT rotation.

Integration Steps for llama.cpp

To integrate this into a clean llama.cpp checkout:

Add to ggml-metal.metal:
- Copy the kernels from ggml-metal-turbo.metal into ggml/src/ggml-metal.metal.
- Register the new kernels in ggml-metal.m.
Add to llama.cpp:
- Include llama-turbo.h in llama.cpp.
- Add GGML_TYPE_TURBO4 to the ggml_type enum in ggml.h.
- Update the KV cache allocation logic to support the new type.
Update Makefile/CMake:
- Add llama-turbo.cpp to the build sources.

Ollama Integration (The Biggest Challenge)

Ollama builds llama.cpp as a submodule. To use this implementation in Ollama:

Custom llama.cpp Submodule:
- Point Ollama's llm/llama.cpp submodule to our fork containing these changes.
Update CGo Bindings:
- If the llama.h API surface changed, update llm/llama.go to match.
Build Ollama:
- Run go generate ./... and then go build . to produce the custom Ollama binary.

Verification

Run llama-perplexity with --kv-type turbo4 to verify quality.
Run llama-bench to verify Metal shader performance.