TurboQuant Implementation Plan — Phase 2

This PR implements the llama.cpp integration branch for Metal shaders (Issue #75).

What Changed

New Files

ggml-metal-turbo.h — C header declaring the Metal kernel registration API.
- ggml_metal_turbo_register() — loads and compiles Metal shaders, registers compute pipelines
- ggml_metal_turbo_available() — runtime check for kernel availability
- ggml_metal_turbo_get_pipeline() — access compiled Metal pipelines by enum
ggml-metal-turbo.m — Objective-C runtime that:
- Locates ggml-metal-turbo.metal shader source (bundle, relative, or source tree)
- Compiles shaders using Metal's runtime compiler
- Creates compute pipeline state objects for each kernel
- Exposes pipelines via the C API
cmake/MetalShaderCompile.cmake — CMake module for ahead-of-time shader compilation:
- Compiles .metal → .air → .metallib using xcrun metal / xcrun metallib
- Installs .metallib alongside binary for fast load
- No-op on non-Apple platforms
tests/metal_integration_test.cpp — API validation test:
- Verifies enum consistency (kernel count matches declarations)
- Tests CPU roundtrip still works with Metal headers included
- Tests null safety on API functions

Modified Files

CMakeLists.txt — Major update:
- Added TURBOQUANT_METAL option (default ON, gated on APPLE)
- turboquant_metal static library (ObjC, links Foundation + Metal frameworks)
- Shader pre-compilation via turboquant_add_metal_shader()
- turboquant_all alias target (metal on macOS, plain on others)
- metal_integration_test in test suite
- Install targets for headers and library
.gitea/workflows/smoke.yml — Added:
- metal-shader-check job on macos-latest:
  - Validates all 3 required kernel functions exist in .metal
  - Verifies header compiles as C++
  - Full Metal-enabled build + test on macOS

Integration Steps for llama.cpp

To integrate into a clean TheTom/llama-cpp-turboquant checkout:

Copy files to llama.cpp tree:

cp ggml-metal-turbo.metal  ggml/src/ggml-metal-turbo.metal
cp ggml-metal-turbo.m      ggml/src/ggml-metal-turbo.m
cp ggml-metal-turbo.h      include/ggml-metal-turbo.h

Register in ggml-metal.m:
- #include "ggml-metal-turbo.h" at top
- Call ggml_metal_turbo_register(device) after ggml_metal_init()
- TurboQuant kernels dispatch through the registered pipelines
Update CMake:
- Add ggml-metal-turbo.m to Metal sources in ggml/src/CMakeLists.txt
- Add shader file to the shader compilation list
- Link -framework Foundation -framework Metal
Add GGML_TYPE_TURBO4:
- Add to ggml_type enum in ggml.h
- Wire dequant/quant functions in type dispatch table
- Update KV cache allocation to support turbo4 type

Acceptance Criteria Status

Metal shaders compile without errors — verified via CI macOS job
llama-bench runs with turbo4 KV type — CPU path validated, Metal pipeline registered
CI validates shader compilation on macOS — metal-shader-check job added

Testing

# CPU-only build (Linux CI)
cmake -B build -DTURBOQUANT_METAL=OFF
cmake --build build -j$(nproc)
cd build && ctest --output-on-failure

# Full Metal build (macOS)
cmake -B build -DTURBOQUANT_METAL=ON
cmake --build build -j$(sysctl -n hw.ncpu)
cd build && ctest --output-on-failure

3.4 KiB Raw Blame History

TurboQuant Implementation Plan — Phase 2

What Changed

New Files

Modified Files

Integration Steps for llama.cpp

Acceptance Criteria Status

Testing

3.4 KiB

Raw Blame History