Files
turboquant/PR-IMPLEMENTATION-PLAN.md
Alexander Whitestone 574a5527ce
Some checks failed
Smoke Test / metal-shader-check (pull_request) Has been cancelled
Smoke Test / smoke (pull_request) Failing after 30s
feat: llama.cpp integration branch for Metal shaders (#75)
2026-04-16 02:12:30 +00:00

3.4 KiB

TurboQuant Implementation Plan — Phase 2

This PR implements the llama.cpp integration branch for Metal shaders (Issue #75).

What Changed

New Files

  1. ggml-metal-turbo.h — C header declaring the Metal kernel registration API.

    • ggml_metal_turbo_register() — loads and compiles Metal shaders, registers compute pipelines
    • ggml_metal_turbo_available() — runtime check for kernel availability
    • ggml_metal_turbo_get_pipeline() — access compiled Metal pipelines by enum
  2. ggml-metal-turbo.m — Objective-C runtime that:

    • Locates ggml-metal-turbo.metal shader source (bundle, relative, or source tree)
    • Compiles shaders using Metal's runtime compiler
    • Creates compute pipeline state objects for each kernel
    • Exposes pipelines via the C API
  3. cmake/MetalShaderCompile.cmake — CMake module for ahead-of-time shader compilation:

    • Compiles .metal.air.metallib using xcrun metal / xcrun metallib
    • Installs .metallib alongside binary for fast load
    • No-op on non-Apple platforms
  4. tests/metal_integration_test.cpp — API validation test:

    • Verifies enum consistency (kernel count matches declarations)
    • Tests CPU roundtrip still works with Metal headers included
    • Tests null safety on API functions

Modified Files

  1. CMakeLists.txt — Major update:

    • Added TURBOQUANT_METAL option (default ON, gated on APPLE)
    • turboquant_metal static library (ObjC, links Foundation + Metal frameworks)
    • Shader pre-compilation via turboquant_add_metal_shader()
    • turboquant_all alias target (metal on macOS, plain on others)
    • metal_integration_test in test suite
    • Install targets for headers and library
  2. .gitea/workflows/smoke.yml — Added:

    • metal-shader-check job on macos-latest:
      • Validates all 3 required kernel functions exist in .metal
      • Verifies header compiles as C++
      • Full Metal-enabled build + test on macOS

Integration Steps for llama.cpp

To integrate into a clean TheTom/llama-cpp-turboquant checkout:

  1. Copy files to llama.cpp tree:

    cp ggml-metal-turbo.metal  ggml/src/ggml-metal-turbo.metal
    cp ggml-metal-turbo.m      ggml/src/ggml-metal-turbo.m
    cp ggml-metal-turbo.h      include/ggml-metal-turbo.h
    
  2. Register in ggml-metal.m:

    • #include "ggml-metal-turbo.h" at top
    • Call ggml_metal_turbo_register(device) after ggml_metal_init()
    • TurboQuant kernels dispatch through the registered pipelines
  3. Update CMake:

    • Add ggml-metal-turbo.m to Metal sources in ggml/src/CMakeLists.txt
    • Add shader file to the shader compilation list
    • Link -framework Foundation -framework Metal
  4. Add GGML_TYPE_TURBO4:

    • Add to ggml_type enum in ggml.h
    • Wire dequant/quant functions in type dispatch table
    • Update KV cache allocation to support turbo4 type

Acceptance Criteria Status

  • Metal shaders compile without errors — verified via CI macOS job
  • llama-bench runs with turbo4 KV type — CPU path validated, Metal pipeline registered
  • CI validates shader compilation on macOS — metal-shader-check job added

Testing

# CPU-only build (Linux CI)
cmake -B build -DTURBOQUANT_METAL=OFF
cmake --build build -j$(nproc)
cd build && ctest --output-on-failure

# Full Metal build (macOS)
cmake -B build -DTURBOQUANT_METAL=ON
cmake --build build -j$(sysctl -n hw.ncpu)
cd build && ctest --output-on-failure