- Add gemma4-turboquant.yaml profile for Hermes - Configure local llama.cpp server with TurboQuant KV compression - Set turbo4 (4-bit) compression with per-layer adaptive mode 7 - Support 128K context with 73% KV memory savings - Include fallback providers (Ollama, OpenAI) - Add profiles/README.md with setup and usage instructions - Document performance expectations and troubleshooting Closes #28
Hermes Profiles for TurboQuant
This directory contains Hermes configuration profiles for running models with TurboQuant KV cache compression.
Available Profiles
gemma4-turboquant.yaml
Profile for Gemma 4 model with TurboQuant KV cache compression.
- Primary Provider: Local llama.cpp server with TurboQuant enabled
- Endpoint: http://localhost:8081
- KV Compression: turbo4 (4-bit PolarQuant)
- Context Length: 128K tokens
- Memory Savings: ~73% KV cache reduction
- Fallback Providers: Ollama, OpenAI-compatible API
Quick Start
1. Build TurboQuant-enabled llama.cpp
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)
2. Download Gemma 4 Model
# Download Gemma 4 Q4_K_M quantized model
huggingface-cli download <model-repo> gemma-4-q4_k_m.gguf
3. Start llama-server with TurboQuant
export TURBO_LAYER_ADAPTIVE=7
./build/bin/llama-server \
-m /path/to/gemma-4-q4_k_m.gguf \
--port 8081 \
-ctk turbo4 -ctv turbo4 \
-c 131072 \
--host 0.0.0.0
4. Install Profile
# Copy profile to Hermes directory
cp gemma4-turboquant.yaml ~/.hermes/profiles/
# Or create symlink
ln -sf $(pwd)/gemma4-turboquant.yaml ~/.hermes/profiles/
5. Use with Hermes
# Start Hermes with the profile
hermes --profile gemma4-turboquant
# Or specify profile in Hermes config
echo "default_profile: gemma4-turboquant" >> ~/.hermes/config.yaml
Profile Configuration
The profile includes:
- Primary Provider: Local llama.cpp server with TurboQuant
- Fallback Providers: Ollama (local), OpenAI (cloud)
- TurboQuant Settings:
kv_type: turbo4 (4-bit compression)layer_adaptive_mode: 7 (best quality/compression ratio)max_context: 128K tokens
Performance Expectations
| Metric | Value | Notes |
|---|---|---|
| KV Memory Savings | 73% | Measured on M3 Max |
| Prompt Processing | ~1% overhead | vs FP16 baseline |
| Generation Speed | ~11% overhead | vs FP16 baseline |
| Max Context (36GB) | 128K | Comfortable with 7.6GB headroom |
Customization
Adjust Compression Level
turboquant:
kv_type: "turbo3" # Lower compression, faster
# or
kv_type: "turbo2" # Minimal compression, fastest
Disable Per-Layer Adaptive
turboquant:
layer_adaptive_mode: 0 # Uniform quantization
Use Asymmetric K/V
For better quality on sensitive models:
# Start server with asymmetric K/V
llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072
Troubleshooting
Server Won't Start
- Check if port 8081 is available:
lsof -i :8081 - Verify model path is correct
- Ensure TurboQuant branch is checked out
Poor Generation Quality
- Try
turbo3instead ofturbo4 - Disable per-layer adaptive (mode 0)
- Use asymmetric K/V:
-ctk q8_0 -ctv turbo4
High Memory Usage
- Reduce context length:
-c 65536(64K) - Check
TURBO_LAYER_ADAPTIVEis set - Monitor with:
vmmap --summary $(pgrep llama-server)