feat: Add Hermes profile for Gemma 4 + TurboQuant (Issue #28) #33
141
profiles/README.md
Normal file
141
profiles/README.md
Normal file
@@ -0,0 +1,141 @@
|
||||
# Hermes Profiles for TurboQuant
|
||||
|
||||
This directory contains Hermes configuration profiles for running models with TurboQuant KV cache compression.
|
||||
|
||||
## Available Profiles
|
||||
|
||||
### gemma4-turboquant.yaml
|
||||
|
||||
**Profile for Gemma 4 model with TurboQuant KV cache compression.**
|
||||
|
||||
- **Primary Provider:** Local llama.cpp server with TurboQuant enabled
|
||||
- **Endpoint:** http://localhost:8081
|
||||
- **KV Compression:** turbo4 (4-bit PolarQuant)
|
||||
- **Context Length:** 128K tokens
|
||||
- **Memory Savings:** ~73% KV cache reduction
|
||||
- **Fallback Providers:** Ollama, OpenAI-compatible API
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Build TurboQuant-enabled llama.cpp
|
||||
|
||||
```bash
|
||||
git clone https://github.com/TheTom/llama-cpp-turboquant.git
|
||||
cd llama-cpp-turboquant
|
||||
git checkout feature/turboquant-kv-cache
|
||||
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build -j$(sysctl -n hw.ncpu)
|
||||
```
|
||||
|
||||
### 2. Download Gemma 4 Model
|
||||
|
||||
```bash
|
||||
# Download Gemma 4 Q4_K_M quantized model
|
||||
huggingface-cli download <model-repo> gemma-4-q4_k_m.gguf
|
||||
```
|
||||
|
||||
### 3. Start llama-server with TurboQuant
|
||||
|
||||
```bash
|
||||
export TURBO_LAYER_ADAPTIVE=7
|
||||
./build/bin/llama-server \
|
||||
-m /path/to/gemma-4-q4_k_m.gguf \
|
||||
--port 8081 \
|
||||
-ctk turbo4 -ctv turbo4 \
|
||||
-c 131072 \
|
||||
--host 0.0.0.0
|
||||
```
|
||||
|
||||
### 4. Install Profile
|
||||
|
||||
```bash
|
||||
# Copy profile to Hermes directory
|
||||
cp gemma4-turboquant.yaml ~/.hermes/profiles/
|
||||
|
||||
# Or create symlink
|
||||
ln -sf $(pwd)/gemma4-turboquant.yaml ~/.hermes/profiles/
|
||||
```
|
||||
|
||||
### 5. Use with Hermes
|
||||
|
||||
```bash
|
||||
# Start Hermes with the profile
|
||||
hermes --profile gemma4-turboquant
|
||||
|
||||
# Or specify profile in Hermes config
|
||||
echo "default_profile: gemma4-turboquant" >> ~/.hermes/config.yaml
|
||||
```
|
||||
|
||||
## Profile Configuration
|
||||
|
||||
The profile includes:
|
||||
|
||||
- **Primary Provider:** Local llama.cpp server with TurboQuant
|
||||
- **Fallback Providers:** Ollama (local), OpenAI (cloud)
|
||||
- **TurboQuant Settings:**
|
||||
- `kv_type`: turbo4 (4-bit compression)
|
||||
- `layer_adaptive_mode`: 7 (best quality/compression ratio)
|
||||
- `max_context`: 128K tokens
|
||||
|
||||
## Performance Expectations
|
||||
|
||||
| Metric | Value | Notes |
|
||||
|--------|-------|-------|
|
||||
| KV Memory Savings | 73% | Measured on M3 Max |
|
||||
| Prompt Processing | ~1% overhead | vs FP16 baseline |
|
||||
| Generation Speed | ~11% overhead | vs FP16 baseline |
|
||||
| Max Context (36GB) | 128K | Comfortable with 7.6GB headroom |
|
||||
|
||||
## Customization
|
||||
|
||||
### Adjust Compression Level
|
||||
|
||||
```yaml
|
||||
turboquant:
|
||||
kv_type: "turbo3" # Lower compression, faster
|
||||
# or
|
||||
kv_type: "turbo2" # Minimal compression, fastest
|
||||
```
|
||||
|
||||
### Disable Per-Layer Adaptive
|
||||
|
||||
```yaml
|
||||
turboquant:
|
||||
layer_adaptive_mode: 0 # Uniform quantization
|
||||
```
|
||||
|
||||
### Use Asymmetric K/V
|
||||
|
||||
For better quality on sensitive models:
|
||||
|
||||
```bash
|
||||
# Start server with asymmetric K/V
|
||||
llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Server Won't Start
|
||||
|
||||
1. Check if port 8081 is available: `lsof -i :8081`
|
||||
2. Verify model path is correct
|
||||
3. Ensure TurboQuant branch is checked out
|
||||
|
||||
### Poor Generation Quality
|
||||
|
||||
1. Try `turbo3` instead of `turbo4`
|
||||
2. Disable per-layer adaptive (mode 0)
|
||||
3. Use asymmetric K/V: `-ctk q8_0 -ctv turbo4`
|
||||
|
||||
### High Memory Usage
|
||||
|
||||
1. Reduce context length: `-c 65536` (64K)
|
||||
2. Check `TURBO_LAYER_ADAPTIVE` is set
|
||||
3. Monitor with: `vmmap --summary $(pgrep llama-server)`
|
||||
|
||||
## References
|
||||
|
||||
- [TurboQuant Build Spec](../BUILD-SPEC.md)
|
||||
- [Phase 1 Report](../PHASE1-REPORT.md)
|
||||
- [Full Knowledge Transfer](../FULL-REPORT.md)
|
||||
- [llama.cpp TurboQuant Fork](https://github.com/TheTom/llama-cpp-turboquant)
|
||||
169
profiles/hermes-profile-gemma4-turboquant.yaml
Normal file
169
profiles/hermes-profile-gemma4-turboquant.yaml
Normal file
@@ -0,0 +1,169 @@
|
||||
# Hermes Profile: Gemma 4 + TurboQuant KV Cache Compression
|
||||
# For use with local llama.cpp server running TurboQuant-enabled inference
|
||||
# Drop into ~/.hermes/profiles/gemma4-turboquant.yaml
|
||||
|
||||
profile:
|
||||
name: "gemma4-turboquant"
|
||||
version: "1.0.0"
|
||||
description: "Gemma 4 model with TurboQuant KV cache compression for extended context on Apple Silicon"
|
||||
|
||||
# Primary provider: local llama.cpp server with TurboQuant
|
||||
providers:
|
||||
primary:
|
||||
type: "llama.cpp"
|
||||
name: "local-turboquant"
|
||||
endpoint: "http://localhost:8081"
|
||||
api_path: "/v1/chat/completions"
|
||||
timeout_ms: 120000
|
||||
|
||||
# Model configuration
|
||||
model:
|
||||
name: "gemma-4"
|
||||
path: "/path/to/gemma-4-q4_k_m.gguf" # Update with actual model path
|
||||
|
||||
# TurboQuant KV cache compression settings
|
||||
turboquant:
|
||||
enabled: true
|
||||
kv_type: "turbo4" # Options: turbo2, turbo3, turbo4 (4-bit recommended)
|
||||
layer_adaptive_mode: 7 # Per-layer adaptive quantization (0-7, 7=best quality/ratio)
|
||||
|
||||
# Context and memory settings
|
||||
context:
|
||||
max_tokens: 131072 # 128K context with TurboQuant compression
|
||||
batch_size: 512
|
||||
|
||||
# Generation parameters
|
||||
generation:
|
||||
temperature: 0.7
|
||||
top_p: 0.9
|
||||
top_k: 40
|
||||
repeat_penalty: 1.1
|
||||
frequency_penalty: 0.0
|
||||
presence_penalty: 0.0
|
||||
|
||||
# Server startup command (for reference)
|
||||
server_command: |
|
||||
export TURBO_LAYER_ADAPTIVE=7
|
||||
llama-server \
|
||||
-m /path/to/gemma-4-q4_k_m.gguf \
|
||||
--port 8081 \
|
||||
-ctk turbo4 -ctv turbo4 \
|
||||
-c 131072 \
|
||||
--host 0.0.0.0
|
||||
|
||||
# Fallback provider 1: Ollama (standard, no TurboQuant)
|
||||
fallback_1:
|
||||
type: "ollama"
|
||||
name: "ollama-gemma4"
|
||||
endpoint: "http://localhost:11434"
|
||||
api_path: "/api/chat"
|
||||
timeout_ms: 120000
|
||||
|
||||
model:
|
||||
name: "gemma4:latest"
|
||||
|
||||
generation:
|
||||
temperature: 0.7
|
||||
top_p: 0.9
|
||||
top_k: 40
|
||||
|
||||
# Fallback provider 2: OpenAI-compatible API (cloud backup)
|
||||
fallback_2:
|
||||
type: "openai"
|
||||
name: "openai-backup"
|
||||
endpoint: "https://api.openai.com"
|
||||
api_path: "/v1/chat/completions"
|
||||
timeout_ms: 60000
|
||||
|
||||
model:
|
||||
name: "gpt-4"
|
||||
|
||||
generation:
|
||||
temperature: 0.7
|
||||
max_tokens: 4096
|
||||
|
||||
# Performance and monitoring
|
||||
performance:
|
||||
# Memory management for TurboQuant
|
||||
memory:
|
||||
max_gpu_memory_gb: 28 # Leave headroom on 36GB M3 Max
|
||||
kv_cache_compression: "turbo4"
|
||||
estimated_savings: "73%" # TurboQuant delivers ~73% KV memory savings
|
||||
|
||||
# Benchmarking integration
|
||||
benchmarks:
|
||||
enabled: true
|
||||
metrics:
|
||||
- "tokens_per_second"
|
||||
- "time_to_first_token"
|
||||
- "peak_memory_usage"
|
||||
- "perplexity"
|
||||
|
||||
# Quality validation
|
||||
quality:
|
||||
# Test prompts for quality comparison
|
||||
test_prompts:
|
||||
enabled: true
|
||||
prompt_file: "benchmarks/prompts.json"
|
||||
|
||||
# Perplexity testing
|
||||
perplexity:
|
||||
enabled: true
|
||||
corpus: "wikitext-2-raw"
|
||||
context_lengths: [8192, 32768, 65536, 131072]
|
||||
|
||||
# Environment variables (applied when using this profile)
|
||||
environment:
|
||||
TURBO_LAYER_ADAPTIVE: "7" # Per-layer adaptive quantization mode
|
||||
GGML_METAL_DEBUG: "0" # Disable Metal debug in production
|
||||
OMP_NUM_THREADS: "8" # Optimize for M3 Max performance cores
|
||||
|
||||
# Logging and diagnostics
|
||||
logging:
|
||||
level: "info"
|
||||
metrics_interval_seconds: 60
|
||||
log_token_speed: true
|
||||
log_memory_usage: true
|
||||
|
||||
# Notes for deployment
|
||||
notes:
|
||||
deployment: |
|
||||
1. Ensure llama.cpp fork with TurboQuant is built:
|
||||
cd /path/to/llama-cpp-turboquant
|
||||
git checkout feature/turboquant-kv-cache
|
||||
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build -j$(sysctl -n hw.ncpu)
|
||||
|
||||
2. Start the server:
|
||||
export TURBO_LAYER_ADAPTIVE=7
|
||||
./build/bin/llama-server \
|
||||
-m /path/to/gemma-4-q4_k_m.gguf \
|
||||
--port 8081 \
|
||||
-ctk turbo4 -ctv turbo4 \
|
||||
-c 131072 \
|
||||
--host 0.0.0.0
|
||||
|
||||
3. Verify server is running:
|
||||
curl http://localhost:8081/v1/models
|
||||
|
||||
4. Copy this profile to Hermes:
|
||||
cp hermes-profile-gemma4-turboquant.yaml ~/.hermes/profiles/
|
||||
|
||||
performance_notes: |
|
||||
TurboQuant delivers:
|
||||
- 73% KV cache memory savings
|
||||
- 1% prompt processing overhead
|
||||
- 11% generation overhead
|
||||
- Enables 128K context on 36GB hardware
|
||||
|
||||
With TurboQuant on Gemma 4 (estimated):
|
||||
- Model weights: ~16GB at Q4_K_M
|
||||
- KV cache at 128K: ~5GB (vs ~20GB without compression)
|
||||
- Total memory: ~23GB (fits comfortably in 31GB budget)
|
||||
|
||||
troubleshooting: |
|
||||
- If generation speed is slow, try turbo3 instead of turbo4
|
||||
- If quality issues, disable per-layer adaptive (set mode to 0)
|
||||
- For maximum quality on sensitive layers, use asymmetric K/V:
|
||||
-ctk q8_0 -ctv turbo4
|
||||
- Monitor memory with: vmmap --summary $(pgrep llama-server)
|
||||
Reference in New Issue
Block a user