profiles/README.md

# Hermes Profiles for TurboQuant

This directory contains Hermes configuration profiles for running models with TurboQuant KV cache compression.

## Available Profiles

### gemma4-turboquant.yaml

**Profile for Gemma 4 model with TurboQuant KV cache compression.**

- **Primary Provider:** Local llama.cpp server with TurboQuant enabled
- **Endpoint:** http://localhost:8081
- **KV Compression:** turbo4 (4-bit PolarQuant)
- **Context Length:** 128K tokens
- **Memory Savings:** ~73% KV cache reduction
- **Fallback Providers:** Ollama, OpenAI-compatible API

## Quick Start

### 1. Build TurboQuant-enabled llama.cpp

```bash
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)
```

### 2. Download Gemma 4 Model

```bash
# Download Gemma 4 Q4_K_M quantized model
huggingface-cli download <model-repo> gemma-4-q4_k_m.gguf
```

### 3. Start llama-server with TurboQuant

```bash
export TURBO_LAYER_ADAPTIVE=7
./build/bin/llama-server \
  -m /path/to/gemma-4-q4_k_m.gguf \
  --port 8081 \
  -ctk turbo4 -ctv turbo4 \
  -c 131072 \
  --host 0.0.0.0
```

### 4. Install Profile

```bash
# Copy profile to Hermes directory
cp gemma4-turboquant.yaml ~/.hermes/profiles/

# Or create symlink
ln -sf $(pwd)/gemma4-turboquant.yaml ~/.hermes/profiles/
```

### 5. Use with Hermes

```bash
# Start Hermes with the profile
hermes --profile gemma4-turboquant

# Or specify profile in Hermes config
echo "default_profile: gemma4-turboquant" >> ~/.hermes/config.yaml
```

## Profile Configuration

The profile includes:

- **Primary Provider:** Local llama.cpp server with TurboQuant
- **Fallback Providers:** Ollama (local), OpenAI (cloud)
- **TurboQuant Settings:**
  - `kv_type`: turbo4 (4-bit compression)
  - `layer_adaptive_mode`: 7 (best quality/compression ratio)
  - `max_context`: 128K tokens

## Performance Expectations

| Metric | Value | Notes |
|--------|-------|-------|
| KV Memory Savings | 73% | Measured on M3 Max |
| Prompt Processing | ~1% overhead | vs FP16 baseline |
| Generation Speed | ~11% overhead | vs FP16 baseline |
| Max Context (36GB) | 128K | Comfortable with 7.6GB headroom |

## Customization

### Adjust Compression Level

```yaml
turboquant:
  kv_type: "turbo3"  # Lower compression, faster
  # or
  kv_type: "turbo2"  # Minimal compression, fastest
```

### Disable Per-Layer Adaptive

```yaml
turboquant:
  layer_adaptive_mode: 0  # Uniform quantization
```

### Use Asymmetric K/V

For better quality on sensitive models:

```bash
# Start server with asymmetric K/V
llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072
```

## Troubleshooting

### Server Won't Start

1. Check if port 8081 is available: `lsof -i :8081`
2. Verify model path is correct
3. Ensure TurboQuant branch is checked out

### Poor Generation Quality

1. Try `turbo3` instead of `turbo4`
2. Disable per-layer adaptive (mode 0)
3. Use asymmetric K/V: `-ctk q8_0 -ctv turbo4`

### High Memory Usage

1. Reduce context length: `-c 65536` (64K)
2. Check `TURBO_LAYER_ADAPTIVE` is set
3. Monitor with: `vmmap --summary $(pgrep llama-server)`

## References

- [TurboQuant Build Spec](../BUILD-SPEC.md)
- [Phase 1 Report](../PHASE1-REPORT.md)
- [Full Knowledge Transfer](../FULL-REPORT.md)
- [llama.cpp TurboQuant Fork](https://github.com/TheTom/llama-cpp-turboquant)
feat: Add Hermes profile for Gemma 4 + TurboQuant (Issue #28) - Add gemma4-turboquant.yaml profile for Hermes - Configure local llama.cpp server with TurboQuant KV compression - Set turbo4 (4-bit) compression with per-layer adaptive mode 7 - Support 128K context with 73% KV memory savings - Include fallback providers (Ollama, OpenAI) - Add profiles/README.md with setup and usage instructions - Document performance expectations and troubleshooting Closes #28 2026-04-09 21:15:57 -04:00			`# Hermes Profiles for TurboQuant`

			`This directory contains Hermes configuration profiles for running models with TurboQuant KV cache compression.`

			`## Available Profiles`

			`### gemma4-turboquant.yaml`

			`Profile for Gemma 4 model with TurboQuant KV cache compression.`

			`- Primary Provider: Local llama.cpp server with TurboQuant enabled`
			`- Endpoint: http://localhost:8081`
			`- KV Compression: turbo4 (4-bit PolarQuant)`
			`- Context Length: 128K tokens`
			`- Memory Savings: ~73% KV cache reduction`
			`- Fallback Providers: Ollama, OpenAI-compatible API`

			`## Quick Start`

			`### 1. Build TurboQuant-enabled llama.cpp`

			```bash
			`git clone https://github.com/TheTom/llama-cpp-turboquant.git`
			`cd llama-cpp-turboquant`
			`git checkout feature/turboquant-kv-cache`
			`cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release`
			`cmake --build build -j$(sysctl -n hw.ncpu)`
			```

			`### 2. Download Gemma 4 Model`

			```bash
			`# Download Gemma 4 Q4_K_M quantized model`
			`huggingface-cli download <model-repo> gemma-4-q4_k_m.gguf`
			```

			`### 3. Start llama-server with TurboQuant`

			```bash
			`export TURBO_LAYER_ADAPTIVE=7`
			`./build/bin/llama-server \`
			`-m /path/to/gemma-4-q4_k_m.gguf \`
			`--port 8081 \`
			`-ctk turbo4 -ctv turbo4 \`
			`-c 131072 \`
			`--host 0.0.0.0`
			```

			`### 4. Install Profile`

			```bash
			`# Copy profile to Hermes directory`
			`cp gemma4-turboquant.yaml ~/.hermes/profiles/`

			`# Or create symlink`
			`ln -sf $(pwd)/gemma4-turboquant.yaml ~/.hermes/profiles/`
			```

			`### 5. Use with Hermes`

			```bash
			`# Start Hermes with the profile`
			`hermes --profile gemma4-turboquant`

			`# Or specify profile in Hermes config`
			`echo "default_profile: gemma4-turboquant" >> ~/.hermes/config.yaml`
			```

			`## Profile Configuration`

			`The profile includes:`

			`- Primary Provider: Local llama.cpp server with TurboQuant`
			`- Fallback Providers: Ollama (local), OpenAI (cloud)`
			`- TurboQuant Settings:`
			- `kv_type`: turbo4 (4-bit compression)
			- `layer_adaptive_mode`: 7 (best quality/compression ratio)
			- `max_context`: 128K tokens

			`## Performance Expectations`

			`\| Metric \| Value \| Notes \|`
			`\|--------\|-------\|-------\|`
			`\| KV Memory Savings \| 73% \| Measured on M3 Max \|`
			`\| Prompt Processing \| ~1% overhead \| vs FP16 baseline \|`
			`\| Generation Speed \| ~11% overhead \| vs FP16 baseline \|`
			`\| Max Context (36GB) \| 128K \| Comfortable with 7.6GB headroom \|`

			`## Customization`

			`### Adjust Compression Level`

			```yaml
			`turboquant:`
			`kv_type: "turbo3" # Lower compression, faster`
			`# or`
			`kv_type: "turbo2" # Minimal compression, fastest`
			```

			`### Disable Per-Layer Adaptive`

			```yaml
			`turboquant:`
			`layer_adaptive_mode: 0 # Uniform quantization`
			```

			`### Use Asymmetric K/V`

			`For better quality on sensitive models:`

			```bash
			`# Start server with asymmetric K/V`
			`llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072`
			```

			`## Troubleshooting`

			`### Server Won't Start`

			1. Check if port 8081 is available: `lsof -i :8081`
			`2. Verify model path is correct`
			`3. Ensure TurboQuant branch is checked out`

			`### Poor Generation Quality`

			1. Try `turbo3` instead of `turbo4`
			`2. Disable per-layer adaptive (mode 0)`
			3. Use asymmetric K/V: `-ctk q8_0 -ctv turbo4`

			`### High Memory Usage`

			1. Reduce context length: `-c 65536` (64K)
			2. Check `TURBO_LAYER_ADAPTIVE` is set
			3. Monitor with: `vmmap --summary $(pgrep llama-server)`

			`## References`

			`- [TurboQuant Build Spec](../BUILD-SPEC.md)`
			`- [Phase 1 Report](../PHASE1-REPORT.md)`
			`- [Full Knowledge Transfer](../FULL-REPORT.md)`
			`- [llama.cpp TurboQuant Fork](https://github.com/TheTom/llama-cpp-turboquant)`