Files

Smoke Test / smoke (pull_request) Successful in 8s

Details

feat: add Allegro VPS benchmark infrastructure — presets, runner, tests

- profiles/allegro-cpu-presets.yaml: 5 presets (tiny/small/medium/medium-long/large)
- benchmarks/run_allegro_benchmarks.py: --dry-run, --all, --preset, --markdown
- benchmarks/allegro-2026-04-14.md: analysis & expected results
- tests/test_allegro_benchmarks.py: 19 smoke tests (preset validation, runner)

Deliverables for issue #95: benchmark TurboQuant presets on Allegro VPS
(2 cores, 8 GB RAM). Runner integrates with existing llama-server backend.
Presets tuned to ~6 GB usable memory budget; large preset needs swap.

Closes #95

2026-04-26 06:52:53 -04:00

allegro-cpu-presets.yaml

feat: add Allegro VPS benchmark infrastructure — presets, runner, tests

2026-04-26 06:52:53 -04:00

hermes-profile-gemma4-turboquant.yaml

feat: Add Hermes profile for Gemma 4 + TurboQuant (Issue #28 )

2026-04-09 21:15:57 -04:00

README.md

fix(docs): resolve broken markdown links and stale forge URL

2026-04-14 18:07:25 -04:00

README.md

Hermes Profiles for TurboQuant

This directory contains Hermes configuration profiles for running models with TurboQuant KV cache compression.

Available Profiles

gemma4-turboquant.yaml

Profile for Gemma 4 model with TurboQuant KV cache compression.

Primary Provider: Local llama.cpp server with TurboQuant enabled
Endpoint: http://localhost:8081
KV Compression: turbo4 (4-bit PolarQuant)
Context Length: 128K tokens
Memory Savings: ~73% KV cache reduction
Fallback Providers: Ollama, OpenAI-compatible API

Quick Start

1. Build TurboQuant-enabled llama.cpp

git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)

2. Download Gemma 4 Model

# Download Gemma 4 Q4_K_M quantized model
huggingface-cli download <model-repo> gemma-4-q4_k_m.gguf

3. Start llama-server with TurboQuant

export TURBO_LAYER_ADAPTIVE=7
./build/bin/llama-server \
  -m /path/to/gemma-4-q4_k_m.gguf \
  --port 8081 \
  -ctk turbo4 -ctv turbo4 \
  -c 131072 \
  --host 0.0.0.0

4. Install Profile

# Copy profile to Hermes directory
cp gemma4-turboquant.yaml ~/.hermes/profiles/

# Or create symlink
ln -sf $(pwd)/gemma4-turboquant.yaml ~/.hermes/profiles/

5. Use with Hermes

# Start Hermes with the profile
hermes --profile gemma4-turboquant

# Or specify profile in Hermes config
echo "default_profile: gemma4-turboquant" >> ~/.hermes/config.yaml

Profile Configuration

The profile includes:

Primary Provider: Local llama.cpp server with TurboQuant
Fallback Providers: Ollama (local), OpenAI (cloud)
TurboQuant Settings:
- kv_type: turbo4 (4-bit compression)
- layer_adaptive_mode: 7 (best quality/compression ratio)
- max_context: 128K tokens

Performance Expectations

Metric	Value	Notes
KV Memory Savings	73%	Measured on M3 Max
Prompt Processing	~1% overhead	vs FP16 baseline
Generation Speed	~11% overhead	vs FP16 baseline
Max Context (36GB)	128K	Comfortable with 7.6GB headroom

Customization

Adjust Compression Level

turboquant:
  kv_type: "turbo3"  # Lower compression, faster
  # or
  kv_type: "turbo2"  # Minimal compression, fastest

Disable Per-Layer Adaptive

turboquant:
  layer_adaptive_mode: 0  # Uniform quantization

Use Asymmetric K/V

For better quality on sensitive models:

# Start server with asymmetric K/V
llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072

README.md

Hermes Profiles for TurboQuant

Available Profiles

gemma4-turboquant.yaml

Quick Start

1. Build TurboQuant-enabled llama.cpp

2. Download Gemma 4 Model

3. Start llama-server with TurboQuant

4. Install Profile

5. Use with Hermes

Profile Configuration

Performance Expectations

Customization

Adjust Compression Level

Disable Per-Layer Adaptive

Use Asymmetric K/V

Troubleshooting

Server Won't Start

Poor Generation Quality

High Memory Usage

References