Compare commits

..

5 Commits

Author SHA1 Message Date
Alexander Whitestone
e4f15254b3 feat: wikitext-2 corpus + perplexity benchmark script (closes #21)
All checks were successful
CI / test Auto-passed by Timmy review
CI / validate Auto-passed by Timmy review
Smoke Test / smoke Auto-passed by Timmy review
Review Approval Gate / verify-review Auto-passed by Timmy review
Smoke Test / smoke (pull_request) Auto-passed by Timmy review cron job
- Downloaded wikitext-2-raw-v1 test corpus (5782 lines, parquet→raw)
- Created benchmarks/run_perplexity.py: automated PPL quality gate
  comparing f16 vs turbo4 KV cache configurations
- Added benchmarks/perplexity_results.json template
- Script handles: subprocess execution, PPL parsing, delta calc,
  pass/fail against 0.5 threshold, JSON output

Usage: python3 benchmarks/run_perplexity.py --model <gguf> --llama-cpp <binary>
2026-04-12 00:39:14 -04:00
4c926312df Merge pull request 'Add smoke test workflow' (#34) from fix/add-smoke-test into main
All checks were successful
Smoke Test / smoke (push) Successful in 3s
Merged PR #34: Add smoke test workflow
2026-04-11 00:43:35 +00:00
Alexander Whitestone
6698b50f8f Add smoke test workflow
All checks were successful
Smoke Test / smoke (pull_request) Successful in 4s
2026-04-10 20:06:28 -04:00
f13287dc58 Merge pull request #33
Merged PR #33
2026-04-10 03:43:48 +00:00
Alexander Whitestone
aa0e76c1ab feat: Add Hermes profile for Gemma 4 + TurboQuant (Issue #28)
- Add gemma4-turboquant.yaml profile for Hermes
- Configure local llama.cpp server with TurboQuant KV compression
- Set turbo4 (4-bit) compression with per-layer adaptive mode 7
- Support 128K context with 73% KV memory savings
- Include fallback providers (Ollama, OpenAI)
- Add profiles/README.md with setup and usage instructions
- Document performance expectations and troubleshooting

Closes #28
2026-04-09 21:15:57 -04:00
6 changed files with 6313 additions and 0 deletions

View File

@@ -0,0 +1,24 @@
name: Smoke Test
on:
pull_request:
push:
branches: [main]
jobs:
smoke:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Parse check
run: |
find . -name '*.yml' -o -name '*.yaml' | grep -v .gitea | xargs -r python3 -c "import sys,yaml; [yaml.safe_load(open(f)) for f in sys.argv[1:]]"
find . -name '*.json' | xargs -r python3 -m json.tool > /dev/null
find . -name '*.py' | xargs -r python3 -m py_compile
find . -name '*.sh' | xargs -r bash -n
echo "PASS: All files parse"
- name: Secret scan
run: |
if grep -rE 'sk-or-|sk-ant-|ghp_|AKIA' . --include='*.yml' --include='*.py' --include='*.sh' 2>/dev/null | grep -v .gitea; then exit 1; fi
echo "PASS: No secrets"

View File

@@ -0,0 +1,31 @@
{
"timestamp": null,
"model": null,
"corpus": "corpora/wiki.test.raw",
"context_length": 2048,
"threshold": 0.5,
"runs": {
"f16": {
"kv_type": "f16",
"perplexity": null,
"tokens": null,
"elapsed_seconds": null,
"exit_code": null,
"passed": false,
"output_tail": ""
},
"turbo4": {
"kv_type": "turbo4",
"perplexity": null,
"tokens": null,
"elapsed_seconds": null,
"exit_code": null,
"passed": false,
"output_tail": ""
}
},
"delta": null,
"pass": null,
"error": null,
"notes": "Template — run benchmarks/run_perplexity.py to populate. Issue #21."
}

View File

@@ -0,0 +1,166 @@
#!/usr/bin/env python3
"""
TurboQuant Perplexity Quality Gate (Issue #21)
Compares text generation quality between f16 KV and turbo4 KV cache
configurations using llama.cpp's perplexity tool on the wikitext-2 corpus.
Usage:
python3 benchmarks/run_perplexity.py \
--model ~/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf \
--llama-cpp ~/turboquant/llama.cpp-fork/build/bin/llama-perplexity \
--corpus corpora/wiki.test.raw \
--context 2048
Acceptance: PPL delta (turbo4 - f16) must be ≤ 0.5 to pass.
"""
import argparse
import json
import os
import re
import subprocess
import sys
import time
from datetime import datetime, timezone
def run_perplexity(llama_bin: str, model: str, corpus: str, context: int,
kv_type: str, threads: int = 4) -> dict:
"""Run llama-perplexity and parse the output."""
cmd = [
llama_bin,
"-m", model,
"-f", corpus,
"-c", str(context),
"-t", str(threads),
"--kv-type", kv_type,
]
print(f"\n{'='*60}")
print(f"Running: {kv_type} KV cache")
print(f"Command: {' '.join(cmd)}")
print(f"{'='*60}\n")
start = time.time()
try:
result = subprocess.run(
cmd, capture_output=True, text=True, timeout=3600
)
elapsed = time.time() - start
output = result.stdout + "\n" + result.stderr
# Parse perplexity from output
# llama-perplexity prints lines like:
# perplexity: 12.3456 [...]
ppl_match = re.search(r"perplexity[:\s]+(\d+\.?\d*)", output, re.IGNORECASE)
ppl = float(ppl_match.group(1)) if ppl_match else None
# Parse token count
token_match = re.search(r"(\d+) tokens", output)
tokens = int(token_match.group(1)) if token_match else None
return {
"kv_type": kv_type,
"perplexity": ppl,
"tokens": tokens,
"elapsed_seconds": round(elapsed, 1),
"exit_code": result.returncode,
"passed": result.returncode == 0,
"output_tail": output.strip()[-500:] if output else "",
}
except subprocess.TimeoutExpired:
return {
"kv_type": kv_type,
"perplexity": None,
"elapsed_seconds": 3600,
"exit_code": -1,
"passed": False,
"error": "Timeout after 3600s",
}
except FileNotFoundError:
return {
"kv_type": kv_type,
"perplexity": None,
"elapsed_seconds": 0,
"exit_code": -1,
"passed": False,
"error": f"Binary not found: {llama_bin}",
}
def main():
parser = argparse.ArgumentParser(description="TurboQuant Perplexity Quality Gate")
parser.add_argument("--model", required=True, help="Path to GGUF model file")
parser.add_argument("--llama-cpp", default="llama.cpp-fork/build/bin/llama-perplexity",
help="Path to llama-perplexity binary")
parser.add_argument("--corpus", default="corpora/wiki.test.raw",
help="Path to wikitext-2 test corpus")
parser.add_argument("--context", type=int, default=2048, help="Context length")
parser.add_argument("--threads", type=int, default=4, help="Thread count")
parser.add_argument("--output", default="benchmarks/perplexity_results.json",
help="Output results file")
parser.add_argument("--kv-types", nargs="+", default=["f16", "turbo4"],
help="KV cache types to test")
parser.add_argument("--threshold", type=float, default=0.5,
help="Max acceptable PPL delta (turbo4 - baseline)")
args = parser.parse_args()
# Validate inputs
for path in [args.model, args.corpus, args.llama_cpp]:
if not os.path.exists(path):
print(f"ERROR: Not found: {path}")
sys.exit(1)
results = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"model": os.path.basename(args.model),
"corpus": args.corpus,
"context_length": args.context,
"threshold": args.threshold,
"runs": {},
"pass": None,
}
# Run each KV type
for kv in args.kv_types:
results["runs"][kv] = run_perplexity(
args.llama_cpp, args.model, args.corpus,
args.context, kv, args.threads
)
# Calculate delta and pass/fail
baseline = results["runs"].get("f16", {})
turbo = results["runs"].get("turbo4", {})
if baseline.get("perplexity") and turbo.get("perplexity"):
delta = turbo["perplexity"] - baseline["perplexity"]
results["delta"] = round(delta, 4)
results["pass"] = delta <= args.threshold
print(f"\n{'='*60}")
print(f"RESULTS:")
print(f" Baseline (f16): PPL = {baseline['perplexity']:.4f}")
print(f" Turbo4: PPL = {turbo['perplexity']:.4f}")
print(f" Delta: {delta:+.4f}")
print(f" Threshold: ≤ {args.threshold}")
print(f" PASS: {'✓ YES' if results['pass'] else '✗ NO'}")
print(f"{'='*60}")
else:
results["pass"] = False
results["error"] = "Could not parse perplexity from one or both runs"
print(f"\nERROR: {results['error']}")
if not baseline.get("perplexity"):
print(f" f16 run output: {baseline.get('output_tail', 'N/A')}")
if not turbo.get("perplexity"):
print(f" turbo4 run output: {turbo.get('output_tail', 'N/A')}")
# Save results
os.makedirs(os.path.dirname(args.output), exist_ok=True)
with open(args.output, "w") as f:
json.dump(results, f, indent=2)
print(f"\nResults saved to {args.output}")
sys.exit(0 if results["pass"] else 1)
if __name__ == "__main__":
main()

5782
corpora/wiki.test.raw Normal file

File diff suppressed because it is too large Load Diff

141
profiles/README.md Normal file
View File

@@ -0,0 +1,141 @@
# Hermes Profiles for TurboQuant
This directory contains Hermes configuration profiles for running models with TurboQuant KV cache compression.
## Available Profiles
### gemma4-turboquant.yaml
**Profile for Gemma 4 model with TurboQuant KV cache compression.**
- **Primary Provider:** Local llama.cpp server with TurboQuant enabled
- **Endpoint:** http://localhost:8081
- **KV Compression:** turbo4 (4-bit PolarQuant)
- **Context Length:** 128K tokens
- **Memory Savings:** ~73% KV cache reduction
- **Fallback Providers:** Ollama, OpenAI-compatible API
## Quick Start
### 1. Build TurboQuant-enabled llama.cpp
```bash
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)
```
### 2. Download Gemma 4 Model
```bash
# Download Gemma 4 Q4_K_M quantized model
huggingface-cli download <model-repo> gemma-4-q4_k_m.gguf
```
### 3. Start llama-server with TurboQuant
```bash
export TURBO_LAYER_ADAPTIVE=7
./build/bin/llama-server \
-m /path/to/gemma-4-q4_k_m.gguf \
--port 8081 \
-ctk turbo4 -ctv turbo4 \
-c 131072 \
--host 0.0.0.0
```
### 4. Install Profile
```bash
# Copy profile to Hermes directory
cp gemma4-turboquant.yaml ~/.hermes/profiles/
# Or create symlink
ln -sf $(pwd)/gemma4-turboquant.yaml ~/.hermes/profiles/
```
### 5. Use with Hermes
```bash
# Start Hermes with the profile
hermes --profile gemma4-turboquant
# Or specify profile in Hermes config
echo "default_profile: gemma4-turboquant" >> ~/.hermes/config.yaml
```
## Profile Configuration
The profile includes:
- **Primary Provider:** Local llama.cpp server with TurboQuant
- **Fallback Providers:** Ollama (local), OpenAI (cloud)
- **TurboQuant Settings:**
- `kv_type`: turbo4 (4-bit compression)
- `layer_adaptive_mode`: 7 (best quality/compression ratio)
- `max_context`: 128K tokens
## Performance Expectations
| Metric | Value | Notes |
|--------|-------|-------|
| KV Memory Savings | 73% | Measured on M3 Max |
| Prompt Processing | ~1% overhead | vs FP16 baseline |
| Generation Speed | ~11% overhead | vs FP16 baseline |
| Max Context (36GB) | 128K | Comfortable with 7.6GB headroom |
## Customization
### Adjust Compression Level
```yaml
turboquant:
kv_type: "turbo3" # Lower compression, faster
# or
kv_type: "turbo2" # Minimal compression, fastest
```
### Disable Per-Layer Adaptive
```yaml
turboquant:
layer_adaptive_mode: 0 # Uniform quantization
```
### Use Asymmetric K/V
For better quality on sensitive models:
```bash
# Start server with asymmetric K/V
llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072
```
## Troubleshooting
### Server Won't Start
1. Check if port 8081 is available: `lsof -i :8081`
2. Verify model path is correct
3. Ensure TurboQuant branch is checked out
### Poor Generation Quality
1. Try `turbo3` instead of `turbo4`
2. Disable per-layer adaptive (mode 0)
3. Use asymmetric K/V: `-ctk q8_0 -ctv turbo4`
### High Memory Usage
1. Reduce context length: `-c 65536` (64K)
2. Check `TURBO_LAYER_ADAPTIVE` is set
3. Monitor with: `vmmap --summary $(pgrep llama-server)`
## References
- [TurboQuant Build Spec](../BUILD-SPEC.md)
- [Phase 1 Report](../PHASE1-REPORT.md)
- [Full Knowledge Transfer](../FULL-REPORT.md)
- [llama.cpp TurboQuant Fork](https://github.com/TheTom/llama-cpp-turboquant)

View File

@@ -0,0 +1,169 @@
# Hermes Profile: Gemma 4 + TurboQuant KV Cache Compression
# For use with local llama.cpp server running TurboQuant-enabled inference
# Drop into ~/.hermes/profiles/gemma4-turboquant.yaml
profile:
name: "gemma4-turboquant"
version: "1.0.0"
description: "Gemma 4 model with TurboQuant KV cache compression for extended context on Apple Silicon"
# Primary provider: local llama.cpp server with TurboQuant
providers:
primary:
type: "llama.cpp"
name: "local-turboquant"
endpoint: "http://localhost:8081"
api_path: "/v1/chat/completions"
timeout_ms: 120000
# Model configuration
model:
name: "gemma-4"
path: "/path/to/gemma-4-q4_k_m.gguf" # Update with actual model path
# TurboQuant KV cache compression settings
turboquant:
enabled: true
kv_type: "turbo4" # Options: turbo2, turbo3, turbo4 (4-bit recommended)
layer_adaptive_mode: 7 # Per-layer adaptive quantization (0-7, 7=best quality/ratio)
# Context and memory settings
context:
max_tokens: 131072 # 128K context with TurboQuant compression
batch_size: 512
# Generation parameters
generation:
temperature: 0.7
top_p: 0.9
top_k: 40
repeat_penalty: 1.1
frequency_penalty: 0.0
presence_penalty: 0.0
# Server startup command (for reference)
server_command: |
export TURBO_LAYER_ADAPTIVE=7
llama-server \
-m /path/to/gemma-4-q4_k_m.gguf \
--port 8081 \
-ctk turbo4 -ctv turbo4 \
-c 131072 \
--host 0.0.0.0
# Fallback provider 1: Ollama (standard, no TurboQuant)
fallback_1:
type: "ollama"
name: "ollama-gemma4"
endpoint: "http://localhost:11434"
api_path: "/api/chat"
timeout_ms: 120000
model:
name: "gemma4:latest"
generation:
temperature: 0.7
top_p: 0.9
top_k: 40
# Fallback provider 2: OpenAI-compatible API (cloud backup)
fallback_2:
type: "openai"
name: "openai-backup"
endpoint: "https://api.openai.com"
api_path: "/v1/chat/completions"
timeout_ms: 60000
model:
name: "gpt-4"
generation:
temperature: 0.7
max_tokens: 4096
# Performance and monitoring
performance:
# Memory management for TurboQuant
memory:
max_gpu_memory_gb: 28 # Leave headroom on 36GB M3 Max
kv_cache_compression: "turbo4"
estimated_savings: "73%" # TurboQuant delivers ~73% KV memory savings
# Benchmarking integration
benchmarks:
enabled: true
metrics:
- "tokens_per_second"
- "time_to_first_token"
- "peak_memory_usage"
- "perplexity"
# Quality validation
quality:
# Test prompts for quality comparison
test_prompts:
enabled: true
prompt_file: "benchmarks/prompts.json"
# Perplexity testing
perplexity:
enabled: true
corpus: "wikitext-2-raw"
context_lengths: [8192, 32768, 65536, 131072]
# Environment variables (applied when using this profile)
environment:
TURBO_LAYER_ADAPTIVE: "7" # Per-layer adaptive quantization mode
GGML_METAL_DEBUG: "0" # Disable Metal debug in production
OMP_NUM_THREADS: "8" # Optimize for M3 Max performance cores
# Logging and diagnostics
logging:
level: "info"
metrics_interval_seconds: 60
log_token_speed: true
log_memory_usage: true
# Notes for deployment
notes:
deployment: |
1. Ensure llama.cpp fork with TurboQuant is built:
cd /path/to/llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)
2. Start the server:
export TURBO_LAYER_ADAPTIVE=7
./build/bin/llama-server \
-m /path/to/gemma-4-q4_k_m.gguf \
--port 8081 \
-ctk turbo4 -ctv turbo4 \
-c 131072 \
--host 0.0.0.0
3. Verify server is running:
curl http://localhost:8081/v1/models
4. Copy this profile to Hermes:
cp hermes-profile-gemma4-turboquant.yaml ~/.hermes/profiles/
performance_notes: |
TurboQuant delivers:
- 73% KV cache memory savings
- 1% prompt processing overhead
- 11% generation overhead
- Enables 128K context on 36GB hardware
With TurboQuant on Gemma 4 (estimated):
- Model weights: ~16GB at Q4_K_M
- KV cache at 128K: ~5GB (vs ~20GB without compression)
- Total memory: ~23GB (fits comfortably in 31GB budget)
troubleshooting: |
- If generation speed is slow, try turbo3 instead of turbo4
- If quality issues, disable per-layer adaptive (set mode to 0)
- For maximum quality on sensitive layers, use asymmetric K/V:
-ctk q8_0 -ctv turbo4
- Monitor memory with: vmmap --summary $(pgrep llama-server)