feat: wikitext-2 corpus + perplexity benchmark script (closes #21 )

- Downloaded wikitext-2-raw-v1 test corpus (5782 lines, parquet→raw) - Created benchmarks/run_perplexity.py: automated PPL quality gate comparing f16 vs turbo4 KV cache configurations - Added benchmarks/perplexity_results.json template - Script handles: subprocess execution, PPL parsing, delta calc, pass/fail against 0.5 threshold, JSON output Usage: python3 benchmarks/run_perplexity.py --model <gguf> --llama-cpp <binary>
Merge pull request 'Add smoke test workflow' (#34 ) from fix/add-smoke-test into main
2026-04-12 00:39:14 -04:00 · 2026-04-11 00:43:35 +00:00 · 2026-04-10 20:06:28 -04:00 · 2026-04-10 03:43:48 +00:00 · 2026-04-09 21:15:57 -04:00
6 changed files with 6313 additions and 0 deletions
--- a/.gitea/workflows/smoke.yml
+++ b/.gitea/workflows/smoke.yml
@@ -0,0 +1,24 @@
+name: Smoke Test
+on:
+  pull_request:
+  push:
+    branches: [main]
+jobs:
+  smoke:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+      - name: Parse check
+        run: |
+          find . -name '*.yml' -o -name '*.yaml' | grep -v .gitea | xargs -r python3 -c "import sys,yaml; [yaml.safe_load(open(f)) for f in sys.argv[1:]]"
+          find . -name '*.json' | xargs -r python3 -m json.tool > /dev/null
+          find . -name '*.py' | xargs -r python3 -m py_compile
+          find . -name '*.sh' | xargs -r bash -n
+          echo "PASS: All files parse"
+      - name: Secret scan
+        run: |
+          if grep -rE 'sk-or-|sk-ant-|ghp_|AKIA' . --include='*.yml' --include='*.py' --include='*.sh' 2>/dev/null | grep -v .gitea; then exit 1; fi
+          echo "PASS: No secrets"
--- a/benchmarks/perplexity_results.json
+++ b/benchmarks/perplexity_results.json
@@ -0,0 +1,31 @@
+{
+  "timestamp": null,
+  "model": null,
+  "corpus": "corpora/wiki.test.raw",
+  "context_length": 2048,
+  "threshold": 0.5,
+  "runs": {
+    "f16": {
+      "kv_type": "f16",
+      "perplexity": null,
+      "tokens": null,
+      "elapsed_seconds": null,
+      "exit_code": null,
+      "passed": false,
+      "output_tail": ""
+    },
+    "turbo4": {
+      "kv_type": "turbo4",
+      "perplexity": null,
+      "tokens": null,
+      "elapsed_seconds": null,
+      "exit_code": null,
+      "passed": false,
+      "output_tail": ""
+    }
+  },
+  "delta": null,
+  "pass": null,
+  "error": null,
+  "notes": "Template — run benchmarks/run_perplexity.py to populate. Issue #21."
+}
--- a/benchmarks/run_perplexity.py
+++ b/benchmarks/run_perplexity.py
@@ -0,0 +1,166 @@
+#!/usr/bin/env python3
+"""
+TurboQuant Perplexity Quality Gate (Issue #21)
+
+Compares text generation quality between f16 KV and turbo4 KV cache
+configurations using llama.cpp's perplexity tool on the wikitext-2 corpus.
+
+Usage:
+    python3 benchmarks/run_perplexity.py \
+        --model ~/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf \
+        --llama-cpp ~/turboquant/llama.cpp-fork/build/bin/llama-perplexity \
+        --corpus corpora/wiki.test.raw \
+        --context 2048
+
+Acceptance: PPL delta (turbo4 - f16) must be ≤ 0.5 to pass.
+"""
+
+import argparse
+import json
+import os
+import re
+import subprocess
+import sys
+import time
+from datetime import datetime, timezone
+
+
+def run_perplexity(llama_bin: str, model: str, corpus: str, context: int,
+                   kv_type: str, threads: int = 4) -> dict:
+    """Run llama-perplexity and parse the output."""
+    cmd = [
+        llama_bin,
+        "-m", model,
+        "-f", corpus,
+        "-c", str(context),
+        "-t", str(threads),
+        "--kv-type", kv_type,
+    ]
+    print(f"\n{'='*60}")
+    print(f"Running: {kv_type} KV cache")
+    print(f"Command: {' '.join(cmd)}")
+    print(f"{'='*60}\n")
+
+    start = time.time()
+    try:
+        result = subprocess.run(
+            cmd, capture_output=True, text=True, timeout=3600
+        )
+        elapsed = time.time() - start
+        output = result.stdout + "\n" + result.stderr
+
+        # Parse perplexity from output
+        # llama-perplexity prints lines like:
+        # perplexity: 12.3456 [...]
+        ppl_match = re.search(r"perplexity[:\s]+(\d+\.?\d*)", output, re.IGNORECASE)
+        ppl = float(ppl_match.group(1)) if ppl_match else None
+
+        # Parse token count
+        token_match = re.search(r"(\d+) tokens", output)
+        tokens = int(token_match.group(1)) if token_match else None
+
+        return {
+            "kv_type": kv_type,
+            "perplexity": ppl,
+            "tokens": tokens,
+            "elapsed_seconds": round(elapsed, 1),
+            "exit_code": result.returncode,
+            "passed": result.returncode == 0,
+            "output_tail": output.strip()[-500:] if output else "",
+        }
+    except subprocess.TimeoutExpired:
+        return {
+            "kv_type": kv_type,
+            "perplexity": None,
+            "elapsed_seconds": 3600,
+            "exit_code": -1,
+            "passed": False,
+            "error": "Timeout after 3600s",
+        }
+    except FileNotFoundError:
+        return {
+            "kv_type": kv_type,
+            "perplexity": None,
+            "elapsed_seconds": 0,
+            "exit_code": -1,
+            "passed": False,
+            "error": f"Binary not found: {llama_bin}",
+        }
+
+
+def main():
+    parser = argparse.ArgumentParser(description="TurboQuant Perplexity Quality Gate")
+    parser.add_argument("--model", required=True, help="Path to GGUF model file")
+    parser.add_argument("--llama-cpp", default="llama.cpp-fork/build/bin/llama-perplexity",
+                        help="Path to llama-perplexity binary")
+    parser.add_argument("--corpus", default="corpora/wiki.test.raw",
+                        help="Path to wikitext-2 test corpus")
+    parser.add_argument("--context", type=int, default=2048, help="Context length")
+    parser.add_argument("--threads", type=int, default=4, help="Thread count")
+    parser.add_argument("--output", default="benchmarks/perplexity_results.json",
+                        help="Output results file")
+    parser.add_argument("--kv-types", nargs="+", default=["f16", "turbo4"],
+                        help="KV cache types to test")
+    parser.add_argument("--threshold", type=float, default=0.5,
+                        help="Max acceptable PPL delta (turbo4 - baseline)")
+    args = parser.parse_args()
+
+    # Validate inputs
+    for path in [args.model, args.corpus, args.llama_cpp]:
+        if not os.path.exists(path):
+            print(f"ERROR: Not found: {path}")
+            sys.exit(1)
+
+    results = {
+        "timestamp": datetime.now(timezone.utc).isoformat(),
+        "model": os.path.basename(args.model),
+        "corpus": args.corpus,
+        "context_length": args.context,
+        "threshold": args.threshold,
+        "runs": {},
+        "pass": None,
+    }
+
+    # Run each KV type
+    for kv in args.kv_types:
+        results["runs"][kv] = run_perplexity(
+            args.llama_cpp, args.model, args.corpus,
+            args.context, kv, args.threads
+        )
+
+    # Calculate delta and pass/fail
+    baseline = results["runs"].get("f16", {})
+    turbo = results["runs"].get("turbo4", {})
+
+    if baseline.get("perplexity") and turbo.get("perplexity"):
+        delta = turbo["perplexity"] - baseline["perplexity"]
+        results["delta"] = round(delta, 4)
+        results["pass"] = delta <= args.threshold
+        print(f"\n{'='*60}")
+        print(f"RESULTS:")
+        print(f"  Baseline (f16):    PPL = {baseline['perplexity']:.4f}")
+        print(f"  Turbo4:            PPL = {turbo['perplexity']:.4f}")
+        print(f"  Delta:                   {delta:+.4f}")
+        print(f"  Threshold:               ≤ {args.threshold}")
+        print(f"  PASS:                    {'✓ YES' if results['pass'] else '✗ NO'}")
+        print(f"{'='*60}")
+    else:
+        results["pass"] = False
+        results["error"] = "Could not parse perplexity from one or both runs"
+        print(f"\nERROR: {results['error']}")
+        if not baseline.get("perplexity"):
+            print(f"  f16 run output: {baseline.get('output_tail', 'N/A')}")
+        if not turbo.get("perplexity"):
+            print(f"  turbo4 run output: {turbo.get('output_tail', 'N/A')}")
+
+    # Save results
+    os.makedirs(os.path.dirname(args.output), exist_ok=True)
+    with open(args.output, "w") as f:
+        json.dump(results, f, indent=2)
+    print(f"\nResults saved to {args.output}")
+
+    sys.exit(0 if results["pass"] else 1)
+
+
+if __name__ == "__main__":
+    main()
--- a/corpora/wiki.test.raw
+++ b/corpora/wiki.test.raw
--- a/profiles/README.md
+++ b/profiles/README.md
@@ -0,0 +1,141 @@
+# Hermes Profiles for TurboQuant
+
+This directory contains Hermes configuration profiles for running models with TurboQuant KV cache compression.
+
+## Available Profiles
+
+### gemma4-turboquant.yaml
+
+**Profile for Gemma 4 model with TurboQuant KV cache compression.**
+
+- **Primary Provider:** Local llama.cpp server with TurboQuant enabled
+- **Endpoint:** http://localhost:8081
+- **KV Compression:** turbo4 (4-bit PolarQuant)
+- **Context Length:** 128K tokens
+- **Memory Savings:** ~73% KV cache reduction
+- **Fallback Providers:** Ollama, OpenAI-compatible API
+
+## Quick Start
+
+### 1. Build TurboQuant-enabled llama.cpp
+
+```bash
+git clone https://github.com/TheTom/llama-cpp-turboquant.git
+cd llama-cpp-turboquant
+git checkout feature/turboquant-kv-cache
+cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
+cmake --build build -j$(sysctl -n hw.ncpu)
+```
+
+### 2. Download Gemma 4 Model
+
+```bash
+# Download Gemma 4 Q4_K_M quantized model
+huggingface-cli download <model-repo> gemma-4-q4_k_m.gguf
+```
+
+### 3. Start llama-server with TurboQuant
+
+```bash
+export TURBO_LAYER_ADAPTIVE=7
+./build/bin/llama-server \
+  -m /path/to/gemma-4-q4_k_m.gguf \
+  --port 8081 \
+  -ctk turbo4 -ctv turbo4 \
+  -c 131072 \
+  --host 0.0.0.0
+```
+
+### 4. Install Profile
+
+```bash
+# Copy profile to Hermes directory
+cp gemma4-turboquant.yaml ~/.hermes/profiles/
+
+# Or create symlink
+ln -sf $(pwd)/gemma4-turboquant.yaml ~/.hermes/profiles/
+```
+
+### 5. Use with Hermes
+
+```bash
+# Start Hermes with the profile
+hermes --profile gemma4-turboquant
+
+# Or specify profile in Hermes config
+echo "default_profile: gemma4-turboquant" >> ~/.hermes/config.yaml
+```
+
+## Profile Configuration
+
+The profile includes:
+
+- **Primary Provider:** Local llama.cpp server with TurboQuant
+- **Fallback Providers:** Ollama (local), OpenAI (cloud)
+- **TurboQuant Settings:**
+  - `kv_type`: turbo4 (4-bit compression)
+  - `layer_adaptive_mode`: 7 (best quality/compression ratio)
+  - `max_context`: 128K tokens
+
+## Performance Expectations
+
+| Metric | Value | Notes |
+|--------|-------|-------|
+| KV Memory Savings | 73% | Measured on M3 Max |
+| Prompt Processing | ~1% overhead | vs FP16 baseline |
+| Generation Speed | ~11% overhead | vs FP16 baseline |
+| Max Context (36GB) | 128K | Comfortable with 7.6GB headroom |
+
+## Customization
+
+### Adjust Compression Level
+
+```yaml
+turboquant:
+  kv_type: "turbo3"  # Lower compression, faster
+  # or
+  kv_type: "turbo2"  # Minimal compression, fastest
+```
+
+### Disable Per-Layer Adaptive
+
+```yaml
+turboquant:
+  layer_adaptive_mode: 0  # Uniform quantization
+```
+
+### Use Asymmetric K/V
+
+For better quality on sensitive models:
+
+```bash
+# Start server with asymmetric K/V
+llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072
+```
+
+## Troubleshooting
+
+### Server Won't Start
+
+1. Check if port 8081 is available: `lsof -i :8081`
+2. Verify model path is correct
+3. Ensure TurboQuant branch is checked out
+
+### Poor Generation Quality
+
+1. Try `turbo3` instead of `turbo4`
+2. Disable per-layer adaptive (mode 0)
+3. Use asymmetric K/V: `-ctk q8_0 -ctv turbo4`
+
+### High Memory Usage
+
+1. Reduce context length: `-c 65536` (64K)
+2. Check `TURBO_LAYER_ADAPTIVE` is set
+3. Monitor with: `vmmap --summary $(pgrep llama-server)`
+
+## References
+
+- [TurboQuant Build Spec](../BUILD-SPEC.md)
+- [Phase 1 Report](../PHASE1-REPORT.md)
+- [Full Knowledge Transfer](../FULL-REPORT.md)
+- [llama.cpp TurboQuant Fork](https://github.com/TheTom/llama-cpp-turboquant)
--- a/profiles/hermes-profile-gemma4-turboquant.yaml
+++ b/profiles/hermes-profile-gemma4-turboquant.yaml
@@ -0,0 +1,169 @@
+# Hermes Profile: Gemma 4 + TurboQuant KV Cache Compression
+# For use with local llama.cpp server running TurboQuant-enabled inference
+# Drop into ~/.hermes/profiles/gemma4-turboquant.yaml
+
+profile:
+  name: "gemma4-turboquant"
+  version: "1.0.0"
+  description: "Gemma 4 model with TurboQuant KV cache compression for extended context on Apple Silicon"
+
+# Primary provider: local llama.cpp server with TurboQuant
+providers:
+  primary:
+    type: "llama.cpp"
+    name: "local-turboquant"
+    endpoint: "http://localhost:8081"
+    api_path: "/v1/chat/completions"
+    timeout_ms: 120000
+    
+    # Model configuration
+    model:
+      name: "gemma-4"
+      path: "/path/to/gemma-4-q4_k_m.gguf"  # Update with actual model path
+      
+    # TurboQuant KV cache compression settings
+    turboquant:
+      enabled: true
+      kv_type: "turbo4"  # Options: turbo2, turbo3, turbo4 (4-bit recommended)
+      layer_adaptive_mode: 7  # Per-layer adaptive quantization (0-7, 7=best quality/ratio)
+      
+    # Context and memory settings
+    context:
+      max_tokens: 131072  # 128K context with TurboQuant compression
+      batch_size: 512
+      
+    # Generation parameters
+    generation:
+      temperature: 0.7
+      top_p: 0.9
+      top_k: 40
+      repeat_penalty: 1.1
+      frequency_penalty: 0.0
+      presence_penalty: 0.0
+      
+    # Server startup command (for reference)
+    server_command: |
+      export TURBO_LAYER_ADAPTIVE=7
+      llama-server \
+        -m /path/to/gemma-4-q4_k_m.gguf \
+        --port 8081 \
+        -ctk turbo4 -ctv turbo4 \
+        -c 131072 \
+        --host 0.0.0.0
+
+  # Fallback provider 1: Ollama (standard, no TurboQuant)
+  fallback_1:
+    type: "ollama"
+    name: "ollama-gemma4"
+    endpoint: "http://localhost:11434"
+    api_path: "/api/chat"
+    timeout_ms: 120000
+    
+    model:
+      name: "gemma4:latest"
+      
+    generation:
+      temperature: 0.7
+      top_p: 0.9
+      top_k: 40
+
+  # Fallback provider 2: OpenAI-compatible API (cloud backup)
+  fallback_2:
+    type: "openai"
+    name: "openai-backup"
+    endpoint: "https://api.openai.com"
+    api_path: "/v1/chat/completions"
+    timeout_ms: 60000
+    
+    model:
+      name: "gpt-4"
+      
+    generation:
+      temperature: 0.7
+      max_tokens: 4096
+
+# Performance and monitoring
+performance:
+  # Memory management for TurboQuant
+  memory:
+    max_gpu_memory_gb: 28  # Leave headroom on 36GB M3 Max
+    kv_cache_compression: "turbo4"
+    estimated_savings: "73%"  # TurboQuant delivers ~73% KV memory savings
+    
+  # Benchmarking integration
+  benchmarks:
+    enabled: true
+    metrics:
+      - "tokens_per_second"
+      - "time_to_first_token"
+      - "peak_memory_usage"
+      - "perplexity"
+
+# Quality validation
+quality:
+  # Test prompts for quality comparison
+  test_prompts:
+    enabled: true
+    prompt_file: "benchmarks/prompts.json"
+    
+  # Perplexity testing
+  perplexity:
+    enabled: true
+    corpus: "wikitext-2-raw"
+    context_lengths: [8192, 32768, 65536, 131072]
+
+# Environment variables (applied when using this profile)
+environment:
+  TURBO_LAYER_ADAPTIVE: "7"  # Per-layer adaptive quantization mode
+  GGML_METAL_DEBUG: "0"  # Disable Metal debug in production
+  OMP_NUM_THREADS: "8"  # Optimize for M3 Max performance cores
+
+# Logging and diagnostics
+logging:
+  level: "info"
+  metrics_interval_seconds: 60
+  log_token_speed: true
+  log_memory_usage: true
+
+# Notes for deployment
+notes:
+  deployment: |
+    1. Ensure llama.cpp fork with TurboQuant is built:
+       cd /path/to/llama-cpp-turboquant
+       git checkout feature/turboquant-kv-cache
+       cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
+       cmake --build build -j$(sysctl -n hw.ncpu)
+    
+    2. Start the server:
+       export TURBO_LAYER_ADAPTIVE=7
+       ./build/bin/llama-server \
+         -m /path/to/gemma-4-q4_k_m.gguf \
+         --port 8081 \
+         -ctk turbo4 -ctv turbo4 \
+         -c 131072 \
+         --host 0.0.0.0
+    
+    3. Verify server is running:
+       curl http://localhost:8081/v1/models
+    
+    4. Copy this profile to Hermes:
+       cp hermes-profile-gemma4-turboquant.yaml ~/.hermes/profiles/
+    
+  performance_notes: |
+    TurboQuant delivers:
+    - 73% KV cache memory savings
+    - 1% prompt processing overhead
+    - 11% generation overhead
+    - Enables 128K context on 36GB hardware
+    
+    With TurboQuant on Gemma 4 (estimated):
+    - Model weights: ~16GB at Q4_K_M
+    - KV cache at 128K: ~5GB (vs ~20GB without compression)
+    - Total memory: ~23GB (fits comfortably in 31GB budget)
+    
+  troubleshooting: |
+    - If generation speed is slow, try turbo3 instead of turbo4
+    - If quality issues, disable per-layer adaptive (set mode to 0)
+    - For maximum quality on sensitive layers, use asymmetric K/V:
+      -ctk q8_0 -ctv turbo4
+    - Monitor memory with: vmmap --summary $(pgrep llama-server)
Author	SHA1	Message	Date
Alexander Whitestone	e4f15254b3	feat: wikitext-2 corpus + perplexity benchmark script (closes #21 ) All checks were successful CI / test Auto-passed by Timmy review CI / validate Auto-passed by Timmy review Smoke Test / smoke Auto-passed by Timmy review Review Approval Gate / verify-review Auto-passed by Timmy review Smoke Test / smoke (pull_request) Auto-passed by Timmy review cron job - Downloaded wikitext-2-raw-v1 test corpus (5782 lines, parquet→raw) - Created benchmarks/run_perplexity.py: automated PPL quality gate comparing f16 vs turbo4 KV cache configurations - Added benchmarks/perplexity_results.json template - Script handles: subprocess execution, PPL parsing, delta calc, pass/fail against 0.5 threshold, JSON output Usage: python3 benchmarks/run_perplexity.py --model <gguf> --llama-cpp <binary>	2026-04-12 00:39:14 -04:00
Timmy Time	4c926312df	Merge pull request 'Add smoke test workflow' (#34 ) from fix/add-smoke-test into main All checks were successful Smoke Test / smoke (push) Successful in 3s Details Merged PR #34: Add smoke test workflow	2026-04-11 00:43:35 +00:00
Alexander Whitestone	6698b50f8f	Add smoke test workflow All checks were successful Smoke Test / smoke (pull_request) Successful in 4s Details	2026-04-10 20:06:28 -04:00
Alexander Whitestone	f13287dc58	Merge pull request #33 Merged PR #33	2026-04-10 03:43:48 +00:00
Alexander Whitestone	aa0e76c1ab	feat: Add Hermes profile for Gemma 4 + TurboQuant (Issue #28 ) - Add gemma4-turboquant.yaml profile for Hermes - Configure local llama.cpp server with TurboQuant KV compression - Set turbo4 (4-bit) compression with per-layer adaptive mode 7 - Support 128K context with 73% KV memory savings - Include fallback providers (Ollama, OpenAI) - Add profiles/README.md with setup and usage instructions - Document performance expectations and troubleshooting Closes #28	2026-04-09 21:15:57 -04:00