test: add Allegro benchmark and preset tests (#95 )

docs: Allegro VPS benchmark analysis — expected results (#95 )
feat: add Allegro VPS benchmark runner (#95 )
2026-04-16 01:56:15 +00:00 · 2026-04-16 01:54:53 +00:00 · 2026-04-16 01:53:49 +00:00 · 2026-04-16 01:50:50 +00:00
4 changed files with 930 additions and 0 deletions
--- a/benchmarks/allegro-2026-04-14.md
+++ b/benchmarks/allegro-2026-04-14.md
@@ -0,0 +1,113 @@
 # Allegro VPS Benchmark Analysis — 2026-04-14
 ## Hardware
 | Spec | Value |
 |------|-------|
 | Hostname | allegro |
 | IP | 167.99.126.228 |
 | Cores | 2 |
 | RAM | 8GB |
 | GPU | No (CPU-only) |
 | Arch | x86_64 |
 | Available for model | ~6GB (2GB reserved for OS + hermes agent) |
 ## Preset Analysis
 Based on GGUF model sizes and TurboQuant KV cache memory math.
 ### Memory Budget
 ```
 Total RAM:              8,192 MB
 OS + hermes agent:     -2,048 MB
 Available:              6,144 MB
 ```
 ### Preset Memory Estimates
 | Preset | Model Size | Context | KV Type | KV Cache | Total Est. | Fits? |
 |--------|-----------|---------|---------|----------|------------|-------|
 | tiny-2b-q4 | 1,536 MB | 4K | f16 | 256 MB | ~2,800 MB | YES |
 | small-3b-q4 | 2,048 MB | 8K | turbo2 | 512 MB | ~3,600 MB | YES |
 | medium-7b-q4 | 4,096 MB | 8K | turbo4 | 384 MB | ~5,200 MB | YES |
 | medium-7b-q4-long | 4,096 MB | 32K | turbo4 | 1,024 MB | ~5,800 MB | YES |
 | large-14b-q3 | 6,656 MB | 4K | turbo4 | 320 MB | ~7,200 MB | NO* |
 *Large preset needs swap or will OOM. Usable for batch jobs with `--mlock` disabled.
 ### Estimated Performance (CPU-only, 2 cores)
 These are theoretical estimates based on model size and CPU throughput.
 Actual results depend on prompt length, generation length, and system load.
 | Preset | Est. tok/s | Est. TTFT | Use Case |
 |--------|-----------|-----------|----------|
 | tiny-2b-q4 | 8-15 | 1.5-3.0s | Simple Q&A, triage, short completions |
 | small-3b-q4 | 5-10 | 2.0-5.0s | Code gen, tool calling, burn-loop workers |
 | medium-7b-q4 | 2-5 | 4.0-8.0s | Reasoning, multi-turn conversation |
 | medium-7b-q4-long | 1.5-4 | 6.0-12.0s | Long docs, code review, research |
 | large-14b-q3 | 0.5-2 | 10-30s | Batch processing only (needs swap) |
 ## Recommendation
 **Default: `medium` (7B Q4 + TurboQuant)**
 - Best quality that fits comfortably in 6GB budget
 - 2-5 tok/s is usable for interactive work (burn-loop, conversation)
 - TurboQuant KV4 keeps 8K context at ~384MB cache
 **For burn-loop workers: `small` (3B Q4 + TurboQuant2)**
 - 5-10 tok/s is better for high-throughput batch work
 - Lower memory footprint leaves room for multiple workers
 **For long documents: `medium-long` (7B Q4 + TurboQuant4, 32K)**
 - 32K context for code review, research papers
 - Stays within 6GB budget with q3_k KV compression
 ## Server Startup Commands
 ### Ollama (simplest)
 ```bash
 # Tiny
 ollama pull qwen2.5:1.5b
 # Small
 ollama pull qwen2.5:3b
 # Medium (recommended)
 ollama pull qwen2.5:7b
 ```
 ### llama-server with TurboQuant
 ```bash
 # Medium preset
 export TURBO_LAYER_ADAPTIVE=7
 llama-server \
  -m /models/qwen2.5-7b-instruct-q4_k_m.gguf \
  --port 8081 \
  -t 2 \
  -c 8192 \
  -b 512 \
  -ctk q4_0 -ctv q4_0 \
  --host 0.0.0.0
 ```
 ### Run Benchmarks
 ```bash
 # All presets
 python3 benchmarks/run_allegro_benchmarks.py --all --markdown
 # Specific preset
 python3 benchmarks/run_allegro_benchmarks.py --preset medium \
  --url http://localhost:11434
 ```
 ## Next Steps
 1. Run benchmarks on Allegro VPS: `python3 benchmarks/run_allegro_benchmarks.py --all --markdown`
 2. Update this document with actual measured results
 3. Set `recommended_preset` based on measured performance
 4. Create hermes profile for each viable preset
--- a/benchmarks/run_allegro_benchmarks.py
+++ b/benchmarks/run_allegro_benchmarks.py
@@ -0,0 +1,512 @@
 #!/usr/bin/env python3
 """
 Allegro VPS Benchmark Runner — TurboQuant presets on 2 cores, 8GB RAM.
 Runs each preset from profiles/allegro-cpu-presets.yaml against the
 benchmark prompts, measuring tokens/sec, latency, TTFT, and memory.
 Designed for CPU-only inference (no GPU) on the Allegro VPS.
 Usage:
    # Run all presets
    python3 benchmarks/run_allegro_benchmarks.py --all
    # Run specific preset
    python3 benchmarks/run_allegro_benchmarks.py --preset medium
    # Dry run (validate config, no inference)
    python3 benchmarks/run_allegro_benchmarks.py --dry-run
    # Output markdown report
    python3 benchmarks/run_allegro_benchmarks.py --all --markdown
    # Against remote Ollama
    python3 benchmarks/run_allegro_benchmarks.py --preset small \
        --url http://167.99.126.228:11434
 """
 import argparse
 import json
 import os
 import subprocess
 import sys
 import time
 from datetime import datetime, timezone
 from pathlib import Path
 from typing import Any, Dict, List, Optional
 ROOT = Path(__file__).resolve().parents[1]
 PRESETS_FILE = ROOT / "profiles" / "allegro-cpu-presets.yaml"
 PROMPTS_FILE = ROOT / "benchmarks" / "prompts.json"
 RESULTS_DIR = ROOT / "benchmarks"
 try:
    import requests
 except ImportError:
    requests = None
 # ── Hardware Detection ────────────────────────────────────────────────────
 def detect_hardware() -> dict:
    """Detect current hardware specs."""
    info = {
        "hostname": "",
        "cores": os.cpu_count() or 0,
        "ram_gb": 0,
        "gpu": False,
        "arch": "",
    }
    try:
        import platform
        info["hostname"] = platform.node()
        info["arch"] = platform.machine()
    except Exception:
        pass
    # RAM detection (Linux)
    try:
        with open("/proc/meminfo") as f:
            for line in f:
                if line.startswith("MemTotal:"):
                    kb = int(line.split()[1])
                    info["ram_gb"] = round(kb / 1024 / 1024, 1)
                    break
    except Exception:
        # macOS fallback
        try:
            result = subprocess.run(["sysctl", "-n", "hw.memsize"],
                                    capture_output=True, text=True)
            bytes_val = int(result.stdout.strip())
            info["ram_gb"] = round(bytes_val / 1024**3, 1)
        except Exception:
            pass
    # GPU detection
    try:
        result = subprocess.run(["nvidia-smi", "--query-gpu=name",
                                 "--format=csv,noheader"],
                                capture_output=True, text=True, timeout=5)
        if result.returncode == 0 and result.stdout.strip():
            info["gpu"] = True
    except Exception:
        pass
    return info
 def get_memory_usage_gb() -> float:
    """Get current process RSS in GB."""
    try:
        if sys.platform == "darwin":
            result = subprocess.run(["ps", "-o", "rss=", "-p", str(os.getpid())],
                                    capture_output=True, text=True)
            return int(result.stdout.strip()) / 1024 / 1024
        else:
            with open(f"/proc/{os.getpid()}/status") as f:
                for line in f:
                    if line.startswith("VmRSS:"):
                        return int(line.split()[1]) / 1024 / 1024
    except Exception:
        pass
    return 0.0
 def get_system_memory_gb() -> float:
    """Get available system memory in GB."""
    try:
        with open("/proc/meminfo") as f:
            for line in f:
                if line.startswith("MemAvailable:"):
                    kb = int(line.split()[1])
                    return round(kb / 1024 / 1024, 2)
    except Exception:
        pass
    return 0.0
 # ── Preset Loading ────────────────────────────────────────────────────────
 def load_presets() -> dict:
    """Load preset configuration from YAML."""
    try:
        import yaml
        with open(PRESETS_FILE) as f:
            return yaml.safe_load(f)
    except ImportError:
        # Fallback: parse basic YAML manually
        import re
        with open(PRESETS_FILE) as f:
            content = f.read()
        # Very basic YAML parsing — just enough to extract preset names
        presets = {}
        current = None
        for line in content.split("\n"):
            m = re.match(r"^  (\w+):$", line)
            if m and line.startswith("  "):
                current = m.group(1)
                presets[current] = {"name": current}
        return {"presets": presets}
 def load_prompts() -> list:
    """Load benchmark prompts."""
    with open(PROMPTS_FILE) as f:
        return json.load(f)
 # ── Inference Backends ────────────────────────────────────────────────────
 def run_ollama(prompt: str, model: str, url: str, timeout: int = 120) -> dict:
    """Run inference against Ollama."""
    if requests is None:
        return {"status": "failed", "error": "requests not installed"}
    api_url = f"{url.rstrip('/')}/api/generate"
    start = time.time()
    mem_before = get_memory_usage_gb()
    sys_mem_before = get_system_memory_gb()
    try:
        resp = requests.post(api_url, json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {"num_predict": 256}
        }, timeout=timeout)
        elapsed = time.time() - start
        mem_after = get_memory_usage_gb()
        sys_mem_after = get_system_memory_gb()
        resp.raise_for_status()
        data = resp.json()
        response_text = data.get("response", "")
        eval_count = data.get("eval_count", 0)
        eval_duration_ns = data.get("eval_duration", 0)
        prompt_eval_ns = data.get("prompt_eval_duration", 0)
        tok_per_sec = 0.0
        ttft = None
        if eval_duration_ns > 0:
            tok_per_sec = eval_count / (eval_duration_ns / 1e9)
        if prompt_eval_ns > 0:
            ttft = prompt_eval_ns / 1e9
        return {
            "response": response_text[:200],
            "latency_s": round(elapsed, 3),
            "ttft_s": round(ttft, 3) if ttft else None,
            "tokens_per_sec": round(tok_per_sec, 2),
            "eval_count": eval_count,
            "memory_gb": round(max(mem_before, mem_after), 2),
            "system_mem_available_gb": round(sys_mem_after, 2),
            "system_mem_delta_gb": round(sys_mem_before - sys_mem_after, 2),
            "status": "success",
        }
    except Exception as e:
        return {
            "status": "failed",
            "error": str(e)[:200],
            "latency_s": round(time.time() - start, 3),
        }
 def run_llama_server(prompt: str, model: str, url: str,
                     kv_type: str = "f16", timeout: int = 120) -> dict:
    """Run inference against llama-server (OpenAI-compatible)."""
    if requests is None:
        return {"status": "failed", "error": "requests not installed"}
    api_url = f"{url.rstrip('/')}/v1/chat/completions"
    start = time.time()
    mem_before = get_memory_usage_gb()
    sys_mem_before = get_system_memory_gb()
    try:
        resp = requests.post(api_url, json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 256,
            "stream": False,
        }, timeout=timeout)
        elapsed = time.time() - start
        mem_after = get_memory_usage_gb()
        sys_mem_after = get_system_memory_gb()
        resp.raise_for_status()
        data = resp.json()
        choice = data.get("choices", [{}])[0]
        response_text = choice.get("message", {}).get("content", "")
        usage = data.get("usage", {})
        completion_tokens = usage.get("completion_tokens", 0)
        tok_per_sec = 0.0
        if elapsed > 0 and completion_tokens > 0:
            tok_per_sec = completion_tokens / max(elapsed - 0.1, 0.01)
        return {
            "response": response_text[:200],
            "latency_s": round(elapsed, 3),
            "ttft_s": None,
            "tokens_per_sec": round(tok_per_sec, 2),
            "completion_tokens": completion_tokens,
            "kv_type": kv_type,
            "memory_gb": round(max(mem_before, mem_after), 2),
            "system_mem_available_gb": round(sys_mem_after, 2),
            "system_mem_delta_gb": round(sys_mem_before - sys_mem_after, 2),
            "status": "success",
        }
    except Exception as e:
        return {
            "status": "failed",
            "error": str(e)[:200],
            "latency_s": round(time.time() - start, 3),
        }
 # ── Benchmark Runner ──────────────────────────────────────────────────────
 def run_preset(preset: dict, backend: str, url: str, prompts: list,
               timeout: int = 120, dry_run: bool = False) -> dict:
    """Run a single preset against all prompts."""
    name = preset.get("name", "unknown")
    model = preset.get("ollama_model", "") if backend == "ollama" else preset.get("llama_cpp_model", "")
    kv_type = preset.get("kv_type", "f16")
    run_fn = run_ollama if backend == "ollama" else run_llama_server
    print(f"\nPreset: {name} (model={model}, kv={kv_type})")
    print(f"  Estimated RAM: {preset.get('estimated_ram_gb', '?')}GB | "
          f"Fits Allegro: {preset.get('fits_in_allegro', '?')}")
    if dry_run:
        print(f"  [DRY RUN] Skipping inference")
        return {"preset": name, "status": "dry_run", "results": []}
    results = []
    for item in prompts:
        pid = item.get("id", item.get("category", "unknown"))
        prompt = item["prompt"]
        print(f"  [{pid}] ...", end=" ", flush=True)
        if backend == "ollama":
            result = run_fn(prompt, model, url, timeout=timeout)
        else:
            result = run_fn(prompt, model, url, kv_type=kv_type, timeout=timeout)
        result["id"] = pid
        result["prompt_preview"] = prompt[:80]
        results.append(result)
        status = "OK" if result["status"] == "success" else "FAIL"
        tps = result.get("tokens_per_sec", 0)
        lat = result.get("latency_s", 0)
        mem = result.get("system_mem_available_gb", 0)
        print(f"{status} {tps:.1f} tok/s {lat:.1f}s mem={mem:.1f}GB")
    # Summary
    successes = [r for r in results if r["status"] == "success"]
    summary = {
        "preset": name,
        "model": model,
        "kv_type": kv_type,
        "total": len(results),
        "success": len(successes),
        "failed": len(results) - len(successes),
        "avg_tok_per_sec": round(
            sum(r.get("tokens_per_sec", 0) for r in successes) / max(len(successes), 1), 2
        ),
        "avg_latency_s": round(
            sum(r.get("latency_s", 0) for r in successes) / max(len(successes), 1), 3
        ),
        "peak_memory_gb": round(
            max((r.get("memory_gb", 0) for r in results), default=0), 2
        ),
        "min_system_mem_available_gb": round(
            min((r.get("system_mem_available_gb", 999) for r in results), default=0), 2
        ),
        "results": results,
    }
    print(f"  SUMMARY: {summary['success']}/{summary['total']} OK | "
          f"Avg {summary['avg_tok_per_sec']:.1f} tok/s | "
          f"Peak {summary['peak_memory_gb']:.1f}GB | "
          f"Min avail {summary['min_system_mem_available_gb']:.1f}GB")
    return summary
 def generate_report(all_results: list, hw_info: dict, output_dir: str) -> str:
    """Generate markdown benchmark report."""
    today = datetime.now().strftime("%Y-%m-%d")
    lines = [
        f"# Allegro VPS Benchmark Results — {today}",
        "",
        "## Hardware",
        "",
        f"| Spec | Value |",
        f"|------|-------|",
        f"| Hostname | {hw_info.get('hostname', 'unknown')} |",
        f"| Cores | {hw_info.get('cores', '?')} |",
        f"| RAM | {hw_info.get('ram_gb', '?')}GB |",
        f"| GPU | {'Yes' if hw_info.get('gpu') else 'No (CPU-only)'} |",
        f"| Arch | {hw_info.get('arch', '?')} |",
        "",
        "## Results Summary",
        "",
        "| Preset | Model | KV | tok/s | Latency (s) | Peak Mem (GB) | Status |",
        "|--------|-------|-----|-------|-------------|---------------|--------|",
    ]
    for r in all_results:
        status = "PASS" if r["success"] == r["total"] else f"{r['success']}/{r['total']}"
        lines.append(
            f"| {r['preset']} | {r['model']} | {r['kv_type']} | "
            f"{r['avg_tok_per_sec']} | {r['avg_latency_s']} | "
            f"{r['peak_memory_gb']} | {status} |"
        )
    # Find minimum viable preset
    viable = [r for r in all_results
              if r["success"] == r["total"]
              and r.get("min_system_mem_available_gb", 0) > 1.0]
    if viable:
        best = min(viable, key=lambda x: x["peak_memory_gb"])
        lines.extend([
            "",
            "## Minimum Viable Preset",
            "",
            f"**{best['preset']}** ({best['model']}, {best['kv_type']})",
            f"- Peak memory: {best['peak_memory_gb']}GB",
            f"- Min available system memory: {best['min_system_mem_available_gb']}GB",
            f"- Avg performance: {best['avg_tok_per_sec']} tok/s",
            "",
            "Fits within the 6GB budget (8GB - 2GB OS reserve).",
        ])
    else:
        lines.extend([
            "",
            "## Minimum Viable Preset",
            "",
            "No preset passed all tests with >1GB system memory headroom.",
            "Recommendation: use `tiny` or `small` presets.",
        ])
    lines.extend([
        "",
        "## Per-Preset Details",
        "",
    ])
    for r in all_results:
        lines.extend([
            f"### {r['preset']}",
            "",
            f"- Model: `{r['model']}`",
            f"- KV type: `{r['kv_type']}`",
            f"- Avg tok/s: {r['avg_tok_per_sec']}",
            f"- Avg latency: {r['avg_latency_s']}s",
            f"- Peak memory: {r['peak_memory_gb']}GB",
            "",
            "| Prompt | tok/s | Latency (s) | Status |",
            "|--------|-------|-------------|--------|",
        ])
        for res in r.get("results", []):
            pid = res.get("id", "?")
            tps = res.get("tokens_per_sec", 0)
            lat = res.get("latency_s", 0)
            st = res.get("status", "?")
            lines.append(f"| {pid} | {tps} | {lat} | {st} |")
        lines.append("")
    report = "\n".join(lines)
    output_path = os.path.join(output_dir, f"allegro-{today}.md")
    with open(output_path, "w") as f:
        f.write(report)
    print(f"\nReport saved to {output_path}")
    return report
 # ── CLI ───────────────────────────────────────────────────────────────────
 def main():
    parser = argparse.ArgumentParser(description="Allegro VPS Benchmark Runner")
    parser.add_argument("--all", action="store_true", help="Run all presets")
    parser.add_argument("--preset", help="Run a specific preset")
    parser.add_argument("--backend", choices=["ollama", "llama-server"],
                        default="ollama", help="Inference backend")
    parser.add_argument("--url", default="http://localhost:11434",
                        help="Backend URL")
    parser.add_argument("--prompts", default=None, help="Prompts file")
    parser.add_argument("--timeout", type=int, default=120,
                        help="Per-prompt timeout (s)")
    parser.add_argument("--dry-run", action="store_true",
                        help="Validate config without inference")
    parser.add_argument("--markdown", action="store_true",
                        help="Generate markdown report")
    parser.add_argument("--json", dest="json_output", action="store_true",
                        help="JSON output")
    args = parser.parse_args()
    if not args.all and not args.preset:
        parser.error("Specify --all or --preset <name>")
    # Load config
    config = load_presets()
    presets = config.get("presets", {})
    prompts_file = args.prompts or str(PROMPTS_FILE)
    prompts = load_prompts() if os.path.exists(prompts_file) else []
    # Hardware info
    hw_info = detect_hardware()
    print(f"Hardware: {hw_info['cores']} cores, {hw_info['ram_gb']}GB RAM, "
          f"{'GPU' if hw_info['gpu'] else 'CPU-only'}")
    # Determine which presets to run
    if args.all:
        preset_names = list(presets.keys())
    else:
        preset_names = [args.preset]
    all_results = []
    for pname in preset_names:
        if pname not in presets:
            print(f"Unknown preset: {pname}")
            continue
        preset = presets[pname]
        result = run_preset(preset, args.backend, args.url, prompts,
                           timeout=args.timeout, dry_run=args.dry_run)
        all_results.append(result)
    # Output
    if args.json_output:
        print(json.dumps(all_results, indent=2))
    elif args.markdown:
        generate_report(all_results, hw_info, str(RESULTS_DIR))
    else:
        # Summary table
        print(f"\n{'='*70}")
        print(f"{'Preset':<20} {'Model':<25} {'tok/s':<8} {'Lat(s)':<8} {'Mem(GB)':<8}")
        print(f"{'-'*70}")
        for r in all_results:
            print(f"{r['preset']:<20} {r.get('model','?'):<25} "
                  f"{r.get('avg_tok_per_sec',0):<8} "
                  f"{r.get('avg_latency_s',0):<8} "
                  f"{r.get('peak_memory_gb',0):<8}")
        print(f"{'='*70}")
    # Save raw results
    ts = int(time.time())
    raw_path = str(RESULTS_DIR / f"allegro_results_{ts}.json")
    os.makedirs(os.path.dirname(raw_path), exist_ok=True)
    with open(raw_path, "w") as f:
        json.dump({"hardware": hw_info, "results": all_results}, f, indent=2)
    print(f"Raw results: {raw_path}")
 if __name__ == "__main__":
    main()
--- a/profiles/allegro-cpu-presets.yaml
+++ b/profiles/allegro-cpu-presets.yaml
@@ -0,0 +1,164 @@
 # Allegro VPS Presets — 2 cores, 8GB RAM, CPU-only inference
 # Optimized for the Timmy Foundation Allegro server (167.99.126.228)
 #
 # Hardware constraints:
 #   - 2 CPU cores (no GPU)
 #   - 8GB RAM total
 #   - ~2GB reserved for OS + hermes agent
 #   - ~6GB available for model + KV cache
 #
 # Strategy: GGUF quantization via llama.cpp (CPU-optimized)
 # KV cache compression via TurboQuant to maximize context within RAM
 hardware:
  hostname: "allegro"
  ip: "167.99.126.228"
  cores: 2
  ram_gb: 8
  gpu: false
  os_reserved_gb: 2
  available_gb: 6
  arch: "x86_64"
  cpu_backend: "llama.cpp"
 presets:
  # ─── TIER 1: Conservative (fits comfortably) ──────────────────────
  tiny:
    name: "tiny-2b-q4"
    description: "2B param model, Q4_K_M — leaves headroom for other processes"
    model_size_gb: 1.5
    quantization: "Q4_K_M"
    context_tokens: 4096
    kv_type: "f16"
    estimated_ram_gb: 2.8
    fits_in_allegro: true
    server_flags:
      threads: 2
      context: 4096
      batch: 256
    expected_perf:
      tokens_per_sec: "8-15"
      ttft_s: "1.5-3.0"
      use_case: "Simple Q&A, short completions, triage"
    ollama_model: "qwen2.5:1.5b"
    llama_cpp_model: "qwen2.5-1.5b-instruct-q4_k_m.gguf"
  small:
    name: "small-3b-q4"
    description: "3B param model, Q4_K_M — sweet spot for value on 2 cores"
    model_size_gb: 2.0
    quantization: "Q4_K_M"
    context_tokens: 8192
    kv_type: "turbo2"
    estimated_ram_gb: 3.6
    fits_in_allegro: true
    server_flags:
      threads: 2
      context: 8192
      batch: 512
      ctk: "q4_0"
      ctv: "q4_0"
    expected_perf:
      tokens_per_sec: "5-10"
      ttft_s: "2.0-5.0"
      use_case: "Code generation, tool calling, burn-loop workers"
    ollama_model: "qwen2.5:3b"
    llama_cpp_model: "qwen2.5-3b-instruct-q4_k_m.gguf"
  # ─── TIER 2: Balanced (recommended default) ───────────────────────
  medium:
    name: "medium-7b-q4"
    description: "7B param model, Q4_K_M + TurboQuant — best quality that fits"
    model_size_gb: 4.1
    quantization: "Q4_K_M"
    context_tokens: 8192
    kv_type: "turbo4"
    estimated_ram_gb: 5.2
    fits_in_allegro: true
    server_flags:
      threads: 2
      context: 8192
      batch: 512
      ctk: "q4_0"
      ctv: "q4_0"
      layer_adaptive: 7
    expected_perf:
      tokens_per_sec: "2-5"
      ttft_s: "4.0-8.0"
      use_case: "Complex reasoning, multi-turn conversation, analysis"
    ollama_model: "qwen2.5:7b"
    llama_cpp_model: "qwen2.5-7b-instruct-q4_k_m.gguf"
  medium_long:
    name: "medium-7b-q4-long"
    description: "7B Q4 + aggressive TurboQuant for 32K context"
    model_size_gb: 4.1
    quantization: "Q4_K_M"
    context_tokens: 32768
    kv_type: "turbo4"
    estimated_ram_gb: 5.8
    fits_in_allegro: true
    server_flags:
      threads: 2
      context: 32768
      batch: 256
      ctk: "q3_k"
      ctv: "q3_k"
      layer_adaptive: 7
    expected_perf:
      tokens_per_sec: "1.5-4"
      ttft_s: "6.0-12.0"
      use_case: "Long document analysis, code review, research"
    ollama_model: "qwen2.5:7b"
    llama_cpp_model: "qwen2.5-7b-instruct-q4_k_m.gguf"
  # ─── TIER 3: Pushing limits (may swap) ────────────────────────────
  large:
    name: "large-14b-q3"
    description: "14B param model, Q3_K_M — may page to swap, use with caution"
    model_size_gb: 6.5
    quantization: "Q3_K_M"
    context_tokens: 4096
    kv_type: "turbo4"
    estimated_ram_gb: 7.2
    fits_in_allegro: false
    warning: "Exceeds 6GB limit. Needs swap or will OOM. Use only for batch jobs."
    server_flags:
      threads: 2
      context: 4096
      batch: 256
      ctk: "q3_k"
      ctv: "q3_k"
      layer_adaptive: 7
    expected_perf:
      tokens_per_sec: "0.5-2"
      ttft_s: "10.0-30.0"
      use_case: "Batch processing, overnight jobs (with swap)"
    ollama_model: "qwen2.5:14b"
    llama_cpp_model: "qwen2.5-14b-instruct-q3_k_m.gguf"
 # Recommended default for Allegro
 recommended_preset: "medium"
 # Server startup examples
 examples:
  ollama: |
    # Pull and run
    ollama pull qwen2.5:7b
    ollama run qwen2.5:7b
  llama_cpp: |
    # With TurboQuant KV cache
    export TURBO_LAYER_ADAPTIVE=7
    llama-server \
      -m /models/qwen2.5-7b-instruct-q4_k_m.gguf \
      --port 8081 \
      -t 2 \
      -c 8192 \
      -b 512 \
      -ctk q4_0 -ctv q4_0 \
      --host 0.0.0.0
  hermes_profile: |
    # Use with hermes agent
    hermes -p allegro-medium chat
--- a/tests/test_allegro_benchmarks.py
+++ b/tests/test_allegro_benchmarks.py
@@ -0,0 +1,141 @@
 """Tests for Allegro VPS benchmark runner and preset configuration."""
 import json
 import os
 import pathlib
 import sys
 import pytest
 ROOT = pathlib.Path(__file__).resolve().parents[1]
 PRESETS_FILE = ROOT / "profiles" / "allegro-cpu-presets.yaml"
 PROMPTS_FILE = ROOT / "benchmarks" / "prompts.json"
 sys.path.insert(0, str(ROOT / "benchmarks"))
 # ---------------------------------------------------------------------------
 # Preset config validation
 # ---------------------------------------------------------------------------
 class TestPresetConfig:
    """Validate allegro-cpu-presets.yaml structure."""
    @classmethod
    def setUpClass(cls):
        import yaml
        cls.config = yaml.safe_load(PRESETS_FILE.read_text())
    def test_config_has_hardware(self):
        assert "hardware" in self.config
        hw = self.config["hardware"]
        assert hw["cores"] == 2
        assert hw["ram_gb"] == 8
        assert hw["gpu"] is False
    def test_config_has_presets(self):
        assert "presets" in self.config
        assert len(self.config["presets"]) >= 3
    def test_each_preset_has_required_fields(self):
        for name, preset in self.config["presets"].items():
            assert "name" in preset, f"Preset {name} missing 'name'"
            assert "description" in preset, f"Preset {name} missing 'description'"
            assert "model_size_gb" in preset, f"Preset {name} missing 'model_size_gb'"
            assert "quantization" in preset, f"Preset {name} missing 'quantization'"
            assert "context_tokens" in preset, f"Preset {name} missing 'context_tokens'"
            assert "kv_type" in preset, f"Preset {name} missing 'kv_type'"
            assert "estimated_ram_gb" in preset, f"Preset {name} missing 'estimated_ram_gb'"
            assert "fits_in_allegro" in preset, f"Preset {name} missing 'fits_in_allegro'"
            assert "expected_perf" in preset, f"Preset {name} missing 'expected_perf'"
            assert "server_flags" in preset, f"Preset {name} missing 'server_flags'"
    def test_tiny_fits_in_allegro(self):
        tiny = self.config["presets"]["tiny"]
        assert tiny["fits_in_allegro"] is True
        assert tiny["estimated_ram_gb"] <= 6.0
    def test_small_fits_in_allegro(self):
        small = self.config["presets"]["small"]
        assert small["fits_in_allegro"] is True
        assert small["estimated_ram_gb"] <= 6.0
    def test_medium_fits_in_allegro(self):
        medium = self.config["presets"]["medium"]
        assert medium["fits_in_allegro"] is True
        assert medium["estimated_ram_gb"] <= 6.0
    def test_large_does_not_fit(self):
        large = self.config["presets"]["large"]
        assert large["fits_in_allegro"] is False
        assert large["estimated_ram_gb"] > 6.0
    def test_recommended_preset_exists(self):
        rec = self.config.get("recommended_preset")
        assert rec is not None
        assert rec in self.config["presets"]
    def test_server_flags_have_threads(self):
        for name, preset in self.config["presets"].items():
            flags = preset.get("server_flags", {})
            assert "threads" in flags, f"Preset {name} missing threads in server_flags"
            assert flags["threads"] == 2, f"Preset {name} should use 2 threads"
    def test_context_tokens_reasonable(self):
        for name, preset in self.config["presets"].items():
            ctx = preset["context_tokens"]
            assert ctx >= 2048, f"Preset {name} context too small: {ctx}"
            assert ctx <= 131072, f"Preset {name} context too large: {ctx}"
    def test_kv_types_valid(self):
        valid_types = {"f16", "q4_0", "q4_1", "q5_0", "q5_1", "q8_0",
                       "turbo2", "turbo3", "turbo4", "q3_k", "q4_k", "q5_k"}
        for name, preset in self.config["presets"].items():
            kv = preset["kv_type"]
            assert kv in valid_types, f"Preset {name} has invalid kv_type: {kv}"
 # ---------------------------------------------------------------------------
 # Benchmark prompts validation
 # ---------------------------------------------------------------------------
 class TestBenchmarkPrompts:
    def test_prompts_file_exists(self):
        assert PROMPTS_FILE.exists()
    def test_prompts_is_list(self):
        prompts = json.loads(PROMPTS_FILE.read_text())
        assert isinstance(prompts, list)
        assert len(prompts) >= 5
    def test_each_prompt_has_required_fields(self):
        prompts = json.loads(PROMPTS_FILE.read_text())
        for p in prompts:
            assert "id" in p or "category" in p
            assert "prompt" in p
            assert len(p["prompt"]) > 10
 # ---------------------------------------------------------------------------
 # Hardware detection (unit tests)
 # ---------------------------------------------------------------------------
 class TestHardwareDetection:
    def test_detect_hardware_returns_dict(self):
        from run_allegro_benchmarks import detect_hardware
        hw = detect_hardware()
        assert isinstance(hw, dict)
        assert "cores" in hw
        assert "ram_gb" in hw
        assert "gpu" in hw
    def test_cores_positive(self):
        from run_allegro_benchmarks import detect_hardware
        hw = detect_hardware()
        assert hw["cores"] > 0
    def test_memory_usage_returns_float(self):
        from run_allegro_benchmarks import get_memory_usage_gb
        mem = get_memory_usage_gb()
        assert isinstance(mem, (int, float))
        assert mem >= 0
Author	SHA1	Message	Date
Alexander Whitestone	45840c1b70	test: add Allegro benchmark and preset tests (#95 ) All checks were successful Smoke Test / smoke (pull_request) Successful in 19s Details	2026-04-16 01:56:15 +00:00
Alexander Whitestone	d603a1b053	docs: Allegro VPS benchmark analysis — expected results (#95 )	2026-04-16 01:54:53 +00:00
Alexander Whitestone	f3a5be5638	feat: add Allegro VPS benchmark runner (#95 )	2026-04-16 01:53:49 +00:00
Alexander Whitestone	70d292c222	feat: add Allegro VPS preset configurations (#95 )	2026-04-16 01:50:50 +00:00