Compare commits

..

3 Commits

Author SHA1 Message Date
36b2b07fcc Merge pull request 'feat: Auto-start llama.cpp server for tool call regression tests (#118)' (#151) from fix/118-auto-start-server-fixture into main
All checks were successful
Smoke Test / smoke (push) Successful in 12s
2026-05-05 12:55:25 +00:00
9ed8cd3cae feat: add auto-start server fixture (#118)
All checks were successful
Smoke Test / smoke (pull_request) Successful in 21s
- turboquant_server_url fixture: auto-starts llama-server if no URL provided
- Finds binary in standard locations or PATH
- Finds GGUF model in standard locations
- Configurable via env vars (port, kv_type, ctx_size, timeout)
- Skips gracefully if binary or model not found
- turboquant_model_name fixture for model discovery
2026-04-21 11:52:26 +00:00
82ab8b22c3 feat: add server manager for auto-start fixture (#118) 2026-04-21 11:51:22 +00:00
7 changed files with 280 additions and 504 deletions

View File

@@ -30,8 +30,3 @@ See [issues](https://forge.alexanderwhitestone.com/Timmy_Foundation/turboquant/i
## Docs
- [Project Status](docs/PROJECT_STATUS.md) — Full project status and build specification
## Benchmarks
- [Bonsai 1-bit vs Q4_0 — M4 Pro Metal](benchmarks/bonsai-1bit-comparison-2025-10-06.md) — speed, memory, quality comparison (issue #100)
- Run locally: `python3 benchmarks/run_bonsai_compare.py`

View File

@@ -1,148 +0,0 @@
# Bonsai 1-bit vs Q4_0 Benchmark Results
> Issue #100 — bench: Bonsai 1-bit models vs Q4_0 — quality, speed, memory
> Author: Rockachopa (STEP35 FREE BURN)
> Date: 2025-10-06
## Test Host
| Item | Value |
|------|-------|
| Machine | Apple Silicon MacBook |
| Chip | M4 Pro (Metal GPU, 48 GB unified memory) — published reference from Prism ML |
| Backend | llama.cpp Prism fork — `llama.cpp` + Metal Q1_0 kernels |
| OS | macOS 15.x |
| Models dir | `~/models/` |
| Run command | `python3 benchmarks/run_bonsai_compare.py --models-dir ~/models` |
> **Note on M1 Mac**: Published Bonsai README explicitly reports M4 Pro numbers.
> For pure M1 data (M1 8-core GPU, 16 GB RAM), run the included benchmark script on
> your own machine and commit `benchmarks/bonsai_results_YYYY-MM-DD.json` back to the repo.
## Model Set
| Model | File | Quant | Source repo |
|-------------|---------------------------------|-------|-------------|
| Bonsai-8B | `Bonsai-8B-Q1_0.gguf` | Q1_0 | prism-ml/Bonsai-8B-gguf (gated) |
| Bonsai-4B | `Bonsai-4B-Q1_0.gguf` | Q1_0 | prism-ml/Bonsai-4B-gguf (gated) |
| Bonsai-1.7B | `Bonsai-1.7B-Q1_0.gguf` | Q1_0 | prism-ml/Bonsai-1.7B-gguf (gated) |
| Qwen3-8B | `Qwen3-8B-Q4_0.gguf` | Q4_0 | TheBloke/Qwen3-8B-GGUF (public) |
| Qwen3-4B | `Qwen3-4B-Q4_0.gguf` | Q4_0 | TheBloke/Qwen3-4B-GGUF (public) |
| Qwen3-1.7B | `Qwen3-1.7B-Q4_0.gguf` | Q4_0 | TheBloke/Qwen3-1.7B-GGUF (public) |
## Disk Size & Memory Footprint
Disk sizes are measured from actual GGUF files; GPU mem estimate includes activation
overhead (weights + KV cache warm-up).
| Model | Disk size (GB) | Est. GPU mem (GB) | FP16 baseline | Compression |
|-------------|---------------:|------------------:|--------------:|------------:|
| Bonsai-8B | 1.15 | 1.2 | 16.38 | **14.2×** |
| Bonsai-4B | 0.57 | 0.6 | 8.04 | **14.1×** |
| Bonsai-1.7B | 0.24 | 0.25| 3.44 | **14.3×** |
| Qwen3-8B | 4.70 | 5.0 | 16.38 | 3.5× |
| Qwen3-4B | 2.40 | 2.5 | 8.04 | 3.4× |
| Qwen3-1.7B | 1.00 | 1.05| 3.44 | 3.4× |
1-bit Bonsai models occupy **1.15 → 0.24 GB** on disk vs 4.71.0 GB for Q4_0 Qwen baselines.
Same numerical precision across embeddings, attention, MLP projections, and LM head.
## Throughput (Published Reference — M4 Pro Metal, 48 GB)
Numbers below are from the official Prism ML model READMEs (HuggingFace).
Measured with `llama-cli --timings`; prompt `"Once upon a time"`;
128 output tokens; temperature 0; Metal backend; all layers offloaded (`-ngl 99`).
| Model | TG128 tok/s (1-bit) | FP16 TG tok/s | Speedup vs FP16 |
|-------------|-------------------:|--------------:|----------------:|
| Bonsai-8B | 85 | 16 | **5.4×** |
| Bonsai-4B | 136 | 29 | **4.7×** |
| Bonsai-1.7B | 250 | 65 | **3.8×** |
Prefill throughput (PP512, tok/s):
| Model | PP512 tok/s (1-bit) | FP16 PP tok/s |
|-------------|-------------------:|--------------:|
| Bonsai-8B | 498 | 490 |
| Bonsai-4B | 915 | 915 |
| Bonsai-1.7B | 2305 | 2291 |
> **Interpretation**: 1-bit kernels eliminate the FP16→INT dequantization stall on Metal,
> yielding 3.8×5.4× speedup for generation. Prefill is compute-bound anyway (FFT path),
> so speedup is minimal there.
## Quality (Benchmark Scores — Published)
GSM8K / MMLU-R / MuSR / HE+ / IFEval / BFCL scores from Prism ML technical report.
Evaluated on H100 under EvalScope v1.4.2 with vLLM 0.15.1, identical scoring across all models.
| Model | Avg | GSM8K | MMLU-R | MuSR | HE+ | IFEval | BFCL |
|-------------|-----:|------:|-------:|-----:|-----:|-------:|-----:|
| Bonsai-8B | **70.5** | 88.0 | 65.7 | 50.0 | 73.8 | 79.8 | 65.7 |
| Qwen3-8B | 79.3 | 93.0 | 83.0 | 55.0 | 82.0 | 84.2 | 81.0 |
| Qwen3-4B | 76.0 | 90.0 | 80.0 | 52.0 | 78.0 | 80.1 | 78.1 |
| Qwen3-1.7B | 71.0 | 87.0 | 74.0 | 49.5 | 75.0 | 76.4 | 72.2 |
Despite being **1/14th the size**, 1-bit Bonsai 8B is competitive with leading
6B9B full-precision instruct models. Dropped 89 points vs the best-in-class
(mostly factuality and fine-grained instruction adherence), but still well above random.
## Tool Calling Viability
Run the regression test suite: `pytest tests/test_bonsai_tool_calling.py`
(created by issue #173). It spins up a local llama-server with Metal offload,
sends 10 structured tool-use prompts, and measures success rate.
**Pre-release indicators** (from Prism ML tool-use pilot):
- Bonsai-8B 1-bit achieved ~78% structured function-calling accuracy on 50-sample test set
- Failure mode: rare schema mis-generation on low-confidence math subroutines
- Memory budget on M1 Pro (16 GB) leaves ~13 GB for context with 8B model (3 GB base + 1 GB KV)
**Verdict**: 1-bit Bonsai 8B is viable for edge agent tool calling; Bonsai-4B
preferred when total RAM ≤ 4 GB (Air/Raspberry Pi).
## Minimum Viable Model for Edge Deployment
| Edge form factor | Recommended model | Why |
|-----------------|-:|:----|
| MacBook M1 (16 GB RAM, Metal GPU) | `Bonsai-8B-Q1_0` | Full capability, <2 GB total VRAM; room for 64K context |
| MacBook Air M2 (8 GB RAM) | `Bonsai-4B-Q1_0` | 0.6 GB VRAM, leaves memory for OS + browser |
| Raspberry Pi 5 (8 GB, Mali GPU) | `Bonsai-1.7B-Q1_0` | Fits entirely in RAM, usable latency (~200 tok/s) |
## How to Reproduce
```bash
# 1. Clone Prism fork of llama.cpp (Q1_0 Metal kernel support)
git clone https://github.com/PrismML-Eng/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_METAL=ON
cmake --build build -j # produces build/bin/llama-cli and llama-server
# 2. Download model files into ~/models/
# Bonsai are gated — you need HuggingFace access approval + `huggingface-cli login`
# Qwen3 baselines are public (TheBloke)
# Example:
huggingface-cli download prism-ml/Bonsai-8B-gguf Bonsai-8B-Q1_0.gguf --local-dir ~/models
huggingface-cli download prism-ml/Bonsai-4B-gguf Bonsai-4B-Q1_0.gguf --local-dir ~/models
huggingface-cli download prism-ml/Bonsai-1.7B-gguf Bonsai-1.7B-Q1_0.gguf --local-dir ~/models
# Additionally: download Qwen3 Q4_0 GGUF files from TheBloke into the same directory.
# 3. Run the benchmark (from turboquant repo root)
python3 benchmarks/run_bonsai_compare.py --models-dir ~/models
# 4. Commit the resulting JSON to turboquant/benchmarks/
git add benchmarks/bonsai_results_$(date +%Y-%m-%d).json
git commit -m "bench: add Bonsai 1-bit vs Q4_0 M1 Mac results (#100)"
```
## Sources
- Prism ML, "Bonsai: End-to-End 1-bit Language Model Deployment Across Apple, GPU, and Mobile Runtimes" (2026 ICLR submission)
- Model repositories:
- https://huggingface.co/prism-ml/Bonsai-8B-gguf
- https://huggingface.co/prism-ml/Bonsai-4B-gguf
- https://huggingface.co/prism-ml/Bonsai-1.7B-gguf
- https://huggingface.co/TheBloke/Qwen3-8B-GGUF (public)
- TurboQuant repo upstream:
- https://github.com/TheTom/llama-cpp-turboquant (Metal fork with Q1_0 kernels)
- https://github.com/TheTom/turboquant_plus (reference PolarQuant + QJL impl)

View File

@@ -1,83 +0,0 @@
{
"generated_at": "2026-04-30T06:48:24.534271+00:00",
"host_platform": "darwin",
"models_dir": "/nonexistent/models/path",
"results": [
{
"model": "Bonsai-8B-1bit",
"file": "Bonsai-8B-Q1_0.gguf",
"found": false,
"disk_size_gb": null,
"est_gpu_gb": 1.15,
"tok_per_sec": null,
"avg": 70.5,
"gsm8k": 88.0,
"mmlu_r": 65.7,
"musr": 50.0,
"he_plus": 73.8,
"ifeval": 79.8,
"bfcl": 65.7,
"quality_note": "Published Prism ML 'Bonsai' technical report (EvalScope v1.4.2, H100/H800 infrastructure). M4 Pro measured 85 tok/s (5.4\u00d7 vs FP16)."
},
{
"model": "Bonsai-4B-1bit",
"file": "Bonsai-4B-Q1_0.gguf",
"found": false,
"disk_size_gb": null,
"est_gpu_gb": 0.57,
"tok_per_sec": null,
"avg": 67.5,
"gsm8k": 84.0,
"mmlu_r": 62.0,
"quality_note": "Estimated from 8B trend \u2014 full eval required for ground-truth score."
},
{
"model": "Bonsai-1.7B-1bit",
"file": "Bonsai-1.7B-Q1_0.gguf",
"found": false,
"disk_size_gb": null,
"est_gpu_gb": 0.24,
"tok_per_sec": null,
"avg": 62.0,
"gsm8k": 78.0,
"mmlu_r": 56.0,
"quality_note": "Estimated from 8B trend \u2014 full eval required for ground-truth score."
},
{
"model": "Qwen3-8B-Q4_0",
"file": "Qwen3-8B-Q4_0.gguf",
"found": false,
"disk_size_gb": null,
"est_gpu_gb": 4.7,
"tok_per_sec": null,
"avg": 79.3,
"gsm8k": 93.0,
"mmlu_r": 83.0,
"source": "Alibaba Qwen 3 8B model card (Q4_0 baseline)"
},
{
"model": "Qwen3-4B-Q4_0",
"file": "Qwen3-4B-Q4_0.gguf",
"found": false,
"disk_size_gb": null,
"est_gpu_gb": 2.4,
"tok_per_sec": null,
"avg": 76.0,
"gsm8k": 90.0,
"mmlu_r": 80.0,
"source": "Approximated from Qwen3-4B model card metrics (public)"
},
{
"model": "Qwen3-1.7B-Q4_0",
"file": "Qwen3-1.7B-Q4_0.gguf",
"found": false,
"disk_size_gb": null,
"est_gpu_gb": 1.0,
"tok_per_sec": null,
"avg": 71.0,
"gsm8k": 87.0,
"mmlu_r": 74.0,
"source": "Approximated from Qwen3-1.7B model card metrics (public)"
}
]
}

View File

@@ -1,88 +0,0 @@
{
"generated_at": "2025-10-06T00:00:00.000Z",
"host_platform": "darwin",
"notes": "Pre-seeded results file — numbers sourced from Prism ML model READMEs (published M4 Pro Metal measurements). Replace with locally-generated file by running benchmarks/run_bonsai_compare.py.",
"source": "https://huggingface.co/prism-ml/Bonsai-8B-gguf (and -4B, -1.7B repos)",
"methodology": "llama-cli --timings, prompt='Once upon a time', 128 tokens, temp=0, -ngl 99 (full GPU offload)",
"results": [
{
"model": "Bonsai-8B-1bit",
"file": "Bonsai-8B-Q1_0.gguf",
"found": false,
"disk_size_gb": 1.15,
"est_gpu_gb": 1.15,
"tok_per_sec": null,
"avg": 70.5,
"gsm8k": 88.0,
"mmlu_r": 65.7,
"musr": 50.0,
"he_plus": 73.8,
"ifeval": 79.8,
"bfcl": 65.7,
"quality_note": "Published Prism ML technical report (EvalScope v1.4.2). M4 Pro Metal: 85 tok/s.",
"platform_reference": "M4 Pro (Metal), 48 GB — NOT M1 (see live-run file for actual M1 measurements)"
},
{
"model": "Bonsai-4B-1bit",
"file": "Bonsai-4B-Q1_0.gguf",
"found": false,
"disk_size_gb": 0.57,
"est_gpu_gb": 0.57,
"tok_per_sec": null,
"avg": 67.5,
"gsm8k": 84.0,
"mmlu_r": 62.0,
"quality_note": "Estimated from Bonsai size-quality trend — full eval needed.",
"platform_reference": "M4 Pro (Metal) published: 136 tok/s"
},
{
"model": "Bonsai-1.7B-1bit",
"file": "Bonsai-1.7B-Q1_0.gguf",
"found": false,
"disk_size_gb": 0.24,
"est_gpu_gb": 0.24,
"tok_per_sec": null,
"avg": 62.0,
"gsm8k": 78.0,
"mmlu_r": 56.0,
"quality_note": "Estimated from Bonsai size-quality trend — full eval needed.",
"platform_reference": "M4 Pro (Metal) published: 250 tok/s"
},
{
"model": "Qwen3-8B-Q4_0",
"file": "Qwen3-8B-Q4_0.gguf",
"found": false,
"disk_size_gb": 4.70,
"est_gpu_gb": 4.70,
"tok_per_sec": null,
"avg": 79.3,
"gsm8k": 93.0,
"mmlu_r": 83.0,
"source": "Alibaba Qwen 3 8B model card (Q4_0 baseline)"
},
{
"model": "Qwen3-4B-Q4_0",
"file": "Qwen3-4B-Q4_0.gguf",
"found": false,
"disk_size_gb": 2.40,
"est_gpu_gb": 2.40,
"tok_per_sec": null,
"avg": 76.0,
"gsm8k": 90.0,
"mmlu_r": 80.0,
"source": "Approximated from Qwen3 4B model card metrics"
},
{
"model": "Qwen3-1.7B-Q4_0",
"file": "Qwen3-1.7B-Q4_0.gguf",
"found": false,
"disk_size_gb": 1.00,
"est_gpu_gb": 1.00,
"tok_per_sec": null,
"avg": 71.0,
"gsm8k": 87.0,
"mmlu_r": 74.0,
"source": "Approximated from Qwen3 1.7B model card metrics"
}
]
}

View File

@@ -1,179 +0,0 @@
#!/usr/bin/env python3
"""
Bonsai 1-bit vs Q4_0 benchmark — Issue #100
Compares Prism ML 1-bit Bonsai models (Q1_0) against standard GGUF Q4_0
on Apple Silicon (M1/M4 MacBook) using llama.cpp Metal backend.
Metrics collected:
- Model file size on disk
- Expected GPU memory at inference time
- Tokens/sec (generation throughput) via llama-cli --timings
- Quality: GSM8K score (reported from Prism ML paper)
- Tool calling viability (requires separate test — see issue #173)
Requirements:
- HF token at ~/.config/gitea/token (Prism ML repo is gated on HuggingFace)
- Models downloaded into ~/models/
- llama.cpp fork built with Metal + Q1_0 kernels:
git clone https://github.com/PrismML-Eng/llama.cpp
cmake -B build && cmake --build build -j
- llama-cli binary at ./llama.cpp/build/bin/llama-cli (relative to repo root)
Usage:
cd ~/burn-clone/STEP35-turboquant-100
python3 benchmarks/run_bonsai_compare.py [--models-dir DIR]
Output: benchmarks/bonsai_results_YYYY-MM-DD.json
"""
import argparse, json, os, re, subprocess, sys
from datetime import datetime, timezone
# Model manifest: (display_name, filename_on_disk, source_repo, expected_size_gb)
MODELS = [
# Bonsai 1-bit (Q1_0) — from prism-ml/Bonsai-*-gguf HuggingFace repos
("Bonsai-8B-1bit", "Bonsai-8B-Q1_0.gguf", "prism-ml/Bonsai-8B-gguf", 1.15),
("Bonsai-4B-1bit", "Bonsai-4B-Q1_0.gguf", "prism-ml/Bonsai-4B-gguf", 0.57),
("Bonsai-1.7B-1bit","Bonsai-1.7B-Q1_0.gguf", "prism-ml/Bonsai-1.7B-gguf", 0.24),
# Qwen3 baseline Q4_0 — common reference quant available from TheBloke or local sources
("Qwen3-8B-Q4_0", "Qwen3-8B-Q4_0.gguf", None, 4.70),
("Qwen3-4B-Q4_0", "Qwen3-4B-Q4_0.gguf", None, 2.40),
("Qwen3-1.7B-Q4_0", "Qwen3-1.7B-Q4_0.gguf", None, 1.00),
]
# Quality scores (GSM8K + aggregate) from Prism ML paper / model cards
# All scores 0100; Avg = arithmetic mean across 6 benchmarks.
QUALITY = {
"Bonsai-8B-1bit": {
"avg": 70.5, "gsm8k": 88.0, "mmlu_r": 65.7, "musr": 50.0,
"he_plus": 73.8, "ifeval": 79.8, "bfcl": 65.7,
"quality_note": "Published Prism ML 'Bonsai' technical report (EvalScope v1.4.2, "
"H100/H800 infrastructure). M4 Pro measured 85 tok/s (5.4× vs FP16)."
},
"Bonsai-4B-1bit": {
"avg": 67.5, "gsm8k": 84.0, "mmlu_r": 62.0,
"quality_note": "Estimated from 8B trend — full eval required for ground-truth score."
},
"Bonsai-1.7B-1bit": {
"avg": 62.0, "gsm8k": 78.0, "mmlu_r": 56.0,
"quality_note": "Estimated from 8B trend — full eval required for ground-truth score."
},
"Qwen3-8B-Q4_0": {
"avg": 79.3, "gsm8k": 93.0, "mmlu_r": 83.0,
"source": "Alibaba Qwen 3 8B model card (Q4_0 baseline)"
},
"Qwen3-4B-Q4_0": {
"avg": 76.0, "gsm8k": 90.0, "mmlu_r": 80.0,
"source": "Approximated from Qwen3-4B model card metrics (public)"
},
"Qwen3-1.7B-Q4_0": {
"avg": 71.0, "gsm8k": 87.0, "mmlu_r": 74.0,
"source": "Approximated from Qwen3-1.7B model card metrics (public)"
},
}
def disk_size_gb(path):
if os.path.exists(path):
return round(os.path.getsize(path) / 1024**3, 3)
return None
def run_timing(model_path, n_tokens=128, threads=4):
"""Run llama-cli --timings and parse tokens/sec."""
llama_cli = "./llama.cpp/build/bin/llama-cli"
if not os.path.exists(llama_cli):
return None, "Binary missing — build PrismML-Eng/llama.cpp fork first"
if not os.path.exists(model_path):
return None, "Model file not found"
cmd = [llama_cli,
"-m", model_path,
"-p", "Once upon a time",
"-n", str(n_tokens),
"--temp", "0",
"-t", str(threads),
"--timings",
"-ngl", "99"] # offload 99 layers to GPU
try:
res = subprocess.run(cmd, capture_output=True, text=True, timeout=90)
output = res.stdout + res.stderr
# tg_stop: X.XX ms ( Y.YY tokens/s)
m = re.search(r'tg_stop:\s*[\d.]+ ms\s*\(\s*([\d.]+) tokens/s\)', output)
if m:
return float(m.group(1)), None
return None, "tg_stop timing line not found — ensure Q1_0 Metal kernels present"
except subprocess.TimeoutExpired:
return None, "Subprocess timed out (>90 s)"
except Exception as e:
return None, str(e)
def main():
p = argparse.ArgumentParser(description=__doc__)
p.add_argument("--models-dir", default=os.path.expanduser("~/models"),
help="Directory containing model GGUF files")
p.add_argument("--n-tokens", type=int, default=128,
help="Generation length to measure (affects throughput)")
args = p.parse_args()
print("=" * 70)
print("Bonsai 1-bit vs Q4_0 Benchmark — Issue #100")
print("=" * 70)
print(f"Models directory : {args.models_dir}")
print(f"Metal offload : -ngl 99 (all layers onto GPU)")
print(f"Generation : {args.n_tokens} tokens from prompt 'Once upon a time'")
present = sum(1 for _, f, _, _ in MODELS
if os.path.exists(os.path.join(args.models_dir, f)))
if present == 0:
print("\n NO MODELS FOUND. To populate ~/models/:")
print(" ┌ Bonsai (gated on HuggingFace):")
print(" │ huggingface-cli login")
print(" │ huggingface-cli download prism-ml/Bonsai-8B-gguf Bonsai-8B-Q1_0.gguf --local-dir ~/models")
print(" │ (repeat for -4B and -1.7B repos)")
print(" └ Qwen3 baselines: TheBloke/Qwen3-8B-GGUF (public)")
print()
print(" Then re-run this script.")
results = []
for name, fname, _repo, sz_gb in MODELS:
path = os.path.join(args.models_dir, fname)
found = os.path.exists(path)
size_disk = disk_size_gb(path) if found else None
tok_s, err = (None, None) if not found else run_timing(path, args.n_tokens)
entry = {"model": name, "file": fname, "found": found,
"disk_size_gb": size_disk, "est_gpu_gb": sz_gb,
"tok_per_sec": tok_s}
if name in QUALITY:
entry.update(QUALITY[name])
if err:
entry["error"] = err
results.append(entry)
status = f"tok/s={tok_s:.1f}" if tok_s else f"(note: {err or 'missing'})"
print(f" {'FOUND' if found else 'MISSING':>7} [{name}] "
f"disk={size_disk or ''} GB {status}")
print(f"\nModels locally available: {present}/{len(MODELS)}")
# Write run artifacts
out = {"generated_at": datetime.now(timezone.utc).isoformat(),
"host_platform": sys.platform,
"models_dir": args.models_dir,
"results": results}
out_fname = os.path.join(os.path.dirname(__file__),
f"bonsai_results_{datetime.now().strftime('%Y-%m-%d')}.json")
os.makedirs(os.path.dirname(out_fname), exist_ok=True)
with open(out_fname, "w") as f:
json.dump(out, f, indent=2)
print(f"Results saved → {out_fname}")
return results
if __name__ == "__main__":
main()

View File

@@ -1,3 +1,85 @@
"""Pytest configuration for turboquant."""
import sys, os
import os
import sys
import pytest
from pathlib import Path
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
@pytest.fixture(scope="session")
def turboquant_server_url():
"""
Session-scoped fixture providing a TurboQuant server URL.
If TURBOQUANT_SERVER_URL is set, uses that directly.
Otherwise, auto-starts a llama-server with TurboQuant flags.
Requires:
- llama-server binary (in PATH or standard location)
- GGUF model file (in TURBOQUANT_MODEL_DIR or standard locations)
Skips if server cannot be started.
"""
# If URL already provided, use it
if os.environ.get("TURBOQUANT_SERVER_URL"):
yield os.environ["TURBOQUANT_SERVER_URL"]
return
# Try to auto-start
try:
from server_manager import TurboQuantServer, find_server_binary, find_model
except ImportError:
pytest.skip("server_manager not available")
return
binary = find_server_binary()
if not binary:
pytest.skip("llama-server binary not found — install llama-cpp-turboquant")
return
model = find_model()
if not model:
pytest.skip("No GGUF model found — set TURBOQUANT_MODEL_DIR or place model in ~/models")
return
port = int(os.environ.get("TURBOQUANT_TEST_PORT", "18081"))
kv_type = os.environ.get("TURBOQUANT_KV_TYPE", "turbo4")
ctx_size = int(os.environ.get("TURBOQUANT_CTX_SIZE", "8192"))
timeout = float(os.environ.get("TURBOQUANT_STARTUP_TIMEOUT", "60"))
server = TurboQuantServer(
model_path=model,
port=port,
kv_type=kv_type,
context_size=ctx_size,
server_binary=binary,
timeout=timeout,
)
try:
url = server.start()
yield url
except Exception as e:
pytest.skip(f"Could not start TurboQuant server: {e}")
finally:
server.stop()
@pytest.fixture(scope="session")
def turboquant_model_name(turboquant_server_url):
"""Get the model name from the running server."""
import json
import urllib.request
try:
req = urllib.request.Request(f"{turboquant_server_url}/v1/models")
resp = urllib.request.urlopen(req, timeout=10)
data = json.loads(resp.read())
models = data.get("data", [])
if models:
return models[0].get("id", "unknown")
except Exception:
pass
return "gemma-4"

197
tests/server_manager.py Normal file
View File

@@ -0,0 +1,197 @@
#!/usr/bin/env python3
"""
TurboQuant Server Manager
Manages llama-server lifecycle for integration tests:
- Start server with TurboQuant flags
- Wait for health check
- Stop server on teardown
Usage:
from tests.server_manager import TurboQuantServer
with TurboQuantServer(model_path="/path/to/model.gguf") as server:
url = server.url # e.g. http://localhost:8081
# Run tests against server
"""
import json
import os
import signal
import subprocess
import sys
import time
import urllib.request
import urllib.error
from pathlib import Path
from typing import Optional
class TurboQuantServer:
"""Context manager for llama-server with TurboQuant."""
def __init__(
self,
model_path: str,
port: int = 8081,
kv_type: str = "turbo4",
context_size: int = 32768,
server_binary: Optional[str] = None,
timeout: float = 60.0,
host: str = "127.0.0.1",
):
self.model_path = model_path
self.port = port
self.kv_type = kv_type
self.context_size = context_size
self.timeout = timeout
self.host = host
# Find server binary
if server_binary:
self.server_binary = server_binary
else:
# Try common locations
candidates = [
Path.home() / "llama-cpp-turboquant" / "build" / "bin" / "llama-server",
Path("/opt/llama-cpp-turboquant/build/bin/llama-server"),
Path("llama-server"), # PATH
]
self.server_binary = None
for c in candidates:
if c.exists() or c.name == "llama-server":
try:
subprocess.run([str(c), "--help"], capture_output=True, timeout=5)
self.server_binary = str(c)
break
except (FileNotFoundError, subprocess.TimeoutExpired):
continue
self.process: Optional[subprocess.Popen] = None
@property
def url(self) -> str:
return f"http://{self.host}:{self.port}"
def _build_command(self) -> list:
cmd = [
self.server_binary,
"-m", self.model_path,
"--port", str(self.port),
"--host", self.host,
"-ctk", self.kv_type,
"-ctv", self.kv_type,
"-c", str(self.context_size),
]
return cmd
def _check_health(self) -> bool:
try:
req = urllib.request.Request(f"{self.url}/v1/models")
resp = urllib.request.urlopen(req, timeout=5)
data = json.loads(resp.read())
return "data" in data and len(data.get("data", [])) > 0
except Exception:
return False
def start(self) -> str:
"""Start the server and wait for it to be healthy. Returns the server URL."""
if not self.server_binary:
raise RuntimeError(
"llama-server binary not found. Set server_binary or install to standard location."
)
if not Path(self.model_path).exists():
raise FileNotFoundError(f"Model not found: {self.model_path}")
cmd = self._build_command()
# Set TurboQuant env
env = os.environ.copy()
env["TURBO_LAYER_ADAPTIVE"] = "7"
self.process = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
env=env,
)
# Wait for health
start = time.time()
while time.time() - start < self.timeout:
if self.process.poll() is not None:
stderr = self.process.stderr.read().decode() if self.process.stderr else ""
raise RuntimeError(f"Server exited early (code {self.process.returncode}): {stderr[:500]}")
if self._check_health():
return self.url
time.sleep(1.0)
self.stop()
raise TimeoutError(f"Server did not become healthy within {self.timeout}s")
def stop(self):
"""Stop the server."""
if self.process:
try:
self.process.send_signal(signal.SIGTERM)
self.process.wait(timeout=10)
except subprocess.TimeoutExpired:
self.process.kill()
self.process.wait(timeout=5)
except Exception:
pass
self.process = None
def __enter__(self) -> "TurboQuantServer":
self.start()
return self
def __exit__(self, *args):
self.stop()
def find_server_binary() -> Optional[str]:
"""Find llama-server binary in common locations."""
candidates = [
Path.home() / "llama-cpp-turboquant" / "build" / "bin" / "llama-server",
Path("/opt/llama-cpp-turboquant/build/bin/llama-server"),
]
for c in candidates:
if c.exists():
return str(c)
# Try PATH
try:
result = subprocess.run(["which", "llama-server"], capture_output=True, text=True)
if result.returncode == 0:
return result.stdout.strip()
except Exception:
pass
return None
def find_model(model_dir: Optional[str] = None) -> Optional[str]:
"""Find a GGUF model file."""
search_dirs = [
model_dir,
os.environ.get("TURBOQUANT_MODEL_DIR"),
str(Path.home() / "models"),
"/opt/models",
"/tmp/models",
]
for d in search_dirs:
if not d:
continue
p = Path(d)
if p.is_file() and p.suffix == ".gguf":
return str(p)
if p.is_dir():
for f in sorted(p.rglob("*.gguf")):
return str(f)
return None