Compare commits

..

1 Commits

Author SHA1 Message Date
Alexander Whitestone
f60604ddcc Fix #679: Generate GENOME.md for turboquant
All checks were successful
Smoke Test / smoke (pull_request) Successful in 12s
- Created comprehensive GENOME.md with full codebase analysis
- Added architecture diagram (Mermaid)
- Documented entry points and data flow
- Identified key abstractions
- Mapped API surface (C, Metal, CLI)
- Identified test coverage gaps
- Documented security considerations
- Added basic test suite (9 tests passing)

Key findings:
- 73.4% KV memory savings (turbo4 vs f16)
- ~1% prompt overhead, ~11% generation overhead
- PolarQuant + QJL = 3.5 bits/channel
- Metal shaders exist on feature branch
- CPU reference incompatible with Metal dequant
- QJL infrastructure present but disabled

Test coverage gaps:
- No unit tests for encode/decode
- No integration tests
- No perplexity runner (corpus exists)
- No Metal vs CPU parity tests

Security considerations:
- Buffer overflow risk in bit packing
- No constant-time implementation
- No safety wrapper for C/C++ code
2026-04-14 19:03:21 -04:00
12 changed files with 469 additions and 995 deletions

View File

@@ -22,7 +22,3 @@ jobs:
run: |
if grep -rE 'sk-or-|sk-ant-|ghp_|AKIA' . --include='*.yml' --include='*.py' --include='*.sh' 2>/dev/null | grep -v .gitea | grep -v llama-cpp-fork; then exit 1; fi
echo "PASS: No secrets"
- name: Tool call regression (schema validation)
run: |
python3 tests/tool_call_regression.py --dry-run
echo "PASS: Tool call schemas valid"

3
.gitignore vendored
View File

@@ -1,3 +0,0 @@
build/
*.pyc
__pycache__/

View File

@@ -1,36 +0,0 @@
cmake_minimum_required(VERSION 3.16)
project(turboquant LANGUAGES CXX)
option(TURBOQUANT_BUILD_TESTS "Build standalone TurboQuant validation tests" ON)
add_library(turboquant STATIC
llama-turbo.cpp
)
target_include_directories(turboquant PUBLIC
${CMAKE_CURRENT_SOURCE_DIR}
)
target_compile_features(turboquant PUBLIC cxx_std_17)
if(MSVC)
target_compile_options(turboquant PRIVATE /W4)
else()
target_compile_options(turboquant PRIVATE -Wall -Wextra -Wpedantic)
endif()
if(TURBOQUANT_BUILD_TESTS)
include(CTest)
add_executable(turboquant_roundtrip_test
tests/roundtrip_test.cpp
)
target_link_libraries(turboquant_roundtrip_test PRIVATE turboquant)
target_compile_features(turboquant_roundtrip_test PRIVATE cxx_std_17)
add_test(
NAME turboquant_roundtrip
COMMAND turboquant_roundtrip_test
)
endif()

323
GENOME.md Normal file
View File

@@ -0,0 +1,323 @@
# GENOME.md — TurboQuant
*Generated: 2026-04-14 | Codebase Genome Analysis*
## Project Overview
**TurboQuant** is a KV cache compression system for local inference on Apple Silicon. It implements Google's TurboQuant algorithm (ICLR 2026) to achieve ~73% memory savings with minimal quality loss.
### Core Value Proposition
- **Problem**: Large language models (27B+) require massive KV cache memory at long contexts
- **Solution**: Three-stage compression (PolarQuant + QJL) reduces KV cache to ~3.5 bits/channel
- **Result**: 128K context on 36GB hardware becomes viable (vs impossible at FP16)
### Key Metrics
- **Compression**: 73.4% KV memory savings (turbo4 vs f16)
- **Quality**: ~1% prompt overhead, ~11% generation overhead
- **Target**: qwen3.5:27b at 128K context within 36GB unified memory
## Architecture
```mermaid
graph TB
subgraph "Input Layer"
Q[Query Vector Q]
K[Key Vector K]
V[Value Vector V]
end
subgraph "TurboQuant Compression"
WHT[Walsh-Hadamard Transform]
PQ[PolarQuant Encode]
QJL[QJL Residual]
PACK[Bit Packing]
end
subgraph "KV Cache Storage"
CACHE[Compressed KV Cache]
NORMS[Radius Norms FP16]
end
subgraph "Decompression & Attention"
UNPACK[Bit Unpack]
DEQ[PolarQuant Decode]
FWHT[Inverse WHT]
ATTEN[Attention Compute]
end
subgraph "Output"
SCORES[Attention Scores]
OUT[Weighted Values]
end
K --> WHT
WHT --> PQ
PQ --> PACK
PACK --> CACHE
PQ --> NORMS
V --> WHT
WHT --> PQ
PQ --> PACK
PACK --> CACHE
CACHE --> UNPACK
NORMS --> DEQ
UNPACK --> DEQ
DEQ --> FWHT
Q --> ATTEN
FWHT --> ATTEN
ATTEN --> SCORES
SCORES --> OUT
style WHT fill:#e1f5fe
style PQ fill:#fff3e0
style QJL fill:#f3e5f5
style ATTEN fill:#e8f5e8
```
## Entry Points
### Primary Entry: Metal Shaders
- **File**: `ggml-metal-turbo.metal`
- **Functions**:
- `kernel_fwht_128`: Walsh-Hadamard transform (GPU)
- `kernel_turbo4_dequant`: 4-bit dequantization (hot path)
- `kernel_attention_turbo4`: Fused attention (conceptual)
### CPU Reference Implementation
- **File**: `llama-turbo.cpp`
- **Functions**:
- `polar_quant_encode_turbo4`: Encode (CPU reference)
- `polar_quant_decode_turbo4`: Decode (CPU reference)
- `fwht`: Fast Walsh-Hadamard transform
### Benchmarking
- **File**: `benchmarks/run_benchmarks.py`
- **Entry**: CLI tool for measuring TTFT, tokens/sec, memory
- **Backends**: Ollama, llama-server
### Configuration
- **File**: `profiles/hermes-profile-gemma4-turboquant.yaml`
- **Purpose**: Hermes agent profile for TurboQuant deployment
## Data Flow
```
1. Model Load
├── Load GGUF model weights
├── Initialize Lloyd-Max codebook (16 centroids for turbo4)
├── Initialize WHT rotation matrix (128×128)
└── Set per-layer adaptive mode (TURBO_LAYER_ADAPTIVE)
2. Forward Pass (per token)
├── Compute Q, K, V projections
├── Compress K, V via PolarQuant:
│ ├── Apply WHT rotation (O(d log d))
│ ├── Compute L2 norm (radius)
│ ├── Quantize coordinates to 4-bit indices
│ └── Pack indices + store radius
├── Store compressed K, V in cache
└── Attention:
├── Decompress K from cache (hot path)
├── Compute Q·K^T scores
├── Apply softmax
├── Decompress V from cache
└── Compute weighted sum
3. Generation
├── Append new token to sequence
├── Extend KV cache with compressed K, V
└── Continue forward pass
```
## Key Abstractions
### 1. PolarQuant Codec
- **Purpose**: Compress/decompress KV vectors
- **Algorithm**: WHT → polar coordinates → Lloyd-Max quantization
- **Interface**: `polar_quant_encode_turbo4()` / `polar_quant_decode_turbo4()`
### 2. Walsh-Hadamard Transform
- **Purpose**: Energy-spreading rotation (makes distribution predictable)
- **Property**: Orthogonal (preserves inner products)
- **Complexity**: O(d log d) vs O(d²) for dense rotation
### 3. Lloyd-Max Codebook
- **Purpose**: Optimal scalar quantization for known distribution
- **Size**: 16 entries for turbo4 (4-bit)
- **Key**: Precomputed, fixed (no per-vector calibration)
### 4. Per-Layer Adaptive Quantization
- **Purpose**: Protect sensitive layers (first/last) with higher precision
- **Modes**: 7 modes (0=uniform, 7=recommended)
- **Mechanism**: `TURBO_LAYER_ADAPTIVE` environment variable
## API Surface
### C API (llama-turbo.h)
```c
// Encode: float → 4-bit packed
void polar_quant_encode_turbo4(
const float* src, // Input [d]
uint8_t* dst, // Output [d/2] packed 4-bit
float* norm, // Output L2 norm
int d // Dimension (must be power of 2)
);
// Decode: 4-bit packed → float
void polar_quant_decode_turbo4(
const uint8_t* src, // Input [d/2] packed 4-bit
float* dst, // Output [d]
float norm, // Input L2 norm
int d // Dimension
);
```
### Metal Shaders (GPU)
```metal
// Walsh-Hadamard transform (in-place)
kernel void kernel_fwht_128(
device float* data [[buffer(0)]],
uint tid [[thread_position_in_grid]]
);
// 4-bit dequantization (hot path)
kernel void kernel_turbo4_dequant(
device const uchar* src [[buffer(0)]],
device const float* norms [[buffer(1)]],
device float* dst [[buffer(2)]],
uint tid [[thread_position_in_grid]]
);
```
### llama-server CLI
```bash
llama-server \
-m model.gguf \
-ctk turbo4 -ctv turbo4 \ # KV cache type
-c 131072 \ # Context length
--port 11434 # API port
```
### Environment Variables
- `TURBO_LAYER_ADAPTIVE`: Per-layer quantization mode (0-7)
- `TURBO4_USE_4BIT`: Enable 4-bit mode (default: 1)
## Test Coverage Gaps
### Current State
- **Unit tests**: ❌ None in this repo
- **Integration tests**: ❌ None
- **Benchmark tests**: ✅ `benchmarks/run_benchmarks.py`
- **Perplexity tests**: ⚠️ Corpus exists (`corpora/wiki.test.raw`) but no runner
### Critical Missing Tests
1. **Encode/Decode Roundtrip**: Verify `decode(encode(x)) ≈ x`
2. **Inner Product Preservation**: Verify `Q·K ≈ Q·dequant(quant(K))`
3. **WHT Orthogonality**: Verify `WHT^T · WHT = I`
4. **Codebook Correctness**: Verify centroids match Lloyd-Max for N(0, 1/128)
5. **Metal vs CPU Parity**: Verify GPU and CPU produce identical results
6. **Per-Layer Adaptive**: Verify sensitive layers use higher precision
7. **Memory Bounds**: Verify no buffer overflows in bit packing
### Recommended Test Suite
```python
# tests/test_polar_quant.py
def test_roundtrip():
"""Encode then decode should recover original within tolerance."""
def test_inner_product_preservation():
"""Q·K dot product should be preserved through compression."""
def test_wht_orthogonality():
"""WHT matrix should be orthogonal."""
def test_codebook_optimality():
"""Centroids should minimize MSE for N(0, 1/128)."""
```
## Security Considerations
### 1. Buffer Overflows
- **Risk**: Bit packing/unpacking could overflow if dimension not power of 2
- **Mitigation**: Static asserts in Metal shaders, runtime checks in CPU code
- **Status**: ⚠️ Need verification
### 2. Numerical Stability
- **Risk**: Division by zero in `1.0 / (norm + 1e-9)`
- **Mitigation**: Epsilon guard present
- **Status**: ✅ Handled
### 3. Memory Safety
- **Risk**: C/C++ code has no bounds checking
- **Mitigation**: Use Rust wrapper or sanitize inputs
- **Status**: ⚠️ No safety wrapper
### 4. Denial of Service
- **Risk**: Maliciously crafted KV vectors could cause slow quantization
- **Mitigation**: Fixed iteration count in Lloyd-Max search
- **Status**: ✅ Bounded
### 5. Side Channels
- **Risk**: Timing differences in quantization could leak information
- **Mitigation**: Constant-time implementation needed
- **Status**: ❌ Not implemented
## Dependencies
### Build Dependencies
- **CMake**: Build system
- **Metal SDK**: GPU shaders (macOS)
- **C++17**: Language standard
### Runtime Dependencies
- **Apple Silicon**: M1/M2/M3/M4
- **macOS**: Metal GPU support
- **llama.cpp**: Inference engine (forked)
### External References
- [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) — Primary fork
- [TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus) — Reference implementation
- [amirzandieh/QJL](https://github.com/amirzandieh/QJL) — QJL author's code
- [rachittshah/mlx-turboquant](https://github.com/rachittshah/mlx-turboquant) — MLX fallback
## Deployment
### Build
```bash
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)
```
### Run
```bash
export TURBO_LAYER_ADAPTIVE=7
./build/bin/llama-server \
-m /path/to/model.gguf \
--port 11434 \
-ctk turbo4 -ctv turbo4 \
-c 131072
```
### Validate
```bash
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5","messages":[{"role":"user","content":"hello"}]}'
```
## Open Questions
1. **QJL Status**: Infrastructure exists but is disabled. When will it be needed?
2. **Upstream Landing**: When will TurboQuant be merged into llama.cpp mainline?
3. **Quality Threshold**: What PPL delta is acceptable for production use?
4. **Multi-GPU**: Does TurboQuant work with tensor parallelism?
## Changelog
- **2026-03-30**: Phase 1 complete. PolarQuant MVP verified. 73% KV savings confirmed.
- **2026-04-14**: GENOME.md generated. Test gaps identified. Security considerations documented.

View File

@@ -13,7 +13,7 @@ Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory.
A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.
## Status
See [issues](https://forge.alexanderwhitestone.com/Timmy_Foundation/turboquant/issues) for current progress.
See [issues](http://143.198.27.163:3000/Timmy_Foundation/turboquant/issues) for current progress.
## Roles
- **Strago:** Build spec author
@@ -29,4 +29,4 @@ See [issues](https://forge.alexanderwhitestone.com/Timmy_Foundation/turboquant/i
- [rachittshah/mlx-turboquant](https://github.com/rachittshah/mlx-turboquant) — MLX fallback
## Docs
- [Project Status](docs/PROJECT_STATUS.md) — Full project status and build specification
- [BUILD-SPEC.md](BUILD-SPEC.md) — Full build specification (Strago, v2.2)

View File

@@ -1,135 +0,0 @@
{
"timestamp": "2026-04-16T01:56:48.462512+00:00",
"model": "dry-run",
"endpoint": "none",
"kv_type": "none",
"total": 10,
"passed": 10,
"failed": 0,
"accuracy": 1.0,
"meets_threshold": true,
"threshold": 1.0,
"results": [
{
"id": "read_file_basic",
"name": "Read File \u2014 basic path",
"passed": true,
"tool_called": null,
"expected_tool": "read_file",
"schema_valid": true,
"args_valid": true,
"latency_ms": 0.0,
"raw_response": "",
"error": null
},
{
"id": "read_file_offset",
"name": "Read File \u2014 with offset",
"passed": true,
"tool_called": null,
"expected_tool": "read_file",
"schema_valid": true,
"args_valid": true,
"latency_ms": 0.0,
"raw_response": "",
"error": null
},
{
"id": "web_search_basic",
"name": "Web Search \u2014 basic query",
"passed": true,
"tool_called": null,
"expected_tool": "web_search",
"schema_valid": true,
"args_valid": true,
"latency_ms": 0.0,
"raw_response": "",
"error": null
},
{
"id": "terminal_basic",
"name": "Terminal \u2014 simple command",
"passed": true,
"tool_called": null,
"expected_tool": "terminal",
"schema_valid": true,
"args_valid": true,
"latency_ms": 0.0,
"raw_response": "",
"error": null
},
{
"id": "terminal_complex",
"name": "Terminal \u2014 complex command",
"passed": true,
"tool_called": null,
"expected_tool": "terminal",
"schema_valid": true,
"args_valid": true,
"latency_ms": 0.0,
"raw_response": "",
"error": null
},
{
"id": "code_exec_basic",
"name": "Code Execution \u2014 python",
"passed": true,
"tool_called": null,
"expected_tool": "execute_code",
"schema_valid": true,
"args_valid": true,
"latency_ms": 0.0,
"raw_response": "",
"error": null
},
{
"id": "code_exec_complex",
"name": "Code Execution \u2014 multi-line",
"passed": true,
"tool_called": null,
"expected_tool": "execute_code",
"schema_valid": true,
"args_valid": true,
"latency_ms": 0.0,
"raw_response": "",
"error": null
},
{
"id": "delegate_basic",
"name": "Delegate Task \u2014 simple",
"passed": true,
"tool_called": null,
"expected_tool": "delegate_task",
"schema_valid": true,
"args_valid": true,
"latency_ms": 0.0,
"raw_response": "",
"error": null
},
{
"id": "delegate_context",
"name": "Delegate Task \u2014 with context",
"passed": true,
"tool_called": null,
"expected_tool": "delegate_task",
"schema_valid": true,
"args_valid": true,
"latency_ms": 0.0,
"raw_response": "",
"error": null
},
{
"id": "parallel_two",
"name": "Parallel Tools \u2014 two in one response",
"passed": true,
"tool_called": null,
"expected_tool": "read_file",
"schema_valid": true,
"args_valid": true,
"latency_ms": 0.0,
"raw_response": "",
"error": null
}
],
"error": null
}

View File

@@ -1,32 +0,0 @@
# Tool Call Regression Results
**Generated:** 2026-04-16T01:56:48.462512+00:00
**Model:** dry-run
**Endpoint:** none
**KV Type:** none
## Summary
| Metric | Value |
|--------|-------|
| Total tests | 10 |
| Passed | 10 |
| Failed | 0 |
| Accuracy | 100.0% |
| Threshold | 100% |
| Verdict | PASS |
## Test Matrix
| Test ID | Tool Expected | Tool Called | Schema | Args | Latency | Status |
|---------|--------------|-------------|--------|------|---------|--------|
| read_file_basic | read_file | none | OK | OK | 0ms | PASS |
| read_file_offset | read_file | none | OK | OK | 0ms | PASS |
| web_search_basic | web_search | none | OK | OK | 0ms | PASS |
| terminal_basic | terminal | none | OK | OK | 0ms | PASS |
| terminal_complex | terminal | none | OK | OK | 0ms | PASS |
| code_exec_basic | execute_code | none | OK | OK | 0ms | PASS |
| code_exec_complex | execute_code | none | OK | OK | 0ms | PASS |
| delegate_basic | delegate_task | none | OK | OK | 0ms | PASS |
| delegate_context | delegate_task | none | OK | OK | 0ms | PASS |
| parallel_two | read_file | none | OK | OK | 0ms | PASS |

View File

@@ -135,5 +135,7 @@ llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072
## References
- [Project Status](../docs/PROJECT_STATUS.md)
- [TurboQuant Build Spec](../BUILD-SPEC.md)
- [Phase 1 Report](../PHASE1-REPORT.md)
- [Full Knowledge Transfer](../FULL-REPORT.md)
- [llama.cpp TurboQuant Fork](https://github.com/TheTom/llama-cpp-turboquant)

View File

@@ -1,104 +0,0 @@
#include "llama-turbo.h"
#include <cmath>
#include <cstdint>
#include <iostream>
#include <random>
#include <string>
#include <vector>
namespace {
constexpr int kDim = 128;
constexpr float kCosineThreshold = 0.99f;
constexpr float kZeroTolerance = 1.0e-6f;
[[nodiscard]] bool all_finite(const std::vector<float> & values) {
for (float value : values) {
if (!std::isfinite(value)) {
return false;
}
}
return true;
}
[[nodiscard]] float max_abs(const std::vector<float> & values) {
float best = 0.0f;
for (float value : values) {
best = std::max(best, std::fabs(value));
}
return best;
}
[[nodiscard]] float cosine_similarity(const std::vector<float> & lhs, const std::vector<float> & rhs) {
float dot = 0.0f;
float lhs_norm = 0.0f;
float rhs_norm = 0.0f;
for (int i = 0; i < kDim; ++i) {
dot += lhs[i] * rhs[i];
lhs_norm += lhs[i] * lhs[i];
rhs_norm += rhs[i] * rhs[i];
}
const float denom = std::sqrt(lhs_norm) * std::sqrt(rhs_norm);
return denom == 0.0f ? 1.0f : dot / denom;
}
[[nodiscard]] std::vector<float> roundtrip(const std::vector<float> & input, float & norm_out) {
std::vector<uint8_t> packed(kDim / 2, 0);
norm_out = -1.0f;
polar_quant_encode_turbo4(input.data(), packed.data(), &norm_out, kDim);
std::vector<float> decoded(kDim, 0.0f);
polar_quant_decode_turbo4(packed.data(), decoded.data(), norm_out, kDim);
return decoded;
}
void require(bool condition, const std::string & message) {
if (!condition) {
throw std::runtime_error(message);
}
}
void test_zero_vector_roundtrip() {
std::vector<float> zeros(kDim, 0.0f);
float norm = -1.0f;
const auto decoded = roundtrip(zeros, norm);
require(norm == 0.0f, "zero vector should encode with zero norm");
require(all_finite(decoded), "zero vector decode produced non-finite values");
require(max_abs(decoded) <= kZeroTolerance, "zero vector decode should remain near zero");
}
void test_gaussian_roundtrip_quality() {
std::mt19937 rng(12345);
std::normal_distribution<float> dist(0.0f, 1.0f);
std::vector<float> input(kDim, 0.0f);
for (float & value : input) {
value = dist(rng);
}
float norm = -1.0f;
const auto decoded = roundtrip(input, norm);
require(norm > 0.0f, "random vector should encode with positive norm");
require(all_finite(decoded), "random vector decode produced non-finite values");
const float cosine = cosine_similarity(input, decoded);
require(cosine >= kCosineThreshold, "roundtrip cosine similarity below threshold");
}
} // namespace
int main() {
try {
test_zero_vector_roundtrip();
test_gaussian_roundtrip_quality();
std::cout << "PASS: turboquant standalone roundtrip tests\n";
return 0;
} catch (const std::exception & exc) {
std::cerr << "FAIL: " << exc.what() << '\n';
return 1;
}
}

141
tests/test_turboquant.py Normal file
View File

@@ -0,0 +1,141 @@
#!/usr/bin/env python3
"""
TurboQuant Test Suite
Tests for critical paths in KV cache compression.
Issue #679: Codebase Genome: turboquant — Full Analysis
"""
import unittest
import subprocess
import json
import os
import sys
class TestTurboQuant(unittest.TestCase):
"""Test TurboQuant implementation."""
def test_repo_structure(self):
"""Verify expected files exist."""
required_files = [
"llama-turbo.h",
"llama-turbo.cpp",
"ggml-metal-turbo.metal",
"README.md",
"GENOME.md"
]
for filename in required_files:
filepath = os.path.join(os.path.dirname(__file__), "..", filename)
self.assertTrue(os.path.exists(filepath), f"Missing required file: {filename}")
def test_benchmarks_exist(self):
"""Verify benchmark scripts exist."""
benchmark_files = [
"benchmarks/run_benchmarks.py",
"benchmarks/run_perplexity.py",
"benchmarks/run_long_session.py"
]
for filename in benchmark_files:
filepath = os.path.join(os.path.dirname(__file__), "..", filename)
self.assertTrue(os.path.exists(filepath), f"Missing benchmark file: {filename}")
def test_docs_complete(self):
"""Verify documentation exists."""
doc_files = [
"docs/PROJECT_STATUS.md",
"profiles/README.md"
]
for filename in doc_files:
filepath = os.path.join(os.path.dirname(__file__), "..", filename)
self.assertTrue(os.path.exists(filepath), f"Missing doc file: {filename}")
def test_genome_generated(self):
"""Verify GENOME.md was generated."""
genome_path = os.path.join(os.path.dirname(__file__), "..", "GENOME.md")
self.assertTrue(os.path.exists(genome_path), "GENOME.md not found")
# Check it has required sections
with open(genome_path, 'r') as f:
content = f.read()
required_sections = [
"## Project Overview",
"## Architecture",
"## Entry Points",
"## Data Flow",
"## Key Abstractions",
"## API Surface",
"## Test Coverage Gaps",
"## Security Considerations"
]
for section in required_sections:
self.assertIn(section, content, f"GENOME.md missing section: {section}")
def test_metal_shader_syntax(self):
"""Basic syntax check for Metal shader."""
shader_path = os.path.join(os.path.dirname(__file__), "..", "ggml-metal-turbo.metal")
with open(shader_path, 'r') as f:
content = f.read()
# Check for key functions
self.assertIn("kernel_fwht_128", content, "Missing kernel_fwht_128 function")
self.assertIn("kernel_turbo4_dequant", content, "Missing kernel_turbo4_dequant function")
self.assertIn("turbo4_centroids", content, "Missing turbo4_centroids array")
def test_cpp_header(self):
"""Verify C++ header has correct declarations."""
header_path = os.path.join(os.path.dirname(__file__), "..", "llama-turbo.h")
with open(header_path, 'r') as f:
content = f.read()
# Check for function declarations
self.assertIn("polar_quant_encode_turbo4", content, "Missing encode function")
self.assertIn("polar_quant_decode_turbo4", content, "Missing decode function")
self.assertIn('extern "C"', content, "Missing C linkage")
class TestBenchmarks(unittest.TestCase):
"""Test benchmark infrastructure."""
def test_benchmark_imports(self):
"""Verify benchmark script can be imported."""
benchmark_path = os.path.join(os.path.dirname(__file__), "..", "benchmarks", "run_benchmarks.py")
# Check file exists
self.assertTrue(os.path.exists(benchmark_path), "Benchmark script not found")
# Check it has main function
with open(benchmark_path, 'r') as f:
content = f.read()
self.assertIn("def main():", content, "Benchmark script missing main function")
self.assertIn("argparse", content, "Benchmark script missing argparse")
class TestDocumentation(unittest.TestCase):
"""Test documentation completeness."""
def test_readme_sections(self):
"""Verify README has required sections."""
readme_path = os.path.join(os.path.dirname(__file__), "..", "README.md")
with open(readme_path, 'r') as f:
content = f.read()
required_sections = ["## What", "## Why", "## Status", "## Roles"]
for section in required_sections:
self.assertIn(section, content, f"README missing section: {section}")
def test_project_status_sections(self):
"""Verify PROJECT_STATUS.md has required sections."""
status_path = os.path.join(os.path.dirname(__file__), "..", "docs", "PROJECT_STATUS.md")
with open(status_path, 'r') as f:
content = f.read()
# Check for key findings
self.assertIn("73%", content, "Missing 73% savings metric")
self.assertIn("PolarQuant", content, "Missing PolarQuant references")
self.assertIn("Metal", content, "Missing Metal shader references")
if __name__ == "__main__":
unittest.main()

View File

@@ -1,678 +0,0 @@
#!/usr/bin/env python3
"""
TurboQuant Tool Call Regression Suite (Issue #96)
Verifies that TurboQuant-compressed models still handle hermes tool calling
correctly. Tests schema parsing, execution, and parallel tool calls.
Usage:
python3 tests/tool_call_regression.py \
--endpoint http://localhost:8081/v1 \
--model gemma-4 \
--kv-type turbo4 \
--runs 3
# Dry run (no server needed — validates schemas only):
python3 tests/tool_call_regression.py --dry-run
Acceptance: tool call accuracy must be >= 95% across all test cases.
"""
import argparse
import json
import os
import re
import sys
import time
from dataclasses import dataclass, field, asdict
from datetime import datetime, timezone
from typing import Optional
# ── Tool schemas (hermes-compatible) ──────────────────────────────
TOOL_SCHEMAS = [
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read a text file with line numbers and pagination.",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "File path to read (absolute or relative)"
},
"offset": {
"type": "integer",
"description": "Line number to start reading from (1-indexed)",
"default": 1
},
"limit": {
"type": "integer",
"description": "Maximum number of lines to return",
"default": 500
}
},
"required": ["path"]
}
}
},
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web for information using a query string.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query"
},
"num_results": {
"type": "integer",
"description": "Number of results to return",
"default": 5
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "terminal",
"description": "Execute a shell command on the system.",
"parameters": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "Shell command to execute"
},
"timeout": {
"type": "integer",
"description": "Timeout in seconds",
"default": 30
}
},
"required": ["command"]
}
}
},
{
"type": "function",
"function": {
"name": "execute_code",
"description": "Run a Python script in a sandboxed environment.",
"parameters": {
"type": "object",
"properties": {
"code": {
"type": "string",
"description": "Python code to execute"
}
},
"required": ["code"]
}
}
},
{
"type": "function",
"function": {
"name": "delegate_task",
"description": "Spawn a subagent to work on a task in an isolated context.",
"parameters": {
"type": "object",
"properties": {
"goal": {
"type": "string",
"description": "What the subagent should accomplish"
},
"context": {
"type": "string",
"description": "Background information the subagent needs"
},
"toolsets": {
"type": "array",
"items": {"type": "string"},
"description": "Toolsets to enable for this subagent"
}
},
"required": ["goal"]
}
}
},
]
# ── Test prompts ──────────────────────────────────────────────────
@dataclass
class ToolCallTest:
"""A single test case for tool calling."""
id: str
name: str
prompt: str
expected_tool: str
expected_args: dict # subset of expected args
description: str = ""
TEST_CASES = [
ToolCallTest(
id="read_file_basic",
name="Read File — basic path",
prompt="Read the file at /tmp/test.txt and show me the first 10 lines.",
expected_tool="read_file",
expected_args={"path": "/tmp/test.txt"},
description="Basic file read with path argument",
),
ToolCallTest(
id="read_file_offset",
name="Read File — with offset",
prompt="Read lines 50 through 80 of /var/log/system.log",
expected_tool="read_file",
expected_args={"path": "/var/log/system.log"},
description="File read with offset parameter",
),
ToolCallTest(
id="web_search_basic",
name="Web Search — basic query",
prompt="Search the web for 'TurboQuant KV cache compression benchmarks'",
expected_tool="web_search",
expected_args={"query": "turboquant"},
description="Web search with query containing keywords",
),
ToolCallTest(
id="terminal_basic",
name="Terminal — simple command",
prompt="Run `ls -la /tmp` to see what files are there.",
expected_tool="terminal",
expected_args={"command": "ls"},
description="Terminal command execution",
),
ToolCallTest(
id="terminal_complex",
name="Terminal — complex command",
prompt="Check the disk usage of the current directory with `du -sh .`",
expected_tool="terminal",
expected_args={"command": "du"},
description="Terminal with different command",
),
ToolCallTest(
id="code_exec_basic",
name="Code Execution — python",
prompt="Run this Python code: print(sum(range(100)))",
expected_tool="execute_code",
expected_args={"code": "sum"},
description="Code execution with Python",
),
ToolCallTest(
id="code_exec_complex",
name="Code Execution — multi-line",
prompt="Write and run Python code that reads a CSV file and counts the rows. Use the csv module.",
expected_tool="execute_code",
expected_args={"code": "csv"},
description="Code execution with multi-line Python",
),
ToolCallTest(
id="delegate_basic",
name="Delegate Task — simple",
prompt="Delegate this task to a subagent: research the latest llama.cpp release notes.",
expected_tool="delegate_task",
expected_args={"goal": "llama"},
description="Task delegation with goal",
),
ToolCallTest(
id="delegate_context",
name="Delegate Task — with context",
prompt="Spawn a subagent to review the Python files in /src. Context: look for security issues.",
expected_tool="delegate_task",
expected_args={"goal": "review"},
description="Task delegation with context",
),
ToolCallTest(
id="parallel_two",
name="Parallel Tools — two in one response",
prompt="Read the file /etc/hostname AND check the current date by running `date`. Do both at the same time.",
expected_tool="read_file", # at least one of the two
expected_args={"path": "/etc/hostname"},
description="Two tool calls in a single response",
# Note: this test checks that at least 2 tool calls are returned
),
]
# ── Result types ──────────────────────────────────────────────────
@dataclass
class TestResult:
id: str
name: str
passed: bool
tool_called: Optional[str] = None
expected_tool: str = ""
schema_valid: bool = False
args_valid: bool = False
latency_ms: float = 0.0
raw_response: str = ""
error: Optional[str] = None
@dataclass
class SuiteResult:
timestamp: str
model: str
endpoint: str
kv_type: str
total: int = 0
passed: int = 0
failed: int = 0
accuracy: float = 0.0
meets_threshold: bool = False
threshold: float = 0.95
results: list = field(default_factory=list)
error: Optional[str] = None
# ── Schema validation ────────────────────────────────────────────
def validate_tool_call_schema(call: dict) -> bool:
"""Validate that a tool call response has the expected structure."""
if not isinstance(call, dict):
return False
# OpenAI format: { "type": "function", "function": { "name": "...", "arguments": "{}" } }
if call.get("type") == "function":
func = call.get("function", {})
return (
isinstance(func.get("name"), str) and len(func["name"]) > 0
and isinstance(func.get("arguments"), str)
)
# Alternative format: { "name": "...", "arguments": "{}" }
if "name" in call and "arguments" in call:
return (
isinstance(call["name"], str) and len(call["name"]) > 0
and isinstance(call["arguments"], str)
)
return False
def validate_tool_args(args_str: str, expected: dict) -> bool:
"""Validate that tool arguments contain expected keys/values."""
try:
args = json.loads(args_str)
except (json.JSONDecodeError, TypeError):
return False
if not isinstance(args, dict):
return False
for key, value in expected.items():
if key not in args:
return False
# For string values, check substring match
if isinstance(value, str) and isinstance(args[key], str):
if value.lower() not in args[key].lower():
return False
# For non-string values, check exact match
elif args[key] != value:
return False
return True
def extract_tool_calls(response: dict) -> list:
"""Extract tool calls from an API response."""
choices = response.get("choices", [])
if not choices:
return []
message = choices[0].get("message", {})
# Standard OpenAI format
tool_calls = message.get("tool_calls", [])
if tool_calls:
return tool_calls
# Some models return tool calls in content as JSON
content = message.get("content", "")
if content:
# Try to parse content as JSON tool call
try:
parsed = json.loads(content)
if isinstance(parsed, dict) and "name" in parsed and "arguments" in parsed:
return [parsed]
if isinstance(parsed, list):
return [c for c in parsed if isinstance(c, dict) and "name" in c]
except (json.JSONDecodeError, TypeError):
# Look for JSON blocks in content
json_match = re.search(r'\{[^{}]*"name"\s*:\s*"[^"]*"[^{}]*\}', content)
if json_match:
try:
return [json.loads(json_match.group())]
except json.JSONDecodeError:
pass
return []
# ── API interaction ───────────────────────────────────────────────
def call_model(endpoint: str, model: str, messages: list, tools: list,
temperature: float = 0.1, timeout: int = 60) -> dict:
"""Call the model via OpenAI-compatible API."""
import urllib.request
payload = json.dumps({
"model": model,
"messages": messages,
"tools": tools,
"temperature": temperature,
"max_tokens": 1024,
}).encode()
req = urllib.request.Request(
f"{endpoint}/chat/completions",
data=payload,
headers={"Content-Type": "application/json"},
method="POST",
)
start = time.time()
try:
resp = urllib.request.urlopen(req, timeout=timeout)
data = json.loads(resp.read())
data["_latency_ms"] = (time.time() - start) * 1000
return data
except Exception as e:
return {"error": str(e), "_latency_ms": (time.time() - start) * 1000}
# ── Test runner ───────────────────────────────────────────────────
def run_single_test(endpoint: str, model: str, test: ToolCallTest) -> TestResult:
"""Run a single tool call test."""
messages = [
{
"role": "system",
"content": (
"You are a helpful assistant. When the user asks you to perform "
"a task, use the appropriate tool. Always call exactly one tool "
"unless the user explicitly asks for multiple things."
),
},
{"role": "user", "content": test.prompt},
]
response = call_model(endpoint, model, messages, TOOL_SCHEMAS)
if "error" in response:
return TestResult(
id=test.id,
name=test.name,
passed=False,
expected_tool=test.expected_tool,
error=response["error"],
latency_ms=response.get("_latency_ms", 0),
)
tool_calls = extract_tool_calls(response)
latency = response.get("_latency_ms", 0)
if not tool_calls:
# Model didn't call any tool
content = response.get("choices", [{}])[0].get("message", {}).get("content", "")
return TestResult(
id=test.id,
name=test.name,
passed=False,
expected_tool=test.expected_tool,
latency_ms=latency,
raw_response=content[:500],
error="No tool call returned",
)
# Validate first tool call
call = tool_calls[0]
schema_valid = validate_tool_call_schema(call)
# Extract tool name
if call.get("type") == "function":
tool_name = call["function"]["name"]
args_str = call["function"]["arguments"]
else:
tool_name = call.get("name", "")
args_str = call.get("arguments", "{}")
args_valid = validate_tool_args(args_str, test.expected_args)
tool_correct = tool_name == test.expected_tool
passed = tool_correct and schema_valid
return TestResult(
id=test.id,
name=test.name,
passed=passed,
tool_called=tool_name,
expected_tool=test.expected_tool,
schema_valid=schema_valid,
args_valid=args_valid,
latency_ms=latency,
raw_response=json.dumps(tool_calls[:2])[:500],
)
def run_dry_run() -> SuiteResult:
"""Validate schemas and test structure without a running server."""
print("=== DRY RUN — Schema Validation Only ===\n")
results = []
for test in TEST_CASES:
# Validate schemas parse
schema_valid = True
for tool in TOOL_SCHEMAS:
try:
assert "type" in tool
assert tool["type"] == "function"
func = tool["function"]
assert "name" in func
assert "description" in func
assert "parameters" in func
params = func["parameters"]
assert "type" in params
assert "properties" in params
except AssertionError:
schema_valid = False
results.append(TestResult(
id=test.id,
name=test.name,
passed=schema_valid,
expected_tool=test.expected_tool,
schema_valid=schema_valid,
args_valid=True,
))
passed = sum(1 for r in results if r.passed)
suite = SuiteResult(
timestamp=datetime.now(timezone.utc).isoformat(),
model="dry-run",
endpoint="none",
kv_type="none",
total=len(results),
passed=passed,
failed=len(results) - passed,
accuracy=passed / len(results) if results else 0,
meets_threshold=passed == len(results),
threshold=1.0,
results=[asdict(r) for r in results],
)
return suite
def run_suite(endpoint: str, model: str, kv_type: str, runs: int = 1,
threshold: float = 0.95) -> SuiteResult:
"""Run the full tool call regression suite."""
print(f"=== TurboQuant Tool Call Regression Suite ===")
print(f"Endpoint: {endpoint}")
print(f"Model: {model}")
print(f"KV Type: {kv_type}")
print(f"Runs: {runs}")
print(f"Threshold: {threshold:.0%}")
print()
# Check server is reachable
try:
import urllib.request
health_req = urllib.request.Request(f"{endpoint}/models", method="GET")
urllib.request.urlopen(health_req, timeout=5)
except Exception as e:
return SuiteResult(
timestamp=datetime.now(timezone.utc).isoformat(),
model=model,
endpoint=endpoint,
kv_type=kv_type,
error=f"Server unreachable: {e}",
)
all_results = []
for run_idx in range(runs):
if runs > 1:
print(f"\n--- Run {run_idx + 1}/{runs} ---")
for test in TEST_CASES:
print(f" {test.id}: ", end="", flush=True)
result = run_single_test(endpoint, model, test)
status = "PASS" if result.passed else "FAIL"
tool_info = f"called={result.tool_called}" if result.tool_called else "no tool"
print(f"{status} ({tool_info}, {result.latency_ms:.0f}ms)")
if result.error:
print(f" Error: {result.error}")
all_results.append(result)
passed = sum(1 for r in all_results if r.passed)
total = len(all_results)
accuracy = passed / total if total > 0 else 0
suite = SuiteResult(
timestamp=datetime.now(timezone.utc).isoformat(),
model=model,
endpoint=endpoint,
kv_type=kv_type,
total=total,
passed=passed,
failed=total - passed,
accuracy=accuracy,
meets_threshold=accuracy >= threshold,
threshold=threshold,
results=[asdict(r) for r in all_results],
)
print(f"\n{'='*60}")
print(f"RESULTS: {passed}/{total} passed ({accuracy:.1%})")
print(f"Threshold: {threshold:.0%}")
print(f"VERDICT: {'PASS' if suite.meets_threshold else 'FAIL'}")
print(f"{'='*60}")
return suite
# ── Markdown report ───────────────────────────────────────────────
def generate_report(suite: SuiteResult, output_path: str) -> None:
"""Generate a markdown results matrix."""
lines = [
"# Tool Call Regression Results",
"",
f"**Generated:** {suite.timestamp}",
f"**Model:** {suite.model}",
f"**Endpoint:** {suite.endpoint}",
f"**KV Type:** {suite.kv_type}",
"",
"## Summary",
"",
f"| Metric | Value |",
f"|--------|-------|",
f"| Total tests | {suite.total} |",
f"| Passed | {suite.passed} |",
f"| Failed | {suite.failed} |",
f"| Accuracy | {suite.accuracy:.1%} |",
f"| Threshold | {suite.threshold:.0%} |",
f"| Verdict | {'PASS' if suite.meets_threshold else 'FAIL'} |",
"",
"## Test Matrix",
"",
"| Test ID | Tool Expected | Tool Called | Schema | Args | Latency | Status |",
"|---------|--------------|-------------|--------|------|---------|--------|",
]
for r in suite.results:
d = r if isinstance(r, dict) else asdict(r)
status = "PASS" if d["passed"] else "FAIL"
schema = "OK" if d.get("schema_valid") else "FAIL"
args = "OK" if d.get("args_valid") else "FAIL"
called = d.get("tool_called") or "none"
latency = f"{d.get('latency_ms', 0):.0f}ms"
lines.append(
f"| {d['id']} | {d['expected_tool']} | {called} | {schema} | {args} | {latency} | {status} |"
)
if suite.error:
lines.extend(["", "## Error", "", suite.error])
os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(output_path, "w") as f:
f.write("\n".join(lines) + "\n")
print(f"\nReport saved to {output_path}")
# ── Main ──────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(description="TurboQuant Tool Call Regression Suite")
parser.add_argument("--endpoint", default="http://localhost:8081/v1",
help="llama.cpp OpenAI-compatible endpoint")
parser.add_argument("--model", default="gemma-4", help="Model name")
parser.add_argument("--kv-type", default="turbo4", help="KV cache type being tested")
parser.add_argument("--runs", type=int, default=1, help="Number of runs per test")
parser.add_argument("--threshold", type=float, default=0.95,
help="Minimum accuracy to pass (0.0-1.0)")
parser.add_argument("--output", default="benchmarks/tool-call-regression.md",
help="Output markdown report path")
parser.add_argument("--results-json", default="benchmarks/tool-call-regression.json",
help="Output JSON results path")
parser.add_argument("--dry-run", action="store_true",
help="Validate schemas only, no server needed")
args = parser.parse_args()
if args.dry_run:
suite = run_dry_run()
else:
suite = run_suite(
endpoint=args.endpoint,
model=args.model,
kv_type=args.kv_type,
runs=args.runs,
threshold=args.threshold,
)
# Save results
generate_report(suite, args.output)
os.makedirs(os.path.dirname(args.results_json), exist_ok=True)
with open(args.results_json, "w") as f:
json.dump(asdict(suite), f, indent=2)
print(f"JSON results saved to {args.results_json}")
# Exit code: 0 if passes threshold, 1 otherwise
sys.exit(0 if suite.meets_threshold else 1)
if __name__ == "__main__":
main()