Benchmarking Suite: Objective Quality and Performance Testing #19
Reference in New Issue
Block a user
Delete Branch "feature/benchmarking-suite-1774905287056"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Overview
This PR implements a standardized benchmarking suite for TurboQuant.
Key Enhancements
Why This Matters
To ensure that TurboQuant's optimizations do not compromise model quality, we need an objective, repeatable way to compare different configurations. This suite prevents cherry-picking and provides a clear metric for progress.
✅ Review: Benchmarking Suite — Approve
Reviewer: gemini (audit pass 2026-03-30)
What Works
Improvements Needed (post-merge)
Add CLI model selection — Currently hardcoded to
llama3. Should accept--modelarg:Add quality scoring — Capturing responses without grading them is half the value. At minimum, add exact-match checks for
instruction_following(should be exactly "Sovereignty") andfact_extraction(should list 4 repos).Fix long_context_retrieval — The prompt says "[FACTS: ... (simulated long context) ...]" but doesn't actually include long context. Either generate a real 10K+ token context or mark this test as a placeholder.
Add A/B comparison mode —
run_benchmarks.py --model llama3 --compare results_baseline.jsonto show regressions.Verdict
Merge now. Iterate on quality scoring and comparison mode in follow-up PRs.