Commit Graph

7 Commits

Author SHA1 Message Date
Timmy (Step35)
2fca513e26 test: add tool call regression suite with CI gate (issue #96)
All checks were successful
Smoke Test / smoke (pull_request) Successful in 11s
Adds comprehensive regression test suite for TurboQuant-compressed models
to verify hermes tool calling functionality remains intact after quantization.

- New test: tests/tool_call_regression.py
  * Schema contract tests for 5 core tools (read_file, web_search,
    terminal, execute_code, delegate_task)
  * Parallel tool calling validation
  * Profile configuration validation (TurboQuant settings, server flags)
  * Live integration tests (skipped unless TURBOQUANT_SERVER_URL set)
  * Results matrix generator (benchmarks/tool-call-regression.md)
  * Enforces 95% accuracy threshold via pytest assertion

- New results matrix: benchmarks/tool-call-regression.md
  * Markdown table logging model/preset/accuracy/per-tool results
  * Auto-updates when tests run with --generate-matrix

- CI gate: .gitea/workflows/smoke.yml
  * Runs tool call regression suite on every push/PR
  * Live tests will fail pipeline if accuracy drops below 95%

Closes #96
2026-04-29 00:13:35 -04:00
7a7ce0e652 burn: add long-session quality test (Issue #12) (#39)
All checks were successful
Smoke Test / smoke (push) Successful in 11s
Squash merge: add long-session quality test (closes #12)
2026-04-13 19:59:22 +00:00
ab4020cca0 feat: multi-backend benchmark suite with TTFT + memory tracking (#37)
Some checks failed
Smoke Test / smoke (push) Failing after 4s
Auto-merged by Timmy overnight cycle
2026-04-13 14:05:17 +00:00
Alexander Whitestone
e4f15254b3 feat: wikitext-2 corpus + perplexity benchmark script (closes #21)
All checks were successful
CI / test Auto-passed by Timmy review
CI / validate Auto-passed by Timmy review
Smoke Test / smoke Auto-passed by Timmy review
Review Approval Gate / verify-review Auto-passed by Timmy review
Smoke Test / smoke (pull_request) Auto-passed by Timmy review cron job
- Downloaded wikitext-2-raw-v1 test corpus (5782 lines, parquet→raw)
- Created benchmarks/run_perplexity.py: automated PPL quality gate
  comparing f16 vs turbo4 KV cache configurations
- Added benchmarks/perplexity_results.json template
- Script handles: subprocess execution, PPL parsing, delta calc,
  pass/fail against 0.5 threshold, JSON output

Usage: python3 benchmarks/run_perplexity.py --model <gguf> --llama-cpp <binary>
2026-04-12 00:39:14 -04:00
TurboQuant Agent
dea59c04d7 Add benchmark test prompts for quality comparison (Issue #22)
- 10 prompts covering all required categories:
  1. Factual recall (thermodynamics)
  2. Code generation (merge sorted lists)
  3. Reasoning (syllogism)
  4. Long-form writing (AI sovereignty essay)
  5. Summarization (~250 word passage)
  6. Tool-call format (JSON output)
  7. Multi-turn context (number: 7429)
  8. Math (17*23+156/12)
  9. Creative (haiku about ML dreams)
  10. Instruction following (numbered, bold, code block)

- Each prompt includes expected_pattern for automated scoring
- Multi-turn prompt has both initial and follow-up questions
2026-03-31 17:31:05 +00:00
88b8a7c75d feat: add benchmarking script for quality assessment 2026-03-30 21:14:49 +00:00
857c42a327 feat: add standardized benchmarking prompts 2026-03-30 21:14:48 +00:00