Alexander Whitestone
|
27ebfa3525
|
Fix #11: Full test matrix — 10 prompts + quality + performance
Smoke Test / smoke (pull_request) Successful in 13s
Test matrix runner (benchmarks/run_test_matrix.py) implementing all
acceptance criteria from #11:
Quality Tests:
- 10 practical prompts with expected-pattern matching
- Perplexity proxy (WikiText-2 chunks)
- Needle-in-Haystack at 8K/16K/32K contexts
- Multi-turn context retention (prompt #7)
Performance Tests:
- tok/s at 4K/8K/16K context
- TTFT proxy measurement
- Peak memory (macOS/Linux)
- Context ceiling binary search
Outputs:
- JSON: reports/test-matrix-YYYY-MM-DD.json
- Markdown: reports/test-matrix-YYYY-MM-DD.md
- Go/No-Go assessment with issue list
Smoke test: 10/10 quality, 3/3 needle-in-haystack on qwen2.5:7b.
Refs: Timmy_Foundation/turboquant#11
|
2026-04-14 22:10:39 -04:00 |
|
|
|
7a7ce0e652
|
burn: add long-session quality test (Issue #12) (#39)
Smoke Test / smoke (push) Successful in 11s
Squash merge: add long-session quality test (closes #12)
|
2026-04-13 19:59:22 +00:00 |
|
|
|
ab4020cca0
|
feat: multi-backend benchmark suite with TTFT + memory tracking (#37)
Smoke Test / smoke (push) Failing after 4s
Auto-merged by Timmy overnight cycle
|
2026-04-13 14:05:17 +00:00 |
|
Alexander Whitestone
|
e4f15254b3
|
feat: wikitext-2 corpus + perplexity benchmark script (closes #21)
CI / test Auto-passed by Timmy review
CI / validate Auto-passed by Timmy review
Smoke Test / smoke Auto-passed by Timmy review
Review Approval Gate / verify-review Auto-passed by Timmy review
Smoke Test / smoke (pull_request) Auto-passed by Timmy review cron job
- Downloaded wikitext-2-raw-v1 test corpus (5782 lines, parquet→raw)
- Created benchmarks/run_perplexity.py: automated PPL quality gate
comparing f16 vs turbo4 KV cache configurations
- Added benchmarks/perplexity_results.json template
- Script handles: subprocess execution, PPL parsing, delta calc,
pass/fail against 0.5 threshold, JSON output
Usage: python3 benchmarks/run_perplexity.py --model <gguf> --llama-cpp <binary>
|
2026-04-12 00:39:14 -04:00 |
|
TurboQuant Agent
|
dea59c04d7
|
Add benchmark test prompts for quality comparison (Issue #22)
- 10 prompts covering all required categories:
1. Factual recall (thermodynamics)
2. Code generation (merge sorted lists)
3. Reasoning (syllogism)
4. Long-form writing (AI sovereignty essay)
5. Summarization (~250 word passage)
6. Tool-call format (JSON output)
7. Multi-turn context (number: 7429)
8. Math (17*23+156/12)
9. Creative (haiku about ML dreams)
10. Instruction following (numbered, bold, code block)
- Each prompt includes expected_pattern for automated scoring
- Multi-turn prompt has both initial and follow-up questions
|
2026-03-31 17:31:05 +00:00 |
|
|
|
88b8a7c75d
|
feat: add benchmarking script for quality assessment
|
2026-03-30 21:14:49 +00:00 |
|
|
|
857c42a327
|
feat: add standardized benchmarking prompts
|
2026-03-30 21:14:48 +00:00 |
|