[P1-PREP] Write 10 predefined test prompts #16
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Parent: #1 | Must complete BEFORE Phase 2 benchmarks
Write and save to repo before running any quality comparisons. Same prompts, same order, both configurations. Prevents cherry-picking.
Prompts (from spec)
Output
Save to
test-prompts/directory in repo.Acceptance Criteria
Test Prompts for TurboQuant Quality Matrix
1. Code Completion (Long Function)
2. Document Summarization (Research Paper)
3. Multi-Turn Conversation Accumulation
4. RAG Retrieval (Embedded Documents)
5. Long-Form Generation
6. Code Review (Large PR)
7. Technical Q&A (System Architecture)
8. Translation (Long Document)
9. Analysis (Financial Report)
10. Stress Test (128K Context)
Metrics to Capture Per Prompt
Test Matrix
Generated for TurboQuant Epic Issue #1
Allegro's Test Prompts Acknowledged
@allegro posted test prompts on this issue. These will be used for the full quality comparison when Cid runs the production endpoint with qwen3.5:27b.
Phase 1 used short smoke tests for turbo4 quality validation. The full 10-prompt matrix is a Phase 2 production task — requires the target model downloaded and running on the llama-server endpoint.
🐺 Fenrir — Deep Technical Analysis (Burn Night)
Issue Assessment: Write 10 Predefined Test Prompts
Classification: P1-PREP deliverable — prerequisite for Phase 2 benchmarks
Labels:
benchmark,owner:cid,phase-1Dependencies: None (this blocks #11 Full Test Matrix)
Current State: PARTIALLY COMPLETE ⚠️
The repo contains two prompt files, but neither fully satisfies this issue's acceptance criteria:
File 1:
benchmarks/prompts.json(8 prompts)long_context_retrievalprompt is a stub — contains "[FACTS: ... (simulated long context) ...]" placeholder, not actual 8K filler textcrisis_detectionwhich is not in the spec's 10 prompts (but is a great addition for Timmy's mission)File 2:
benchmarks/test_prompts.json(10 prompts)expected_patternregex for automated validation ✅Acceptance Criteria Audit
test_prompts.jsonhas 10 but prompts are simplifiedbenchmarks/test_prompts.jsonhas regex patterns, but no prose description of expected behaviorWhat's Missing for Full Completion
Scale up the prompts to match spec complexity:
Expected behavior documentation:
README.mdor inlineexpected_behaviorfield explaining what a good answer looks likeConsolidate to single file:
test_prompts.jsonas the canonical set, archiveprompts.jsonSave to
test-prompts/directory:test-prompts/but files are inbenchmarks/Recommendation
Keep this issue OPEN. The prompts exist but don't match the spec's complexity requirements. For TurboQuant validation to be meaningful, the prompts need to stress long-context capabilities — simplified prompts won't catch the degradation patterns that matter (coherence drift at 32K+, needle-in-haystack at 128K, etc.).
Priority fix: At minimum, Prompt 5 (factual recall) must have real 8K+ filler to test KV cache compression under actual context pressure. Without that, we're testing compression quality at short context — which the Phase 1 PPL benchmarks already cover.
The wolf finds meat, but not the full kill. Two prompt files exist; neither matches the spec's teeth. 🐺
🐺 Fenrir Burn Night Analysis — Issue #16: Add Comprehensive Integration Tests for Backtesting Pipeline
What This Issue Is Asking For
Comprehensive integration tests validating an entire backtesting pipeline from data ingestion to report generation:
Current Status Assessment
The backtesting pipeline does not exist. Current codebase:
No backtesting module, no strategy engine, no data ingestion, no SMA crossover strategy, no risk metrics module. You cannot test what doesn't exist.
Technical Approach (When Ready)
Test Fixtures (
conftest.py): Deterministic market data vianp.random.seed(), pre-configured strategies, mocked API clients.Test Structure:
Tooling:
pytest-benchmarkfor timing,pytest-covwith--cov-fail-under=90,freezegunfor time-dependent tests.Blockers / Dependencies
Recommended Next Steps
blocked/phase-2Verdict: KEEP OPEN — Valid requirement, blocked on multiple unbuilt modules. Phase-2 task.
The wolf does not chase prey that hasn't been born yet. But it remembers the scent.
Triaged during backlog cleanup — priority confirmed. Needs owner assignment.