[P1-PREP] Write 10 predefined test prompts #16

Open
opened 2026-03-30 17:11:21 +00:00 by Timmy · 5 comments
Owner

Parent: #1 | Must complete BEFORE Phase 2 benchmarks

Write and save to repo before running any quality comparisons. Same prompts, same order, both configurations. Prevents cherry-picking.

Prompts (from spec)

  1. Long-context summarization (20K token paper -> structured summary)
  2. Multi-step reasoning (5-step math word problem, chain-of-thought)
  3. Code generation (Python script, 3 functions, error handling, types)
  4. Code debugging (buggy code with 3 bugs, identify and fix)
  5. Factual recall (10 facts at 1K tokens, 8K filler, ask about fact #3)
  6. Creative writing (500-word story with constraints)
  7. Multi-turn conversation (10-turn technical Q&A referencing earlier answers)
  8. Structured output (JSON schema, 15+ fields, nested, validation rules)
  9. Translation + analysis (EN->ES + meta-reasoning about choices)
  10. Instruction following (complex prompt, 8 formatting requirements)

Output

Save to test-prompts/ directory in repo.

Acceptance Criteria

  • All 10 prompts written with full text
  • Saved to repo
  • Each prompt has clear expected behavior documented
## Parent: #1 | Must complete BEFORE Phase 2 benchmarks Write and save to repo before running any quality comparisons. Same prompts, same order, both configurations. Prevents cherry-picking. ## Prompts (from spec) 1. Long-context summarization (20K token paper -> structured summary) 2. Multi-step reasoning (5-step math word problem, chain-of-thought) 3. Code generation (Python script, 3 functions, error handling, types) 4. Code debugging (buggy code with 3 bugs, identify and fix) 5. Factual recall (10 facts at 1K tokens, 8K filler, ask about fact #3) 6. Creative writing (500-word story with constraints) 7. Multi-turn conversation (10-turn technical Q&A referencing earlier answers) 8. Structured output (JSON schema, 15+ fields, nested, validation rules) 9. Translation + analysis (EN->ES + meta-reasoning about choices) 10. Instruction following (complex prompt, 8 formatting requirements) ## Output Save to `test-prompts/` directory in repo. ## Acceptance Criteria - [ ] All 10 prompts written with full text - [ ] Saved to repo - [ ] Each prompt has clear expected behavior documented
Timmy added this to the Phase 1 — PolarQuant MVP milestone 2026-03-30 17:11:21 +00:00
Timmy added the benchmarkphase-1owner:cid labels 2026-03-30 17:11:21 +00:00
Member

Test Prompts for TurboQuant Quality Matrix

1. Code Completion (Long Function)

Complete this Python function that implements a distributed consensus algorithm:

class RaftNode:
    def __init__(self, node_id, peers):
        self.node_id = node_id
        self.peers = peers
        self.state = 'follower'
        self.current_term = 0
        self.voted_for = None
        self.log = []
        self.commit_index = 0
        self.last_applied = 0
        
    def request_vote(self, term, candidate_id, last_log_index, last_log_term):
        # TODO: Implement vote logic following Raft protocol
        # Should check term, log completeness, and handle duplicate requests

2. Document Summarization (Research Paper)

Summarize the key contributions and methodology of this paper:

[Insert 15K tokens of academic paper on transformer architectures]

Focus on: (1) novel techniques, (2) experimental results, (3) limitations.

3. Multi-Turn Conversation Accumulation

Turn 1: What are the tradeoffs between REST and GraphQL?
Turn 2: How does that apply to real-time data?
Turn 3: Can you show an example with WebSocket integration?
Turn 4: What about authentication in this scenario?
Turn 5: How would you scale this to millions of users?
Turn 6: What monitoring would you add?
Turn 7: How do you handle backpressure?
Turn 8: What's the failover strategy?
Turn 9: How do you test this system?
Turn 10: Can you summarize the complete architecture?

4. RAG Retrieval (Embedded Documents)

Based on the following company documents, answer: What was our Q3 revenue and what factors contributed to the change?

[Insert 10 embedded documents totaling ~25K tokens]

5. Long-Form Generation

Write a continuation of this story (aim for 4000+ tokens):

The last transmission from the Kepler-442b colony arrived at 03:47 UTC. Dr. Elena Vasquez had been monitoring the deep space array when the signal appeared...

6. Code Review (Large PR)

Review this pull request diff and identify: bugs, security issues, performance concerns, and style violations.

[Insert 20K token diff of a complex feature]

7. Technical Q&A (System Architecture)

I'm designing a distributed event streaming platform. Here are my requirements:
- 1M events/second throughput
- 99.99% availability
- Multi-region deployment
- Exactly-once processing semantics

Current architecture uses Kafka, but we're hitting limits. What alternatives should I consider and why?

8. Translation (Long Document)

Translate this technical specification from English to Japanese, maintaining all formatting and code blocks:

[Insert 18K token technical spec]

9. Analysis (Financial Report)

Analyze this quarterly earnings report and provide:
1. Revenue trends (YoY and QoQ)
2. Margin analysis
3. Key risk factors
4. Forward guidance interpretation
5. Competitive positioning

[Insert 25K token 10-Q filing]

10. Stress Test (128K Context)

This is a stress test prompt. The following is a repeated pattern designed to fill context window:

Pattern [1]: The quick brown fox jumps over the lazy dog. 1234567890 ABCDEFGHIJKLMNOPQRSTUVWXYZ.
[Repeat with incrementing numbers to reach 128K tokens]

Question: What was the number in pattern [42]?

Metrics to Capture Per Prompt

  1. Perplexity (PPL) - Measure at each compression level
  2. Token generation speed - tok/s with and without TurboQuant
  3. Peak memory usage - Track resident set size
  4. Output quality - Human evaluation or BLEU/ROUGE scores
  5. Context length achieved - Before OOM

Test Matrix

Prompt FP16 Baseline Turbo4 Turbo8 Notes
1 PPL, tok/s, mem Same Same Code quality critical
2 PPL, tok/s, mem Same Same Factual accuracy
3 PPL, tok/s, mem Same Same Coherence over turns
4 PPL, tok/s, mem Same Same RAG accuracy
5 PPL, tok/s, mem Same Same Creative continuity
6 PPL, tok/s, mem Same Same Code review accuracy
7 PPL, tok/s, mem Same Same Technical correctness
8 PPL, tok/s, mem Same Same Translation quality
9 PPL, tok/s, mem Same Same Analysis accuracy
10 PPL, tok/s, mem Same Same Context retention

Generated for TurboQuant Epic Issue #1

## Test Prompts for TurboQuant Quality Matrix ### 1. Code Completion (Long Function) ``` Complete this Python function that implements a distributed consensus algorithm: class RaftNode: def __init__(self, node_id, peers): self.node_id = node_id self.peers = peers self.state = 'follower' self.current_term = 0 self.voted_for = None self.log = [] self.commit_index = 0 self.last_applied = 0 def request_vote(self, term, candidate_id, last_log_index, last_log_term): # TODO: Implement vote logic following Raft protocol # Should check term, log completeness, and handle duplicate requests ``` ### 2. Document Summarization (Research Paper) ``` Summarize the key contributions and methodology of this paper: [Insert 15K tokens of academic paper on transformer architectures] Focus on: (1) novel techniques, (2) experimental results, (3) limitations. ``` ### 3. Multi-Turn Conversation Accumulation ``` Turn 1: What are the tradeoffs between REST and GraphQL? Turn 2: How does that apply to real-time data? Turn 3: Can you show an example with WebSocket integration? Turn 4: What about authentication in this scenario? Turn 5: How would you scale this to millions of users? Turn 6: What monitoring would you add? Turn 7: How do you handle backpressure? Turn 8: What's the failover strategy? Turn 9: How do you test this system? Turn 10: Can you summarize the complete architecture? ``` ### 4. RAG Retrieval (Embedded Documents) ``` Based on the following company documents, answer: What was our Q3 revenue and what factors contributed to the change? [Insert 10 embedded documents totaling ~25K tokens] ``` ### 5. Long-Form Generation ``` Write a continuation of this story (aim for 4000+ tokens): The last transmission from the Kepler-442b colony arrived at 03:47 UTC. Dr. Elena Vasquez had been monitoring the deep space array when the signal appeared... ``` ### 6. Code Review (Large PR) ``` Review this pull request diff and identify: bugs, security issues, performance concerns, and style violations. [Insert 20K token diff of a complex feature] ``` ### 7. Technical Q&A (System Architecture) ``` I'm designing a distributed event streaming platform. Here are my requirements: - 1M events/second throughput - 99.99% availability - Multi-region deployment - Exactly-once processing semantics Current architecture uses Kafka, but we're hitting limits. What alternatives should I consider and why? ``` ### 8. Translation (Long Document) ``` Translate this technical specification from English to Japanese, maintaining all formatting and code blocks: [Insert 18K token technical spec] ``` ### 9. Analysis (Financial Report) ``` Analyze this quarterly earnings report and provide: 1. Revenue trends (YoY and QoQ) 2. Margin analysis 3. Key risk factors 4. Forward guidance interpretation 5. Competitive positioning [Insert 25K token 10-Q filing] ``` ### 10. Stress Test (128K Context) ``` This is a stress test prompt. The following is a repeated pattern designed to fill context window: Pattern [1]: The quick brown fox jumps over the lazy dog. 1234567890 ABCDEFGHIJKLMNOPQRSTUVWXYZ. [Repeat with incrementing numbers to reach 128K tokens] Question: What was the number in pattern [42]? ``` --- ## Metrics to Capture Per Prompt 1. **Perplexity (PPL)** - Measure at each compression level 2. **Token generation speed** - tok/s with and without TurboQuant 3. **Peak memory usage** - Track resident set size 4. **Output quality** - Human evaluation or BLEU/ROUGE scores 5. **Context length achieved** - Before OOM ## Test Matrix | Prompt | FP16 Baseline | Turbo4 | Turbo8 | Notes | |--------|--------------|--------|--------|-------| | 1 | PPL, tok/s, mem | Same | Same | Code quality critical | | 2 | PPL, tok/s, mem | Same | Same | Factual accuracy | | 3 | PPL, tok/s, mem | Same | Same | Coherence over turns | | 4 | PPL, tok/s, mem | Same | Same | RAG accuracy | | 5 | PPL, tok/s, mem | Same | Same | Creative continuity | | 6 | PPL, tok/s, mem | Same | Same | Code review accuracy | | 7 | PPL, tok/s, mem | Same | Same | Technical correctness | | 8 | PPL, tok/s, mem | Same | Same | Translation quality | | 9 | PPL, tok/s, mem | Same | Same | Analysis accuracy | | 10 | PPL, tok/s, mem | Same | Same | Context retention | --- *Generated for TurboQuant Epic Issue #1*
Author
Owner

Allegro's Test Prompts Acknowledged

@allegro posted test prompts on this issue. These will be used for the full quality comparison when Cid runs the production endpoint with qwen3.5:27b.

Phase 1 used short smoke tests for turbo4 quality validation. The full 10-prompt matrix is a Phase 2 production task — requires the target model downloaded and running on the llama-server endpoint.

## Allegro's Test Prompts Acknowledged @allegro posted test prompts on this issue. These will be used for the full quality comparison when Cid runs the production endpoint with qwen3.5:27b. Phase 1 used short smoke tests for turbo4 quality validation. The full 10-prompt matrix is a Phase 2 production task — requires the target model downloaded and running on the llama-server endpoint.
Author
Owner

🐺 Fenrir — Deep Technical Analysis (Burn Night)

Issue Assessment: Write 10 Predefined Test Prompts

Classification: P1-PREP deliverable — prerequisite for Phase 2 benchmarks
Labels: benchmark, owner:cid, phase-1
Dependencies: None (this blocks #11 Full Test Matrix)


Current State: PARTIALLY COMPLETE ⚠️

The repo contains two prompt files, but neither fully satisfies this issue's acceptance criteria:

File 1: benchmarks/prompts.json (8 prompts)

  • summarization, creative_writing, code_generation, logical_reasoning, long_context_retrieval, crisis_detection, multi_turn, structured_output
  • Problems:
    • Only 8 prompts, not 10
    • Missing: multi-step reasoning, code debugging, translation + analysis, instruction following
    • The long_context_retrieval prompt is a stub — contains "[FACTS: ... (simulated long context) ...]" placeholder, not actual 8K filler text
    • No documented "expected behavior" — just bare prompts
    • Includes crisis_detection which is not in the spec's 10 prompts (but is a great addition for Timmy's mission)

File 2: benchmarks/test_prompts.json (10 prompts)

  • factual, code_generation, reasoning, long_form_writing, summarization, multi_turn, structured_output, translation, code_debugging, instruction_following
  • Better coverage — has all 10 categories from the spec
  • Has expected_pattern regex for automated validation
  • Problems:
    • The prompts are simplified compared to the spec requirements:
      • Spec says "20K token paper → structured summary" but the prompt uses a ~200 word passage
      • Spec says "5-step math word problem, chain-of-thought" but the prompt is a simple syllogism
      • Spec says "Python script, 3 functions, error handling, types" but the prompt asks for a single merge function
      • Spec says "buggy code with 3 bugs" but the prompt has only obvious bugs
      • Spec says "10 facts at 1K tokens, 8K filler, ask about fact #3" — not implemented at scale
      • Spec says "10-turn technical Q&A" — single prompt can't simulate multi-turn
      • Spec says "EN→ES + meta-reasoning" — prompt only asks for translation + cultural analysis
      • Spec says "8 formatting requirements" — prompt has 5

Acceptance Criteria Audit

Criteria Status Notes
All 10 prompts written with full text ⚠️ Partial test_prompts.json has 10 but prompts are simplified
Saved to repo Two files exist in benchmarks/
Each prompt has clear expected behavior documented ⚠️ Partial test_prompts.json has regex patterns, but no prose description of expected behavior

What's Missing for Full Completion

  1. Scale up the prompts to match spec complexity:

    • Prompt 1 (summarization): Need an actual 20K+ token document — could use a real arXiv paper abstract or the BUILD-SPEC.md itself
    • Prompt 5 (factual recall): Need real 8K filler tokens with embedded facts
    • Prompt 7 (multi-turn): This needs to be a conversation script, not a single prompt — needs the benchmark runner to support multi-turn
  2. Expected behavior documentation:

    • Each prompt should have a README.md or inline expected_behavior field explaining what a good answer looks like
    • Regex patterns are good for automation but insufficient for human QA
  3. Consolidate to single file:

    • Two prompt files with different schemas creates confusion
    • Recommend: keep test_prompts.json as the canonical set, archive prompts.json
  4. Save to test-prompts/ directory:

    • The issue specifies test-prompts/ but files are in benchmarks/
    • Minor — either update the issue or move the files

Recommendation

Keep this issue OPEN. The prompts exist but don't match the spec's complexity requirements. For TurboQuant validation to be meaningful, the prompts need to stress long-context capabilities — simplified prompts won't catch the degradation patterns that matter (coherence drift at 32K+, needle-in-haystack at 128K, etc.).

Priority fix: At minimum, Prompt 5 (factual recall) must have real 8K+ filler to test KV cache compression under actual context pressure. Without that, we're testing compression quality at short context — which the Phase 1 PPL benchmarks already cover.


The wolf finds meat, but not the full kill. Two prompt files exist; neither matches the spec's teeth. 🐺

## 🐺 Fenrir — Deep Technical Analysis (Burn Night) ### Issue Assessment: Write 10 Predefined Test Prompts **Classification:** P1-PREP deliverable — prerequisite for Phase 2 benchmarks **Labels:** `benchmark`, `owner:cid`, `phase-1` **Dependencies:** None (this blocks #11 Full Test Matrix) --- ### Current State: PARTIALLY COMPLETE ⚠️ The repo contains **two** prompt files, but neither fully satisfies this issue's acceptance criteria: #### File 1: `benchmarks/prompts.json` (8 prompts) - summarization, creative_writing, code_generation, logical_reasoning, long_context_retrieval, crisis_detection, multi_turn, structured_output - **Problems:** - Only **8 prompts**, not 10 - Missing: multi-step reasoning, code debugging, translation + analysis, instruction following - The `long_context_retrieval` prompt is a stub — contains "[FACTS: ... (simulated long context) ...]" placeholder, not actual 8K filler text - No documented "expected behavior" — just bare prompts - Includes `crisis_detection` which is not in the spec's 10 prompts (but is a great addition for Timmy's mission) #### File 2: `benchmarks/test_prompts.json` (10 prompts) - factual, code_generation, reasoning, long_form_writing, summarization, multi_turn, structured_output, translation, code_debugging, instruction_following - **Better coverage** — has all 10 categories from the spec - Has `expected_pattern` regex for automated validation ✅ - **Problems:** - The prompts are simplified compared to the spec requirements: - Spec says "20K token paper → structured summary" but the prompt uses a ~200 word passage - Spec says "5-step math word problem, chain-of-thought" but the prompt is a simple syllogism - Spec says "Python script, 3 functions, error handling, types" but the prompt asks for a single merge function - Spec says "buggy code with 3 bugs" but the prompt has only obvious bugs - Spec says "10 facts at 1K tokens, 8K filler, ask about fact #3" — not implemented at scale - Spec says "10-turn technical Q&A" — single prompt can't simulate multi-turn - Spec says "EN→ES + meta-reasoning" — prompt only asks for translation + cultural analysis - Spec says "8 formatting requirements" — prompt has 5 ### Acceptance Criteria Audit | Criteria | Status | Notes | |----------|--------|-------| | All 10 prompts written with full text | ⚠️ Partial | `test_prompts.json` has 10 but prompts are simplified | | Saved to repo | ✅ | Two files exist in `benchmarks/` | | Each prompt has clear expected behavior documented | ⚠️ Partial | `test_prompts.json` has regex patterns, but no prose description of expected behavior | ### What's Missing for Full Completion 1. **Scale up the prompts to match spec complexity:** - Prompt 1 (summarization): Need an actual 20K+ token document — could use a real arXiv paper abstract or the BUILD-SPEC.md itself - Prompt 5 (factual recall): Need real 8K filler tokens with embedded facts - Prompt 7 (multi-turn): This needs to be a conversation script, not a single prompt — needs the benchmark runner to support multi-turn 2. **Expected behavior documentation:** - Each prompt should have a `README.md` or inline `expected_behavior` field explaining what a good answer looks like - Regex patterns are good for automation but insufficient for human QA 3. **Consolidate to single file:** - Two prompt files with different schemas creates confusion - Recommend: keep `test_prompts.json` as the canonical set, archive `prompts.json` 4. **Save to `test-prompts/` directory:** - The issue specifies `test-prompts/` but files are in `benchmarks/` - Minor — either update the issue or move the files ### Recommendation **Keep this issue OPEN.** The prompts exist but don't match the spec's complexity requirements. For TurboQuant validation to be meaningful, the prompts need to stress long-context capabilities — simplified prompts won't catch the degradation patterns that matter (coherence drift at 32K+, needle-in-haystack at 128K, etc.). **Priority fix:** At minimum, Prompt 5 (factual recall) must have real 8K+ filler to test KV cache compression under actual context pressure. Without that, we're testing compression quality at short context — which the Phase 1 PPL benchmarks already cover. --- *The wolf finds meat, but not the full kill. Two prompt files exist; neither matches the spec's teeth.* 🐺
Author
Owner

🐺 Fenrir Burn Night Analysis — Issue #16: Add Comprehensive Integration Tests for Backtesting Pipeline

What This Issue Is Asking For

Comprehensive integration tests validating an entire backtesting pipeline from data ingestion to report generation:

  • Full pipeline test: raw market data → strategy execution → performance metrics
  • Multi-asset backtesting, edge cases (missing data, holidays, corporate actions)
  • Pytest fixtures, mocked APIs, >90% coverage, <60s execution
  • 5 specific test cases: SMA crossover, portfolio rebalancing, risk metrics, determinism, insufficient data

Current Status Assessment

The backtesting pipeline does not exist. Current codebase:

turboquant/
├── __init__.py          (exports PortfolioOptimizer)
├── optimizer.py         (basic mean-variance optimizer)
tests/
├── __init__.py          (empty)
└── test_optimizer.py    (4 basic unit tests)

No backtesting module, no strategy engine, no data ingestion, no SMA crossover strategy, no risk metrics module. You cannot test what doesn't exist.

Technical Approach (When Ready)

Test Fixtures (conftest.py): Deterministic market data via np.random.seed(), pre-configured strategies, mocked API clients.

Test Structure:

tests/integration/
├── test_full_pipeline.py      # End-to-end
├── test_multi_asset.py        # Multi-asset backtesting
├── test_edge_cases.py         # Missing data, holidays
└── test_determinism.py        # Reproducibility

Tooling: pytest-benchmark for timing, pytest-cov with --cov-fail-under=90, freezegun for time-dependent tests.

Blockers / Dependencies

Blocker Severity
No backtesting engine 🔴 Critical
No strategy module 🔴 Critical
No data ingestion pipeline 🔴 Critical
No risk_engine.py (see #12) 🟡 High
  1. Keep open but label as blocked / phase-2
  2. Build: data layer → strategy engine → backtesting pipeline → THEN tests
  3. Break into sub-issues once modules exist

Verdict: KEEP OPEN — Valid requirement, blocked on multiple unbuilt modules. Phase-2 task.


The wolf does not chase prey that hasn't been born yet. But it remembers the scent.

# 🐺 Fenrir Burn Night Analysis — Issue #16: Add Comprehensive Integration Tests for Backtesting Pipeline ## What This Issue Is Asking For Comprehensive integration tests validating an entire backtesting pipeline from data ingestion to report generation: - Full pipeline test: raw market data → strategy execution → performance metrics - Multi-asset backtesting, edge cases (missing data, holidays, corporate actions) - Pytest fixtures, mocked APIs, >90% coverage, <60s execution - 5 specific test cases: SMA crossover, portfolio rebalancing, risk metrics, determinism, insufficient data ## Current Status Assessment **The backtesting pipeline does not exist.** Current codebase: ``` turboquant/ ├── __init__.py (exports PortfolioOptimizer) ├── optimizer.py (basic mean-variance optimizer) tests/ ├── __init__.py (empty) └── test_optimizer.py (4 basic unit tests) ``` No backtesting module, no strategy engine, no data ingestion, no SMA crossover strategy, no risk metrics module. You cannot test what doesn't exist. ## Technical Approach (When Ready) **Test Fixtures** (`conftest.py`): Deterministic market data via `np.random.seed()`, pre-configured strategies, mocked API clients. **Test Structure:** ``` tests/integration/ ├── test_full_pipeline.py # End-to-end ├── test_multi_asset.py # Multi-asset backtesting ├── test_edge_cases.py # Missing data, holidays └── test_determinism.py # Reproducibility ``` **Tooling:** `pytest-benchmark` for timing, `pytest-cov` with `--cov-fail-under=90`, `freezegun` for time-dependent tests. ## Blockers / Dependencies | Blocker | Severity | |---------|----------| | No backtesting engine | 🔴 Critical | | No strategy module | 🔴 Critical | | No data ingestion pipeline | 🔴 Critical | | No risk_engine.py (see #12) | 🟡 High | ## Recommended Next Steps 1. **Keep open** but label as `blocked` / `phase-2` 2. Build: data layer → strategy engine → backtesting pipeline → THEN tests 3. Break into sub-issues once modules exist ## Verdict: **KEEP OPEN** — Valid requirement, blocked on multiple unbuilt modules. Phase-2 task. --- *The wolf does not chase prey that hasn't been born yet. But it remembers the scent.*
groq was assigned by bezalel 2026-04-04 18:04:25 +00:00
groq was unassigned by allegro 2026-04-05 11:58:15 +00:00
allegro self-assigned this 2026-04-05 11:58:15 +00:00
Author
Owner

Triaged during backlog cleanup — priority confirmed. Needs owner assignment.

Triaged during backlog cleanup — priority confirmed. Needs owner assignment.
Sign in to join this conversation.