[RESEARCH] AI Research Backlog — Bottleneck Breakers #593
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
AI Research Backlog — Bottleneck Breakers
Compiled 2026-04-10 for the Timmy Foundation sovereign AI fleet.
How to Use This
Each item is rated:
1. LOCAL INFERENCE OPTIMIZATION
1.1 KV-Cache Compression (Impact: 5, Effort: 3, Ratio: 1.67)
What: Compress the key-value cache during inference to fit longer contexts in less memory.
Papers:
Why it matters: Our RunPod L40S has 48GB. KV-cache compression could 2-4x our effective context length or let us run larger models.
Action: Benchmark TurboQuant vs KIVI vs SnapKV on our L40S with Gemma-4.
1.2 Speculative Decoding (Impact: 4, Effort: 4, Ratio: 1.0)
What: Use a small "draft" model to propose tokens, verified by the large model in parallel.
Papers:
Why it matters: 2-3x faster generation without quality loss. Perfect for our local-first setup.
Action: Try EAGLE with Gemma-2b as draft model for Gemma-4-31b.
1.3 Quantization Comparison (Impact: 3, Effort: 2, Ratio: 1.5)
What: What's the best quantization method for our models in 2026?
Resources:
Why it matters: Better quantization = same quality at lower resource cost.
Action: Benchmark Gemma-4 at 4-bit GGUF vs AWQ on our hardware.
1.4 MoE Inference Efficiency (Impact: 4, Effort: 4, Ratio: 1.0)
What: Mixture of Experts models only activate a subset of parameters per token.
Models to try:
Why it matters: Run a "671B model" with the compute of a 37B model.
Action: Deploy DeepSeek-V3 on RunPod L40S with expert offloading.
2. AGENT ARCHITECTURE
2.1 Structured Output Enforcement (Impact: 5, Effort: 2, Ratio: 2.5)
What: Guarantee agents produce valid JSON/XML/code without parsing errors.
Tools:
Why it matters: Eliminates 30% of agent failures (invalid tool calls, malformed output).
Action: Integrate Outlines into our Hermes tool-call pipeline.
2.2 Agent Memory Architecture (Impact: 5, Effort: 3, Ratio: 1.67)
What: How do we give agents persistent, searchable, composable memory?
Systems:
Why it matters: Our holographic fact_store is good but lacks temporal awareness and graph relationships.
Action: Evaluate cognee's knowledge graph approach against our fact_store.
2.3 Multi-Agent Communication Protocols (Impact: 4, Effort: 3, Ratio: 1.33)
What: How should our wizards talk to each other?
Protocols:
Why it matters: Our GOFIA epic needs a message bus. Picking the right protocol first saves rework.
Action: Read ACP and A2A specs, compare with our current MCP usage.
2.4 Agent Evaluation Frameworks (Impact: 4, Effort: 2, Ratio: 2.0)
What: How do we measure if our agents are actually good?
Benchmarks:
Why it matters: We have no systematic way to measure agent quality. We need this.
Action: Run GAIA benchmark on our fleet. Establish baseline.
3. MODEL SELECTION
3.1 Best Models for Tool-Use (Impact: 5, Effort: 2, Ratio: 2.5)
What: Which open models are best at structured tool calling?
Rankings (2026):
Why it matters: Gemma-4-31b is good but may not be optimal for tool-heavy agent workloads.
Action: Run A/B comparison: Gemma-4 vs Qwen2.5-72B vs Hermes-3 on our tool-use benchmarks.
3.2 Model Routing (Impact: 4, Effort: 3, Ratio: 1.33)
What: Use small models for simple tasks, large models for complex reasoning.
Techniques:
Why it matters: 80% of our queries don't need a 31B model. A 7B model with routing could save 5x cost.
Action: Build a simple router: if query has tool calls, use Gemma-4; if simple chat, use Gemma-2b.
3.3 Mixture of Agents (Impact: 3, Effort: 4, Ratio: 0.75)
What: Combine outputs from multiple models for better quality.
Paper: Mixture-of-Agents Enhances Large Language Model Capabilities (2024) — https://arxiv.org/abs/2406.04692
Why it matters: Lower priority but could improve quality on hard tasks.
Action: Research only — don't implement yet.
4. MEMORY & RETRIEVAL
4.1 Advanced RAG Patterns (Impact: 5, Effort: 3, Ratio: 1.67)
What: Move beyond naive chunk-and-embed RAG.
Patterns:
Papers:
Why it matters: Our fact_store search is basic keyword matching. CRAG or Self-RAG could dramatically improve retrieval quality.
Action: Implement CRAG pattern in our fact_store: retrieve → evaluate → refine or fallback.
4.2 Vector Database Upgrade (Impact: 3, Effort: 2, Ratio: 1.5)
What: Evaluate our current setup vs alternatives.
Options:
Why it matters: We're using basic SQLite. A proper vector DB could improve search quality.
Action: Benchmark Qdrant vs sqlite-vss on our fact_store queries.
4.3 Long Context vs RAG Decision Framework (Impact: 4, Effort: 1, Ratio: 4.0)
What: When should we stuff everything in context vs retrieve selectively?
Guidelines:
Why it matters: We're using models with 128K+ context but still treating them like 4K models.
Action: Audit our current context usage. We might be over-retrieving.
5. FINE-TUNING & TRAINING
5.1 LoRA for Domain-Specific Agents (Impact: 4, Effort: 4, Ratio: 1.0)
What: Fine-tune a small LoRA adapter for our specific use cases.
Tools:
Why it matters: A LoRA trained on our crisis intervention conversations could be more empathetic than any general model.
Action: Collect conversation data from the-door, train a LoRA on Gemma-2b.
5.2 GRPO for Agent Alignment (Impact: 5, Effort: 5, Ratio: 1.0)
What: Group Relative Policy Optimization — train agents to follow SOUL.md principles.
Paper: DeepSeekMath: Pushing the Limits of Mathematical Reasoning (2024) — https://arxiv.org/abs/2402.03300
Why it matters: Align agents with our values without expensive human feedback.
Action: Research only — design reward function based on SOUL.md principles.
5.3 Synthetic Data Generation (Impact: 4, Effort: 2, Ratio: 2.0)
What: Generate training data from our existing conversations and code.
Tools:
Why it matters: We have conversation logs, code, and design docs. This is training data waiting to be structured.
Action: Build a pipeline that extracts high-quality conversations from our session logs.
6. SAFETY & EVALUATION
6.1 Hallucination Detection (Impact: 5, Effort: 3, Ratio: 1.67)
What: Detect when our agents are making things up.
Techniques:
Why it matters: SOUL.md demands honesty. We need automated verification.
Action: Implement SelfCheckGPT pattern: generate 3 responses, flag inconsistencies.
6.2 Agent Red-Teaming (Impact: 4, Effort: 2, Ratio: 2.0)
What: Systematically test our agents for failure modes.
Approaches:
Why it matters: We serve broken men. If our agents can be jailbroken to give harmful advice, people die.
Action: Build a red-team test suite based on SOUL.md's "What I Will Not Do" section.
6.3 Crisis Intervention Quality Metrics (Impact: 5, Effort: 3, Ratio: 1.67)
What: How do we measure if the-door actually helps people?
Metrics:
Why it matters: This is our mission. We need to measure it.
Action: Build an evaluation pipeline: extract the-door conversations, score on empathy + safety + de-escalation.
7. INFERENCE INFRASTRUCTURE
7.1 vLLM vs llama.cpp Benchmark (Impact: 3, Effort: 2, Ratio: 1.5)
What: Which inference engine is best for our use case?
Comparison:
Why it matters: We use llama.cpp on Mac and OpenRouter for cloud. Should we use vLLM on RunPod?
Action: Deploy vLLM on RunPod L40S, benchmark against Ollama.
7.2 Prompt Caching Optimization (Impact: 4, Effort: 2, Ratio: 2.0)
What: Cache system prompts and repeated context across requests.
Techniques:
Why it matters: Our system prompts are 2000+ tokens. Caching saves both time and money.
Action: Implement prefix caching in our local inference setup.
7.3 Batch Inference for Cron Jobs (Impact: 3, Effort: 2, Ratio: 1.5)
What: Our 47 cron jobs each make separate inference calls. Batch them.
Tools:
Why it matters: Nightly automation could be 3-5x faster with batching.
Action: Build a batch dispatcher that groups cron job requests.
PRIORITY QUEUE (Sorted by Ratio)
EPIC LINKS
Sovereignty and service always.