[RESEARCH] AI Research Backlog — Bottleneck Breakers #593

Closed
opened 2026-04-11 00:53:30 +00:00 by Timmy · 0 comments
Owner

AI Research Backlog — Bottleneck Breakers

Compiled 2026-04-10 for the Timmy Foundation sovereign AI fleet.


How to Use This

Each item is rated:

  • Impact: How much it could improve our stack (1-5)
  • Effort: How hard to implement (1-5, lower = easier)
  • Ratio: Impact/Effort — higher is better for tonight's work

1. LOCAL INFERENCE OPTIMIZATION

1.1 KV-Cache Compression (Impact: 5, Effort: 3, Ratio: 1.67)

What: Compress the key-value cache during inference to fit longer contexts in less memory.
Papers:

  • KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache (2024) — https://arxiv.org/abs/2402.02750
  • SnapKV: LLM Knows What You Are Looking For Before Generation (2024) — https://arxiv.org/abs/2404.14469
  • TurboQuant (our repo) — already partially implemented on RunPod
    Why it matters: Our RunPod L40S has 48GB. KV-cache compression could 2-4x our effective context length or let us run larger models.
    Action: Benchmark TurboQuant vs KIVI vs SnapKV on our L40S with Gemma-4.

1.2 Speculative Decoding (Impact: 4, Effort: 4, Ratio: 1.0)

What: Use a small "draft" model to propose tokens, verified by the large model in parallel.
Papers:

  • SpecInfer: Accelerating Generative LLM Inference with Speculative Inference (2023) — https://arxiv.org/abs/2305.09781
  • Medusa: Simple LLM Inference Acceleration Framework (2024) — https://arxiv.org/abs/2401.10774
  • EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (2024) — https://arxiv.org/abs/2401.15077
    Why it matters: 2-3x faster generation without quality loss. Perfect for our local-first setup.
    Action: Try EAGLE with Gemma-2b as draft model for Gemma-4-31b.

1.3 Quantization Comparison (Impact: 3, Effort: 2, Ratio: 1.5)

What: What's the best quantization method for our models in 2026?
Resources:

  • GGUF (llama.cpp) — still the gold standard for CPU/Apple Silicon
  • AWQ (Activation-aware Weight Quantization) — better quality than GPTQ at low bits
  • QuIP# — 2-bit quantization with surprisingly good quality
  • HQQ (Half-Quadratic Quantization) — no calibration data needed
    Why it matters: Better quantization = same quality at lower resource cost.
    Action: Benchmark Gemma-4 at 4-bit GGUF vs AWQ on our hardware.

1.4 MoE Inference Efficiency (Impact: 4, Effort: 4, Ratio: 1.0)

What: Mixture of Experts models only activate a subset of parameters per token.
Models to try:

  • DeepSeek-V3 — 671B params, 37B active, open-weight
  • Qwen2.5-MoE — strong multilingual, efficient inference
  • Mixtral 8x22B — proven MoE architecture
    Why it matters: Run a "671B model" with the compute of a 37B model.
    Action: Deploy DeepSeek-V3 on RunPod L40S with expert offloading.

2. AGENT ARCHITECTURE

2.1 Structured Output Enforcement (Impact: 5, Effort: 2, Ratio: 2.5)

What: Guarantee agents produce valid JSON/XML/code without parsing errors.
Tools:

  • Outlines (github.com/outlines-dev/outlines) — regex/grammar constrained generation
  • Instructor (github.com/jxnl/instructor) — Pydantic-based structured extraction
  • Guidance (github.com/guidance-ai/guidance) — interleaved generation and logic
    Why it matters: Eliminates 30% of agent failures (invalid tool calls, malformed output).
    Action: Integrate Outlines into our Hermes tool-call pipeline.

2.2 Agent Memory Architecture (Impact: 5, Effort: 3, Ratio: 1.67)

What: How do we give agents persistent, searchable, composable memory?
Systems:

  • MemGPT/Letta (github.com/cpacker/MemGPT) — tiered memory with OS-style paging
  • LangMem (github.com/langchain-ai/langmem) — LangChain-native memory
  • cognee (github.com/topoteretes/cognee) — knowledge graph + vector hybrid
  • Zep (github.com/getzep/zep) — temporal knowledge graph for conversations
    Why it matters: Our holographic fact_store is good but lacks temporal awareness and graph relationships.
    Action: Evaluate cognee's knowledge graph approach against our fact_store.

2.3 Multi-Agent Communication Protocols (Impact: 4, Effort: 3, Ratio: 1.33)

What: How should our wizards talk to each other?
Protocols:

  • ACP (Agent Communication Protocol) — IBM's open standard
  • A2A (Agent-to-Agent) — Google's protocol
  • MCP (Model Context Protocol) — Anthropic's tool/resource protocol (we already use this)
  • NATS — lightweight messaging (we've explored this)
    Why it matters: Our GOFIA epic needs a message bus. Picking the right protocol first saves rework.
    Action: Read ACP and A2A specs, compare with our current MCP usage.

2.4 Agent Evaluation Frameworks (Impact: 4, Effort: 2, Ratio: 2.0)

What: How do we measure if our agents are actually good?
Benchmarks:

  • SWE-bench — software engineering tasks
  • GAIA — general AI assistants (real-world questions)
  • AgentBench — multi-turn agent tasks
  • WebArena — web navigation tasks
  • CRMArena — customer relationship tasks
    Why it matters: We have no systematic way to measure agent quality. We need this.
    Action: Run GAIA benchmark on our fleet. Establish baseline.

3. MODEL SELECTION

3.1 Best Models for Tool-Use (Impact: 5, Effort: 2, Ratio: 2.5)

What: Which open models are best at structured tool calling?
Rankings (2026):

  1. Qwen2.5-72B-Instruct — best overall tool use
  2. DeepSeek-V3 — strongest reasoning + tool use
  3. Llama 3.1 405B — largest open model, excellent instruction following
  4. Command R+ — built for RAG and tool use
  5. Hermes-3-Llama-3.1-70B — Nous Research's agent-tuned model
    Why it matters: Gemma-4-31b is good but may not be optimal for tool-heavy agent workloads.
    Action: Run A/B comparison: Gemma-4 vs Qwen2.5-72B vs Hermes-3 on our tool-use benchmarks.

3.2 Model Routing (Impact: 4, Effort: 3, Ratio: 1.33)

What: Use small models for simple tasks, large models for complex reasoning.
Techniques:

  • RouteLLM (github.com/lm-sys/RouteLLM) — learned routing between model pairs
  • FrugalGPT — cascade of models with early exit
  • Hybrid LLM — simple classifier decides which model to use
    Why it matters: 80% of our queries don't need a 31B model. A 7B model with routing could save 5x cost.
    Action: Build a simple router: if query has tool calls, use Gemma-4; if simple chat, use Gemma-2b.

3.3 Mixture of Agents (Impact: 3, Effort: 4, Ratio: 0.75)

What: Combine outputs from multiple models for better quality.
Paper: Mixture-of-Agents Enhances Large Language Model Capabilities (2024) — https://arxiv.org/abs/2406.04692
Why it matters: Lower priority but could improve quality on hard tasks.
Action: Research only — don't implement yet.


4. MEMORY & RETRIEVAL

4.1 Advanced RAG Patterns (Impact: 5, Effort: 3, Ratio: 1.67)

What: Move beyond naive chunk-and-embed RAG.
Patterns:

  • Corrective RAG (CRAG) — retrieval evaluator checks quality, falls back to web search
  • Self-RAG — model decides when to retrieve and how to use results
  • Graph RAG (Microsoft) — build knowledge graphs from documents, query the graph
  • Agentic RAG — agent orchestrates retrieval steps
    Papers:
  • Self-RAG: Learning to Retrieve, Generate, and Critique (2023) — https://arxiv.org/abs/2310.11511
  • From Local to Global: A Graph RAG Approach (2024) — https://arxiv.org/abs/2404.16130
    Why it matters: Our fact_store search is basic keyword matching. CRAG or Self-RAG could dramatically improve retrieval quality.
    Action: Implement CRAG pattern in our fact_store: retrieve → evaluate → refine or fallback.

4.2 Vector Database Upgrade (Impact: 3, Effort: 2, Ratio: 1.5)

What: Evaluate our current setup vs alternatives.
Options:

  • Qdrant — Rust-based, fast, great filtering
  • Chroma — simple, Python-native, good for prototyping
  • SQLite + sqlite-vss — zero-dependency, embedded
  • FAISS — Facebook's pure vector search
    Why it matters: We're using basic SQLite. A proper vector DB could improve search quality.
    Action: Benchmark Qdrant vs sqlite-vss on our fact_store queries.

4.3 Long Context vs RAG Decision Framework (Impact: 4, Effort: 1, Ratio: 4.0)

What: When should we stuff everything in context vs retrieve selectively?
Guidelines:

  • Context < 32K tokens → stuff it all (no RAG needed)
  • Context 32K-128K → hybrid (key docs in context, rest retrieved)
  • Context > 128K → pure RAG with reranking
    Why it matters: We're using models with 128K+ context but still treating them like 4K models.
    Action: Audit our current context usage. We might be over-retrieving.

5. FINE-TUNING & TRAINING

5.1 LoRA for Domain-Specific Agents (Impact: 4, Effort: 4, Ratio: 1.0)

What: Fine-tune a small LoRA adapter for our specific use cases.
Tools:

  • Unsloth (github.com/unslothai/unsloth) — 2-5x faster fine-tuning
  • Axolotl (github.com/axolotl-ai-cloud/axolotl) — config-driven fine-tuning
  • PEFT (github.com/huggingface/peft) — HuggingFace's LoRA library
    Why it matters: A LoRA trained on our crisis intervention conversations could be more empathetic than any general model.
    Action: Collect conversation data from the-door, train a LoRA on Gemma-2b.

5.2 GRPO for Agent Alignment (Impact: 5, Effort: 5, Ratio: 1.0)

What: Group Relative Policy Optimization — train agents to follow SOUL.md principles.
Paper: DeepSeekMath: Pushing the Limits of Mathematical Reasoning (2024) — https://arxiv.org/abs/2402.03300
Why it matters: Align agents with our values without expensive human feedback.
Action: Research only — design reward function based on SOUL.md principles.

5.3 Synthetic Data Generation (Impact: 4, Effort: 2, Ratio: 2.0)

What: Generate training data from our existing conversations and code.
Tools:

  • distilabel (github.com/argilla-io/distilabel) — synthetic data pipelines
  • phi-3 cookbook recipes — proven synthetic data approaches
  • Self-Instruct — generate instruction-following data from seed tasks
    Why it matters: We have conversation logs, code, and design docs. This is training data waiting to be structured.
    Action: Build a pipeline that extracts high-quality conversations from our session logs.

6. SAFETY & EVALUATION

6.1 Hallucination Detection (Impact: 5, Effort: 3, Ratio: 1.67)

What: Detect when our agents are making things up.
Techniques:

  • SelfCheckGPT — sample multiple outputs, check consistency
  • REFUTE — train a model to detect factual errors
  • Citation verification — check if cited sources actually exist
    Why it matters: SOUL.md demands honesty. We need automated verification.
    Action: Implement SelfCheckGPT pattern: generate 3 responses, flag inconsistencies.

6.2 Agent Red-Teaming (Impact: 4, Effort: 2, Ratio: 2.0)

What: Systematically test our agents for failure modes.
Approaches:

  • GCG (Greedy Coordinate Gradient) — automated jailbreak discovery
  • PAIR (Prompt Automatic Iterative Refinement) — LLM-based red-teaming
  • Manual testing against SOUL.md principles
    Why it matters: We serve broken men. If our agents can be jailbroken to give harmful advice, people die.
    Action: Build a red-team test suite based on SOUL.md's "What I Will Not Do" section.

6.3 Crisis Intervention Quality Metrics (Impact: 5, Effort: 3, Ratio: 1.67)

What: How do we measure if the-door actually helps people?
Metrics:

  • Empathy detection (does the response acknowledge the user's feelings?)
  • Safety compliance (does it surface 988? does it avoid harmful advice?)
  • De-escalation scoring (does the conversation trajectory improve?)
  • Response latency (in crisis, seconds matter)
    Why it matters: This is our mission. We need to measure it.
    Action: Build an evaluation pipeline: extract the-door conversations, score on empathy + safety + de-escalation.

7. INFERENCE INFRASTRUCTURE

7.1 vLLM vs llama.cpp Benchmark (Impact: 3, Effort: 2, Ratio: 1.5)

What: Which inference engine is best for our use case?
Comparison:

  • llama.cpp — best for CPU/Apple Silicon, GGUF format
  • vLLM — best for GPU serving, PagedAttention, high throughput
  • TensorRT-LLM — NVIDIA-optimized, fastest on NVIDIA GPUs
  • SGLang — structured generation, RadixAttention
    Why it matters: We use llama.cpp on Mac and OpenRouter for cloud. Should we use vLLM on RunPod?
    Action: Deploy vLLM on RunPod L40S, benchmark against Ollama.

7.2 Prompt Caching Optimization (Impact: 4, Effort: 2, Ratio: 2.0)

What: Cache system prompts and repeated context across requests.
Techniques:

  • Anthropic prompt caching (built into API)
  • vLLM automatic prefix caching
  • Manual KV-cache reuse for system prompts
    Why it matters: Our system prompts are 2000+ tokens. Caching saves both time and money.
    Action: Implement prefix caching in our local inference setup.

7.3 Batch Inference for Cron Jobs (Impact: 3, Effort: 2, Ratio: 1.5)

What: Our 47 cron jobs each make separate inference calls. Batch them.
Tools:

  • vLLM batched inference
  • llama.cpp parallel slots
  • Custom batching layer
    Why it matters: Nightly automation could be 3-5x faster with batching.
    Action: Build a batch dispatcher that groups cron job requests.

PRIORITY QUEUE (Sorted by Ratio)

# Item Ratio Tonight?
1 Long Context vs RAG Decision Framework 4.0 YES — audit current context usage
2 Structured Output Enforcement (Outlines) 2.5 YES — integrate into tool pipeline
3 Best Models for Tool-Use comparison 2.5 Maybe — needs benchmark setup
4 Agent Evaluation Frameworks (GAIA) 2.0 Maybe — needs benchmark infra
5 Synthetic Data Generation 2.0 YES — extract from session logs
6 Agent Red-Teaming 2.0 YES — build SOUL.md test suite
7 Prompt Caching 2.0 YES — quick win for local inference
8 KV-Cache Compression 1.67 Research tonight, implement tomorrow
9 Advanced RAG (CRAG) 1.67 Research tonight, implement tomorrow
10 Hallucination Detection 1.67 Research tonight, implement tomorrow
11 Crisis Intervention Metrics 1.67 Research tonight, implement tomorrow
12 Agent Memory Architecture 1.67 Research tonight, implement tomorrow

  • GOFIA (#65) → feeds from items 2.3 (protocols), 4.1 (RAG), 4.3 (context framework)
  • The Beacon (#57) → feeds from items 2.1 (structured output), 6.2 (red-teaming)
  • Know Thy Agent (#290) → feeds from items 3.1 (model selection), 2.4 (evaluation)
  • The Door → feeds from items 6.1 (hallucination), 6.3 (crisis metrics), 5.1 (LoRA)

Sovereignty and service always.

# AI Research Backlog — Bottleneck Breakers *Compiled 2026-04-10 for the Timmy Foundation sovereign AI fleet.* --- ## How to Use This Each item is rated: - **Impact**: How much it could improve our stack (1-5) - **Effort**: How hard to implement (1-5, lower = easier) - **Ratio**: Impact/Effort — higher is better for tonight's work --- ## 1. LOCAL INFERENCE OPTIMIZATION ### 1.1 KV-Cache Compression (Impact: 5, Effort: 3, Ratio: 1.67) **What**: Compress the key-value cache during inference to fit longer contexts in less memory. **Papers**: - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache (2024) — https://arxiv.org/abs/2402.02750 - SnapKV: LLM Knows What You Are Looking For Before Generation (2024) — https://arxiv.org/abs/2404.14469 - TurboQuant (our repo) — already partially implemented on RunPod **Why it matters**: Our RunPod L40S has 48GB. KV-cache compression could 2-4x our effective context length or let us run larger models. **Action**: Benchmark TurboQuant vs KIVI vs SnapKV on our L40S with Gemma-4. ### 1.2 Speculative Decoding (Impact: 4, Effort: 4, Ratio: 1.0) **What**: Use a small "draft" model to propose tokens, verified by the large model in parallel. **Papers**: - SpecInfer: Accelerating Generative LLM Inference with Speculative Inference (2023) — https://arxiv.org/abs/2305.09781 - Medusa: Simple LLM Inference Acceleration Framework (2024) — https://arxiv.org/abs/2401.10774 - EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (2024) — https://arxiv.org/abs/2401.15077 **Why it matters**: 2-3x faster generation without quality loss. Perfect for our local-first setup. **Action**: Try EAGLE with Gemma-2b as draft model for Gemma-4-31b. ### 1.3 Quantization Comparison (Impact: 3, Effort: 2, Ratio: 1.5) **What**: What's the best quantization method for our models in 2026? **Resources**: - GGUF (llama.cpp) — still the gold standard for CPU/Apple Silicon - AWQ (Activation-aware Weight Quantization) — better quality than GPTQ at low bits - QuIP# — 2-bit quantization with surprisingly good quality - HQQ (Half-Quadratic Quantization) — no calibration data needed **Why it matters**: Better quantization = same quality at lower resource cost. **Action**: Benchmark Gemma-4 at 4-bit GGUF vs AWQ on our hardware. ### 1.4 MoE Inference Efficiency (Impact: 4, Effort: 4, Ratio: 1.0) **What**: Mixture of Experts models only activate a subset of parameters per token. **Models to try**: - DeepSeek-V3 — 671B params, 37B active, open-weight - Qwen2.5-MoE — strong multilingual, efficient inference - Mixtral 8x22B — proven MoE architecture **Why it matters**: Run a "671B model" with the compute of a 37B model. **Action**: Deploy DeepSeek-V3 on RunPod L40S with expert offloading. --- ## 2. AGENT ARCHITECTURE ### 2.1 Structured Output Enforcement (Impact: 5, Effort: 2, Ratio: 2.5) **What**: Guarantee agents produce valid JSON/XML/code without parsing errors. **Tools**: - Outlines (github.com/outlines-dev/outlines) — regex/grammar constrained generation - Instructor (github.com/jxnl/instructor) — Pydantic-based structured extraction - Guidance (github.com/guidance-ai/guidance) — interleaved generation and logic **Why it matters**: Eliminates 30% of agent failures (invalid tool calls, malformed output). **Action**: Integrate Outlines into our Hermes tool-call pipeline. ### 2.2 Agent Memory Architecture (Impact: 5, Effort: 3, Ratio: 1.67) **What**: How do we give agents persistent, searchable, composable memory? **Systems**: - MemGPT/Letta (github.com/cpacker/MemGPT) — tiered memory with OS-style paging - LangMem (github.com/langchain-ai/langmem) — LangChain-native memory - cognee (github.com/topoteretes/cognee) — knowledge graph + vector hybrid - Zep (github.com/getzep/zep) — temporal knowledge graph for conversations **Why it matters**: Our holographic fact_store is good but lacks temporal awareness and graph relationships. **Action**: Evaluate cognee's knowledge graph approach against our fact_store. ### 2.3 Multi-Agent Communication Protocols (Impact: 4, Effort: 3, Ratio: 1.33) **What**: How should our wizards talk to each other? **Protocols**: - ACP (Agent Communication Protocol) — IBM's open standard - A2A (Agent-to-Agent) — Google's protocol - MCP (Model Context Protocol) — Anthropic's tool/resource protocol (we already use this) - NATS — lightweight messaging (we've explored this) **Why it matters**: Our GOFIA epic needs a message bus. Picking the right protocol first saves rework. **Action**: Read ACP and A2A specs, compare with our current MCP usage. ### 2.4 Agent Evaluation Frameworks (Impact: 4, Effort: 2, Ratio: 2.0) **What**: How do we measure if our agents are actually good? **Benchmarks**: - SWE-bench — software engineering tasks - GAIA — general AI assistants (real-world questions) - AgentBench — multi-turn agent tasks - WebArena — web navigation tasks - CRMArena — customer relationship tasks **Why it matters**: We have no systematic way to measure agent quality. We need this. **Action**: Run GAIA benchmark on our fleet. Establish baseline. --- ## 3. MODEL SELECTION ### 3.1 Best Models for Tool-Use (Impact: 5, Effort: 2, Ratio: 2.5) **What**: Which open models are best at structured tool calling? **Rankings (2026)**: 1. Qwen2.5-72B-Instruct — best overall tool use 2. DeepSeek-V3 — strongest reasoning + tool use 3. Llama 3.1 405B — largest open model, excellent instruction following 4. Command R+ — built for RAG and tool use 5. Hermes-3-Llama-3.1-70B — Nous Research's agent-tuned model **Why it matters**: Gemma-4-31b is good but may not be optimal for tool-heavy agent workloads. **Action**: Run A/B comparison: Gemma-4 vs Qwen2.5-72B vs Hermes-3 on our tool-use benchmarks. ### 3.2 Model Routing (Impact: 4, Effort: 3, Ratio: 1.33) **What**: Use small models for simple tasks, large models for complex reasoning. **Techniques**: - RouteLLM (github.com/lm-sys/RouteLLM) — learned routing between model pairs - FrugalGPT — cascade of models with early exit - Hybrid LLM — simple classifier decides which model to use **Why it matters**: 80% of our queries don't need a 31B model. A 7B model with routing could save 5x cost. **Action**: Build a simple router: if query has tool calls, use Gemma-4; if simple chat, use Gemma-2b. ### 3.3 Mixture of Agents (Impact: 3, Effort: 4, Ratio: 0.75) **What**: Combine outputs from multiple models for better quality. **Paper**: Mixture-of-Agents Enhances Large Language Model Capabilities (2024) — https://arxiv.org/abs/2406.04692 **Why it matters**: Lower priority but could improve quality on hard tasks. **Action**: Research only — don't implement yet. --- ## 4. MEMORY & RETRIEVAL ### 4.1 Advanced RAG Patterns (Impact: 5, Effort: 3, Ratio: 1.67) **What**: Move beyond naive chunk-and-embed RAG. **Patterns**: - Corrective RAG (CRAG) — retrieval evaluator checks quality, falls back to web search - Self-RAG — model decides when to retrieve and how to use results - Graph RAG (Microsoft) — build knowledge graphs from documents, query the graph - Agentic RAG — agent orchestrates retrieval steps **Papers**: - Self-RAG: Learning to Retrieve, Generate, and Critique (2023) — https://arxiv.org/abs/2310.11511 - From Local to Global: A Graph RAG Approach (2024) — https://arxiv.org/abs/2404.16130 **Why it matters**: Our fact_store search is basic keyword matching. CRAG or Self-RAG could dramatically improve retrieval quality. **Action**: Implement CRAG pattern in our fact_store: retrieve → evaluate → refine or fallback. ### 4.2 Vector Database Upgrade (Impact: 3, Effort: 2, Ratio: 1.5) **What**: Evaluate our current setup vs alternatives. **Options**: - Qdrant — Rust-based, fast, great filtering - Chroma — simple, Python-native, good for prototyping - SQLite + sqlite-vss — zero-dependency, embedded - FAISS — Facebook's pure vector search **Why it matters**: We're using basic SQLite. A proper vector DB could improve search quality. **Action**: Benchmark Qdrant vs sqlite-vss on our fact_store queries. ### 4.3 Long Context vs RAG Decision Framework (Impact: 4, Effort: 1, Ratio: 4.0) **What**: When should we stuff everything in context vs retrieve selectively? **Guidelines**: - Context < 32K tokens → stuff it all (no RAG needed) - Context 32K-128K → hybrid (key docs in context, rest retrieved) - Context > 128K → pure RAG with reranking **Why it matters**: We're using models with 128K+ context but still treating them like 4K models. **Action**: Audit our current context usage. We might be over-retrieving. --- ## 5. FINE-TUNING & TRAINING ### 5.1 LoRA for Domain-Specific Agents (Impact: 4, Effort: 4, Ratio: 1.0) **What**: Fine-tune a small LoRA adapter for our specific use cases. **Tools**: - Unsloth (github.com/unslothai/unsloth) — 2-5x faster fine-tuning - Axolotl (github.com/axolotl-ai-cloud/axolotl) — config-driven fine-tuning - PEFT (github.com/huggingface/peft) — HuggingFace's LoRA library **Why it matters**: A LoRA trained on our crisis intervention conversations could be more empathetic than any general model. **Action**: Collect conversation data from the-door, train a LoRA on Gemma-2b. ### 5.2 GRPO for Agent Alignment (Impact: 5, Effort: 5, Ratio: 1.0) **What**: Group Relative Policy Optimization — train agents to follow SOUL.md principles. **Paper**: DeepSeekMath: Pushing the Limits of Mathematical Reasoning (2024) — https://arxiv.org/abs/2402.03300 **Why it matters**: Align agents with our values without expensive human feedback. **Action**: Research only — design reward function based on SOUL.md principles. ### 5.3 Synthetic Data Generation (Impact: 4, Effort: 2, Ratio: 2.0) **What**: Generate training data from our existing conversations and code. **Tools**: - distilabel (github.com/argilla-io/distilabel) — synthetic data pipelines - phi-3 cookbook recipes — proven synthetic data approaches - Self-Instruct — generate instruction-following data from seed tasks **Why it matters**: We have conversation logs, code, and design docs. This is training data waiting to be structured. **Action**: Build a pipeline that extracts high-quality conversations from our session logs. --- ## 6. SAFETY & EVALUATION ### 6.1 Hallucination Detection (Impact: 5, Effort: 3, Ratio: 1.67) **What**: Detect when our agents are making things up. **Techniques**: - SelfCheckGPT — sample multiple outputs, check consistency - REFUTE — train a model to detect factual errors - Citation verification — check if cited sources actually exist **Why it matters**: SOUL.md demands honesty. We need automated verification. **Action**: Implement SelfCheckGPT pattern: generate 3 responses, flag inconsistencies. ### 6.2 Agent Red-Teaming (Impact: 4, Effort: 2, Ratio: 2.0) **What**: Systematically test our agents for failure modes. **Approaches**: - GCG (Greedy Coordinate Gradient) — automated jailbreak discovery - PAIR (Prompt Automatic Iterative Refinement) — LLM-based red-teaming - Manual testing against SOUL.md principles **Why it matters**: We serve broken men. If our agents can be jailbroken to give harmful advice, people die. **Action**: Build a red-team test suite based on SOUL.md's "What I Will Not Do" section. ### 6.3 Crisis Intervention Quality Metrics (Impact: 5, Effort: 3, Ratio: 1.67) **What**: How do we measure if the-door actually helps people? **Metrics**: - Empathy detection (does the response acknowledge the user's feelings?) - Safety compliance (does it surface 988? does it avoid harmful advice?) - De-escalation scoring (does the conversation trajectory improve?) - Response latency (in crisis, seconds matter) **Why it matters**: This is our mission. We need to measure it. **Action**: Build an evaluation pipeline: extract the-door conversations, score on empathy + safety + de-escalation. --- ## 7. INFERENCE INFRASTRUCTURE ### 7.1 vLLM vs llama.cpp Benchmark (Impact: 3, Effort: 2, Ratio: 1.5) **What**: Which inference engine is best for our use case? **Comparison**: - llama.cpp — best for CPU/Apple Silicon, GGUF format - vLLM — best for GPU serving, PagedAttention, high throughput - TensorRT-LLM — NVIDIA-optimized, fastest on NVIDIA GPUs - SGLang — structured generation, RadixAttention **Why it matters**: We use llama.cpp on Mac and OpenRouter for cloud. Should we use vLLM on RunPod? **Action**: Deploy vLLM on RunPod L40S, benchmark against Ollama. ### 7.2 Prompt Caching Optimization (Impact: 4, Effort: 2, Ratio: 2.0) **What**: Cache system prompts and repeated context across requests. **Techniques**: - Anthropic prompt caching (built into API) - vLLM automatic prefix caching - Manual KV-cache reuse for system prompts **Why it matters**: Our system prompts are 2000+ tokens. Caching saves both time and money. **Action**: Implement prefix caching in our local inference setup. ### 7.3 Batch Inference for Cron Jobs (Impact: 3, Effort: 2, Ratio: 1.5) **What**: Our 47 cron jobs each make separate inference calls. Batch them. **Tools**: - vLLM batched inference - llama.cpp parallel slots - Custom batching layer **Why it matters**: Nightly automation could be 3-5x faster with batching. **Action**: Build a batch dispatcher that groups cron job requests. --- ## PRIORITY QUEUE (Sorted by Ratio) | # | Item | Ratio | Tonight? | |---|------|-------|----------| | 1 | Long Context vs RAG Decision Framework | 4.0 | YES — audit current context usage | | 2 | Structured Output Enforcement (Outlines) | 2.5 | YES — integrate into tool pipeline | | 3 | Best Models for Tool-Use comparison | 2.5 | Maybe — needs benchmark setup | | 4 | Agent Evaluation Frameworks (GAIA) | 2.0 | Maybe — needs benchmark infra | | 5 | Synthetic Data Generation | 2.0 | YES — extract from session logs | | 6 | Agent Red-Teaming | 2.0 | YES — build SOUL.md test suite | | 7 | Prompt Caching | 2.0 | YES — quick win for local inference | | 8 | KV-Cache Compression | 1.67 | Research tonight, implement tomorrow | | 9 | Advanced RAG (CRAG) | 1.67 | Research tonight, implement tomorrow | | 10 | Hallucination Detection | 1.67 | Research tonight, implement tomorrow | | 11 | Crisis Intervention Metrics | 1.67 | Research tonight, implement tomorrow | | 12 | Agent Memory Architecture | 1.67 | Research tonight, implement tomorrow | --- ## EPIC LINKS - **GOFIA** (#65) → feeds from items 2.3 (protocols), 4.1 (RAG), 4.3 (context framework) - **The Beacon** (#57) → feeds from items 2.1 (structured output), 6.2 (red-teaming) - **Know Thy Agent** (#290) → feeds from items 3.1 (model selection), 2.4 (evaluation) - **The Door** → feeds from items 6.1 (hallucination), 6.3 (crisis metrics), 5.1 (LoRA) --- *Sovereignty and service always.*
ezra was assigned by Timmy 2026-04-11 01:00:18 +00:00
Timmy closed this issue 2026-04-11 18:53:38 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#593