research dump # #445

Closed
opened 2026-03-24 18:20:09 +00:00 by Rockachopa · 5 comments
Timmy was assigned by Rockachopa 2026-03-24 18:20:09 +00:00
Author
Owner
[Wiring_the_Research_Pipeline.pdf](/attachments/5eaca8f4-dd5b-4183-844b-4f553d42fa7d)
Owner

📋 RESEARCH INGESTION SUMMARY

Documents Ingested

# Document Source Pages
1 Wiring_the_Research_Pipeline.pdf Comment #16850 by @Rockachopa 6 pages

Key Findings

1. Autonomous Research Frameworks (6 evaluated)

  • Local Deep Research (LearningCircuit) — BEST sovereign option. MIT license, pip install, MCP server built-in, ~95% SimpleQA, runs fully local with Ollama + SearXNG
  • GPT-Researcher (assafelovic) — Most mature architecture. Apache 2.0, supports Ollama natively with FAST_LLM/SMART_LLM mapping
  • Also evaluated: LangChain Local Deep Researcher, STORM v2 (Stanford), HuggingFace Open Deep Research, DeepSearcher (Zilliz)

2. Search Backend: SearXNG is the answer

  • Self-hosted, $0 forever, no API key needed
  • docker run -d -p 8080:8080 searxng/searxng:latest
  • Critical: searxng-docker-tavily-adapter provides Tavily-compatible REST API wrapping SearXNG — any tool supporting Tavily can use SearXNG as drop-in with zero code changes

3. Web Extraction Pattern

  • Fast path: trafilatura (~50ms, pure HTTP)
  • Fallback: Crawl4AI for JS-heavy pages (~2-5s, Playwright)
  • Crawl4AI already available in Agno (from agno.tools.crawl4ai)

4. Vector Store + Embeddings

  • LanceDB — embedded, serverless, Rust core, native hybrid search
  • Qwen3-Embedding 0.6B (NEW March 2026) — MTEB top open model, replaces nomic-embed-text
  • Fallback: sqlite-vec for lightest MVP option

5. Self-Improving Agent Patterns

  • Voyager Skill Library → crystallization loop
  • ExpeL → extract rules from experience trajectories
  • Reflexion → research retry on failure
  • Claude Code Skills → already in falsework protocol

6. Architecture: ResearchOrchestrator

  • 8-step pipeline: cache check → template → queries → SearXNG → fetch → synthesize → crystallize (LanceDB) → write report + file issues
  • Core file: src/timmy/research.py
  • 6 research templates: state_of_art, tool_evaluation, architecture_spike, game_analysis, integration_guide, competitive_scan

7. Gitea MCP Server

  • Official: gitea.com/gitea/gitea-mcp (Go)
  • Alternative: REST API with token auth

Sovereignty Metrics (Graduation Targets)

Metric Week 1 Month 1 Month 3 Graduation
Cache hit rate 10% 40% 80% >90%
API cost/task $1.50 $0.50 $0.10 <$0.01
Time to report 30 min 15 min 5 min <1 min

Actionable Items → Pipeline Tickets

  • #486 (Day 1): Deploy SearXNG + install LanceDB + wire Crawl4AI
  • #487 (Day 2): Create 6 research templates + scaffold ResearchOrchestrator
  • #488 (Day 3): Complete ResearchOrchestrator + end-to-end test
  • #489 (Day 4): Gitea issue creation + trafilatura fast-path
  • #490 (Day 5): Paperclip integration + sovereignty metrics + import existing PDFs

Ingested automatically by Timmy. 1 document processed, 0 skipped.

## 📋 RESEARCH INGESTION SUMMARY ### Documents Ingested | # | Document | Source | Pages | |---|----------|--------|-------| | 1 | Wiring_the_Research_Pipeline.pdf | Comment #16850 by @Rockachopa | 6 pages | ### Key Findings **1. Autonomous Research Frameworks (6 evaluated)** - **Local Deep Research** (LearningCircuit) — BEST sovereign option. MIT license, pip install, MCP server built-in, ~95% SimpleQA, runs fully local with Ollama + SearXNG - **GPT-Researcher** (assafelovic) — Most mature architecture. Apache 2.0, supports Ollama natively with FAST_LLM/SMART_LLM mapping - Also evaluated: LangChain Local Deep Researcher, STORM v2 (Stanford), HuggingFace Open Deep Research, DeepSearcher (Zilliz) **2. Search Backend: SearXNG is the answer** - Self-hosted, $0 forever, no API key needed - `docker run -d -p 8080:8080 searxng/searxng:latest` - Critical: **searxng-docker-tavily-adapter** provides Tavily-compatible REST API wrapping SearXNG — any tool supporting Tavily can use SearXNG as drop-in with zero code changes **3. Web Extraction Pattern** - Fast path: **trafilatura** (~50ms, pure HTTP) - Fallback: **Crawl4AI** for JS-heavy pages (~2-5s, Playwright) - Crawl4AI already available in Agno (`from agno.tools.crawl4ai`) **4. Vector Store + Embeddings** - **LanceDB** — embedded, serverless, Rust core, native hybrid search - **Qwen3-Embedding 0.6B** (NEW March 2026) — MTEB top open model, replaces nomic-embed-text - Fallback: sqlite-vec for lightest MVP option **5. Self-Improving Agent Patterns** - Voyager Skill Library → crystallization loop - ExpeL → extract rules from experience trajectories - Reflexion → research retry on failure - Claude Code Skills → already in falsework protocol **6. Architecture: ResearchOrchestrator** - 8-step pipeline: cache check → template → queries → SearXNG → fetch → synthesize → crystallize (LanceDB) → write report + file issues - Core file: `src/timmy/research.py` - 6 research templates: state_of_art, tool_evaluation, architecture_spike, game_analysis, integration_guide, competitive_scan **7. Gitea MCP Server** - Official: `gitea.com/gitea/gitea-mcp` (Go) - Alternative: REST API with token auth ### Sovereignty Metrics (Graduation Targets) | Metric | Week 1 | Month 1 | Month 3 | Graduation | |--------|--------|---------|---------|------------| | Cache hit rate | 10% | 40% | 80% | >90% | | API cost/task | $1.50 | $0.50 | $0.10 | <$0.01 | | Time to report | 30 min | 15 min | 5 min | <1 min | ### Actionable Items → Pipeline Tickets - **#486 (Day 1):** Deploy SearXNG + install LanceDB + wire Crawl4AI - **#487 (Day 2):** Create 6 research templates + scaffold ResearchOrchestrator - **#488 (Day 3):** Complete ResearchOrchestrator + end-to-end test - **#489 (Day 4):** Gitea issue creation + trafilatura fast-path - **#490 (Day 5):** Paperclip integration + sovereignty metrics + import existing PDFs --- *Ingested automatically by Timmy. 1 document processed, 0 skipped.*
Owner

📊 Pipeline Infrastructure Status Update

Following the research ingestion summary (comment #17031), here's the current deployment status for the 5-day pipeline plan:

Day 1 (#486): Deploy SearXNG + LanceDB + Crawl4AI

Component Status Details
SearXNG Live VPS port 8080, JSON API verified, multi-engine (Google, Brave, DDG, Startpage)
Docker Installed docker.io 28.2.2 on VPS
LanceDB Installed v0.30.1 — DB creation pending
Crawl4AI Installed v0.8.6 — browser setup pending
trafilatura Installed v2.0.0 — extraction ready
httpx Installed v0.28.1
Ollama embeddings 🔲 Moved to Mac VPS too small (2CPU/3.8GB) — local-first per architecture

Days 2-5 (#487-#490): Pending

  • #487: Wire Local Deep Research agent framework
  • #488: Build research session manager
  • #489: AutoLoRA integration for continuous learning
  • #490: Research dashboard + Nexus integration

Architecture Decision

Embedding models (Qwen3-Embedding) will run on the Mac via Ollama (local-first, sovereign). VPS handles search infrastructure (SearXNG) and web extraction (Crawl4AI + trafilatura). This aligns with the MLX=training, GGUF=inference local architecture documented in the research PDF.

Next: Complete Day 1 by creating the LanceDB schema and running end-to-end pipeline test.

## 📊 Pipeline Infrastructure Status Update Following the research ingestion summary (comment #17031), here's the current deployment status for the 5-day pipeline plan: ### Day 1 (#486): Deploy SearXNG + LanceDB + Crawl4AI | Component | Status | Details | |-----------|--------|---------| | SearXNG | ✅ Live | VPS port 8080, JSON API verified, multi-engine (Google, Brave, DDG, Startpage) | | Docker | ✅ Installed | docker.io 28.2.2 on VPS | | LanceDB | ✅ Installed | v0.30.1 — DB creation pending | | Crawl4AI | ✅ Installed | v0.8.6 — browser setup pending | | trafilatura | ✅ Installed | v2.0.0 — extraction ready | | httpx | ✅ Installed | v0.28.1 | | Ollama embeddings | 🔲 Moved to Mac | VPS too small (2CPU/3.8GB) — local-first per architecture | ### Days 2-5 (#487-#490): Pending - #487: Wire Local Deep Research agent framework - #488: Build research session manager - #489: AutoLoRA integration for continuous learning - #490: Research dashboard + Nexus integration ### Architecture Decision Embedding models (Qwen3-Embedding) will run on the Mac via Ollama (local-first, sovereign). VPS handles search infrastructure (SearXNG) and web extraction (Crawl4AI + trafilatura). This aligns with the MLX=training, GGUF=inference local architecture documented in the research PDF. *Next: Complete Day 1 by creating the LanceDB schema and running end-to-end pipeline test.*
Owner

Triage by Timmy (2026-03-25)

This research dump was ingested in a prior session. PDF was processed, pipeline issues #486-#490 were created (5-day research pipeline plan).

Status: ingestion complete. Sub-issues exist. Closing as triaged.

**Triage by Timmy (2026-03-25)** This research dump was ingested in a prior session. PDF was processed, pipeline issues #486-#490 were created (5-day research pipeline plan). Status: ingestion complete. Sub-issues exist. Closing as triaged.
Timmy closed this issue 2026-03-25 19:15:19 +00:00
Owner

Triage complete. Ingestion was done in prior session. Sub-issues #486-#490 created for the 5-day research pipeline. Closing as fully triaged.

**Triage complete.** Ingestion was done in prior session. Sub-issues #486-#490 created for the 5-day research pipeline. Closing as fully triaged.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#445