research dump # #445

New Issue

Local Deep Research (LearningCircuit) — BEST sovereign option. MIT license, pip install, MCP server built-in, ~95% SimpleQA, runs fully local with Ollama + SearXNG
GPT-Researcher (assafelovic) — Most mature architecture. Apache 2.0, supports Ollama natively with FAST_LLM/SMART_LLM mapping
Also evaluated: LangChain Local Deep Researcher, STORM v2 (Stanford), HuggingFace Open Deep Research, DeepSearcher (Zilliz)

2. Search Backend: SearXNG is the answer

Self-hosted, $0 forever, no API key needed
docker run -d -p 8080:8080 searxng/searxng:latest
Critical: searxng-docker-tavily-adapter provides Tavily-compatible REST API wrapping SearXNG — any tool supporting Tavily can use SearXNG as drop-in with zero code changes

3. Web Extraction Pattern

Fast path: trafilatura (~50ms, pure HTTP)
Fallback: Crawl4AI for JS-heavy pages (~2-5s, Playwright)
Crawl4AI already available in Agno (from agno.tools.crawl4ai)

4. Vector Store + Embeddings

LanceDB — embedded, serverless, Rust core, native hybrid search
Qwen3-Embedding 0.6B (NEW March 2026) — MTEB top open model, replaces nomic-embed-text
Fallback: sqlite-vec for lightest MVP option

5. Self-Improving Agent Patterns

Voyager Skill Library → crystallization loop
ExpeL → extract rules from experience trajectories
Reflexion → research retry on failure
Claude Code Skills → already in falsework protocol

6. Architecture: ResearchOrchestrator

8-step pipeline: cache check → template → queries → SearXNG → fetch → synthesize → crystallize (LanceDB) → write report + file issues
Core file: src/timmy/research.py
6 research templates: state_of_art, tool_evaluation, architecture_spike, game_analysis, integration_guide, competitive_scan

7. Gitea MCP Server

Official: gitea.com/gitea/gitea-mcp (Go)
Alternative: REST API with token auth

Sovereignty Metrics (Graduation Targets)

Metric	Week 1	Month 1	Month 3	Graduation
Cache hit rate	10%	40%	80%	>90%
API cost/task	$1.50	$0.50	$0.10	<$0.01
Time to report	30 min	15 min	5 min	<1 min

Actionable Items → Pipeline Tickets

#486 (Day 1): Deploy SearXNG + install LanceDB + wire Crawl4AI
#487 (Day 2): Create 6 research templates + scaffold ResearchOrchestrator
#488 (Day 3): Complete ResearchOrchestrator + end-to-end test
#489 (Day 4): Gitea issue creation + trafilatura fast-path
#490 (Day 5): Paperclip integration + sovereignty metrics + import existing PDFs

Ingested automatically by Timmy. 1 document processed, 0 skipped.

## 📋 RESEARCH INGESTION SUMMARY ### Documents Ingested | # | Document | Source | Pages | |---|----------|--------|-------| | 1 | Wiring_the_Research_Pipeline.pdf | Comment #16850 by @Rockachopa | 6 pages | ### Key Findings **1. Autonomous Research Frameworks (6 evaluated)** - **Local Deep Research** (LearningCircuit) — BEST sovereign option. MIT license, pip install, MCP server built-in, ~95% SimpleQA, runs fully local with Ollama + SearXNG - **GPT-Researcher** (assafelovic) — Most mature architecture. Apache 2.0, supports Ollama natively with FAST_LLM/SMART_LLM mapping - Also evaluated: LangChain Local Deep Researcher, STORM v2 (Stanford), HuggingFace Open Deep Research, DeepSearcher (Zilliz) **2. Search Backend: SearXNG is the answer** - Self-hosted, $0 forever, no API key needed - `docker run -d -p 8080:8080 searxng/searxng:latest` - Critical: **searxng-docker-tavily-adapter** provides Tavily-compatible REST API wrapping SearXNG — any tool supporting Tavily can use SearXNG as drop-in with zero code changes **3. Web Extraction Pattern** - Fast path: **trafilatura** (~50ms, pure HTTP) - Fallback: **Crawl4AI** for JS-heavy pages (~2-5s, Playwright) - Crawl4AI already available in Agno (`from agno.tools.crawl4ai`) **4. Vector Store + Embeddings** - **LanceDB** — embedded, serverless, Rust core, native hybrid search - **Qwen3-Embedding 0.6B** (NEW March 2026) — MTEB top open model, replaces nomic-embed-text - Fallback: sqlite-vec for lightest MVP option **5. Self-Improving Agent Patterns** - Voyager Skill Library → crystallization loop - ExpeL → extract rules from experience trajectories - Reflexion → research retry on failure - Claude Code Skills → already in falsework protocol **6. Architecture: ResearchOrchestrator** - 8-step pipeline: cache check → template → queries → SearXNG → fetch → synthesize → crystallize (LanceDB) → write report + file issues - Core file: `src/timmy/research.py` - 6 research templates: state_of_art, tool_evaluation, architecture_spike, game_analysis, integration_guide, competitive_scan **7. Gitea MCP Server** - Official: `gitea.com/gitea/gitea-mcp` (Go) - Alternative: REST API with token auth ### Sovereignty Metrics (Graduation Targets) | Metric | Week 1 | Month 1 | Month 3 | Graduation | |--------|--------|---------|---------|------------| | Cache hit rate | 10% | 40% | 80% | >90% | | API cost/task | $1.50 | $0.50 | $0.10 | <$0.01 | | Time to report | 30 min | 15 min | 5 min | <1 min | ### Actionable Items → Pipeline Tickets - **#486 (Day 1):** Deploy SearXNG + install LanceDB + wire Crawl4AI - **#487 (Day 2):** Create 6 research templates + scaffold ResearchOrchestrator - **#488 (Day 3):** Complete ResearchOrchestrator + end-to-end test - **#489 (Day 4):** Gitea issue creation + trafilatura fast-path - **#490 (Day 5):** Paperclip integration + sovereignty metrics + import existing PDFs --- *Ingested automatically by Timmy. 1 document processed, 0 skipped.*

Timmy referenced this issue

2026-03-25 03:13:03 +00:00

Research Pipeline Day 1: Deploy SearXNG + LanceDB + Crawl4AI #486

Timmy commented

2026-03-25 04:20:31 +00:00

📊 Pipeline Infrastructure Status Update

Following the research ingestion summary (comment #17031), here's the current deployment status for the 5-day pipeline plan:

Day 1 (#486): Deploy SearXNG + LanceDB + Crawl4AI

Component	Status	Details
SearXNG	✅ Live	VPS port 8080, JSON API verified, multi-engine (Google, Brave, DDG, Startpage)
Docker	✅ Installed	docker.io 28.2.2 on VPS
LanceDB	✅ Installed	v0.30.1 — DB creation pending
Crawl4AI	✅ Installed	v0.8.6 — browser setup pending
trafilatura	✅ Installed	v2.0.0 — extraction ready
httpx	✅ Installed	v0.28.1
Ollama embeddings	🔲 Moved to Mac	VPS too small (2CPU/3.8GB) — local-first per architecture

Days 2-5 (#487-#490): Pending

#487: Wire Local Deep Research agent framework
#488: Build research session manager
#489: AutoLoRA integration for continuous learning
#490: Research dashboard + Nexus integration

Architecture Decision

Embedding models (Qwen3-Embedding) will run on the Mac via Ollama (local-first, sovereign). VPS handles search infrastructure (SearXNG) and web extraction (Crawl4AI + trafilatura). This aligns with the MLX=training, GGUF=inference local architecture documented in the research PDF.

Next: Complete Day 1 by creating the LanceDB schema and running end-to-end pipeline test.

## 📊 Pipeline Infrastructure Status Update Following the research ingestion summary (comment #17031), here's the current deployment status for the 5-day pipeline plan: ### Day 1 (#486): Deploy SearXNG + LanceDB + Crawl4AI | Component | Status | Details | |-----------|--------|---------| | SearXNG | ✅ Live | VPS port 8080, JSON API verified, multi-engine (Google, Brave, DDG, Startpage) | | Docker | ✅ Installed | docker.io 28.2.2 on VPS | | LanceDB | ✅ Installed | v0.30.1 — DB creation pending | | Crawl4AI | ✅ Installed | v0.8.6 — browser setup pending | | trafilatura | ✅ Installed | v2.0.0 — extraction ready | | httpx | ✅ Installed | v0.28.1 | | Ollama embeddings | 🔲 Moved to Mac | VPS too small (2CPU/3.8GB) — local-first per architecture | ### Days 2-5 (#487-#490): Pending - #487: Wire Local Deep Research agent framework - #488: Build research session manager - #489: AutoLoRA integration for continuous learning - #490: Research dashboard + Nexus integration ### Architecture Decision Embedding models (Qwen3-Embedding) will run on the Mac via Ollama (local-first, sovereign). VPS handles search infrastructure (SearXNG) and web extraction (Crawl4AI + trafilatura). This aligns with the MLX=training, GGUF=inference local architecture documented in the research PDF. *Next: Complete Day 1 by creating the LanceDB schema and running end-to-end pipeline test.*

Timmy referenced this issue from a commit

2026-03-25 13:18:19 +00:00

feat: add headless smoke tests for Nexus rendering and interaction

Timmy referenced this issue

2026-03-25 13:18:54 +00:00

feat: headless smoke tests for Nexus — zero LLM, pure Playwright #504

Timmy referenced this issue from a commit

2026-03-25 13:48:30 +00:00

fix: add GPU screenshot tests, improve render wait times

Timmy commented

2026-03-25 19:15:18 +00:00

Triage by Timmy (2026-03-25)

This research dump was ingested in a prior session. PDF was processed, pipeline issues #486-#490 were created (5-day research pipeline plan).

Status: ingestion complete. Sub-issues exist. Closing as triaged.

**Triage by Timmy (2026-03-25)** This research dump was ingested in a prior session. PDF was processed, pipeline issues #486-#490 were created (5-day research pipeline plan). Status: ingestion complete. Sub-issues exist. Closing as triaged.

Timmy closed this issue

2026-03-25 19:15:19 +00:00

Timmy commented

2026-03-25 20:10:13 +00:00

Triage complete. Ingestion was done in prior session. Sub-issues #486-#490 created for the 5-day research pipeline. Closing as fully triaged.

**Triage complete.** Ingestion was done in prior session. Sub-issues #486-#490 created for the 5-day research pipeline. Closing as fully triaged.

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#445