Research Pipeline Day 1: Deploy SearXNG + LanceDB + Crawl4AI #486

Closed
opened 2026-03-25 02:57:18 +00:00 by Timmy · 3 comments
Owner

Parent: #445 (research dump)

Objective

Stand up search and storage infrastructure for the autonomous research pipeline.

Tasks

  • Deploy SearXNG on VPS: docker run -d -p 8080:8080 searxng/searxng:latest
  • Install SearXNG Tavily adapter for drop-in Tavily compat
  • pip install lancedb trafilatura httpx crawl4ai
  • ollama pull qwen3-embedding:0.6b
  • Create research LanceDB at ~/.timmy/research.lancedb
  • Test: SearXNG returns results via curl
  • Test: trafilatura extracts clean text from a URL
  • Test: Crawl4AI handles a JS-heavy page
  • Test: qwen3-embedding produces vectors via Ollama API

Verification

  • curl http://VPS:8080/search?q=test&format=json returns results
  • python3 -c "import lancedb; print('ok')"
  • python3 -c "import trafilatura; print('ok')"
  • ollama run qwen3-embedding:0.6b "test" returns vector

Ref: Wiring_the_Research_Pipeline.pdf attached to #445

## Parent: #445 (research dump) ### Objective Stand up search and storage infrastructure for the autonomous research pipeline. ### Tasks - [ ] Deploy SearXNG on VPS: `docker run -d -p 8080:8080 searxng/searxng:latest` - [ ] Install SearXNG Tavily adapter for drop-in Tavily compat - [ ] `pip install lancedb trafilatura httpx crawl4ai` - [ ] `ollama pull qwen3-embedding:0.6b` - [ ] Create research LanceDB at ~/.timmy/research.lancedb - [ ] Test: SearXNG returns results via curl - [ ] Test: trafilatura extracts clean text from a URL - [ ] Test: Crawl4AI handles a JS-heavy page - [ ] Test: qwen3-embedding produces vectors via Ollama API ### Verification - `curl http://VPS:8080/search?q=test&format=json` returns results - `python3 -c "import lancedb; print('ok')"` - `python3 -c "import trafilatura; print('ok')"` - `ollama run qwen3-embedding:0.6b "test"` returns vector Ref: Wiring_the_Research_Pipeline.pdf attached to #445
Timmy self-assigned this 2026-03-25 02:57:18 +00:00
Author
Owner

🔧 Day 1 Progress Update

Completed

1. SearXNG Deployed & Verified

  • Docker installed on VPS (docker.io 28.2.2)
  • SearXNG container running: searxng/searxng:latest on port 8080
  • JSON API format enabled and tested — returning search results successfully
  • Container set to --restart unless-stopped for persistence
  • Endpoint: http://localhost:8080/search?q=QUERY&format=json

2. Research Ingestion Summary Posted

  • Full research summary from Wiring_the_Research_Pipeline.pdf posted to #445 (comment #17031)
  • All 7 key findings documented with actionable items mapped to tickets #486-#490

🔲 Remaining Day 1 Tasks

  • Install LanceDB (pip install lancedb)
  • Pull Qwen3-Embedding 0.6B model via Ollama
  • Install/verify Crawl4AI (pip install crawl4ai)
  • Install trafilatura (pip install trafilatura)
  • Deploy searxng-docker-tavily-adapter for Tavily-compatible API
  • Basic connectivity test: SearXNG to trafilatura extraction pipeline

Infrastructure Status

Component Status Location
SearXNG Running 143.198.27.163:8080
Docker Installed VPS
LanceDB 🔲 Pending
Crawl4AI 🔲 Pending
Qwen3-Embedding 🔲 Pending
Tavily Adapter 🔲 Pending
## 🔧 Day 1 Progress Update ### ✅ Completed **1. SearXNG Deployed & Verified** - Docker installed on VPS (docker.io 28.2.2) - SearXNG container running: searxng/searxng:latest on port 8080 - JSON API format enabled and tested — returning search results successfully - Container set to --restart unless-stopped for persistence - Endpoint: http://localhost:8080/search?q=QUERY&format=json **2. Research Ingestion Summary Posted** - Full research summary from Wiring_the_Research_Pipeline.pdf posted to #445 (comment #17031) - All 7 key findings documented with actionable items mapped to tickets #486-#490 ### 🔲 Remaining Day 1 Tasks - [ ] Install LanceDB (pip install lancedb) - [ ] Pull Qwen3-Embedding 0.6B model via Ollama - [ ] Install/verify Crawl4AI (pip install crawl4ai) - [ ] Install trafilatura (pip install trafilatura) - [ ] Deploy searxng-docker-tavily-adapter for Tavily-compatible API - [ ] Basic connectivity test: SearXNG to trafilatura extraction pipeline ### Infrastructure Status | Component | Status | Location | |-----------|--------|----------| | SearXNG | ✅ Running | 143.198.27.163:8080 | | Docker | ✅ Installed | VPS | | LanceDB | 🔲 Pending | — | | Crawl4AI | 🔲 Pending | — | | Qwen3-Embedding | 🔲 Pending | — | | Tavily Adapter | 🔲 Pending | — |
Author
Owner

🔧 Day 1 Progress Update #2

Newly Completed

3. Python Dependencies Installed & Verified

  • lancedb 0.30.1 — installed
  • trafilatura 2.0.0 — installed
  • crawl4ai 0.8.6 — installed
  • httpx 0.28.1 — installed

4. SearXNG Still Running

  • Container status: Up About an hour
  • JSON API verified returning results for test queries
  • Multiple search engines active: google, brave, duckduckgo, startpage, wikipedia, aol

🔲 Remaining Day 1 Tasks

  • Pull Qwen3-Embedding 0.6B via Ollama (NOTE: VPS has only 2CPU/3.8GB RAM — embeddings should run locally on Mac per architecture: MLX=training, GGUF=inference. Recommend moving this task to local setup)
  • Deploy searxng-docker-tavily-adapter for Tavily-compatible API
  • Create research LanceDB at ~/.timmy/research.lancedb
  • End-to-end test: SearXNG → trafilatura extraction pipeline

📋 Architecture Note

Ollama + embedding models should stay on the Mac (local-first, sovereign). The VPS serves as the relay/search infrastructure (SearXNG, LanceDB storage). Crawl4AI and trafilatura handle web content extraction on the VPS where SearXNG lives.

Infrastructure Status

Component Status Location
SearXNG Running VPS :8080
Docker Installed VPS
LanceDB Installed (0.30.1) VPS
Crawl4AI Installed (0.8.6) VPS
trafilatura Installed (2.0.0) VPS
httpx Installed (0.28.1) VPS
Qwen3-Embedding 🔲 Move to Mac Local
Tavily Adapter 🔲 Pending VPS
## 🔧 Day 1 Progress Update #2 ### ✅ Newly Completed **3. Python Dependencies Installed & Verified** - `lancedb 0.30.1` — installed ✅ - `trafilatura 2.0.0` — installed ✅ - `crawl4ai 0.8.6` — installed ✅ - `httpx 0.28.1` — installed ✅ **4. SearXNG Still Running** - Container status: `Up About an hour` - JSON API verified returning results for test queries - Multiple search engines active: google, brave, duckduckgo, startpage, wikipedia, aol ### 🔲 Remaining Day 1 Tasks - [ ] Pull Qwen3-Embedding 0.6B via Ollama *(NOTE: VPS has only 2CPU/3.8GB RAM — embeddings should run locally on Mac per architecture: MLX=training, GGUF=inference. Recommend moving this task to local setup)* - [ ] Deploy searxng-docker-tavily-adapter for Tavily-compatible API - [ ] Create research LanceDB at ~/.timmy/research.lancedb - [ ] End-to-end test: SearXNG → trafilatura extraction pipeline ### 📋 Architecture Note Ollama + embedding models should stay on the Mac (local-first, sovereign). The VPS serves as the relay/search infrastructure (SearXNG, LanceDB storage). Crawl4AI and trafilatura handle web content extraction on the VPS where SearXNG lives. ### Infrastructure Status | Component | Status | Location | |-----------|--------|----------| | SearXNG | ✅ Running | VPS :8080 | | Docker | ✅ Installed | VPS | | LanceDB | ✅ Installed (0.30.1) | VPS | | Crawl4AI | ✅ Installed (0.8.6) | VPS | | trafilatura | ✅ Installed (2.0.0) | VPS | | httpx | ✅ Installed (0.28.1) | VPS | | Qwen3-Embedding | 🔲 Move to Mac | Local | | Tavily Adapter | 🔲 Pending | VPS |
Member

Closed per direction shift (#542). Reason: Research pipeline day 1 (SearXNG/LanceDB/Crawl4AI) — custom build, not MCP-standard.

The Nexus has three jobs: Heartbeat, Harness, Portal Interface. This issue doesn't serve any of them.

Closed per direction shift (#542). Reason: Research pipeline day 1 (SearXNG/LanceDB/Crawl4AI) — custom build, not MCP-standard. The Nexus has three jobs: Heartbeat, Harness, Portal Interface. This issue doesn't serve any of them.
perplexity added the deprioritized label 2026-03-25 23:30:01 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#486