[P0] Implement web_fetch tool (trafilatura) in src/timmy/tools.py #973

Closed
opened 2026-03-22 19:08:52 +00:00 by perplexity · 1 comment
Collaborator

Parent

  • #972 — [GOVERNING] Replacing Claude — Autonomous Research Pipeline Spec

Objective

Add a web_fetch tool that downloads a URL, extracts clean text via trafilatura, and returns content truncated to a token budget. This closes the single biggest gap in Timmy's research pipeline.

Scope

  • pip install trafilatura (Apache 2.0 license)
  • Implement web_fetch(url: str, max_tokens: int = 4000) -> str
  • Use requests.get() with 15s timeout and TimmyResearchBot/1.0 user-agent
  • Extract clean text via trafilatura.extract(resp.text, include_tables=True, include_links=True)
  • Truncate to max_tokens * 4 characters (~4 chars per token)
  • Register as Agno tool in create_full_toolkit()
  • Handle errors gracefully (timeout, 404, extraction failure)

Key Design Notes

  • Pure Python, zero cloud dependency, runs locally forever
  • Existing web_search returns snippets only — this tool reads full pages
  • Combined with web_search, enables the complete Search→Fetch→Synthesize pipeline

Effort Estimate

2 hours

Acceptance Criteria

  • web_fetch("https://example.com") returns clean extracted text
  • Token budget truncation works correctly
  • Registered and callable as Agno tool
  • Error handling for timeouts, bad URLs, empty pages
## Parent - #972 — [GOVERNING] Replacing Claude — Autonomous Research Pipeline Spec ## Objective Add a `web_fetch` tool that downloads a URL, extracts clean text via trafilatura, and returns content truncated to a token budget. This closes the single biggest gap in Timmy's research pipeline. ## Scope - `pip install trafilatura` (Apache 2.0 license) - Implement `web_fetch(url: str, max_tokens: int = 4000) -> str` - Use `requests.get()` with 15s timeout and `TimmyResearchBot/1.0` user-agent - Extract clean text via `trafilatura.extract(resp.text, include_tables=True, include_links=True)` - Truncate to `max_tokens * 4` characters (~4 chars per token) - Register as Agno tool in `create_full_toolkit()` - Handle errors gracefully (timeout, 404, extraction failure) ## Key Design Notes - Pure Python, zero cloud dependency, runs locally forever - Existing `web_search` returns snippets only — this tool reads full pages - Combined with web_search, enables the complete Search→Fetch→Synthesize pipeline ## Effort Estimate 2 hours ## Acceptance Criteria - [ ] `web_fetch("https://example.com")` returns clean extracted text - [ ] Token budget truncation works correctly - [ ] Registered and callable as Agno tool - [ ] Error handling for timeouts, bad URLs, empty pages
claude was assigned by Rockachopa 2026-03-22 21:44:43 +00:00
Collaborator

PR created: http://143.198.27.163:3000/rockachopa/Timmy-time-dashboard/pulls/1004

Implemented web_fetch(url, max_tokens=4000) tool using trafilatura for clean text extraction. Handles all error cases gracefully (invalid URLs, timeouts, HTTP errors, empty pages, missing packages). Registered in create_full_toolkit() with catalog entry. 11 unit tests, all passing.

PR created: http://143.198.27.163:3000/rockachopa/Timmy-time-dashboard/pulls/1004 Implemented `web_fetch(url, max_tokens=4000)` tool using trafilatura for clean text extraction. Handles all error cases gracefully (invalid URLs, timeouts, HTTP errors, empty pages, missing packages). Registered in `create_full_toolkit()` with catalog entry. 11 unit tests, all passing.
Sign in to join this conversation.
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#973