diff --git a/skills/productivity/ocr-and-documents/SKILL.md b/skills/productivity/ocr-and-documents/SKILL.md index cbbc07aad..2fdf4ea41 100644 --- a/skills/productivity/ocr-and-documents/SKILL.md +++ b/skills/productivity/ocr-and-documents/SKILL.md @@ -122,6 +122,44 @@ web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) web_search(query="arxiv GRPO reinforcement learning 2026") ``` +## Split, Merge & Search + +pymupdf handles these natively — use `execute_code` or inline Python: + +```python +# Split: extract pages 1-5 to a new PDF +import pymupdf +doc = pymupdf.open("report.pdf") +new = pymupdf.open() +for i in range(5): + new.insert_pdf(doc, from_page=i, to_page=i) +new.save("pages_1-5.pdf") +``` + +```python +# Merge multiple PDFs +import pymupdf +result = pymupdf.open() +for path in ["a.pdf", "b.pdf", "c.pdf"]: + result.insert_pdf(pymupdf.open(path)) +result.save("merged.pdf") +``` + +```python +# Search for text across all pages +import pymupdf +doc = pymupdf.open("report.pdf") +for i, page in enumerate(doc): + results = page.search_for("revenue") + if results: + print(f"Page {i+1}: {len(results)} match(es)") + print(page.get_text("text")) +``` + +No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package. + +--- + ## Notes - `web_extract` is always first choice for URLs