diff --git a/optional-skills/productivity/siyuan/SKILL.md b/optional-skills/productivity/siyuan/SKILL.md new file mode 100644 index 000000000..49c5d6185 --- /dev/null +++ b/optional-skills/productivity/siyuan/SKILL.md @@ -0,0 +1,297 @@ +--- +name: siyuan +description: SiYuan Note API for searching, reading, creating, and managing blocks and documents in a self-hosted knowledge base via curl. +version: 1.0.0 +author: FEUAZUR +license: MIT +metadata: + hermes: + tags: [SiYuan, Notes, Knowledge Base, PKM, API] + related_skills: [obsidian, notion] + homepage: https://github.com/siyuan-note/siyuan +prerequisites: + env_vars: [SIYUAN_TOKEN] + commands: [curl, jq] +required_environment_variables: + - name: SIYUAN_TOKEN + prompt: SiYuan API token + help: "Settings > About in SiYuan desktop app" + - name: SIYUAN_URL + prompt: SiYuan instance URL (default http://127.0.0.1:6806) + required_for: remote instances +--- + +# SiYuan Note API + +Use the [SiYuan](https://github.com/siyuan-note/siyuan) kernel API via curl to search, read, create, update, and delete blocks and documents in a self-hosted knowledge base. No extra tools needed -- just curl and an API token. + +## Prerequisites + +1. Install and run SiYuan (desktop or Docker) +2. Get your API token: **Settings > About > API token** +3. Store it in `~/.hermes/.env`: + ``` + SIYUAN_TOKEN=your_token_here + SIYUAN_URL=http://127.0.0.1:6806 + ``` + `SIYUAN_URL` defaults to `http://127.0.0.1:6806` if not set. + +## API Basics + +All SiYuan API calls are **POST with JSON body**. Every request follows this pattern: + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/..." \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"param": "value"}' +``` + +Responses are JSON with this structure: +```json +{"code": 0, "msg": "", "data": { ... }} +``` +`code: 0` means success. Any other value is an error -- check `msg` for details. + +**ID format:** SiYuan IDs look like `20210808180117-6v0mkxr` (14-digit timestamp + 7 alphanumeric chars). + +## Quick Reference + +| Operation | Endpoint | +|-----------|----------| +| Full-text search | `/api/search/fullTextSearchBlock` | +| SQL query | `/api/query/sql` | +| Read block | `/api/block/getBlockKramdown` | +| Read children | `/api/block/getChildBlocks` | +| Get path | `/api/filetree/getHPathByID` | +| Get attributes | `/api/attr/getBlockAttrs` | +| List notebooks | `/api/notebook/lsNotebooks` | +| List documents | `/api/filetree/listDocsByPath` | +| Create notebook | `/api/notebook/createNotebook` | +| Create document | `/api/filetree/createDocWithMd` | +| Append block | `/api/block/appendBlock` | +| Update block | `/api/block/updateBlock` | +| Rename document | `/api/filetree/renameDocByID` | +| Set attributes | `/api/attr/setBlockAttrs` | +| Delete block | `/api/block/deleteBlock` | +| Delete document | `/api/filetree/removeDocByID` | +| Export as Markdown | `/api/export/exportMdContent` | + +## Common Operations + +### Search (Full-Text) + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/search/fullTextSearchBlock" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"query": "meeting notes", "page": 0}' | jq '.data.blocks[:5]' +``` + +### Search (SQL) + +Query the blocks database directly. Only SELECT statements are safe. + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/query/sql" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"stmt": "SELECT id, content, type, box FROM blocks WHERE content LIKE '\''%keyword%'\'' AND type='\''p'\'' LIMIT 20"}' | jq '.data' +``` + +Useful columns: `id`, `parent_id`, `root_id`, `box` (notebook ID), `path`, `content`, `type`, `subtype`, `created`, `updated`. + +### Read Block Content + +Returns block content in Kramdown (Markdown-like) format. + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/block/getBlockKramdown" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"id": "20210808180117-6v0mkxr"}' | jq '.data.kramdown' +``` + +### Read Child Blocks + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/block/getChildBlocks" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"id": "20210808180117-6v0mkxr"}' | jq '.data' +``` + +### Get Human-Readable Path + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/filetree/getHPathByID" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"id": "20210808180117-6v0mkxr"}' | jq '.data' +``` + +### Get Block Attributes + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/attr/getBlockAttrs" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"id": "20210808180117-6v0mkxr"}' | jq '.data' +``` + +### List Notebooks + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/notebook/lsNotebooks" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{}' | jq '.data.notebooks[] | {id, name, closed}' +``` + +### List Documents in a Notebook + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/filetree/listDocsByPath" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"notebook": "NOTEBOOK_ID", "path": "/"}' | jq '.data.files[] | {id, name}' +``` + +### Create a Document + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/filetree/createDocWithMd" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "notebook": "NOTEBOOK_ID", + "path": "/Meeting Notes/2026-03-22", + "markdown": "# Meeting Notes\n\n- Discussed project timeline\n- Assigned tasks" + }' | jq '.data' +``` + +### Create a Notebook + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/notebook/createNotebook" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"name": "My New Notebook"}' | jq '.data.notebook.id' +``` + +### Append Block to Document + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/block/appendBlock" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "parentID": "DOCUMENT_OR_BLOCK_ID", + "data": "New paragraph added at the end.", + "dataType": "markdown" + }' | jq '.data' +``` + +Also available: `/api/block/prependBlock` (same params, inserts at the beginning) and `/api/block/insertBlock` (uses `previousID` instead of `parentID` to insert after a specific block). + +### Update Block Content + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/block/updateBlock" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "id": "BLOCK_ID", + "data": "Updated content here.", + "dataType": "markdown" + }' | jq '.data' +``` + +### Rename a Document + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/filetree/renameDocByID" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"id": "DOCUMENT_ID", "title": "New Title"}' +``` + +### Set Block Attributes + +Custom attributes must be prefixed with `custom-`: + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/attr/setBlockAttrs" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "id": "BLOCK_ID", + "attrs": { + "custom-status": "reviewed", + "custom-priority": "high" + } + }' +``` + +### Delete a Block + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/block/deleteBlock" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"id": "BLOCK_ID"}' +``` + +To delete a whole document: use `/api/filetree/removeDocByID` with `{"id": "DOC_ID"}`. +To delete a notebook: use `/api/notebook/removeNotebook` with `{"notebook": "NOTEBOOK_ID"}`. + +### Export Document as Markdown + +```bash +curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/export/exportMdContent" \ + -H "Authorization: Token $SIYUAN_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"id": "DOCUMENT_ID"}' | jq -r '.data.content' +``` + +## Block Types + +Common `type` values in SQL queries: + +| Type | Description | +|------|-------------| +| `d` | Document (root block) | +| `p` | Paragraph | +| `h` | Heading | +| `l` | List | +| `i` | List item | +| `c` | Code block | +| `m` | Math block | +| `t` | Table | +| `b` | Blockquote | +| `s` | Super block | +| `html` | HTML block | + +## Pitfalls + +- **All endpoints are POST** -- even read-only operations. Do not use GET. +- **SQL safety**: only use SELECT queries. INSERT/UPDATE/DELETE/DROP are dangerous and should never be sent. +- **ID validation**: IDs match the pattern `YYYYMMDDHHmmss-xxxxxxx`. Reject anything else. +- **Error responses**: always check `code != 0` in responses before processing `data`. +- **Large documents**: block content and export results can be very large. Use `LIMIT` in SQL and pipe through `jq` to extract only what you need. +- **Notebook IDs**: when working with a specific notebook, get its ID first via `lsNotebooks`. + +## Alternative: MCP Server + +If you prefer a native integration instead of curl, install the SiYuan MCP server: + +```yaml +# In ~/.hermes/config.yaml under mcp_servers: +mcp_servers: + siyuan: + command: npx + args: ["-y", "@porkll/siyuan-mcp"] + env: + SIYUAN_TOKEN: "your_token" + SIYUAN_URL: "http://127.0.0.1:6806" +``` diff --git a/optional-skills/research/scrapling/SKILL.md b/optional-skills/research/scrapling/SKILL.md new file mode 100644 index 000000000..aaa38c90a --- /dev/null +++ b/optional-skills/research/scrapling/SKILL.md @@ -0,0 +1,335 @@ +--- +name: scrapling +description: Web scraping with Scrapling - HTTP fetching, stealth browser automation, Cloudflare bypass, and spider crawling via CLI and Python. +version: 1.0.0 +author: FEUAZUR +license: MIT +metadata: + hermes: + tags: [Web Scraping, Browser, Cloudflare, Stealth, Crawling, Spider] + related_skills: [duckduckgo-search, domain-intel] + homepage: https://github.com/D4Vinci/Scrapling +prerequisites: + commands: [scrapling, python] +--- + +# Scrapling + +[Scrapling](https://github.com/D4Vinci/Scrapling) is a web scraping framework with anti-bot bypass, stealth browser automation, and a spider framework. It provides three fetching strategies (HTTP, dynamic JS, stealth/Cloudflare) and a full CLI. + +**This skill is for educational and research purposes only.** Users must comply with local/international data scraping laws and respect website Terms of Service. + +## When to Use + +- Scraping static HTML pages (faster than browser tools) +- Scraping JS-rendered pages that need a real browser +- Bypassing Cloudflare Turnstile or bot detection +- Crawling multiple pages with a spider +- When the built-in `web_extract` tool does not return the data you need + +## Installation + +```bash +pip install "scrapling[all]" +scrapling install +``` + +Minimal install (HTTP only, no browser): +```bash +pip install scrapling +``` + +With browser automation only: +```bash +pip install "scrapling[fetchers]" +scrapling install +``` + +## Quick Reference + +| Approach | Class | Use When | +|----------|-------|----------| +| HTTP | `Fetcher` / `FetcherSession` | Static pages, APIs, fast bulk requests | +| Dynamic | `DynamicFetcher` / `DynamicSession` | JS-rendered content, SPAs | +| Stealth | `StealthyFetcher` / `StealthySession` | Cloudflare, anti-bot protected sites | +| Spider | `Spider` | Multi-page crawling with link following | + +## CLI Usage + +### Extract Static Page + +```bash +scrapling extract get 'https://example.com' output.md +``` + +With CSS selector and browser impersonation: + +```bash +scrapling extract get 'https://example.com' output.md \ + --css-selector '.content' \ + --impersonate 'chrome' +``` + +### Extract JS-Rendered Page + +```bash +scrapling extract fetch 'https://example.com' output.md \ + --css-selector '.dynamic-content' \ + --disable-resources \ + --network-idle +``` + +### Extract Cloudflare-Protected Page + +```bash +scrapling extract stealthy-fetch 'https://protected-site.com' output.html \ + --solve-cloudflare \ + --block-webrtc \ + --hide-canvas +``` + +### POST Request + +```bash +scrapling extract post 'https://example.com/api' output.json \ + --json '{"query": "search term"}' +``` + +### Output Formats + +The output format is determined by the file extension: +- `.html` -- raw HTML +- `.md` -- converted to Markdown +- `.txt` -- plain text +- `.json` / `.jsonl` -- JSON + +## Python: HTTP Scraping + +### Single Request + +```python +from scrapling.fetchers import Fetcher + +page = Fetcher.get('https://quotes.toscrape.com/') +quotes = page.css('.quote .text::text').getall() +for q in quotes: + print(q) +``` + +### Session (Persistent Cookies) + +```python +from scrapling.fetchers import FetcherSession + +with FetcherSession(impersonate='chrome') as session: + page = session.get('https://example.com/', stealthy_headers=True) + links = page.css('a::attr(href)').getall() + for link in links[:5]: + sub = session.get(link) + print(sub.css('h1::text').get()) +``` + +### POST / PUT / DELETE + +```python +page = Fetcher.post('https://api.example.com/data', json={"key": "value"}) +page = Fetcher.put('https://api.example.com/item/1', data={"name": "updated"}) +page = Fetcher.delete('https://api.example.com/item/1') +``` + +### With Proxy + +```python +page = Fetcher.get('https://example.com', proxy='http://user:pass@proxy:8080') +``` + +## Python: Dynamic Pages (JS-Rendered) + +For pages that require JavaScript execution (SPAs, lazy-loaded content): + +```python +from scrapling.fetchers import DynamicFetcher + +page = DynamicFetcher.fetch('https://example.com', headless=True) +data = page.css('.js-loaded-content::text').getall() +``` + +### Wait for Specific Element + +```python +page = DynamicFetcher.fetch( + 'https://example.com', + wait_selector=('.results', 'visible'), + network_idle=True, +) +``` + +### Disable Resources for Speed + +Blocks fonts, images, media, stylesheets (~25% faster): + +```python +from scrapling.fetchers import DynamicSession + +with DynamicSession(headless=True, disable_resources=True, network_idle=True) as session: + page = session.fetch('https://example.com') + items = page.css('.item::text').getall() +``` + +### Custom Page Automation + +```python +from playwright.sync_api import Page +from scrapling.fetchers import DynamicFetcher + +def scroll_and_click(page: Page): + page.mouse.wheel(0, 3000) + page.wait_for_timeout(1000) + page.click('button.load-more') + page.wait_for_selector('.extra-results') + +page = DynamicFetcher.fetch('https://example.com', page_action=scroll_and_click) +results = page.css('.extra-results .item::text').getall() +``` + +## Python: Stealth Mode (Anti-Bot Bypass) + +For Cloudflare-protected or heavily fingerprinted sites: + +```python +from scrapling.fetchers import StealthyFetcher + +page = StealthyFetcher.fetch( + 'https://protected-site.com', + headless=True, + solve_cloudflare=True, + block_webrtc=True, + hide_canvas=True, +) +content = page.css('.protected-content::text').getall() +``` + +### Stealth Session + +```python +from scrapling.fetchers import StealthySession + +with StealthySession(headless=True, solve_cloudflare=True) as session: + page1 = session.fetch('https://protected-site.com/page1') + page2 = session.fetch('https://protected-site.com/page2') +``` + +## Element Selection + +All fetchers return a `Selector` object with these methods: + +### CSS Selectors + +```python +page.css('h1::text').get() # First h1 text +page.css('a::attr(href)').getall() # All link hrefs +page.css('.quote .text::text').getall() # Nested selection +``` + +### XPath + +```python +page.xpath('//div[@class="content"]/text()').getall() +page.xpath('//a/@href').getall() +``` + +### Find Methods + +```python +page.find_all('div', class_='quote') # By tag + attribute +page.find_by_text('Read more', tag='a') # By text content +page.find_by_regex(r'\$\d+\.\d{2}') # By regex pattern +``` + +### Similar Elements + +Find elements with similar structure (useful for product listings, etc.): + +```python +first_product = page.css('.product')[0] +all_similar = first_product.find_similar() +``` + +### Navigation + +```python +el = page.css('.target')[0] +el.parent # Parent element +el.children # Child elements +el.next_sibling # Next sibling +el.prev_sibling # Previous sibling +``` + +## Python: Spider Framework + +For multi-page crawling with link following: + +```python +from scrapling.spiders import Spider, Request, Response + +class QuotesSpider(Spider): + name = "quotes" + start_urls = ["https://quotes.toscrape.com/"] + concurrent_requests = 10 + download_delay = 1 + + async def parse(self, response: Response): + for quote in response.css('.quote'): + yield { + "text": quote.css('.text::text').get(), + "author": quote.css('.author::text').get(), + "tags": quote.css('.tag::text').getall(), + } + + next_page = response.css('.next a::attr(href)').get() + if next_page: + yield response.follow(next_page) + +result = QuotesSpider().start() +print(f"Scraped {len(result.items)} quotes") +result.items.to_json("quotes.json") +``` + +### Multi-Session Spider + +Route requests to different fetcher types: + +```python +from scrapling.fetchers import FetcherSession, AsyncStealthySession + +class SmartSpider(Spider): + name = "smart" + start_urls = ["https://example.com/"] + + def configure_sessions(self, manager): + manager.add("fast", FetcherSession(impersonate="chrome")) + manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) + + async def parse(self, response: Response): + for link in response.css('a::attr(href)').getall(): + if "protected" in link: + yield Request(link, sid="stealth") + else: + yield Request(link, sid="fast", callback=self.parse) +``` + +### Pause/Resume Crawling + +```python +spider = QuotesSpider(crawldir="./crawl_checkpoint") +spider.start() # Ctrl+C to pause, re-run to resume from checkpoint +``` + +## Pitfalls + +- **Browser install required**: run `scrapling install` after pip install -- without it, `DynamicFetcher` and `StealthyFetcher` will fail +- **Timeouts**: DynamicFetcher/StealthyFetcher timeout is in **milliseconds** (default 30000), Fetcher timeout is in **seconds** +- **Cloudflare bypass**: `solve_cloudflare=True` adds 5-15 seconds to fetch time -- only enable when needed +- **Resource usage**: StealthyFetcher runs a real browser -- limit concurrent usage +- **Legal**: always check robots.txt and website ToS before scraping. This library is for educational and research purposes +- **Python version**: requires Python 3.10+