feat(skills): add SiYuan Note and Scrapling as optional skills (#3742)

Add two new optional skills:

- siyuan (optional-skills/productivity/): SiYuan Note knowledge base
  API skill — search, read, create, and manage blocks/documents in a
  self-hosted SiYuan instance via curl. Requires SIYUAN_TOKEN.

- scrapling (optional-skills/research/): Intelligent web scraping skill
  using the Scrapling library — anti-bot fetching, Cloudflare bypass,
  CSS/XPath selectors, spider framework for multi-page crawling.

Placed in optional-skills/ (not bundled) since both are niche tools
that require external dependencies.

Co-authored-by: FEUAZUR <FEUAZUR@users.noreply.github.com>
This commit is contained in:
Teknium
2026-03-29 09:34:56 -07:00
committed by GitHub
parent aafe37012a
commit 811adca277
2 changed files with 632 additions and 0 deletions

View File

@@ -0,0 +1,297 @@
---
name: siyuan
description: SiYuan Note API for searching, reading, creating, and managing blocks and documents in a self-hosted knowledge base via curl.
version: 1.0.0
author: FEUAZUR
license: MIT
metadata:
hermes:
tags: [SiYuan, Notes, Knowledge Base, PKM, API]
related_skills: [obsidian, notion]
homepage: https://github.com/siyuan-note/siyuan
prerequisites:
env_vars: [SIYUAN_TOKEN]
commands: [curl, jq]
required_environment_variables:
- name: SIYUAN_TOKEN
prompt: SiYuan API token
help: "Settings > About in SiYuan desktop app"
- name: SIYUAN_URL
prompt: SiYuan instance URL (default http://127.0.0.1:6806)
required_for: remote instances
---
# SiYuan Note API
Use the [SiYuan](https://github.com/siyuan-note/siyuan) kernel API via curl to search, read, create, update, and delete blocks and documents in a self-hosted knowledge base. No extra tools needed -- just curl and an API token.
## Prerequisites
1. Install and run SiYuan (desktop or Docker)
2. Get your API token: **Settings > About > API token**
3. Store it in `~/.hermes/.env`:
```
SIYUAN_TOKEN=your_token_here
SIYUAN_URL=http://127.0.0.1:6806
```
`SIYUAN_URL` defaults to `http://127.0.0.1:6806` if not set.
## API Basics
All SiYuan API calls are **POST with JSON body**. Every request follows this pattern:
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/..." \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"param": "value"}'
```
Responses are JSON with this structure:
```json
{"code": 0, "msg": "", "data": { ... }}
```
`code: 0` means success. Any other value is an error -- check `msg` for details.
**ID format:** SiYuan IDs look like `20210808180117-6v0mkxr` (14-digit timestamp + 7 alphanumeric chars).
## Quick Reference
| Operation | Endpoint |
|-----------|----------|
| Full-text search | `/api/search/fullTextSearchBlock` |
| SQL query | `/api/query/sql` |
| Read block | `/api/block/getBlockKramdown` |
| Read children | `/api/block/getChildBlocks` |
| Get path | `/api/filetree/getHPathByID` |
| Get attributes | `/api/attr/getBlockAttrs` |
| List notebooks | `/api/notebook/lsNotebooks` |
| List documents | `/api/filetree/listDocsByPath` |
| Create notebook | `/api/notebook/createNotebook` |
| Create document | `/api/filetree/createDocWithMd` |
| Append block | `/api/block/appendBlock` |
| Update block | `/api/block/updateBlock` |
| Rename document | `/api/filetree/renameDocByID` |
| Set attributes | `/api/attr/setBlockAttrs` |
| Delete block | `/api/block/deleteBlock` |
| Delete document | `/api/filetree/removeDocByID` |
| Export as Markdown | `/api/export/exportMdContent` |
## Common Operations
### Search (Full-Text)
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/search/fullTextSearchBlock" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"query": "meeting notes", "page": 0}' | jq '.data.blocks[:5]'
```
### Search (SQL)
Query the blocks database directly. Only SELECT statements are safe.
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/query/sql" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"stmt": "SELECT id, content, type, box FROM blocks WHERE content LIKE '\''%keyword%'\'' AND type='\''p'\'' LIMIT 20"}' | jq '.data'
```
Useful columns: `id`, `parent_id`, `root_id`, `box` (notebook ID), `path`, `content`, `type`, `subtype`, `created`, `updated`.
### Read Block Content
Returns block content in Kramdown (Markdown-like) format.
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/block/getBlockKramdown" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"id": "20210808180117-6v0mkxr"}' | jq '.data.kramdown'
```
### Read Child Blocks
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/block/getChildBlocks" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"id": "20210808180117-6v0mkxr"}' | jq '.data'
```
### Get Human-Readable Path
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/filetree/getHPathByID" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"id": "20210808180117-6v0mkxr"}' | jq '.data'
```
### Get Block Attributes
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/attr/getBlockAttrs" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"id": "20210808180117-6v0mkxr"}' | jq '.data'
```
### List Notebooks
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/notebook/lsNotebooks" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{}' | jq '.data.notebooks[] | {id, name, closed}'
```
### List Documents in a Notebook
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/filetree/listDocsByPath" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"notebook": "NOTEBOOK_ID", "path": "/"}' | jq '.data.files[] | {id, name}'
```
### Create a Document
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/filetree/createDocWithMd" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"notebook": "NOTEBOOK_ID",
"path": "/Meeting Notes/2026-03-22",
"markdown": "# Meeting Notes\n\n- Discussed project timeline\n- Assigned tasks"
}' | jq '.data'
```
### Create a Notebook
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/notebook/createNotebook" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "My New Notebook"}' | jq '.data.notebook.id'
```
### Append Block to Document
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/block/appendBlock" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"parentID": "DOCUMENT_OR_BLOCK_ID",
"data": "New paragraph added at the end.",
"dataType": "markdown"
}' | jq '.data'
```
Also available: `/api/block/prependBlock` (same params, inserts at the beginning) and `/api/block/insertBlock` (uses `previousID` instead of `parentID` to insert after a specific block).
### Update Block Content
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/block/updateBlock" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"id": "BLOCK_ID",
"data": "Updated content here.",
"dataType": "markdown"
}' | jq '.data'
```
### Rename a Document
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/filetree/renameDocByID" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"id": "DOCUMENT_ID", "title": "New Title"}'
```
### Set Block Attributes
Custom attributes must be prefixed with `custom-`:
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/attr/setBlockAttrs" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"id": "BLOCK_ID",
"attrs": {
"custom-status": "reviewed",
"custom-priority": "high"
}
}'
```
### Delete a Block
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/block/deleteBlock" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"id": "BLOCK_ID"}'
```
To delete a whole document: use `/api/filetree/removeDocByID` with `{"id": "DOC_ID"}`.
To delete a notebook: use `/api/notebook/removeNotebook` with `{"notebook": "NOTEBOOK_ID"}`.
### Export Document as Markdown
```bash
curl -s -X POST "${SIYUAN_URL:-http://127.0.0.1:6806}/api/export/exportMdContent" \
-H "Authorization: Token $SIYUAN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"id": "DOCUMENT_ID"}' | jq -r '.data.content'
```
## Block Types
Common `type` values in SQL queries:
| Type | Description |
|------|-------------|
| `d` | Document (root block) |
| `p` | Paragraph |
| `h` | Heading |
| `l` | List |
| `i` | List item |
| `c` | Code block |
| `m` | Math block |
| `t` | Table |
| `b` | Blockquote |
| `s` | Super block |
| `html` | HTML block |
## Pitfalls
- **All endpoints are POST** -- even read-only operations. Do not use GET.
- **SQL safety**: only use SELECT queries. INSERT/UPDATE/DELETE/DROP are dangerous and should never be sent.
- **ID validation**: IDs match the pattern `YYYYMMDDHHmmss-xxxxxxx`. Reject anything else.
- **Error responses**: always check `code != 0` in responses before processing `data`.
- **Large documents**: block content and export results can be very large. Use `LIMIT` in SQL and pipe through `jq` to extract only what you need.
- **Notebook IDs**: when working with a specific notebook, get its ID first via `lsNotebooks`.
## Alternative: MCP Server
If you prefer a native integration instead of curl, install the SiYuan MCP server:
```yaml
# In ~/.hermes/config.yaml under mcp_servers:
mcp_servers:
siyuan:
command: npx
args: ["-y", "@porkll/siyuan-mcp"]
env:
SIYUAN_TOKEN: "your_token"
SIYUAN_URL: "http://127.0.0.1:6806"
```

View File

@@ -0,0 +1,335 @@
---
name: scrapling
description: Web scraping with Scrapling - HTTP fetching, stealth browser automation, Cloudflare bypass, and spider crawling via CLI and Python.
version: 1.0.0
author: FEUAZUR
license: MIT
metadata:
hermes:
tags: [Web Scraping, Browser, Cloudflare, Stealth, Crawling, Spider]
related_skills: [duckduckgo-search, domain-intel]
homepage: https://github.com/D4Vinci/Scrapling
prerequisites:
commands: [scrapling, python]
---
# Scrapling
[Scrapling](https://github.com/D4Vinci/Scrapling) is a web scraping framework with anti-bot bypass, stealth browser automation, and a spider framework. It provides three fetching strategies (HTTP, dynamic JS, stealth/Cloudflare) and a full CLI.
**This skill is for educational and research purposes only.** Users must comply with local/international data scraping laws and respect website Terms of Service.
## When to Use
- Scraping static HTML pages (faster than browser tools)
- Scraping JS-rendered pages that need a real browser
- Bypassing Cloudflare Turnstile or bot detection
- Crawling multiple pages with a spider
- When the built-in `web_extract` tool does not return the data you need
## Installation
```bash
pip install "scrapling[all]"
scrapling install
```
Minimal install (HTTP only, no browser):
```bash
pip install scrapling
```
With browser automation only:
```bash
pip install "scrapling[fetchers]"
scrapling install
```
## Quick Reference
| Approach | Class | Use When |
|----------|-------|----------|
| HTTP | `Fetcher` / `FetcherSession` | Static pages, APIs, fast bulk requests |
| Dynamic | `DynamicFetcher` / `DynamicSession` | JS-rendered content, SPAs |
| Stealth | `StealthyFetcher` / `StealthySession` | Cloudflare, anti-bot protected sites |
| Spider | `Spider` | Multi-page crawling with link following |
## CLI Usage
### Extract Static Page
```bash
scrapling extract get 'https://example.com' output.md
```
With CSS selector and browser impersonation:
```bash
scrapling extract get 'https://example.com' output.md \
--css-selector '.content' \
--impersonate 'chrome'
```
### Extract JS-Rendered Page
```bash
scrapling extract fetch 'https://example.com' output.md \
--css-selector '.dynamic-content' \
--disable-resources \
--network-idle
```
### Extract Cloudflare-Protected Page
```bash
scrapling extract stealthy-fetch 'https://protected-site.com' output.html \
--solve-cloudflare \
--block-webrtc \
--hide-canvas
```
### POST Request
```bash
scrapling extract post 'https://example.com/api' output.json \
--json '{"query": "search term"}'
```
### Output Formats
The output format is determined by the file extension:
- `.html` -- raw HTML
- `.md` -- converted to Markdown
- `.txt` -- plain text
- `.json` / `.jsonl` -- JSON
## Python: HTTP Scraping
### Single Request
```python
from scrapling.fetchers import Fetcher
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
for q in quotes:
print(q)
```
### Session (Persistent Cookies)
```python
from scrapling.fetchers import FetcherSession
with FetcherSession(impersonate='chrome') as session:
page = session.get('https://example.com/', stealthy_headers=True)
links = page.css('a::attr(href)').getall()
for link in links[:5]:
sub = session.get(link)
print(sub.css('h1::text').get())
```
### POST / PUT / DELETE
```python
page = Fetcher.post('https://api.example.com/data', json={"key": "value"})
page = Fetcher.put('https://api.example.com/item/1', data={"name": "updated"})
page = Fetcher.delete('https://api.example.com/item/1')
```
### With Proxy
```python
page = Fetcher.get('https://example.com', proxy='http://user:pass@proxy:8080')
```
## Python: Dynamic Pages (JS-Rendered)
For pages that require JavaScript execution (SPAs, lazy-loaded content):
```python
from scrapling.fetchers import DynamicFetcher
page = DynamicFetcher.fetch('https://example.com', headless=True)
data = page.css('.js-loaded-content::text').getall()
```
### Wait for Specific Element
```python
page = DynamicFetcher.fetch(
'https://example.com',
wait_selector=('.results', 'visible'),
network_idle=True,
)
```
### Disable Resources for Speed
Blocks fonts, images, media, stylesheets (~25% faster):
```python
from scrapling.fetchers import DynamicSession
with DynamicSession(headless=True, disable_resources=True, network_idle=True) as session:
page = session.fetch('https://example.com')
items = page.css('.item::text').getall()
```
### Custom Page Automation
```python
from playwright.sync_api import Page
from scrapling.fetchers import DynamicFetcher
def scroll_and_click(page: Page):
page.mouse.wheel(0, 3000)
page.wait_for_timeout(1000)
page.click('button.load-more')
page.wait_for_selector('.extra-results')
page = DynamicFetcher.fetch('https://example.com', page_action=scroll_and_click)
results = page.css('.extra-results .item::text').getall()
```
## Python: Stealth Mode (Anti-Bot Bypass)
For Cloudflare-protected or heavily fingerprinted sites:
```python
from scrapling.fetchers import StealthyFetcher
page = StealthyFetcher.fetch(
'https://protected-site.com',
headless=True,
solve_cloudflare=True,
block_webrtc=True,
hide_canvas=True,
)
content = page.css('.protected-content::text').getall()
```
### Stealth Session
```python
from scrapling.fetchers import StealthySession
with StealthySession(headless=True, solve_cloudflare=True) as session:
page1 = session.fetch('https://protected-site.com/page1')
page2 = session.fetch('https://protected-site.com/page2')
```
## Element Selection
All fetchers return a `Selector` object with these methods:
### CSS Selectors
```python
page.css('h1::text').get() # First h1 text
page.css('a::attr(href)').getall() # All link hrefs
page.css('.quote .text::text').getall() # Nested selection
```
### XPath
```python
page.xpath('//div[@class="content"]/text()').getall()
page.xpath('//a/@href').getall()
```
### Find Methods
```python
page.find_all('div', class_='quote') # By tag + attribute
page.find_by_text('Read more', tag='a') # By text content
page.find_by_regex(r'\$\d+\.\d{2}') # By regex pattern
```
### Similar Elements
Find elements with similar structure (useful for product listings, etc.):
```python
first_product = page.css('.product')[0]
all_similar = first_product.find_similar()
```
### Navigation
```python
el = page.css('.target')[0]
el.parent # Parent element
el.children # Child elements
el.next_sibling # Next sibling
el.prev_sibling # Previous sibling
```
## Python: Spider Framework
For multi-page crawling with link following:
```python
from scrapling.spiders import Spider, Request, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
concurrent_requests = 10
download_delay = 1
async def parse(self, response: Response):
for quote in response.css('.quote'):
yield {
"text": quote.css('.text::text').get(),
"author": quote.css('.author::text').get(),
"tags": quote.css('.tag::text').getall(),
}
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield response.follow(next_page)
result = QuotesSpider().start()
print(f"Scraped {len(result.items)} quotes")
result.items.to_json("quotes.json")
```
### Multi-Session Spider
Route requests to different fetcher types:
```python
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class SmartSpider(Spider):
name = "smart"
start_urls = ["https://example.com/"]
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response: Response):
for link in response.css('a::attr(href)').getall():
if "protected" in link:
yield Request(link, sid="stealth")
else:
yield Request(link, sid="fast", callback=self.parse)
```
### Pause/Resume Crawling
```python
spider = QuotesSpider(crawldir="./crawl_checkpoint")
spider.start() # Ctrl+C to pause, re-run to resume from checkpoint
```
## Pitfalls
- **Browser install required**: run `scrapling install` after pip install -- without it, `DynamicFetcher` and `StealthyFetcher` will fail
- **Timeouts**: DynamicFetcher/StealthyFetcher timeout is in **milliseconds** (default 30000), Fetcher timeout is in **seconds**
- **Cloudflare bypass**: `solve_cloudflare=True` adds 5-15 seconds to fetch time -- only enable when needed
- **Resource usage**: StealthyFetcher runs a real browser -- limit concurrent usage
- **Legal**: always check robots.txt and website ToS before scraping. This library is for educational and research purposes
- **Python version**: requires Python 3.10+