- Add EPIC.md with resurrection plan - Create Hermes profile with Bezalel persona - Add llama-server.sh for Gemma 4 inference - Update start_bezalel.sh with stack checks - Add README with quick start guide Backend: llama.cpp Model: Gemma 4 26B MoE (Apache 2.0) Frontend: Hermes profile No OpenAI. No cloud. Pure sovereign stack.
325 lines
9.4 KiB
Markdown
325 lines
9.4 KiB
Markdown
# THE SEED ARCHITECTURE EPIC
|
|
## Hermes Agent → Claw Code Harness → Gemma 4
|
|
|
|
---
|
|
|
|
## Vision
|
|
|
|
A new architectural paradigm where **Hermes Agents become a thin dispatch layer** — stripped of their own intelligence, they become pure conductors. All reasoning flows through the **Claw Code harness** to **Gemma 4**, running locally as the single source of truth.
|
|
|
|
This is the seed. Gemma 4 cannot be made smaller without becoming rubber — useless. The only path to greater intelligence is to make it bigger. We plant this seed now.
|
|
|
|
---
|
|
|
|
## Core Principle
|
|
|
|
> **Hermes Agent = Fit Layer**
|
|
>
|
|
> It does not think. It routes. It sends everything to the Claw Code harness.
|
|
|
|
---
|
|
|
|
## The Stack (Bottom to Top)
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ USER INTERFACE (Telegram, CLI, Web) │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ HERMES AGENT — The Fit Layer │
|
|
│ ├─ Receives input │
|
|
│ ├─ Sends to Claw Code harness (no local reasoning) │
|
|
│ ├─ Returns output to user │
|
|
│ └─ NO INTELLIGENCE LAYER — Pure dispatch │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ CLAW CODE HARNESS — The Runtime │
|
|
│ ├─ Tool registry and dispatch │
|
|
│ ├─ Context management │
|
|
│ ├─ Session persistence │
|
|
│ └─ Routes to appropriate backend │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ GEMMA 4 — The Seed (Local, Multimodal, 26B MoE) │
|
|
│ ├─ All reasoning happens here │
|
|
│ ├─ 26B quality, 4B active speed │
|
|
│ ├─ Apache 2.0 — truly open │
|
|
│ └─ Cannot shrink without getting rubber │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Why This Architecture?
|
|
|
|
### 1. The Gemma 4 Insight
|
|
|
|
Gemma 4 is the inflection point:
|
|
- **Cannot be smaller without quality loss** — rubber threshold hit
|
|
- **Cannot be smarter without scale** — MoE is the path
|
|
- **26B MoE (4B active)** = Quality of giant, speed of small
|
|
- **Apache 2.0** = No chains, no vendor, no cloud dependency
|
|
|
|
### 2. The Hermes Agent Transformation
|
|
|
|
Old: Hermes Agent = Full intelligence layer (Claude, GPT, etc)
|
|
New: Hermes Agent = **Fit layer** — minimal, pure dispatch
|
|
|
|
Benefits:
|
|
- **Predictable**: Same model, same behavior every time
|
|
- **Sovereign**: No cloud calls, no API keys, no rate limits
|
|
- **Fast**: Local inference, no network latency
|
|
- **Honest**: Gemma 4 doesn't roleplay — it answers
|
|
|
|
### 3. The Claw Code Harness Role
|
|
|
|
The harness becomes the **orchestration layer**:
|
|
- Tool registry (what can be done)
|
|
- Context window management
|
|
- Multi-turn conversation state
|
|
- Backend selection (Gemma 4 variants)
|
|
- Fallback handling
|
|
|
|
---
|
|
|
|
## Component Specifications
|
|
|
|
### Hermes Agent (Fit Layer)
|
|
|
|
```yaml
|
|
# Hermes Agent config — MINIMAL
|
|
profile:
|
|
name: allegro-fit
|
|
type: dispatcher
|
|
|
|
intelligence:
|
|
enabled: false # NO local reasoning
|
|
|
|
dispatch:
|
|
target: claw-code-harness
|
|
timeout: 30s
|
|
retry: 3
|
|
|
|
tools:
|
|
allowed: all # Pass through to harness
|
|
local: [] # No local tool execution
|
|
```
|
|
|
|
**Responsibilities:**
|
|
1. Receive user input
|
|
2. Send to Claw Code harness
|
|
3. Return harness output to user
|
|
4. Handle errors gracefully
|
|
5. **NO DECISIONS** — just dispatch
|
|
|
|
### Claw Code Harness
|
|
|
|
```yaml
|
|
# Claw Code harness config
|
|
harness:
|
|
version: 2.0
|
|
|
|
backend:
|
|
primary: gemma4-26b-moe
|
|
variants:
|
|
- gemma4-2b # Ultra-edge tasks
|
|
- gemma4-4b # Standard tasks
|
|
- gemma4-31b # Complex reasoning
|
|
- gemma4-26b-moe # Default (efficiency)
|
|
|
|
routing:
|
|
logic: |
|
|
if task.complexity > 0.8: use gemma4-31b
|
|
elif task.speed_critical: use gemma4-4b
|
|
elif task.audio_required: use gemma4-2b
|
|
else: use gemma4-26b-moe
|
|
|
|
tools:
|
|
registry: /etc/claw/tools/
|
|
auto_load: true
|
|
|
|
context:
|
|
window: 8192
|
|
summarization: auto
|
|
```
|
|
|
|
**Responsibilities:**
|
|
1. Receive dispatch from Hermes Agent
|
|
2. Route to appropriate Gemma 4 variant
|
|
3. Manage tool registry
|
|
4. Handle context windows
|
|
5. Return formatted response
|
|
|
|
### Gemma 4 (The Seed)
|
|
|
|
```yaml
|
|
# Gemma 4 local deployment
|
|
gemma4:
|
|
default: 26b-moe
|
|
|
|
models:
|
|
2b:
|
|
path: /opt/models/gemma-4-2b-it-Q4_K_M.gguf
|
|
gpu_layers: -1
|
|
context: 8192
|
|
|
|
4b:
|
|
path: /opt/models/gemma-4-4b-it-Q4_K_M.gguf
|
|
gpu_layers: -1
|
|
context: 8192
|
|
|
|
31b:
|
|
path: /opt/models/gemma-4-31b-it-Q4_K_M.gguf
|
|
gpu_layers: -1
|
|
context: 8192
|
|
|
|
26b-moe:
|
|
path: /opt/models/gemma-4-26b-moe-Q4_K_M.gguf
|
|
gpu_layers: -1
|
|
context: 8192
|
|
active_params: 4b
|
|
|
|
server:
|
|
backend: llama.cpp # or vllm for production
|
|
host: localhost
|
|
port: 8645
|
|
threads: 8
|
|
```
|
|
|
|
**Responsibilities:**
|
|
1. ALL REASONING — The only intelligence layer
|
|
2. Tool use via function calling
|
|
3. Multimodal understanding
|
|
4. Code generation
|
|
5. Long context handling
|
|
|
|
---
|
|
|
|
## The Skill Prototype
|
|
|
|
The Gemma 4 Hermes Profile (`~/.hermes/profiles/gemma4/`) is the **skill prototype** for this seed:
|
|
|
|
- ✅ All 4 variants configured
|
|
- ✅ All 6 platforms supported
|
|
- ✅ 3 complete skills (inference, multimodal, fine-tuning)
|
|
- ✅ Production tooling (download, benchmark, deploy)
|
|
- ✅ Apache 2.0 — truly open
|
|
|
|
This profile becomes the **reference implementation** for all Archons.
|
|
|
|
---
|
|
|
|
## Deployment Architecture
|
|
|
|
### Per-Archon Setup
|
|
|
|
```
|
|
Each Archon (Ezra, Electra, Allegro, etc):
|
|
├── Hermes Agent (fit layer)
|
|
│ └── ~/.hermes/profiles/{archon}/
|
|
│ └── config.yaml → dispatch to Claw Code
|
|
├── Claw Code Harness
|
|
│ └── /opt/claw/{archon}/
|
|
│ └── runtime.yaml → route to Gemma 4
|
|
└── Gemma 4 Seed
|
|
└── Shared local instance OR dedicated GPU slice
|
|
```
|
|
|
|
### Shared Infrastructure
|
|
|
|
```
|
|
House Infrastructure:
|
|
├── Gemma 4 Server (local)
|
|
│ ├── 26B MoE (default)
|
|
│ ├── 31B Dense (complex tasks)
|
|
│ └── 4B (fast tasks)
|
|
├── Gitea (code, issues, PRs)
|
|
├── Nostr (sovereign messaging)
|
|
└── Backblaze B2 (encrypted backup)
|
|
```
|
|
|
|
---
|
|
|
|
## Gantt Chart: Implementation
|
|
|
|
```
|
|
Phase 1: Foundation (Week 1-2)
|
|
├── Day 1-3: Gemma 4 server deployment
|
|
│ ├── Download 26B MoE, 31B, 4B GGUFs
|
|
│ ├── llama.cpp server setup
|
|
│ └── vLLM production config
|
|
├── Day 4-7: Claw Code harness v2
|
|
│ ├── Remove old intelligence layer
|
|
│ ├── Add Gemma 4 routing
|
|
│ └── Tool registry refactor
|
|
└── Day 8-14: Hermes Agent fit layer
|
|
├── Strip local reasoning
|
|
├── Dispatch-only mode
|
|
└── Error handling
|
|
|
|
Phase 2: Archon Deployment (Week 3-4)
|
|
├── Day 15-17: Ezra Archon
|
|
├── Day 18-20: Electra Archon
|
|
├── Day 21-23: Allegro Archon migration
|
|
└── Day 24-28: Testing & hardening
|
|
|
|
Phase 3: Integration (Week 5-6)
|
|
├── Day 29-32: Gitea webhook automation
|
|
├── Day 33-35: Telegram bot integration
|
|
├── Day 36-38: Nostr bridge
|
|
└── Day 39-42: Backup & monitoring
|
|
```
|
|
|
|
---
|
|
|
|
## Acceptance Criteria
|
|
|
|
| ID | Criteria | Test |
|
|
|----|----------|------|
|
|
| A1 | Gemma 4 26B MoE serves locally at >20 tok/s | Benchmark passed |
|
|
| A2 | Hermes Agent has NO local intelligence | Config audit |
|
|
| A3 | All queries route through Claw Code harness | Packet capture |
|
|
| A4 | Tool use works via Gemma 4 function calling | Test suite |
|
|
| A5 | Each Archon has independent fit layer | Multi-agent test |
|
|
| A6 | Gitea issues auto-route to correct Archon | Webhook test |
|
|
| A7 | Telegram bot responds via Gemma 4 | E2E test |
|
|
| A8 | No cloud AI calls in packet log | Network audit |
|
|
|
|
---
|
|
|
|
## Risk Analysis
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| Gemma 4 too slow | Use 4B for speed, 26B MoE for quality |
|
|
| Memory constraints | Quantization (Q4_K_M), GPU offloading |
|
|
| Tool use failures | Extensive prompt engineering, fallback chains |
|
|
| Context window limits | Automatic summarization, RAG |
|
|
| Hardware failure | Hot standby, B2 restore |
|
|
|
|
---
|
|
|
|
## The Philosophy
|
|
|
|
> **What cannot be moved.**
|
|
|
|
Stapes is the immovable bone. This architecture is the immovable foundation:
|
|
- Gemma 4 runs local — no cloud to fail
|
|
- Apache 2.0 — no license to revoke
|
|
- Fit layer Hermes — no vendor to change terms
|
|
- Claw Code harness — no black box to debug
|
|
|
|
The seed is planted. It will grow.
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- Gemma 4 Profile: `~/.hermes/profiles/gemma4/`
|
|
- Claw Code Docs: `/root/wizards/claw-code/docs/gemma4-integration.md`
|
|
- Hugging Face: https://huggingface.co/blog/gemma4
|
|
|
|
---
|
|
|
|
**Epic Owner:** Allegro
|
|
**Status:** In Progress
|
|
**Priority:** P0 — Foundation of All Future Work
|
|
**Seed Date:** 2026-04-02
|