Files
Alexander Whitestone 11cc14d707 init: Hermes config, skills, memories, cron
Sovereign backup of all Hermes Agent configuration and data.
Excludes: secrets, auth tokens, sessions, caches, code (separate repo).

Tracked:
- config.yaml (model, fallback chain, toolsets, display prefs)
- SOUL.md (Timmy personality charter)
- memories/ (persistent MEMORY.md + USER.md)
- skills/ (371 files — full skill library)
- cron/jobs.json (scheduled tasks)
- channel_directory.json (platform channels)
- hooks/ (custom hooks)
2026-03-14 14:42:33 -04:00

1.6 KiB
Raw Permalink Blame History

Performance Optimization Guide

Maximize llama.cpp inference speed and efficiency.

CPU Optimization

Thread tuning

# Set threads (default: physical cores)
./llama-cli -m model.gguf -t 8

# For AMD Ryzen 9 7950X (16 cores, 32 threads)
-t 16  # Best: physical cores

# Avoid hyperthreading (slower for matrix ops)

BLAS acceleration

# OpenBLAS (faster matrix ops)
make LLAMA_OPENBLAS=1

# BLAS gives 2-3× speedup

GPU Offloading

Layer offloading

# Offload 35 layers to GPU (hybrid mode)
./llama-cli -m model.gguf -ngl 35

# Offload all layers
./llama-cli -m model.gguf -ngl 999

# Find optimal value:
# Start with -ngl 999
# If OOM, reduce by 5 until fits

Memory usage

# Check VRAM usage
nvidia-smi dmon

# Reduce context if needed
./llama-cli -m model.gguf -c 2048  # 2K context instead of 4K

Batch Processing

# Increase batch size for throughput
./llama-cli -m model.gguf -b 512  # Default: 512

# Physical batch (GPU)
--ubatch 128  # Process 128 tokens at once

Context Management

# Default context (512 tokens)
-c 512

# Longer context (slower, more memory)
-c 4096

# Very long context (if model supports)
-c 32768

Benchmarks

CPU Performance (Llama 2-7B Q4_K_M)

Setup Speed Notes
Apple M3 Max 50 tok/s Metal acceleration
AMD 7950X (16c) 35 tok/s OpenBLAS
Intel i9-13900K 30 tok/s AVX2

GPU Offloading (RTX 4090)

Layers GPU Speed VRAM
0 (CPU only) 30 tok/s 0 GB
20 (hybrid) 80 tok/s 8 GB
35 (all) 120 tok/s 12 GB