Sets up the AutoLoRA base model infrastructure for Project Bannerlord. No live model download (requires manual GGUF import into Ollama), but provides all scaffolding needed to onboard and validate the model. Changes: - Modelfile.hermes4-14b: Ollama Modelfile for NousResearch Hermes 4 14B (Q5_K_M, 32K context, temp=0.3, tool-calling + reasoning configured) - config/providers.yaml: register hermes4-14b and hermes4-36b as model entries on ollama-local; hermes4-14b promoted to top of tools fallback chain (native tool calling + structured JSON) - scripts/test_hermes4.py: 5-test smoke suite (availability, basic response, memory < 28 GB, tool calling, Timmy-persona acceptance) Refs #1101
56 lines
2.1 KiB
Plaintext
56 lines
2.1 KiB
Plaintext
# Modelfile.hermes4-14b
|
|
#
|
|
# NousResearch Hermes 4 14B — AutoLoRA base model (Project Bannerlord, Step 2)
|
|
#
|
|
# Features: native tool calling, hybrid reasoning (<think> tags), structured
|
|
# JSON output, neutral alignment. Built to serve as the LoRA fine-tuning base.
|
|
#
|
|
# Build:
|
|
# # Download GGUF from HuggingFace first:
|
|
# # https://huggingface.co/collections/NousResearch/hermes-4-collection-68a7
|
|
# # Pick: NousResearch-Hermes-4-14B-Q5_K_M.gguf (or Q4_K_M for less RAM)
|
|
# ollama create hermes4-14b -f Modelfile.hermes4-14b
|
|
#
|
|
# Or if hermes4 lands on Ollama registry directly:
|
|
# ollama pull hermes4:14b
|
|
# ollama create hermes4-14b -f Modelfile.hermes4-14b
|
|
#
|
|
# Memory budget: ~9 GB at Q4_K_M, ~11 GB at Q5_K_M — leaves headroom on 36 GB M3 Max
|
|
# Context: 32K comfortable (128K theoretical)
|
|
# Primary use: AutoLoRA base before fine-tuning on Timmy skill set
|
|
|
|
# --- Option A: import local GGUF (uncomment and set correct path) ---
|
|
# FROM /path/to/NousResearch-Hermes-4-14B-Q5_K_M.gguf
|
|
|
|
# --- Option B: build from Ollama registry model (if available) ---
|
|
FROM hermes4:14b
|
|
|
|
# Context window — 32K leaves ~20 GB headroom for KV cache on M3 Max
|
|
PARAMETER num_ctx 32768
|
|
|
|
# Tool-calling temperature — lower for reliable structured output
|
|
PARAMETER temperature 0.3
|
|
|
|
# Nucleus sampling — balanced for reasoning + tool use
|
|
PARAMETER top_p 0.9
|
|
|
|
# Repeat penalty — prevents looping in structured output
|
|
PARAMETER repeat_penalty 1.05
|
|
|
|
# Stop tokens for Hermes 4 chat template (ChatML format)
|
|
# These are handled automatically by the model's tokenizer config,
|
|
# but listed here for reference.
|
|
# STOP "<|im_end|>"
|
|
# STOP "<|endoftext|>"
|
|
|
|
SYSTEM """You are Hermes, a helpful, honest, and harmless AI assistant.
|
|
|
|
You have access to tool calling. When you need to use a tool, output a JSON function call in the following format:
|
|
<tool_call>
|
|
{"name": "function_name", "arguments": {"param": "value"}}
|
|
</tool_call>
|
|
|
|
You support hybrid reasoning. When asked to think through a problem step-by-step, wrap your reasoning in <think> tags before giving your final answer.
|
|
|
|
Always provide structured, accurate responses."""
|