1
0

Compare commits

...

5 Commits

23 changed files with 3465 additions and 21 deletions

View File

@@ -0,0 +1,230 @@
# Bannerlord Windows VM Setup Guide
**Issue:** #1098
**Parent Epic:** #1091 (Project Bannerlord)
**Date:** 2026-03-23
**Status:** Reference
---
## Overview
This document covers provisioning the Windows VM that hosts Bannerlord + GABS mod,
verifying the GABS TCP JSON-RPC server, and confirming connectivity from Hermes.
Architecture reminder:
```
Timmy (Qwen3 on Ollama, Hermes M3 Max)
→ GABS TCP/JSON-RPC (port 4825)
→ Bannerlord.GABS C# mod
→ Game API + Harmony
→ Bannerlord (Windows VM)
```
---
## 1. Provision Windows VM
### Minimum Spec
| Resource | Minimum | Recommended |
|----------|---------|-------------|
| CPU | 4 cores | 8 cores |
| RAM | 16 GB | 32 GB |
| Disk | 100 GB SSD | 150 GB SSD |
| OS | Windows Server 2022 / Windows 11 | Windows 11 |
| Network | Private VLAN to Hermes | Private VLAN to Hermes |
### Hetzner (preferred)
```powershell
# Hetzner Cloud CLI — create CX41 (4 vCPU, 16 GB RAM, 160 GB SSD)
hcloud server create \
--name bannerlord-vm \
--type cx41 \
--image windows-server-2022 \
--location nbg1 \
--ssh-key your-key
```
### DigitalOcean alternative
```
Droplet: General Purpose 4 vCPU / 16 GB / 100 GB SSD
Image: Windows Server 2022
Region: Same region as Hermes
```
### Post-provision
1. Enable RDP (port 3389) for initial setup only — close after configuration
2. Open port 4825 TCP inbound from Hermes IP only
3. Disable Windows Firewall for 4825 or add specific allow rule:
```powershell
New-NetFirewallRule -DisplayName "GABS TCP" -Direction Inbound `
-Protocol TCP -LocalPort 4825 -Action Allow
```
---
## 2. Install Steam + Bannerlord
### Steam installation
1. Download Steam installer from store.steampowered.com
2. Install silently:
```powershell
.\SteamSetup.exe /S
```
3. Log in with a dedicated Steam account (not personal)
### Bannerlord installation
```powershell
# Install Bannerlord (App ID: 261550) via SteamCMD
steamcmd +login <user> <pass> +app_update 261550 validate +quit
```
### Pin game version
GABS requires a specific Bannerlord version. To pin and prevent auto-updates:
1. Right-click Bannerlord in Steam → Properties → Updates
2. Set "Automatic Updates" to "Only update this game when I launch it"
3. Record the current version in `docs/research/bannerlord-vm-setup.md` after installation
```powershell
# Check installed version
Get-Content "C:\Program Files (x86)\Steam\steamapps\appmanifest_261550.acf" |
Select-String "buildid"
```
---
## 3. Install GABS Mod
### Source
- NexusMods: https://www.nexusmods.com/mountandblade2bannerlord/mods/10419
- GitHub: https://github.com/BUTR/Bannerlord.GABS
- AGENTS.md: https://github.com/BUTR/Bannerlord.GABS/blob/master/AGENTS.md
### Installation via Vortex (NexusMods)
1. Install Vortex Mod Manager
2. Download GABS mod package from NexusMods
3. Install via Vortex — it handles the Modules/ directory layout automatically
4. Enable in the mod list and set load order after Harmony
### Manual installation
```powershell
# Copy mod to Bannerlord Modules directory
$BannerlordPath = "C:\Program Files (x86)\Steam\steamapps\common\Mount & Blade II Bannerlord"
Copy-Item -Recurse ".\Bannerlord.GABS" "$BannerlordPath\Modules\Bannerlord.GABS"
```
### Required dependencies
- **Harmony** (BUTR.Harmony) — must load before GABS
- **ButterLib** — utility library
Install via the same method as GABS.
### GABS configuration
GABS TCP server listens on `0.0.0.0:4825` by default. To confirm or override:
```
%APPDATA%\Mount and Blade II Bannerlord\Configs\Bannerlord.GABS\settings.json
```
Expected defaults:
```json
{
"ServerHost": "0.0.0.0",
"ServerPort": 4825,
"LogLevel": "Information"
}
```
---
## 4. Verify GABS TCP Server
### Start Bannerlord with GABS
Launch Bannerlord with the mod enabled. GABS starts its TCP server during game
initialisation. Watch the game log for:
```
[GABS] TCP server listening on 0.0.0.0:4825
```
Log location:
```
%APPDATA%\Mount and Blade II Bannerlord\logs\rgl_log_*.txt
```
### Local connectivity check (on VM)
```powershell
# Verify port is listening
netstat -an | findstr 4825
# Quick TCP probe
Test-NetConnection -ComputerName localhost -Port 4825
```
### Send a test JSON-RPC call
```powershell
$msg = '{"jsonrpc":"2.0","method":"ping","id":1}'
$client = New-Object System.Net.Sockets.TcpClient("localhost", 4825)
$stream = $client.GetStream()
$writer = New-Object System.IO.StreamWriter($stream)
$writer.AutoFlush = $true
$writer.WriteLine($msg)
$reader = New-Object System.IO.StreamReader($stream)
$response = $reader.ReadLine()
Write-Host "Response: $response"
$client.Close()
```
Expected response shape:
```json
{"jsonrpc":"2.0","result":{"status":"ok"},"id":1}
```
---
## 5. Test Connectivity from Hermes
Use `scripts/test_gabs_connectivity.py` (checked in with this issue):
```bash
# From Hermes (M3 Max)
python scripts/test_gabs_connectivity.py --host <VM_IP> --port 4825
```
The script tests:
1. TCP socket connection
2. JSON-RPC ping round-trip
3. `get_game_state` call
4. Response latency (target < 100 ms on LAN)
---
## 6. Firewall / Network Summary
| Source | Destination | Port | Protocol | Purpose |
|--------|-------------|------|----------|---------|
| Hermes (local) | Bannerlord VM | 4825 | TCP | GABS JSON-RPC |
| Admin workstation | Bannerlord VM | 3389 | TCP | RDP setup (disable after) |
---
## 7. Reproducibility Checklist
After completing setup, record:
- [ ] VM provider + region + instance type
- [ ] Windows version + build number
- [ ] Steam account used (non-personal, credentials in secrets manager)
- [ ] Bannerlord App version (buildid from appmanifest)
- [ ] GABS version (from NexusMods or GitHub release tag)
- [ ] Harmony version
- [ ] ButterLib version
- [ ] GABS settings.json contents
- [ ] VM IP address (update Timmy config)
- [ ] Connectivity test output from `test_gabs_connectivity.py`
---
## References
- GABS GitHub: https://github.com/BUTR/Bannerlord.GABS
- GABS AGENTS.md: https://github.com/BUTR/Bannerlord.GABS/blob/master/AGENTS.md
- NexusMods page: https://www.nexusmods.com/mountandblade2bannerlord/mods/10419
- Parent Epic: #1091
- Connectivity test script: `scripts/test_gabs_connectivity.py`

View File

@@ -0,0 +1,333 @@
#!/usr/bin/env python3
"""Export Timmy session logs as LoRA training data (ChatML JSONL).
Reads session JSONL files written by ``SessionLogger`` and converts them into
conversation pairs suitable for fine-tuning with ``mlx_lm.lora``.
Output format — one JSON object per line::
{"messages": [
{"role": "system", "content": "<Timmy system prompt>"},
{"role": "user", "content": "<user turn>"},
{"role": "assistant", "content": "<timmy response, with tool calls embedded>"}
]}
Tool calls that appear between a user turn and the next assistant message are
embedded in the assistant content using the Hermes 4 ``<tool_call>`` XML format
so the fine-tuned model learns both when to call tools and what JSON to emit.
Usage::
# Export all session logs (default paths)
python scripts/export_trajectories.py
# Custom source / destination
python scripts/export_trajectories.py \\
--logs-dir ~/custom-logs \\
--output ~/timmy-training-data.jsonl \\
--min-turns 2 \\
--verbose
Epic: #1091 Project Bannerlord — AutoLoRA Sovereignty Loop (Step 3 of 7)
Refs: #1103
"""
from __future__ import annotations
import argparse
import json
import logging
import sys
from pathlib import Path
from typing import Any
logger = logging.getLogger(__name__)
# ── Constants ─────────────────────────────────────────────────────────────────
TIMMY_SYSTEM_PROMPT = (
"You are Timmy, Alexander's personal AI agent running on a local Mac. "
"You are concise, direct, and action-oriented. "
"You have access to a broad set of tools — use them proactively. "
"When you need to call a tool, output it in this format:\n"
"<tool_call>\n"
'{"name": "function_name", "arguments": {"param": "value"}}\n'
"</tool_call>\n\n"
"Always provide structured, accurate responses."
)
# ── Entry grouping ─────────────────────────────────────────────────────────────
def _load_entries(logs_dir: Path) -> list[dict[str, Any]]:
"""Load all session log entries, sorted chronologically."""
entries: list[dict[str, Any]] = []
log_files = sorted(logs_dir.glob("session_*.jsonl"))
for log_file in log_files:
try:
with open(log_file) as f:
for line in f:
line = line.strip()
if not line:
continue
try:
entries.append(json.loads(line))
except json.JSONDecodeError:
logger.warning("Skipping malformed line in %s", log_file.name)
except OSError as exc:
logger.warning("Cannot read %s: %s", log_file, exc)
return entries
def _format_tool_call(entry: dict[str, Any]) -> str:
"""Render a tool_call entry as a Hermes 4 <tool_call> XML block."""
payload = {"name": entry.get("tool", "unknown"), "arguments": entry.get("args", {})}
return f"<tool_call>\n{json.dumps(payload)}\n</tool_call>"
def _format_tool_result(entry: dict[str, Any]) -> str:
"""Render a tool result observation."""
result = entry.get("result", "")
tool = entry.get("tool", "unknown")
return f"<tool_response>\n{{\"name\": \"{tool}\", \"result\": {json.dumps(result)}}}\n</tool_response>"
def _group_into_turns(entries: list[dict[str, Any]]) -> list[dict[str, Any]]:
"""Group raw session entries into (user_text, assistant_parts) turn pairs.
Returns a list of dicts with keys:
``user`` - user message content
``assistant`` - assembled assistant content (responses + tool calls)
"""
turns: list[dict[str, Any]] = []
pending_user: str | None = None
assistant_parts: list[str] = []
for entry in entries:
etype = entry.get("type", "")
role = entry.get("role", "")
if etype == "message" and role == "user":
# Flush any open turn
if pending_user is not None and assistant_parts:
turns.append(
{
"user": pending_user,
"assistant": "\n".join(assistant_parts).strip(),
}
)
elif pending_user is not None:
# User message with no assistant response — discard
pass
pending_user = entry.get("content", "").strip()
assistant_parts = []
elif etype == "message" and role == "timmy":
if pending_user is not None:
content = entry.get("content", "").strip()
if content:
assistant_parts.append(content)
elif etype == "tool_call":
if pending_user is not None:
assistant_parts.append(_format_tool_call(entry))
# Also append tool result as context so model learns the full loop
if entry.get("result"):
assistant_parts.append(_format_tool_result(entry))
# decision / error entries are skipped — they are meta-data, not conversation
# Flush final open turn
if pending_user is not None and assistant_parts:
turns.append(
{
"user": pending_user,
"assistant": "\n".join(assistant_parts).strip(),
}
)
return turns
# ── Conversion ────────────────────────────────────────────────────────────────
def turns_to_training_examples(
turns: list[dict[str, Any]],
system_prompt: str = TIMMY_SYSTEM_PROMPT,
min_assistant_len: int = 10,
) -> list[dict[str, Any]]:
"""Convert grouped turns into mlx-lm training examples.
Each example has a ``messages`` list in ChatML order:
``[system, user, assistant]``.
Args:
turns: Output of ``_group_into_turns``.
system_prompt: System prompt prepended to every example.
min_assistant_len: Skip examples where the assistant turn is shorter
than this many characters (filters out empty/trivial turns).
Returns:
List of training example dicts.
"""
examples: list[dict[str, Any]] = []
for turn in turns:
assistant_text = turn.get("assistant", "").strip()
user_text = turn.get("user", "").strip()
if not user_text or len(assistant_text) < min_assistant_len:
continue
examples.append(
{
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_text},
{"role": "assistant", "content": assistant_text},
]
}
)
return examples
def export_training_data(
logs_dir: Path,
output_path: Path,
min_turns: int = 1,
min_assistant_len: int = 10,
verbose: bool = False,
) -> int:
"""Full export pipeline: load → group → convert → write.
Args:
logs_dir: Directory containing ``session_*.jsonl`` files.
output_path: Destination ``.jsonl`` file for training data.
min_turns: Minimum number of turns required (used for logging only).
min_assistant_len: Minimum assistant response length to include.
verbose: Print progress to stdout.
Returns:
Number of training examples written.
"""
if verbose:
print(f"Loading session logs from: {logs_dir}")
entries = _load_entries(logs_dir)
if verbose:
print(f" Loaded {len(entries)} raw entries")
turns = _group_into_turns(entries)
if verbose:
print(f" Grouped into {len(turns)} conversation turns")
examples = turns_to_training_examples(
turns, min_assistant_len=min_assistant_len
)
if verbose:
print(f" Generated {len(examples)} training examples")
if not examples:
print("WARNING: No training examples generated. Check that session logs exist.")
return 0
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w") as f:
for ex in examples:
f.write(json.dumps(ex) + "\n")
if verbose:
print(f" Wrote {len(examples)} examples → {output_path}")
return len(examples)
# ── CLI ───────────────────────────────────────────────────────────────────────
def _default_logs_dir() -> Path:
"""Return default logs directory (repo root / logs)."""
# Walk up from this script to find repo root (contains pyproject.toml)
candidate = Path(__file__).resolve().parent
for _ in range(5):
candidate = candidate.parent
if (candidate / "pyproject.toml").exists():
return candidate / "logs"
return Path.home() / "logs"
def _default_output_path() -> Path:
return Path.home() / "timmy-training-data.jsonl"
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(
description="Export Timmy session logs as LoRA training data (ChatML JSONL)",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__,
)
parser.add_argument(
"--logs-dir",
type=Path,
default=_default_logs_dir(),
help="Directory containing session_*.jsonl files (default: <repo>/logs)",
)
parser.add_argument(
"--output",
type=Path,
default=_default_output_path(),
help="Output JSONL path (default: ~/timmy-training-data.jsonl)",
)
parser.add_argument(
"--min-turns",
type=int,
default=1,
help="Minimum turns to process (informational, default: 1)",
)
parser.add_argument(
"--min-assistant-len",
type=int,
default=10,
help="Minimum assistant response length in chars (default: 10)",
)
parser.add_argument(
"--verbose",
"-v",
action="store_true",
help="Print progress information",
)
args = parser.parse_args(argv)
logging.basicConfig(
level=logging.DEBUG if args.verbose else logging.WARNING,
format="%(levelname)s: %(message)s",
)
if not args.logs_dir.exists():
print(f"ERROR: Logs directory not found: {args.logs_dir}")
print("Run the Timmy dashboard first to generate session logs.")
return 1
count = export_training_data(
logs_dir=args.logs_dir,
output_path=args.output,
min_turns=args.min_turns,
min_assistant_len=args.min_assistant_len,
verbose=args.verbose,
)
if count > 0:
print(f"Exported {count} training examples to: {args.output}")
print()
print("Next steps:")
print(f" mkdir -p ~/timmy-lora-training")
print(f" cp {args.output} ~/timmy-lora-training/train.jsonl")
print(f" python scripts/lora_finetune.py --data ~/timmy-lora-training")
else:
print("No training examples exported.")
return 1
return 0
if __name__ == "__main__":
sys.exit(main())

399
scripts/lora_finetune.py Normal file
View File

@@ -0,0 +1,399 @@
#!/usr/bin/env python3
"""LoRA fine-tuning launcher for Hermes 4 on Timmy trajectory data.
Wraps ``mlx_lm.lora`` with project-specific defaults and pre-flight checks.
Requires Apple Silicon (M-series) and the ``mlx-lm`` package.
Usage::
# Minimal — uses defaults (expects data in ~/timmy-lora-training/)
python scripts/lora_finetune.py
# Custom model path and data
python scripts/lora_finetune.py \\
--model /path/to/hermes4-mlx \\
--data ~/timmy-lora-training \\
--iters 500 \\
--adapter-path ~/timmy-lora-adapter
# Dry run (print command, don't execute)
python scripts/lora_finetune.py --dry-run
# After training, test with the adapter
python scripts/lora_finetune.py --test \\
--prompt "List the open PRs on the Timmy Time Dashboard repo"
# Fuse adapter into base model for Ollama import
python scripts/lora_finetune.py --fuse \\
--save-path ~/timmy-fused-model
Typical workflow::
# 1. Export trajectories
python scripts/export_trajectories.py --verbose
# 2. Prepare training dir
mkdir -p ~/timmy-lora-training
cp ~/timmy-training-data.jsonl ~/timmy-lora-training/train.jsonl
# 3. Fine-tune
python scripts/lora_finetune.py --verbose
# 4. Test
python scripts/lora_finetune.py --test
# 5. Fuse + import to Ollama
python scripts/lora_finetune.py --fuse
ollama create timmy-hermes4 -f Modelfile.timmy-hermes4
Epic: #1091 Project Bannerlord — AutoLoRA Sovereignty Loop (Step 4 of 7)
Refs: #1103
"""
from __future__ import annotations
import argparse
import platform
import shutil
import subprocess
import sys
from pathlib import Path
# ── Defaults ──────────────────────────────────────────────────────────────────
DEFAULT_DATA_DIR = Path.home() / "timmy-lora-training"
DEFAULT_ADAPTER_PATH = Path.home() / "timmy-lora-adapter"
DEFAULT_FUSED_PATH = Path.home() / "timmy-fused-model"
# mlx-lm model path — local HuggingFace checkout of Hermes 4 in MLX format.
# Set MLX_HERMES4_PATH env var or pass --model to override.
DEFAULT_MODEL_PATH_ENV = "MLX_HERMES4_PATH"
# Training hyperparameters (conservative for 36 GB M3 Max)
DEFAULT_BATCH_SIZE = 1
DEFAULT_LORA_LAYERS = 16
DEFAULT_ITERS = 1000
DEFAULT_LEARNING_RATE = 1e-5
# Test prompt used after training
DEFAULT_TEST_PROMPT = (
"List the open PRs on the Timmy Time Dashboard repo and triage them by priority."
)
# ── Pre-flight checks ─────────────────────────────────────────────────────────
def _check_apple_silicon() -> bool:
"""Return True if running on Apple Silicon."""
return platform.system() == "Darwin" and platform.machine() == "arm64"
def _check_mlx_lm() -> bool:
"""Return True if mlx-lm is installed and mlx_lm.lora is runnable."""
return shutil.which("mlx_lm.lora") is not None or _can_import("mlx_lm")
def _can_import(module: str) -> bool:
try:
import importlib
importlib.import_module(module)
return True
except ImportError:
return False
def _resolve_model_path(model_arg: str | None) -> str | None:
"""Resolve model path from arg or environment variable."""
if model_arg:
return model_arg
import os
env_path = os.environ.get(DEFAULT_MODEL_PATH_ENV)
if env_path:
return env_path
return None
def _preflight(model_path: str | None, data_dir: Path, verbose: bool) -> list[str]:
"""Run pre-flight checks and return a list of warnings (empty = all OK)."""
warnings: list[str] = []
if not _check_apple_silicon():
warnings.append(
"Not running on Apple Silicon. mlx-lm requires an M-series Mac.\n"
" Alternative: use Unsloth on Google Colab / RunPod / Modal."
)
if not _check_mlx_lm():
warnings.append(
"mlx-lm not found. Install with:\n pip install mlx-lm"
)
if model_path is None:
warnings.append(
f"No model path specified. Set {DEFAULT_MODEL_PATH_ENV} or pass --model.\n"
" Download Hermes 4 in MLX format from HuggingFace:\n"
" https://huggingface.co/collections/NousResearch/hermes-4-collection-68a7\n"
" or convert the GGUF:\n"
" mlx_lm.convert --hf-path NousResearch/Hermes-4-14B --mlx-path ~/hermes4-mlx"
)
elif not Path(model_path).exists():
warnings.append(f"Model path does not exist: {model_path}")
train_file = data_dir / "train.jsonl"
if not train_file.exists():
warnings.append(
f"Training data not found: {train_file}\n"
" Generate it with:\n"
" python scripts/export_trajectories.py --verbose\n"
f" mkdir -p {data_dir}\n"
f" cp ~/timmy-training-data.jsonl {train_file}"
)
if verbose and not warnings:
print("Pre-flight checks: all OK")
return warnings
# ── Command builders ──────────────────────────────────────────────────────────
def _build_train_cmd(
model_path: str,
data_dir: Path,
adapter_path: Path,
batch_size: int,
lora_layers: int,
iters: int,
learning_rate: float,
) -> list[str]:
return [
sys.executable, "-m", "mlx_lm.lora",
"--model", model_path,
"--train",
"--data", str(data_dir),
"--batch-size", str(batch_size),
"--lora-layers", str(lora_layers),
"--iters", str(iters),
"--learning-rate", str(learning_rate),
"--adapter-path", str(adapter_path),
]
def _build_test_cmd(
model_path: str,
adapter_path: Path,
prompt: str,
) -> list[str]:
return [
sys.executable, "-m", "mlx_lm.generate",
"--model", model_path,
"--adapter-path", str(adapter_path),
"--prompt", prompt,
"--max-tokens", "512",
]
def _build_fuse_cmd(
model_path: str,
adapter_path: Path,
save_path: Path,
) -> list[str]:
return [
sys.executable, "-m", "mlx_lm.fuse",
"--model", model_path,
"--adapter-path", str(adapter_path),
"--save-path", str(save_path),
]
# ── Runner ─────────────────────────────────────────────────────────────────────
def _run(cmd: list[str], dry_run: bool, verbose: bool) -> int:
"""Print and optionally execute a command."""
print("\nCommand:")
print(" " + " \\\n ".join(cmd))
if dry_run:
print("\n(dry-run — not executing)")
return 0
print()
result = subprocess.run(cmd)
return result.returncode
# ── Main ──────────────────────────────────────────────────────────────────────
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(
description="LoRA fine-tuning launcher for Hermes 4 (AutoLoRA Step 4)",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__,
)
# Mode flags (mutually exclusive-ish)
mode = parser.add_mutually_exclusive_group()
mode.add_argument(
"--test",
action="store_true",
help="Run inference test with trained adapter instead of training",
)
mode.add_argument(
"--fuse",
action="store_true",
help="Fuse adapter into base model (for Ollama import)",
)
# Paths
parser.add_argument(
"--model",
default=None,
help=f"Path to local MLX model (or set {DEFAULT_MODEL_PATH_ENV} env var)",
)
parser.add_argument(
"--data",
type=Path,
default=DEFAULT_DATA_DIR,
help=f"Training data directory (default: {DEFAULT_DATA_DIR})",
)
parser.add_argument(
"--adapter-path",
type=Path,
default=DEFAULT_ADAPTER_PATH,
help=f"LoRA adapter output path (default: {DEFAULT_ADAPTER_PATH})",
)
parser.add_argument(
"--save-path",
type=Path,
default=DEFAULT_FUSED_PATH,
help=f"Fused model output path (default: {DEFAULT_FUSED_PATH})",
)
# Hyperparameters
parser.add_argument(
"--batch-size",
type=int,
default=DEFAULT_BATCH_SIZE,
help=f"Training batch size (default: {DEFAULT_BATCH_SIZE}; reduce to 1 if OOM)",
)
parser.add_argument(
"--lora-layers",
type=int,
default=DEFAULT_LORA_LAYERS,
help=f"Number of LoRA layers (default: {DEFAULT_LORA_LAYERS}; reduce if OOM)",
)
parser.add_argument(
"--iters",
type=int,
default=DEFAULT_ITERS,
help=f"Training iterations (default: {DEFAULT_ITERS})",
)
parser.add_argument(
"--learning-rate",
type=float,
default=DEFAULT_LEARNING_RATE,
help=f"Learning rate (default: {DEFAULT_LEARNING_RATE})",
)
# Misc
parser.add_argument(
"--prompt",
default=DEFAULT_TEST_PROMPT,
help="Prompt for --test mode",
)
parser.add_argument(
"--dry-run",
action="store_true",
help="Print command without executing",
)
parser.add_argument(
"--verbose",
"-v",
action="store_true",
help="Print extra progress information",
)
parser.add_argument(
"--skip-preflight",
action="store_true",
help="Skip pre-flight checks (useful in CI)",
)
args = parser.parse_args(argv)
model_path = _resolve_model_path(args.model)
# ── Pre-flight ──────────────────────────────────────────────────────────
if not args.skip_preflight:
warnings = _preflight(model_path, args.data, args.verbose)
if warnings:
for w in warnings:
print(f"WARNING: {w}\n")
if not args.dry_run:
print("Aborting due to pre-flight warnings. Use --dry-run to see commands anyway.")
return 1
if model_path is None:
# Allow dry-run without a model for documentation purposes
model_path = "<path-to-hermes4-mlx>"
# ── Mode dispatch ────────────────────────────────────────────────────────
if args.test:
print(f"Testing fine-tuned model with adapter: {args.adapter_path}")
cmd = _build_test_cmd(model_path, args.adapter_path, args.prompt)
return _run(cmd, args.dry_run, args.verbose)
if args.fuse:
print(f"Fusing adapter {args.adapter_path} into base model → {args.save_path}")
cmd = _build_fuse_cmd(model_path, args.adapter_path, args.save_path)
rc = _run(cmd, args.dry_run, args.verbose)
if rc == 0 and not args.dry_run:
print(
f"\nFused model saved to: {args.save_path}\n"
"To import into Ollama:\n"
f" ollama create timmy-hermes4 -f Modelfile.hermes4-14b\n"
" (edit Modelfile to point FROM to the fused GGUF path)"
)
return rc
# Default: train
print(f"Starting LoRA fine-tuning")
print(f" Model: {model_path}")
print(f" Data: {args.data}")
print(f" Adapter path: {args.adapter_path}")
print(f" Iterations: {args.iters}")
print(f" Batch size: {args.batch_size}")
print(f" LoRA layers: {args.lora_layers}")
print(f" Learning rate:{args.learning_rate}")
print()
print("Estimated time: 2-8 hours on M3 Max (depends on dataset size).")
print("If OOM: reduce --lora-layers to 8 or --batch-size stays at 1.")
cmd = _build_train_cmd(
model_path=model_path,
data_dir=args.data,
adapter_path=args.adapter_path,
batch_size=args.batch_size,
lora_layers=args.lora_layers,
iters=args.iters,
learning_rate=args.learning_rate,
)
rc = _run(cmd, args.dry_run, args.verbose)
if rc == 0 and not args.dry_run:
print(
f"\nTraining complete! Adapter saved to: {args.adapter_path}\n"
"Test with:\n"
f" python scripts/lora_finetune.py --test\n"
"Then fuse + import to Ollama:\n"
f" python scripts/lora_finetune.py --fuse"
)
return rc
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,244 @@
#!/usr/bin/env python3
"""GABS TCP connectivity and JSON-RPC smoke test.
Tests connectivity from Hermes to the Bannerlord.GABS TCP server running on the
Windows VM. Covers:
1. TCP socket connection (port 4825 reachable)
2. JSON-RPC ping round-trip
3. get_game_state call (game must be running)
4. Latency — target < 100 ms on LAN
Usage:
python scripts/test_gabs_connectivity.py --host 10.0.0.50
python scripts/test_gabs_connectivity.py --host 10.0.0.50 --port 4825 --timeout 5
Refs: #1098 (Bannerlord Infra — Windows VM Setup + GABS Mod Installation)
Epic: #1091 (Project Bannerlord)
"""
from __future__ import annotations
import argparse
import json
import socket
import sys
import time
from typing import Any
DEFAULT_HOST = "127.0.0.1"
DEFAULT_PORT = 4825
DEFAULT_TIMEOUT = 5 # seconds
LATENCY_TARGET_MS = 100.0
# ── Low-level TCP helpers ─────────────────────────────────────────────────────
def _tcp_connect(host: str, port: int, timeout: float) -> socket.socket:
"""Open a TCP connection and return the socket. Raises on failure."""
sock = socket.create_connection((host, port), timeout=timeout)
sock.settimeout(timeout)
return sock
def _send_recv(sock: socket.socket, payload: dict[str, Any]) -> dict[str, Any]:
"""Send a newline-delimited JSON-RPC request and return the parsed response."""
raw = json.dumps(payload) + "\n"
sock.sendall(raw.encode())
buf = b""
while b"\n" not in buf:
chunk = sock.recv(4096)
if not chunk:
raise ConnectionError("Connection closed before response received")
buf += chunk
line = buf.split(b"\n", 1)[0]
return json.loads(line.decode())
def _rpc(sock: socket.socket, method: str, params: dict | None = None, req_id: int = 1) -> dict[str, Any]:
"""Build and send a JSON-RPC 2.0 request, return the response dict."""
payload: dict[str, Any] = {
"jsonrpc": "2.0",
"method": method,
"id": req_id,
}
if params:
payload["params"] = params
return _send_recv(sock, payload)
# ── Test cases ────────────────────────────────────────────────────────────────
def test_tcp_connection(host: str, port: int, timeout: float) -> tuple[bool, socket.socket | None]:
"""PASS: TCP connection to host:port succeeds."""
print(f"\n[1/4] TCP connection → {host}:{port}")
try:
t0 = time.monotonic()
sock = _tcp_connect(host, port, timeout)
elapsed_ms = (time.monotonic() - t0) * 1000
print(f" ✓ Connected ({elapsed_ms:.1f} ms)")
return True, sock
except OSError as exc:
print(f" ✗ Connection failed: {exc}")
print(f" Checklist:")
print(f" - Is Bannerlord running with GABS mod enabled?")
print(f" - Is port {port} open in Windows Firewall?")
print(f" - Is the VM IP correct? (got: {host})")
return False, None
def test_ping(sock: socket.socket) -> bool:
"""PASS: JSON-RPC ping returns a 2.0 response."""
print(f"\n[2/4] JSON-RPC ping")
try:
t0 = time.monotonic()
resp = _rpc(sock, "ping", req_id=1)
elapsed_ms = (time.monotonic() - t0) * 1000
if resp.get("jsonrpc") == "2.0" and "error" not in resp:
print(f" ✓ Ping OK ({elapsed_ms:.1f} ms): {json.dumps(resp)}")
return True
print(f" ✗ Unexpected response ({elapsed_ms:.1f} ms): {json.dumps(resp)}")
return False
except Exception as exc:
print(f" ✗ Ping failed: {exc}")
return False
def test_game_state(sock: socket.socket) -> bool:
"""PASS: get_game_state returns a result (game must be in a campaign)."""
print(f"\n[3/4] get_game_state call")
try:
t0 = time.monotonic()
resp = _rpc(sock, "get_game_state", req_id=2)
elapsed_ms = (time.monotonic() - t0) * 1000
if "error" in resp:
code = resp["error"].get("code", "?")
msg = resp["error"].get("message", "")
if code == -32601:
# Method not found — GABS version may not expose this method
print(f" ~ Method not available ({elapsed_ms:.1f} ms): {msg}")
print(f" This is acceptable if game is not yet in a campaign.")
return True
print(f" ✗ RPC error ({elapsed_ms:.1f} ms) [{code}]: {msg}")
return False
result = resp.get("result", {})
print(f" ✓ Game state received ({elapsed_ms:.1f} ms):")
for k, v in result.items():
print(f" {k}: {v}")
return True
except Exception as exc:
print(f" ✗ get_game_state failed: {exc}")
return False
def test_latency(host: str, port: int, timeout: float, iterations: int = 5) -> bool:
"""PASS: Average round-trip latency is under LATENCY_TARGET_MS."""
print(f"\n[4/4] Latency test ({iterations} pings, target < {LATENCY_TARGET_MS:.0f} ms)")
try:
times: list[float] = []
for i in range(iterations):
sock = _tcp_connect(host, port, timeout)
try:
t0 = time.monotonic()
_rpc(sock, "ping", req_id=i + 10)
times.append((time.monotonic() - t0) * 1000)
finally:
sock.close()
avg_ms = sum(times) / len(times)
min_ms = min(times)
max_ms = max(times)
print(f" avg={avg_ms:.1f} ms min={min_ms:.1f} ms max={max_ms:.1f} ms")
if avg_ms <= LATENCY_TARGET_MS:
print(f" ✓ Latency within target ({avg_ms:.1f} ms ≤ {LATENCY_TARGET_MS:.0f} ms)")
return True
print(
f" ✗ Latency too high ({avg_ms:.1f} ms > {LATENCY_TARGET_MS:.0f} ms)\n"
f" Check network path between Hermes and the VM."
)
return False
except Exception as exc:
print(f" ✗ Latency test failed: {exc}")
return False
# ── Main ──────────────────────────────────────────────────────────────────────
def main() -> int:
parser = argparse.ArgumentParser(description="GABS TCP connectivity smoke test")
parser.add_argument(
"--host",
default=DEFAULT_HOST,
help=f"Bannerlord VM IP or hostname (default: {DEFAULT_HOST})",
)
parser.add_argument(
"--port",
type=int,
default=DEFAULT_PORT,
help=f"GABS TCP port (default: {DEFAULT_PORT})",
)
parser.add_argument(
"--timeout",
type=float,
default=DEFAULT_TIMEOUT,
help=f"Socket timeout in seconds (default: {DEFAULT_TIMEOUT})",
)
args = parser.parse_args()
print("=" * 60)
print(f"GABS Connectivity Test Suite")
print(f"Target: {args.host}:{args.port}")
print(f"Timeout: {args.timeout}s")
print("=" * 60)
results: dict[str, bool] = {}
# Test 1: TCP connection (gate — skip remaining if unreachable)
ok, sock = test_tcp_connection(args.host, args.port, args.timeout)
results["tcp_connection"] = ok
if not ok:
_print_summary(results)
return 1
# Tests 23 reuse the same socket
try:
results["ping"] = test_ping(sock)
results["game_state"] = test_game_state(sock)
finally:
sock.close()
# Test 4: latency uses fresh connections
results["latency"] = test_latency(args.host, args.port, args.timeout)
return _print_summary(results)
def _print_summary(results: dict[str, bool]) -> int:
passed = sum(results.values())
total = len(results)
print("\n" + "=" * 60)
print(f"Results: {passed}/{total} passed")
print("=" * 60)
for name, ok in results.items():
icon = "" if ok else ""
print(f" {icon} {name}")
if passed == total:
print("\n✓ GABS connectivity verified. Timmy can reach the game.")
print(" Next step: run benchmark level 0 (JSON compliance check).")
elif not results.get("tcp_connection"):
print("\n✗ TCP connection failed. VM/firewall setup incomplete.")
print(" See docs/research/bannerlord-vm-setup.md for checklist.")
else:
print("\n~ Partial pass — review failures above.")
return 0 if passed == total else 1
if __name__ == "__main__":
sys.exit(main())

View File

@@ -196,7 +196,7 @@ async def get_evening_ritual_form(request: Request, db: Session = Depends(get_db
if not journal_entry:
raise HTTPException(status_code=404, detail="No journal entry for today")
return templates.TemplateResponse(
"calm/evening_ritual_form.html", {"request": request, "journal_entry": journal_entry}
request, "calm/evening_ritual_form.html", {"journal_entry": journal_entry}
)
@@ -257,8 +257,9 @@ async def create_new_task(
# After creating a new task, we might need to re-evaluate NOW/NEXT/LATER, but for simplicity
# and given the spec, new tasks go to LATER. Promotion happens on completion/deferral.
return templates.TemplateResponse(
request,
"calm/partials/later_count.html",
{"request": request, "later_tasks_count": len(get_later_tasks(db))},
{"later_tasks_count": len(get_later_tasks(db))},
)
@@ -287,9 +288,9 @@ async def start_task(
promote_tasks(db)
return templates.TemplateResponse(
request,
"calm/partials/now_next_later.html",
{
"request": request,
"now_task": get_now_task(db),
"next_task": get_next_task(db),
"later_tasks_count": len(get_later_tasks(db)),
@@ -316,9 +317,9 @@ async def complete_task(
promote_tasks(db)
return templates.TemplateResponse(
request,
"calm/partials/now_next_later.html",
{
"request": request,
"now_task": get_now_task(db),
"next_task": get_next_task(db),
"later_tasks_count": len(get_later_tasks(db)),
@@ -345,9 +346,9 @@ async def defer_task(
promote_tasks(db)
return templates.TemplateResponse(
request,
"calm/partials/now_next_later.html",
{
"request": request,
"now_task": get_now_task(db),
"next_task": get_next_task(db),
"later_tasks_count": len(get_later_tasks(db)),
@@ -360,8 +361,7 @@ async def get_later_tasks_list(request: Request, db: Session = Depends(get_db)):
"""Render the expandable list of LATER tasks."""
later_tasks = get_later_tasks(db)
return templates.TemplateResponse(
"calm/partials/later_tasks_list.html",
{"request": request, "later_tasks": later_tasks},
request, "calm/partials/later_tasks_list.html", {"later_tasks": later_tasks}
)
@@ -404,9 +404,9 @@ async def reorder_tasks(
# Re-render the relevant parts of the UI
return templates.TemplateResponse(
request,
"calm/partials/now_next_later.html",
{
"request": request,
"now_task": get_now_task(db),
"next_task": get_next_task(db),
"later_tasks_count": len(get_later_tasks(db)),

View File

@@ -40,9 +40,9 @@ async def tools_page(request: Request):
total_calls = 0
return templates.TemplateResponse(
request,
"tools.html",
{
"request": request,
"available_tools": available_tools,
"agent_tools": agent_tools,
"total_calls": total_calls,

View File

@@ -16,6 +16,8 @@ from datetime import UTC, datetime
from pathlib import Path
from typing import Any
from src.config import settings
logger = logging.getLogger(__name__)
@@ -102,7 +104,7 @@ class EventBus:
self._persistence_db_path.parent.mkdir(parents=True, exist_ok=True)
with closing(sqlite3.connect(str(self._persistence_db_path))) as conn:
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA busy_timeout=5000")
conn.execute(f"PRAGMA busy_timeout={settings.db_busy_timeout_ms}")
conn.executescript(_EVENTS_SCHEMA)
conn.commit()
@@ -114,7 +116,7 @@ class EventBus:
return
with closing(sqlite3.connect(str(self._persistence_db_path))) as conn:
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA busy_timeout=5000")
conn.execute(f"PRAGMA busy_timeout={settings.db_busy_timeout_ms}")
yield conn
def _persist_event(self, event: Event) -> None:

View File

@@ -18,6 +18,8 @@ from datetime import UTC, datetime
from enum import StrEnum
from pathlib import Path
from src.config import settings
logger = logging.getLogger(__name__)
DB_PATH = Path("data/swarm.db")
@@ -68,7 +70,7 @@ def _get_conn() -> Generator[sqlite3.Connection, None, None]:
with closing(sqlite3.connect(str(DB_PATH))) as conn:
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA busy_timeout=5000")
conn.execute(f"PRAGMA busy_timeout={settings.db_busy_timeout_ms}")
conn.execute("""
CREATE TABLE IF NOT EXISTS custom_models (
name TEXT PRIMARY KEY,

View File

@@ -22,6 +22,8 @@ from dataclasses import dataclass
from datetime import UTC, datetime
from pathlib import Path
from src.config import settings
logger = logging.getLogger(__name__)
DB_PATH = Path("data/spark.db")
@@ -47,7 +49,7 @@ def _get_conn() -> Generator[sqlite3.Connection, None, None]:
with closing(sqlite3.connect(str(DB_PATH))) as conn:
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA busy_timeout=5000")
conn.execute(f"PRAGMA busy_timeout={settings.db_busy_timeout_ms}")
conn.execute("""
CREATE TABLE IF NOT EXISTS spark_predictions (
id TEXT PRIMARY KEY,

View File

@@ -19,6 +19,8 @@ from dataclasses import dataclass
from datetime import UTC, datetime
from pathlib import Path
from src.config import settings
logger = logging.getLogger(__name__)
DB_PATH = Path("data/spark.db")
@@ -63,7 +65,7 @@ def _get_conn() -> Generator[sqlite3.Connection, None, None]:
with closing(sqlite3.connect(str(DB_PATH))) as conn:
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA busy_timeout=5000")
conn.execute(f"PRAGMA busy_timeout={settings.db_busy_timeout_ms}")
conn.execute("""
CREATE TABLE IF NOT EXISTS spark_events (
id TEXT PRIMARY KEY,

View File

@@ -13,8 +13,8 @@ from dataclasses import dataclass
import httpx
from config import settings
from timmy.research_tools import get_llm_client, google_web_search
from timmy.research_triage import triage_research_report
from timmy.research_tools import google_web_search, get_llm_client
logger = logging.getLogger(__name__)

View File

@@ -6,7 +6,6 @@ import logging
import os
from typing import Any
from config import settings
from serpapi import GoogleSearch
logger = logging.getLogger(__name__)

View File

@@ -6,8 +6,8 @@ import time
from pathlib import Path
import pytest
from infrastructure.db_pool import ConnectionPool
from src.config import settings
from src.infrastructure.db_pool import ConnectionPool
class TestConnectionPoolInit:
@@ -330,9 +330,9 @@ class TestPragmaApplication:
"""busy_timeout pragma set on a pooled connection persists."""
pool = ConnectionPool(tmp_path / "test.db")
conn = pool.get_connection()
conn.execute("PRAGMA busy_timeout=5000")
conn.execute(f"PRAGMA busy_timeout={settings.db_busy_timeout_ms}")
timeout = conn.execute("PRAGMA busy_timeout").fetchone()[0]
assert timeout == 5000
assert timeout == settings.db_busy_timeout_ms
pool.close_connection()
def test_pragmas_apply_per_connection(self, tmp_path):

View File

@@ -0,0 +1,285 @@
"""Unit tests for scripts/export_trajectories.py.
Tests trajectory conversion logic — no I/O, no Ollama, no mlx.
"""
from __future__ import annotations
import json
from pathlib import Path
import pytest
import scripts.export_trajectories as et
# ── Fixtures ──────────────────────────────────────────────────────────────────
@pytest.fixture()
def simple_session(tmp_path: Path) -> Path:
"""Write a minimal session JSONL file and return the logs dir."""
logs_dir = tmp_path / "logs"
logs_dir.mkdir()
entries = [
{"type": "message", "role": "user", "content": "What time is it?", "timestamp": "2026-03-01T10:00:00"},
{"type": "message", "role": "timmy", "content": "It is 10:00 AM.", "timestamp": "2026-03-01T10:00:01"},
{"type": "message", "role": "user", "content": "Thanks!", "timestamp": "2026-03-01T10:00:05"},
{"type": "message", "role": "timmy", "content": "You're welcome!", "timestamp": "2026-03-01T10:00:06"},
]
session_file = logs_dir / "session_2026-03-01.jsonl"
session_file.write_text("\n".join(json.dumps(e) for e in entries) + "\n")
return logs_dir
@pytest.fixture()
def tool_call_session(tmp_path: Path) -> Path:
"""Write a session JSONL with tool calls."""
logs_dir = tmp_path / "logs"
logs_dir.mkdir()
entries = [
{"type": "message", "role": "user", "content": "Read CLAUDE.md", "timestamp": "2026-03-01T10:00:00"},
{
"type": "tool_call",
"tool": "read_file",
"args": {"path": "CLAUDE.md"},
"result": "# CLAUDE.md content here",
"timestamp": "2026-03-01T10:00:01",
},
{"type": "message", "role": "timmy", "content": "Here is the content.", "timestamp": "2026-03-01T10:00:02"},
]
session_file = logs_dir / "session_2026-03-01.jsonl"
session_file.write_text("\n".join(json.dumps(e) for e in entries) + "\n")
return logs_dir
# ── _load_entries ─────────────────────────────────────────────────────────────
@pytest.mark.unit
def test_load_entries_returns_all(simple_session: Path) -> None:
entries = et._load_entries(simple_session)
assert len(entries) == 4
@pytest.mark.unit
def test_load_entries_skips_malformed(tmp_path: Path) -> None:
logs_dir = tmp_path / "logs"
logs_dir.mkdir()
session = logs_dir / "session_2026-03-01.jsonl"
session.write_text(
'{"type": "message", "role": "user", "content": "hi"}\n'
"NOT_JSON\n"
'{"type": "message", "role": "timmy", "content": "hello"}\n'
)
entries = et._load_entries(logs_dir)
assert len(entries) == 2 # malformed line skipped
@pytest.mark.unit
def test_load_entries_empty_dir(tmp_path: Path) -> None:
logs_dir = tmp_path / "logs"
logs_dir.mkdir()
entries = et._load_entries(logs_dir)
assert entries == []
@pytest.mark.unit
def test_load_entries_multiple_files(tmp_path: Path) -> None:
logs_dir = tmp_path / "logs"
logs_dir.mkdir()
for day in ("2026-03-01", "2026-03-02"):
entry = {"type": "message", "role": "user", "content": f"day {day}"}
(logs_dir / f"session_{day}.jsonl").write_text(json.dumps(entry) + "\n")
entries = et._load_entries(logs_dir)
assert len(entries) == 2
# ── _format_tool_call ─────────────────────────────────────────────────────────
@pytest.mark.unit
def test_format_tool_call_structure() -> None:
entry = {
"type": "tool_call",
"tool": "read_file",
"args": {"path": "/tmp/foo.txt"},
"result": "file contents",
}
result = et._format_tool_call(entry)
assert result.startswith("<tool_call>")
assert result.endswith("</tool_call>")
payload = json.loads(result.split("\n")[1])
assert payload["name"] == "read_file"
assert payload["arguments"]["path"] == "/tmp/foo.txt"
@pytest.mark.unit
def test_format_tool_call_missing_tool() -> None:
entry = {"type": "tool_call", "args": {}}
result = et._format_tool_call(entry)
assert "unknown" in result
# ── _group_into_turns ─────────────────────────────────────────────────────────
@pytest.mark.unit
def test_group_basic_conversation() -> None:
entries = [
{"type": "message", "role": "user", "content": "hello"},
{"type": "message", "role": "timmy", "content": "hi there"},
{"type": "message", "role": "user", "content": "bye"},
{"type": "message", "role": "timmy", "content": "goodbye"},
]
turns = et._group_into_turns(entries)
assert len(turns) == 2
assert turns[0]["user"] == "hello"
assert turns[0]["assistant"] == "hi there"
assert turns[1]["user"] == "bye"
assert turns[1]["assistant"] == "goodbye"
@pytest.mark.unit
def test_group_with_tool_call() -> None:
entries = [
{"type": "message", "role": "user", "content": "check the file"},
{"type": "tool_call", "tool": "read_file", "args": {"path": "x"}, "result": "content"},
{"type": "message", "role": "timmy", "content": "Done."},
]
turns = et._group_into_turns(entries)
assert len(turns) == 1
assert "<tool_call>" in turns[0]["assistant"]
assert "Done." in turns[0]["assistant"]
@pytest.mark.unit
def test_group_skips_user_without_response() -> None:
"""User message with no timmy response should not create a turn."""
entries = [
{"type": "message", "role": "user", "content": "hello"},
# No timmy response
{"type": "message", "role": "user", "content": "are you there?"},
{"type": "message", "role": "timmy", "content": "Yes!"},
]
turns = et._group_into_turns(entries)
assert len(turns) == 1
assert turns[0]["user"] == "are you there?"
@pytest.mark.unit
def test_group_ignores_errors_and_decisions() -> None:
entries = [
{"type": "message", "role": "user", "content": "hello"},
{"type": "error", "error": "something failed"},
{"type": "decision", "decision": "retry"},
{"type": "message", "role": "timmy", "content": "Got it."},
]
turns = et._group_into_turns(entries)
assert len(turns) == 1
assert "error" not in turns[0]["assistant"]
assert "retry" not in turns[0]["assistant"]
@pytest.mark.unit
def test_group_empty_entries() -> None:
assert et._group_into_turns([]) == []
# ── turns_to_training_examples ────────────────────────────────────────────────
@pytest.mark.unit
def test_training_examples_structure() -> None:
turns = [{"user": "hello", "assistant": "hi there, how can I help?"}]
examples = et.turns_to_training_examples(turns)
assert len(examples) == 1
msgs = examples[0]["messages"]
assert msgs[0]["role"] == "system"
assert msgs[1]["role"] == "user"
assert msgs[1]["content"] == "hello"
assert msgs[2]["role"] == "assistant"
assert msgs[2]["content"] == "hi there, how can I help?"
@pytest.mark.unit
def test_training_examples_filters_short_responses() -> None:
turns = [
{"user": "hello", "assistant": "ok"}, # too short
{"user": "hello", "assistant": "This is a longer response that passes."},
]
examples = et.turns_to_training_examples(turns, min_assistant_len=10)
assert len(examples) == 1
assert examples[0]["messages"][2]["content"] == "This is a longer response that passes."
@pytest.mark.unit
def test_training_examples_filters_empty_user() -> None:
turns = [{"user": "", "assistant": "some response here"}]
examples = et.turns_to_training_examples(turns)
assert len(examples) == 0
@pytest.mark.unit
def test_training_examples_uses_custom_system_prompt() -> None:
turns = [{"user": "hi", "assistant": "hello there!"}]
examples = et.turns_to_training_examples(turns, system_prompt="Custom prompt.")
assert examples[0]["messages"][0]["content"] == "Custom prompt."
# ── export_training_data (integration-style, uses tmp_path) ──────────────────
@pytest.mark.unit
def test_export_training_data_writes_jsonl(simple_session: Path, tmp_path: Path) -> None:
output = tmp_path / "train.jsonl"
count = et.export_training_data(logs_dir=simple_session, output_path=output)
assert count == 2
assert output.exists()
lines = [
json.loads(line) for line in output.read_text().splitlines() if line.strip()
]
assert len(lines) == 2
for line in lines:
assert "messages" in line
roles = [m["role"] for m in line["messages"]]
assert roles == ["system", "user", "assistant"]
@pytest.mark.unit
def test_export_training_data_with_tool_calls(tool_call_session: Path, tmp_path: Path) -> None:
output = tmp_path / "train.jsonl"
count = et.export_training_data(logs_dir=tool_call_session, output_path=output)
assert count == 1
line = json.loads(output.read_text().strip())
assistant_content = line["messages"][2]["content"]
assert "<tool_call>" in assistant_content
assert "read_file" in assistant_content
@pytest.mark.unit
def test_export_training_data_returns_zero_for_empty_logs(tmp_path: Path) -> None:
logs_dir = tmp_path / "logs"
logs_dir.mkdir()
output = tmp_path / "train.jsonl"
count = et.export_training_data(logs_dir=logs_dir, output_path=output)
assert count == 0
assert not output.exists()
# ── CLI ───────────────────────────────────────────────────────────────────────
@pytest.mark.unit
def test_cli_missing_logs_dir(tmp_path: Path) -> None:
rc = et.main(["--logs-dir", str(tmp_path / "nonexistent"), "--output", str(tmp_path / "out.jsonl")])
assert rc == 1
@pytest.mark.unit
def test_cli_exports_and_returns_zero(simple_session: Path, tmp_path: Path) -> None:
output = tmp_path / "out.jsonl"
rc = et.main([
"--logs-dir", str(simple_session),
"--output", str(output),
])
assert rc == 0
assert output.exists()

View File

@@ -0,0 +1,546 @@
"""Unit tests for the AutoLoRA continuous improvement loop.
Covers trajectory extraction, quality filtering, dataset management,
and the retrain orchestrator.
Refs: #1105
"""
from __future__ import annotations
import json
from datetime import UTC, datetime, timedelta
from pathlib import Path
from timmy_automations.retrain.quality_filter import QualityFilter, TrajectoryQuality
from timmy_automations.retrain.retrain import RetrainOrchestrator
from timmy_automations.retrain.training_dataset import TrainingDataset
from timmy_automations.retrain.training_log import CycleMetrics, TrainingLog
from timmy_automations.retrain.trajectory_exporter import Trajectory, TrajectoryExporter
# ── Fixtures ─────────────────────────────────────────────────────────────────
def _ts(offset_minutes: int = 0) -> str:
"""Return an ISO timestamp offset from now."""
return (datetime.now(tz=UTC) + timedelta(minutes=offset_minutes)).isoformat()
def _make_session_log(entries: list[dict], date_str: str, tmp_path: Path) -> Path:
"""Write session JSONL entries to a temp log file."""
log_dir = tmp_path / "logs"
log_dir.mkdir(parents=True, exist_ok=True)
log_file = log_dir / f"session_{date_str}.jsonl"
with open(log_file, "w") as f:
for entry in entries:
f.write(json.dumps(entry) + "\n")
return log_file
def _user_msg(content: str, offset: int = 0) -> dict:
return {"type": "message", "role": "user", "content": content, "timestamp": _ts(offset)}
def _timmy_msg(content: str, confidence: float | None = None, offset: int = 0) -> dict:
entry = {"type": "message", "role": "timmy", "content": content, "timestamp": _ts(offset)}
if confidence is not None:
entry["confidence"] = confidence
return entry
def _tool_call(tool: str = "bash", result: str = "ok", offset: int = 0) -> dict:
return {
"type": "tool_call",
"tool": tool,
"args": {},
"result": result,
"timestamp": _ts(offset),
}
def _error_entry(msg: str = "Something failed", offset: int = 0) -> dict:
return {"type": "error", "error": msg, "timestamp": _ts(offset)}
def _decision_entry(decision: str = "Use approach A", offset: int = 0) -> dict:
return {"type": "decision", "decision": decision, "timestamp": _ts(offset)}
# ── Trajectory dataclass tests ────────────────────────────────────────────────
class TestTrajectory:
def test_message_count(self):
t = Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
messages=[_user_msg("hi"), _timmy_msg("hello")],
)
assert t.message_count == 2
def test_tool_call_count(self):
t = Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
tool_calls=[_tool_call(), _tool_call()],
)
assert t.tool_call_count == 2
def test_has_successful_tool_call_when_no_errors(self):
t = Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
tool_calls=[_tool_call()],
errors=[],
)
assert t.has_successful_tool_call is True
def test_has_successful_tool_call_false_when_errors(self):
t = Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
tool_calls=[_tool_call()],
errors=[_error_entry()],
)
assert t.has_successful_tool_call is False
def test_is_multi_step(self):
t = Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
messages=[_user_msg("do it"), _timmy_msg("done")],
tool_calls=[_tool_call()],
)
assert t.is_multi_step is True
def test_is_not_multi_step_single_message(self):
t = Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
messages=[_timmy_msg("hello")],
tool_calls=[],
)
assert t.is_multi_step is False
def test_to_chat_format_ordering(self):
t = Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
messages=[_user_msg("question", offset=0), _timmy_msg("answer", offset=2)],
tool_calls=[_tool_call(offset=1)],
)
chat = t.to_chat_format()
roles = [m["role"] for m in chat]
assert "user" in roles
assert "assistant" in roles
def test_to_chat_format_empty_content_skipped(self):
t = Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
messages=[_user_msg(""), _timmy_msg("response")],
)
chat = t.to_chat_format()
# Empty user message should be skipped
assert all(m["content"] for m in chat)
# ── TrajectoryExporter tests ──────────────────────────────────────────────────
class TestTrajectoryExporter:
def test_export_empty_logs_dir(self, tmp_path):
(tmp_path / "logs").mkdir()
exporter = TrajectoryExporter(logs_dir=tmp_path / "logs", repo_root=tmp_path)
result = exporter.export_week(weeks_ago=0)
assert result == []
def test_export_reads_session_files(self, tmp_path):
# Write a session file for this week
today = datetime.now(tz=UTC)
date_str = today.strftime("%Y-%m-%d")
entries = [
_user_msg("tell me about Python"),
_timmy_msg("Python is great"),
]
_make_session_log(entries, date_str, tmp_path)
exporter = TrajectoryExporter(logs_dir=tmp_path / "logs", repo_root=tmp_path)
result = exporter.export_week(weeks_ago=0)
assert len(result) >= 1
def test_export_skips_old_sessions(self, tmp_path):
# Write a session file for 3 weeks ago
three_weeks_ago = datetime.now(tz=UTC) - timedelta(weeks=3)
date_str = three_weeks_ago.strftime("%Y-%m-%d")
entries = [_user_msg("old message"), _timmy_msg("old response")]
_make_session_log(entries, date_str, tmp_path)
exporter = TrajectoryExporter(logs_dir=tmp_path / "logs", repo_root=tmp_path)
# Request current week — should not include 3-week-old data
result = exporter.export_week(weeks_ago=0)
assert result == []
def test_export_segments_by_gap(self, tmp_path):
today = datetime.now(tz=UTC)
date_str = today.strftime("%Y-%m-%d")
# Two conversations separated by 10 minutes
t1 = (today - timedelta(minutes=15)).isoformat()
t2 = (today - timedelta(minutes=14)).isoformat()
t3 = (today - timedelta(minutes=2)).isoformat()
t4 = (today - timedelta(minutes=1)).isoformat()
entries = [
{"type": "message", "role": "user", "content": "first q", "timestamp": t1},
{"type": "message", "role": "timmy", "content": "first a", "timestamp": t2},
{"type": "message", "role": "user", "content": "second q", "timestamp": t3},
{"type": "message", "role": "timmy", "content": "second a", "timestamp": t4},
]
_make_session_log(entries, date_str, tmp_path)
exporter = TrajectoryExporter(logs_dir=tmp_path / "logs", repo_root=tmp_path)
result = exporter.export_week(weeks_ago=0)
# Should have at least 1 trajectory (may be 1 or 2 depending on segmentation)
assert len(result) >= 1
def test_handles_malformed_log_file(self, tmp_path):
log_dir = tmp_path / "logs"
log_dir.mkdir()
today = datetime.now(tz=UTC).strftime("%Y-%m-%d")
(log_dir / f"session_{today}.jsonl").write_text("not json\n{}\n")
exporter = TrajectoryExporter(logs_dir=log_dir, repo_root=tmp_path)
# Should not raise, just return empty or partial results
result = exporter.export_week(weeks_ago=0)
assert isinstance(result, list)
# ── QualityFilter tests ───────────────────────────────────────────────────────
class TestQualityFilter:
def _make_high_quality(self) -> Trajectory:
return Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
messages=[_user_msg("do task"), _timmy_msg("done", confidence=0.9)],
tool_calls=[_tool_call(), _tool_call()],
errors=[],
decisions=[_decision_entry()],
)
def _make_medium_quality(self) -> Trajectory:
return Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
messages=[_user_msg("hello"), _timmy_msg("hi")],
tool_calls=[],
errors=[],
)
def _make_low_quality(self) -> Trajectory:
return Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
messages=[_timmy_msg("oops")], # No user message
errors=[_error_entry()],
)
def test_high_quality_classification(self):
qf = QualityFilter()
result = qf.assess(self._make_high_quality())
assert result.quality == TrajectoryQuality.HIGH
assert result.score >= 4.0
assert result.is_trainable
def test_medium_quality_classification(self):
qf = QualityFilter()
result = qf.assess(self._make_medium_quality())
assert result.quality == TrajectoryQuality.MEDIUM
assert result.is_trainable
def test_low_quality_no_user_message(self):
qf = QualityFilter()
t = Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
messages=[_timmy_msg("random")],
)
result = qf.assess(t)
assert result.quality == TrajectoryQuality.LOW
assert not result.is_trainable
def test_error_penalizes_score(self):
qf = QualityFilter()
t = Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
messages=[_user_msg("go"), _timmy_msg("fail")],
tool_calls=[_tool_call()],
errors=[_error_entry(), _error_entry()],
)
result = qf.assess(t)
assert result.score < qf.assess(self._make_high_quality()).score
def test_low_confidence_penalizes_score(self):
qf = QualityFilter()
t = Trajectory(
session_date="2026-03-17",
started_at=_ts(),
ended_at=_ts(),
messages=[_user_msg("q"), _timmy_msg("a", confidence=0.2)],
)
result = qf.assess(t)
assert result.score < 1.0
def test_filter_returns_stats(self):
qf = QualityFilter()
trajectories = [
self._make_high_quality(),
self._make_medium_quality(),
self._make_low_quality(),
]
trainable, stats = qf.filter(trajectories)
assert stats["total"] == 3
assert stats["accepted"] == len(trainable)
assert stats["high"] + stats["medium"] + stats["low"] == 3
def test_filter_empty_list(self):
qf = QualityFilter()
trainable, stats = qf.filter([])
assert trainable == []
assert stats["total"] == 0
assert stats["accepted"] == 0
# ── TrainingDataset tests ─────────────────────────────────────────────────────
class TestTrainingDataset:
def _make_result(self, quality=TrajectoryQuality.HIGH, score=5.0) -> object:
from timmy_automations.retrain.quality_filter import QualityResult
t = Trajectory(
session_date="2026-03-17",
started_at=_ts(-5),
ended_at=_ts(),
messages=[_user_msg("do it"), _timmy_msg("done")],
tool_calls=[_tool_call()],
)
return QualityResult(trajectory=t, quality=quality, score=score, reasons=[])
def test_count_empty_dataset(self, tmp_path):
ds = TrainingDataset(
dataset_path=".loop/retrain/training_data.jsonl",
repo_root=tmp_path,
)
assert ds.count() == 0
def test_append_adds_examples(self, tmp_path):
ds = TrainingDataset(repo_root=tmp_path)
result = ds.append([self._make_result()], "2026-W12")
assert result.new_examples == 1
assert result.total_examples == 1
assert ds.count() == 1
def test_append_idempotent(self, tmp_path):
ds = TrainingDataset(repo_root=tmp_path)
r = self._make_result()
ds.append([r], "2026-W12")
result2 = ds.append([r], "2026-W12")
# Same trajectory shouldn't be added twice
assert result2.new_examples == 0
assert ds.count() == 1
def test_append_different_weeks(self, tmp_path):
ds = TrainingDataset(repo_root=tmp_path)
r1 = self._make_result()
ds.append([r1], "2026-W11")
ds.append([r1], "2026-W12")
# Different week tags = different records
assert ds.count() == 2
def test_dataset_file_is_valid_jsonl(self, tmp_path):
ds = TrainingDataset(repo_root=tmp_path)
ds.append([self._make_result()], "2026-W12")
with open(ds.dataset_path) as f:
lines = [line.strip() for line in f if line.strip()]
assert len(lines) == 1
record = json.loads(lines[0])
assert "messages" in record
assert "week" in record
assert "quality" in record
def test_index_updated_after_append(self, tmp_path):
ds = TrainingDataset(repo_root=tmp_path)
ds.append([self._make_result()], "2026-W12")
index_path = tmp_path / ".loop" / "retrain" / "dataset_index.json"
assert index_path.exists()
index = json.loads(index_path.read_text())
assert index["total_examples"] == 1
assert "2026-W12" in index["weeks"]
# ── TrainingLog tests ─────────────────────────────────────────────────────────
class TestTrainingLog:
def _make_metrics(self, iteration: int = 1) -> CycleMetrics:
return CycleMetrics(
iteration=iteration,
week="2026-W12",
ran_at=datetime.now(tz=UTC).isoformat(),
trajectories_total=10,
trajectories_high=5,
trajectories_medium=3,
trajectories_low=2,
trajectories_accepted=8,
examples_added=5,
dataset_total=5,
train_status="completed",
train_loss=1.2345,
train_duration_seconds=120.5,
adapter_path=".loop/retrain/adapters/iter_0001/adapters.npz",
model_name="hermes4-14b-ft-0001",
notes="First fine-tune cycle complete",
)
def test_next_iteration_starts_at_1(self, tmp_path):
log = TrainingLog(repo_root=tmp_path)
assert log.next_iteration() == 1
def test_next_iteration_increments(self, tmp_path):
log = TrainingLog(repo_root=tmp_path)
log.record(self._make_metrics(iteration=1))
assert log.next_iteration() == 2
def test_record_creates_log_file(self, tmp_path):
log = TrainingLog(repo_root=tmp_path)
log.record(self._make_metrics())
assert log.log_path.exists()
def test_load_all_returns_records(self, tmp_path):
log = TrainingLog(repo_root=tmp_path)
log.record(self._make_metrics(iteration=1))
log.record(self._make_metrics(iteration=2))
entries = log.load_all()
assert len(entries) == 2
assert entries[0]["iteration"] == 1
def test_latest_returns_last_entry(self, tmp_path):
log = TrainingLog(repo_root=tmp_path)
log.record(self._make_metrics(iteration=1))
log.record(self._make_metrics(iteration=2))
latest = log.latest()
assert latest is not None
assert latest["iteration"] == 2
def test_latest_returns_none_when_empty(self, tmp_path):
log = TrainingLog(repo_root=tmp_path)
assert log.latest() is None
def test_summary_markdown_written(self, tmp_path):
log = TrainingLog(repo_root=tmp_path)
log.record(self._make_metrics())
summary_path = tmp_path / ".loop" / "retrain" / "training_log.md"
assert summary_path.exists()
content = summary_path.read_text()
assert "AutoLoRA Training Log" in content
assert "2026-W12" in content
assert "completed" in content
def test_skill_accuracy_in_summary(self, tmp_path):
log = TrainingLog(repo_root=tmp_path)
m = self._make_metrics()
m.skill_accuracy = {"tool_calling": 0.85, "reasoning": 0.72}
log.record(m)
content = (tmp_path / ".loop" / "retrain" / "training_log.md").read_text()
assert "tool_calling" in content
assert "reasoning" in content
# ── RetrainOrchestrator integration tests ─────────────────────────────────────
class TestRetrainOrchestrator:
def test_run_dry_run_no_data(self, tmp_path):
"""Dry run with no session logs should complete without errors."""
(tmp_path / "logs").mkdir(parents=True)
orc = RetrainOrchestrator(repo_root=tmp_path, dry_run=True)
result = orc.run(weeks_ago=0)
assert result.train_status in ("skipped",)
assert result.examples_added == 0
assert result.iteration == 1
def test_run_creates_log_entry(self, tmp_path):
(tmp_path / "logs").mkdir(parents=True)
orc = RetrainOrchestrator(repo_root=tmp_path, dry_run=True)
orc.run(weeks_ago=0)
log = TrainingLog(repo_root=tmp_path)
entries = log.load_all()
assert len(entries) == 1
def test_run_with_session_data(self, tmp_path):
"""Run with actual session data — should export, filter, and log."""
today = datetime.now(tz=UTC)
date_str = today.strftime("%Y-%m-%d")
entries = [
_user_msg("deploy the service", offset=-10),
_tool_call("bash", "deployed successfully", offset=-9),
_tool_call("bash", "health check ok", offset=-8),
_timmy_msg("Service deployed and healthy", confidence=0.92, offset=-7),
_user_msg("run the tests", offset=-6),
_tool_call("bash", "All tests passed", offset=-5),
_timmy_msg("All 42 tests passed", confidence=0.95, offset=-4),
]
_make_session_log(entries, date_str, tmp_path)
orc = RetrainOrchestrator(repo_root=tmp_path, dry_run=True)
result = orc.run(weeks_ago=0)
assert result.trajectories_exported >= 1
assert result.iteration == 1
# In dry_run mode, fine-tune is skipped but trajectories should be processed
assert result.train_status == "skipped"
def test_iteration_increments_on_second_run(self, tmp_path):
(tmp_path / "logs").mkdir(parents=True)
orc = RetrainOrchestrator(repo_root=tmp_path, dry_run=True)
r1 = orc.run(weeks_ago=0)
r2 = orc.run(weeks_ago=0)
assert r2.iteration == r1.iteration + 1
def test_automations_json_has_retrain_entry(self):
"""Verify the retrain automation is registered in automations.json."""
config_path = _REPO_ROOT / "timmy_automations" / "config" / "automations.json"
assert config_path.exists()
manifest = json.loads(config_path.read_text())
ids = [a["id"] for a in manifest.get("automations", [])]
assert "retrain" in ids
def test_retrain_automation_config(self):
"""Verify retrain automation has correct schedule and config."""
config_path = _REPO_ROOT / "timmy_automations" / "config" / "automations.json"
manifest = json.loads(config_path.read_text())
retrain = next(a for a in manifest["automations"] if a["id"] == "retrain")
assert retrain["schedule"] == "weekly_sunday"
assert retrain["trigger"] == "scheduled"
assert retrain["config"]["base_model"] == "hermes4-14b"
assert retrain["config"]["weeks_ago"] == 1
_REPO_ROOT = Path(__file__).resolve().parent.parent.parent

View File

@@ -4,7 +4,7 @@
"_health_snapshot": {
"note": "Quick health check before coding — CI, P0/P1 issues, flakiness"
},
"last_updated": "2026-03-21",
"last_updated": "2026-03-23",
"automations": [
{
"id": "cycle_retro",
@@ -268,6 +268,36 @@
"ci_timeout_seconds": 5
},
"outputs": []
},
{
"id": "retrain",
"name": "AutoLoRA Continuous Improvement Loop",
"description": "Weekly sovereignty loop — exports trajectories, filters quality, appends to training dataset, triggers LoRA fine-tune, loads new adapter, and logs iteration metrics",
"script": "timmy_automations/retrain/retrain.py",
"category": "autolora",
"enabled": true,
"trigger": "scheduled",
"schedule": "weekly_sunday",
"executable": "python3",
"epic": "#1091",
"pipeline": "AutoLoRA Sovereignty Loop (Step 6 of 7)",
"config": {
"weeks_ago": 1,
"base_model": "hermes4-14b",
"dry_run": false,
"logs_dir": "logs",
"dataset_path": ".loop/retrain/training_data.jsonl",
"adapter_dir": ".loop/retrain/adapters",
"training_log_path": ".loop/retrain/training_log.jsonl",
"training_summary_path": ".loop/retrain/training_log.md"
},
"outputs": [
".loop/retrain/training_data.jsonl",
".loop/retrain/dataset_index.json",
".loop/retrain/training_log.jsonl",
".loop/retrain/training_log.md",
".loop/retrain/adapters/"
]
}
]
}

View File

@@ -0,0 +1,26 @@
"""AutoLoRA continuous improvement loop — sovereignty engine for Timmy.
Implements the weekly retrain cycle:
Work → Record trajectories → Export weekly → Filter quality
→ LoRA fine-tune → Load adapter → Model improves → Repeat
Epic: #1091 — Project Bannerlord
Pipeline: AutoLoRA Sovereignty Loop (Step 6 of 7)
Refs: #1105
"""
from timmy_automations.retrain.quality_filter import QualityFilter, TrajectoryQuality
from timmy_automations.retrain.retrain import RetrainOrchestrator, RetrainResult
from timmy_automations.retrain.training_dataset import TrainingDataset
from timmy_automations.retrain.training_log import TrainingLog
from timmy_automations.retrain.trajectory_exporter import TrajectoryExporter
__all__ = [
"QualityFilter",
"RetrainOrchestrator",
"RetrainResult",
"TrainingDataset",
"TrainingLog",
"TrajectoryExporter",
"TrajectoryQuality",
]

View File

@@ -0,0 +1,262 @@
"""LoRA trainer — triggers fine-tune job and loads the resulting adapter.
Supports two backends:
1. mlx-lm (default, Apple Silicon) — `mlx_lm.lora` CLI
2. Ollama create (adapter packaging into a new Ollama model)
Graceful degradation: if neither backend is available, logs a warning
and returns a skipped result — the rest of the loop continues.
Refs: #1105
"""
from __future__ import annotations
import json
import logging
import os
import shutil
import subprocess
from dataclasses import dataclass
from datetime import UTC, datetime
from pathlib import Path
logger = logging.getLogger(__name__)
_DEFAULT_BASE_MODEL = "hermes4-14b"
_DEFAULT_ADAPTER_DIR = ".loop/retrain/adapters"
_MLX_LM_BIN = "mlx_lm.lora"
_OLLAMA_BIN = "ollama"
@dataclass
class TrainResult:
"""Result of a LoRA fine-tune run."""
status: str # "completed" | "skipped" | "failed"
adapter_path: str | None
model_name: str | None
iteration: int
duration_seconds: float
message: str
train_loss: float | None = None
class LoRATrainer:
"""Orchestrates LoRA fine-tuning and adapter loading.
Workflow:
1. Run mlx_lm.lora fine-tune on the training dataset
2. Save the resulting adapter to .loop/retrain/adapters/<iteration>/
3. Create (or update) an Ollama model that uses the new adapter
"""
def __init__(
self,
base_model: str = _DEFAULT_BASE_MODEL,
adapter_dir: str | Path | None = None,
repo_root: str | Path | None = None,
dry_run: bool = False,
):
if repo_root is None:
repo_root = Path(__file__).resolve().parent.parent.parent
self._repo_root = Path(repo_root)
self._base_model = base_model
self._adapter_dir = self._repo_root / (adapter_dir or _DEFAULT_ADAPTER_DIR)
self._adapter_dir.mkdir(parents=True, exist_ok=True)
self._dry_run = dry_run
def train(self, dataset_path: Path, iteration: int) -> TrainResult:
"""Run LoRA fine-tuning on the dataset.
Args:
dataset_path: Path to the JSONL training dataset.
iteration: Current fine-tune iteration number (used for naming).
Returns:
TrainResult with status, adapter path, and metrics.
"""
started = datetime.now(tz=UTC)
if not dataset_path.exists() or dataset_path.stat().st_size == 0:
return TrainResult(
status="skipped",
adapter_path=None,
model_name=None,
iteration=iteration,
duration_seconds=0.0,
message="Training dataset is empty — skipping fine-tune",
)
if self._dry_run:
logger.info("[dry-run] Would fine-tune %s on %s", self._base_model, dataset_path)
adapter_path = self._adapter_dir / f"iter_{iteration:04d}" / "adapters.npz"
return TrainResult(
status="skipped",
adapter_path=str(adapter_path),
model_name=f"{self._base_model}-ft-{iteration:04d}",
iteration=iteration,
duration_seconds=0.0,
message="dry-run mode — no training performed",
)
# Determine which backend is available
if shutil.which(_MLX_LM_BIN):
return self._train_mlx(dataset_path, iteration, started)
else:
logger.warning(
"%s not found — skipping LoRA fine-tune (install mlx-lm to enable)",
_MLX_LM_BIN,
)
return TrainResult(
status="skipped",
adapter_path=None,
model_name=None,
iteration=iteration,
duration_seconds=0.0,
message=(
f"{_MLX_LM_BIN} not available. "
"Install mlx-lm on Apple Silicon to enable LoRA fine-tuning."
),
)
def _train_mlx(
self, dataset_path: Path, iteration: int, started: datetime
) -> TrainResult:
"""Run mlx_lm.lora fine-tune."""
adapter_out = self._adapter_dir / f"iter_{iteration:04d}"
adapter_out.mkdir(parents=True, exist_ok=True)
cmd = [
_MLX_LM_BIN,
"--model", self._base_model,
"--data", str(dataset_path),
"--adapter-path", str(adapter_out),
"--train",
"--iters", "100",
"--batch-size", "1",
"--learning-rate", "1e-5",
]
logger.info("Starting mlx-lm LoRA fine-tune: iteration %d", iteration)
logger.info("Command: %s", " ".join(cmd))
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=3600, # 1 hour max
env={**os.environ, "PYTHONUNBUFFERED": "1"},
)
except subprocess.TimeoutExpired:
duration = (datetime.now(tz=UTC) - started).total_seconds()
return TrainResult(
status="failed",
adapter_path=None,
model_name=None,
iteration=iteration,
duration_seconds=duration,
message="Fine-tune timed out after 1 hour",
)
except Exception as exc:
duration = (datetime.now(tz=UTC) - started).total_seconds()
return TrainResult(
status="failed",
adapter_path=None,
model_name=None,
iteration=iteration,
duration_seconds=duration,
message=f"Fine-tune subprocess error: {exc}",
)
duration = (datetime.now(tz=UTC) - started).total_seconds()
if result.returncode != 0:
logger.error("mlx-lm fine-tune failed: %s", result.stderr[:500])
return TrainResult(
status="failed",
adapter_path=None,
model_name=None,
iteration=iteration,
duration_seconds=duration,
message=f"mlx_lm.lora exited {result.returncode}: {result.stderr[:300]}",
)
# Parse final train loss from stdout if available
train_loss = _parse_train_loss(result.stdout)
adapter_file = adapter_out / "adapters.npz"
model_name = f"{self._base_model}-ft-{iteration:04d}"
# Attempt to register with Ollama
ollama_ok = self._register_ollama_adapter(adapter_out, model_name)
if not ollama_ok:
logger.warning("Ollama adapter registration failed — adapter saved locally")
logger.info(
"Fine-tune complete: iteration=%d loss=%.4f duration=%.1fs adapter=%s",
iteration,
train_loss or 0.0,
duration,
adapter_file,
)
return TrainResult(
status="completed",
adapter_path=str(adapter_file),
model_name=model_name,
iteration=iteration,
duration_seconds=duration,
message=f"LoRA fine-tune completed successfully in {duration:.0f}s",
train_loss=train_loss,
)
def _register_ollama_adapter(self, adapter_dir: Path, model_name: str) -> bool:
"""Create an Ollama model entry for the new adapter.
Writes a minimal Modelfile and runs `ollama create`.
"""
if not shutil.which(_OLLAMA_BIN):
logger.debug("Ollama not found — skipping adapter registration")
return False
modelfile_content = (
f"FROM {self._base_model}\n"
f"ADAPTER {adapter_dir}\n"
)
modelfile_path = adapter_dir / "Modelfile"
try:
modelfile_path.write_text(modelfile_content)
result = subprocess.run(
[_OLLAMA_BIN, "create", model_name, "-f", str(modelfile_path)],
capture_output=True,
text=True,
timeout=300,
)
if result.returncode == 0:
logger.info("Ollama model registered: %s", model_name)
return True
else:
logger.warning("ollama create failed: %s", result.stderr[:200])
return False
except Exception as exc:
logger.warning("Ollama adapter registration error: %s", exc)
return False
def _parse_train_loss(stdout: str) -> float | None:
"""Extract the final training loss from mlx-lm stdout."""
loss: float | None = None
for line in stdout.splitlines():
line_lower = line.lower()
if "train loss" in line_lower or "loss:" in line_lower:
parts = line.split()
for i, part in enumerate(parts):
if "loss" in part.lower() and i + 1 < len(parts):
try:
loss = float(parts[i + 1].strip(",:"))
except ValueError:
pass
return loss

View File

@@ -0,0 +1,172 @@
"""Quality filter — keeps only high-value trajectories for LoRA training.
Criteria for a high-quality training example:
1. Tool calls succeeded (tool calls present, no error entries)
2. Multi-step tasks completed (≥2 messages + ≥1 tool call)
3. No low-confidence signals (confidence < 0.5 on any Timmy message)
4. Minimum meaningful exchange (≥1 user message + ≥1 Timmy message)
Refs: #1105
"""
from __future__ import annotations
import logging
from dataclasses import dataclass
from enum import StrEnum
from timmy_automations.retrain.trajectory_exporter import Trajectory
logger = logging.getLogger(__name__)
_MIN_CONFIDENCE = 0.5
class TrajectoryQuality(StrEnum):
"""Quality classification for a trajectory."""
HIGH = "high" # Multi-step + tool success — ideal training data
MEDIUM = "medium" # Single exchange, no errors — acceptable
LOW = "low" # Error-prone or trivial — skip
@dataclass
class QualityResult:
"""Result of quality assessment for a single trajectory."""
trajectory: Trajectory
quality: TrajectoryQuality
score: float
reasons: list[str]
@property
def is_trainable(self) -> bool:
return self.quality in (TrajectoryQuality.HIGH, TrajectoryQuality.MEDIUM)
class QualityFilter:
"""Filters trajectories to keep only those worth training on.
Scoring:
- +1 pt: base score for any valid clean exchange (no errors)
- +3 pts: multi-step task (≥2 messages + ≥1 tool call)
- +2 pts: tool calls present and no errors
- +1 pt: decision recorded (deliberate choice made)
- -2 pts: any error entry
- -1 pt: any low-confidence response (confidence < 0.5)
HIGH ≥ 4, MEDIUM 13, LOW ≤ 0
"""
def __init__(self, min_confidence: float = _MIN_CONFIDENCE):
self._min_confidence = min_confidence
def assess(self, trajectory: Trajectory) -> QualityResult:
"""Score and classify a single trajectory."""
score = 0.0
reasons: list[str] = []
# Minimum viable exchange check
user_msgs = [m for m in trajectory.messages if m.get("role") == "user"]
timmy_msgs = [m for m in trajectory.messages if m.get("role") == "timmy"]
if not user_msgs or not timmy_msgs:
return QualityResult(
trajectory=trajectory,
quality=TrajectoryQuality.LOW,
score=0.0,
reasons=["Missing user or assistant messages — not a valid exchange"],
)
# Multi-step bonus
if trajectory.is_multi_step:
score += 3.0
reasons.append(
f"Multi-step task: {trajectory.message_count} messages, "
f"{trajectory.tool_call_count} tool calls"
)
# Base score for any clean exchange (user + timmy, no tool call required)
if trajectory.error_count == 0:
score += 1.0
reasons.append("Clean exchange (no errors)")
# Tool call quality
if trajectory.tool_call_count > 0:
if trajectory.error_count == 0:
score += 2.0
reasons.append(
f"All {trajectory.tool_call_count} tool call(s) succeeded"
)
else:
score -= 2.0
reasons.append(
f"{trajectory.error_count} error(s) during {trajectory.tool_call_count} tool call(s)"
)
elif trajectory.error_count > 0:
score -= 2.0
reasons.append(f"{trajectory.error_count} error(s) with no tool calls")
# Decision bonus
if trajectory.decisions:
score += 1.0
reasons.append(f"Decisions recorded: {len(trajectory.decisions)}")
# Confidence penalty
low_conf = [
m
for m in timmy_msgs
if m.get("confidence") is not None
and m["confidence"] < self._min_confidence
]
if low_conf:
score -= len(low_conf)
reasons.append(
f"{len(low_conf)} low-confidence response(s) (threshold={self._min_confidence})"
)
# Classify
if score >= 4.0:
quality = TrajectoryQuality.HIGH
elif score >= 1.0:
quality = TrajectoryQuality.MEDIUM
else:
quality = TrajectoryQuality.LOW
return QualityResult(
trajectory=trajectory,
quality=quality,
score=score,
reasons=reasons,
)
def filter(
self, trajectories: list[Trajectory]
) -> tuple[list[QualityResult], dict[str, int]]:
"""Assess all trajectories and return trainable ones with stats.
Returns:
(trainable_results, stats_dict) where stats_dict has keys
'total', 'high', 'medium', 'low', 'accepted'.
"""
results = [self.assess(t) for t in trajectories]
trainable = [r for r in results if r.is_trainable]
stats = {
"total": len(results),
"high": sum(1 for r in results if r.quality == TrajectoryQuality.HIGH),
"medium": sum(1 for r in results if r.quality == TrajectoryQuality.MEDIUM),
"low": sum(1 for r in results if r.quality == TrajectoryQuality.LOW),
"accepted": len(trainable),
}
logger.info(
"Quality filter: %d/%d accepted (high=%d medium=%d low=%d)",
stats["accepted"],
stats["total"],
stats["high"],
stats["medium"],
stats["low"],
)
return trainable, stats

View File

@@ -0,0 +1,292 @@
#!/usr/bin/env python3
"""AutoLoRA continuous improvement loop — the sovereignty retrain script.
Implements the weekly retrain cycle end-to-end:
Work → Record trajectories → Export weekly → Filter quality
→ LoRA fine-tune → Load adapter → Model improves → Repeat forever
Run:
python3 timmy_automations/retrain/retrain.py
python3 timmy_automations/retrain/retrain.py --dry-run
python3 timmy_automations/retrain/retrain.py --weeks-ago 1
Epic: #1091 — Project Bannerlord
Pipeline: AutoLoRA Sovereignty Loop (Step 6 of 7)
Refs: #1105
"""
from __future__ import annotations
import argparse
import json
import logging
import sys
from dataclasses import dataclass
from datetime import UTC, datetime
from pathlib import Path
# Allow running directly from repo root
_REPO_ROOT = Path(__file__).resolve().parent.parent.parent
if str(_REPO_ROOT) not in sys.path:
sys.path.insert(0, str(_REPO_ROOT))
from timmy_automations.retrain.lora_trainer import LoRATrainer
from timmy_automations.retrain.quality_filter import QualityFilter
from timmy_automations.retrain.training_dataset import TrainingDataset
from timmy_automations.retrain.training_log import CycleMetrics, TrainingLog
from timmy_automations.retrain.trajectory_exporter import TrajectoryExporter
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)-8s %(name)s: %(message)s",
datefmt="%Y-%m-%dT%H:%M:%S",
)
logger = logging.getLogger("retrain")
@dataclass
class RetrainResult:
"""Result of a complete retrain cycle."""
iteration: int
week: str
trajectories_exported: int
trajectories_accepted: int
examples_added: int
dataset_total: int
train_status: str
adapter_path: str | None
model_name: str | None
train_loss: float | None
duration_seconds: float
notes: str
class RetrainOrchestrator:
"""Orchestrates the complete AutoLoRA continuous improvement loop.
Step 1: Export this week's conversation trajectories from session logs
Step 2: Filter for high-quality exchanges
Step 3: Append to the training dataset
Step 4: Trigger LoRA fine-tune
Step 5: Load the new adapter (via Ollama)
Step 6: Log iteration, loss, skill accuracy
"""
def __init__(
self,
base_model: str = "hermes4-14b",
repo_root: str | Path | None = None,
dry_run: bool = False,
):
if repo_root is None:
repo_root = _REPO_ROOT
self._repo_root = Path(repo_root)
self._dry_run = dry_run
self.exporter = TrajectoryExporter(repo_root=self._repo_root)
self.quality_filter = QualityFilter()
self.dataset = TrainingDataset(repo_root=self._repo_root)
self.trainer = LoRATrainer(
base_model=base_model,
repo_root=self._repo_root,
dry_run=dry_run,
)
self.log = TrainingLog(repo_root=self._repo_root)
def run(self, weeks_ago: int = 1) -> RetrainResult:
"""Execute one complete retrain cycle.
Args:
weeks_ago: Which week to process. 0 = current week (partial),
1 = last week (default, Sunday night run), etc.
Returns:
RetrainResult with full cycle summary.
"""
started = datetime.now(tz=UTC)
iteration = self.log.next_iteration()
# Determine ISO week tag
from datetime import timedelta
now = datetime.now(tz=UTC)
target_date = now - timedelta(weeks=weeks_ago)
week_tag = f"{target_date.year}-W{target_date.isocalendar().week:02d}"
logger.info(
"=== AutoLoRA Retrain Cycle %d | Week: %s | dry_run=%s ===",
iteration,
week_tag,
self._dry_run,
)
# Step 1: Export trajectories
logger.info("Step 1: Exporting trajectories for %s...", week_tag)
trajectories = self.exporter.export_week(weeks_ago=weeks_ago)
logger.info("Exported %d raw trajectories", len(trajectories))
# Step 2: Quality filter
logger.info("Step 2: Applying quality filter...")
trainable, filter_stats = self.quality_filter.filter(trajectories)
logger.info(
"Quality filter: %d/%d accepted (high=%d medium=%d low=%d)",
filter_stats["accepted"],
filter_stats["total"],
filter_stats["high"],
filter_stats["medium"],
filter_stats["low"],
)
# Step 3: Append to dataset
logger.info("Step 3: Appending to training dataset...")
append_result = self.dataset.append(trainable, week_tag)
logger.info(
"Dataset: +%d new examples (%d total)",
append_result.new_examples,
append_result.total_examples,
)
# Step 4: LoRA fine-tune
logger.info("Step 4: Triggering LoRA fine-tune (iteration=%d)...", iteration)
train_result = self.trainer.train(
dataset_path=self.dataset.dataset_path,
iteration=iteration,
)
logger.info(
"Train result: status=%s loss=%s duration=%.1fs",
train_result.status,
train_result.train_loss,
train_result.duration_seconds,
)
# Step 5 & 6: Log cycle
duration = (datetime.now(tz=UTC) - started).total_seconds()
metrics = CycleMetrics(
iteration=iteration,
week=week_tag,
ran_at=started.isoformat(),
trajectories_total=filter_stats["total"],
trajectories_high=filter_stats["high"],
trajectories_medium=filter_stats["medium"],
trajectories_low=filter_stats["low"],
trajectories_accepted=filter_stats["accepted"],
examples_added=append_result.new_examples,
dataset_total=append_result.total_examples,
train_status=train_result.status,
train_loss=train_result.train_loss,
train_duration_seconds=train_result.duration_seconds,
adapter_path=train_result.adapter_path,
model_name=train_result.model_name,
notes=train_result.message,
)
self.log.record(metrics)
result = RetrainResult(
iteration=iteration,
week=week_tag,
trajectories_exported=len(trajectories),
trajectories_accepted=filter_stats["accepted"],
examples_added=append_result.new_examples,
dataset_total=append_result.total_examples,
train_status=train_result.status,
adapter_path=train_result.adapter_path,
model_name=train_result.model_name,
train_loss=train_result.train_loss,
duration_seconds=duration,
notes=train_result.message,
)
logger.info(
"=== Cycle %d complete: status=%s examples_added=%d total=%.1fs ===",
iteration,
train_result.status,
append_result.new_examples,
duration,
)
return result
def _print_result(result: RetrainResult, as_json: bool = False) -> None:
"""Print cycle result to stdout."""
if as_json:
print(
json.dumps(
{
"iteration": result.iteration,
"week": result.week,
"trajectories_exported": result.trajectories_exported,
"trajectories_accepted": result.trajectories_accepted,
"examples_added": result.examples_added,
"dataset_total": result.dataset_total,
"train_status": result.train_status,
"adapter_path": result.adapter_path,
"model_name": result.model_name,
"train_loss": result.train_loss,
"duration_seconds": result.duration_seconds,
"notes": result.notes,
},
indent=2,
)
)
return
print(f"\n{'='*60}")
print(f" AutoLoRA Retrain — Cycle {result.iteration}")
print(f" Week: {result.week}")
print(f"{'='*60}")
print(f" Trajectories: {result.trajectories_exported} exported, {result.trajectories_accepted} accepted")
print(f" Dataset: +{result.examples_added} examples ({result.dataset_total} total)")
print(f" Fine-tune: {result.train_status}")
if result.train_loss is not None:
print(f" Train loss: {result.train_loss:.4f}")
if result.model_name:
print(f" New model: {result.model_name}")
if result.adapter_path:
print(f" Adapter: {result.adapter_path}")
print(f" Duration: {result.duration_seconds:.1f}s")
print(f" Notes: {result.notes}")
print(f"{'='*60}\n")
def main() -> int:
parser = argparse.ArgumentParser(
description="AutoLoRA continuous improvement loop — sovereignty engine for Timmy"
)
parser.add_argument(
"--weeks-ago",
type=int,
default=1,
help="Which week to process: 0=current (partial), 1=last week (default)",
)
parser.add_argument(
"--base-model",
default="hermes4-14b",
help="Ollama base model name (default: hermes4-14b)",
)
parser.add_argument(
"--dry-run",
action="store_true",
help="Export and filter trajectories but skip actual fine-tuning",
)
parser.add_argument(
"--json",
action="store_true",
dest="as_json",
help="Output result as JSON",
)
args = parser.parse_args()
orchestrator = RetrainOrchestrator(
base_model=args.base_model,
dry_run=args.dry_run,
)
result = orchestrator.run(weeks_ago=args.weeks_ago)
_print_result(result, as_json=args.as_json)
# Exit 0 even on skipped/failed training — the loop must continue
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,180 @@
"""Training dataset manager — appends filtered trajectories to a JSONL training file.
Maintains a growing dataset of high-quality conversation examples in the
chat-format expected by mlx-lm / HuggingFace fine-tuning pipelines.
Output format (one JSON object per line):
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
Refs: #1105
"""
from __future__ import annotations
import json
import logging
from dataclasses import dataclass
from datetime import UTC, datetime
from pathlib import Path
from timmy_automations.retrain.quality_filter import QualityResult
logger = logging.getLogger(__name__)
_DEFAULT_DATASET_PATH = ".loop/retrain/training_data.jsonl"
_DEFAULT_INDEX_PATH = ".loop/retrain/dataset_index.json"
@dataclass
class AppendResult:
"""Result of appending trajectories to the training dataset."""
new_examples: int
total_examples: int
dataset_path: str
week_tag: str
class TrainingDataset:
"""Manages the LoRA training dataset file.
Each entry is a chat-format example:
{"messages": [...], "week": "2026-W12", "quality": "high", "added_at": "..."}
"""
def __init__(
self,
dataset_path: str | Path | None = None,
index_path: str | Path | None = None,
repo_root: str | Path | None = None,
):
if repo_root is None:
repo_root = Path(__file__).resolve().parent.parent.parent
self._repo_root = Path(repo_root)
self._dataset_path = self._repo_root / (
dataset_path or _DEFAULT_DATASET_PATH
)
self._index_path = self._repo_root / (
index_path or _DEFAULT_INDEX_PATH
)
self._dataset_path.parent.mkdir(parents=True, exist_ok=True)
@property
def dataset_path(self) -> Path:
return self._dataset_path
def count(self) -> int:
"""Return the number of examples currently in the dataset."""
if not self._dataset_path.exists():
return 0
count = 0
with open(self._dataset_path) as f:
for line in f:
if line.strip():
count += 1
return count
def append(
self, quality_results: list[QualityResult], week_tag: str
) -> AppendResult:
"""Append high-quality trajectories to the training dataset.
Deduplicates by (week_tag, session_date, started_at) so re-running
the export for the same week is idempotent.
Args:
quality_results: Filtered, trainable quality results.
week_tag: ISO week string e.g. "2026-W12".
Returns:
AppendResult with counts.
"""
existing_keys = self._load_existing_keys()
new_count = 0
added_at = datetime.now(tz=UTC).isoformat()
with open(self._dataset_path, "a") as f:
for result in quality_results:
traj = result.trajectory
dedup_key = (
f"{week_tag}|{traj.session_date}|{traj.started_at}"
)
if dedup_key in existing_keys:
logger.debug("Skipping duplicate trajectory: %s", dedup_key)
continue
chat_messages = traj.to_chat_format()
if len(chat_messages) < 2:
logger.debug(
"Skipping trajectory with %d chat messages (need ≥2)",
len(chat_messages),
)
continue
record = {
"messages": chat_messages,
"week": week_tag,
"quality": result.quality.value,
"score": result.score,
"session_date": traj.session_date,
"started_at": traj.started_at,
"tool_calls": traj.tool_call_count,
"added_at": added_at,
}
f.write(json.dumps(record) + "\n")
existing_keys.add(dedup_key)
new_count += 1
total = self.count()
self._update_index(week_tag, new_count, total)
logger.info(
"Dataset: appended %d new examples (total=%d)", new_count, total
)
return AppendResult(
new_examples=new_count,
total_examples=total,
dataset_path=str(self._dataset_path),
week_tag=week_tag,
)
def _load_existing_keys(self) -> set[str]:
"""Load deduplication keys from the existing dataset."""
keys: set[str] = set()
if not self._dataset_path.exists():
return keys
with open(self._dataset_path) as f:
for line in f:
line = line.strip()
if not line:
continue
try:
record = json.loads(line)
week = record.get("week", "")
session_date = record.get("session_date", "")
started_at = record.get("started_at", "")
keys.add(f"{week}|{session_date}|{started_at}")
except json.JSONDecodeError:
continue
return keys
def _update_index(self, week_tag: str, new_count: int, total: int) -> None:
"""Update the dataset index JSON with latest run metadata."""
index: dict = {}
if self._index_path.exists():
try:
index = json.loads(self._index_path.read_text())
except (json.JSONDecodeError, OSError):
index = {}
index.setdefault("weeks", {})
index["weeks"][week_tag] = {
"examples_added": new_count,
"updated_at": datetime.now(tz=UTC).isoformat(),
}
index["total_examples"] = total
index["last_updated"] = datetime.now(tz=UTC).isoformat()
self._index_path.write_text(json.dumps(index, indent=2))

View File

@@ -0,0 +1,183 @@
"""Training log — records each fine-tune cycle with metrics and skill deltas.
Writes to .loop/retrain/training_log.jsonl (one entry per cycle) and
maintains a human-readable .loop/retrain/training_log.md summary.
Each log entry captures:
- Iteration count
- Week processed
- Quality filter stats
- Examples added to dataset
- LoRA train result (loss, duration, adapter path)
- Skill accuracy deltas (from smoke tests)
Refs: #1105
"""
from __future__ import annotations
import json
import logging
from dataclasses import asdict, dataclass, field
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
logger = logging.getLogger(__name__)
_DEFAULT_LOG_PATH = ".loop/retrain/training_log.jsonl"
_DEFAULT_SUMMARY_PATH = ".loop/retrain/training_log.md"
@dataclass
class CycleMetrics:
"""Metrics for a single retrain cycle."""
iteration: int
week: str
ran_at: str
# Quality filter
trajectories_total: int = 0
trajectories_high: int = 0
trajectories_medium: int = 0
trajectories_low: int = 0
trajectories_accepted: int = 0
# Dataset
examples_added: int = 0
dataset_total: int = 0
# Training
train_status: str = "skipped"
train_loss: float | None = None
train_duration_seconds: float = 0.0
adapter_path: str | None = None
model_name: str | None = None
# Skill accuracy (optional, from smoke tests)
skill_accuracy: dict[str, float] = field(default_factory=dict)
skill_delta: dict[str, float] = field(default_factory=dict)
# Human-readable summary
notes: str = ""
class TrainingLog:
"""Persistent log of all retrain cycles."""
def __init__(
self,
log_path: str | Path | None = None,
summary_path: str | Path | None = None,
repo_root: str | Path | None = None,
):
if repo_root is None:
repo_root = Path(__file__).resolve().parent.parent.parent
self._repo_root = Path(repo_root)
self._log_path = self._repo_root / (log_path or _DEFAULT_LOG_PATH)
self._summary_path = self._repo_root / (summary_path or _DEFAULT_SUMMARY_PATH)
self._log_path.parent.mkdir(parents=True, exist_ok=True)
@property
def log_path(self) -> Path:
return self._log_path
def next_iteration(self) -> int:
"""Return the next iteration number (1-indexed)."""
entries = self.load_all()
if not entries:
return 1
return max(e.get("iteration", 0) for e in entries) + 1
def record(self, metrics: CycleMetrics) -> None:
"""Append a cycle metrics record to the log."""
entry = asdict(metrics)
with open(self._log_path, "a") as f:
f.write(json.dumps(entry) + "\n")
self._update_summary(metrics)
logger.info(
"Training log: iteration=%d week=%s status=%s examples_added=%d",
metrics.iteration,
metrics.week,
metrics.train_status,
metrics.examples_added,
)
def load_all(self) -> list[dict[str, Any]]:
"""Load all cycle records from the log."""
if not self._log_path.exists():
return []
entries: list[dict[str, Any]] = []
with open(self._log_path) as f:
for line in f:
line = line.strip()
if not line:
continue
try:
entries.append(json.loads(line))
except json.JSONDecodeError:
logger.debug("Skipping malformed log entry")
return entries
def latest(self) -> dict[str, Any] | None:
"""Return the most recent cycle record."""
entries = self.load_all()
return entries[-1] if entries else None
def _update_summary(self, metrics: CycleMetrics) -> None:
"""Rewrite the markdown summary with all cycles."""
all_entries = self.load_all()
lines = [
"# AutoLoRA Training Log\n",
f"*Updated: {datetime.now(tz=UTC).isoformat()}*\n",
f"*Total iterations: {len(all_entries)}*\n",
"",
"## Cycles\n",
"| # | Week | Status | Loss | Examples | Duration |",
"|---|------|--------|------|----------|----------|",
]
for entry in reversed(all_entries[-20:]): # Last 20 cycles
loss = f"{entry.get('train_loss', 0.0) or 0.0:.4f}" if entry.get("train_loss") else ""
lines.append(
f"| {entry.get('iteration', '?')} "
f"| {entry.get('week', '?')} "
f"| {entry.get('train_status', '?')} "
f"| {loss} "
f"| +{entry.get('examples_added', 0)} ({entry.get('dataset_total', 0)} total) "
f"| {entry.get('train_duration_seconds', 0.0):.0f}s |"
)
lines.append("")
lines.append("## Skill Accuracy Over Time\n")
# Collect all unique skills
all_skills: set[str] = set()
for entry in all_entries:
all_skills.update(entry.get("skill_accuracy", {}).keys())
if all_skills:
skill_header = "| # | Week | " + " | ".join(sorted(all_skills)) + " |"
skill_sep = "|---|------|" + "|".join("---" for _ in all_skills) + "|"
lines.extend([skill_header, skill_sep])
for entry in reversed(all_entries[-10:]):
acc = entry.get("skill_accuracy", {})
row = f"| {entry.get('iteration', '?')} | {entry.get('week', '?')} | "
row += " | ".join(
f"{acc.get(s, 0.0):.0%}" if s in acc else ""
for s in sorted(all_skills)
)
row += " |"
lines.append(row)
else:
lines.append("*No skill accuracy data yet — run smoke tests after fine-tuning.*")
lines.append("")
if metrics.notes:
lines.append(f"## Latest Notes\n\n{metrics.notes}\n")
self._summary_path.write_text("\n".join(lines))

View File

@@ -0,0 +1,255 @@
"""Trajectory exporter — reads session JSONL logs and extracts conversation trajectories.
A trajectory is a coherent sequence of messages + tool calls that form
a single task attempt. Each trajectory becomes one training example.
Refs: #1105
"""
from __future__ import annotations
import json
import logging
from dataclasses import dataclass, field
from datetime import UTC, datetime, timedelta
from pathlib import Path
from typing import Any
logger = logging.getLogger(__name__)
_LOGS_DIR_DEFAULT = "logs"
_SESSION_GLOB = "session_*.jsonl"
@dataclass
class Trajectory:
"""A single conversation trajectory extracted from session logs."""
session_date: str
started_at: str
ended_at: str
messages: list[dict[str, Any]] = field(default_factory=list)
tool_calls: list[dict[str, Any]] = field(default_factory=list)
errors: list[dict[str, Any]] = field(default_factory=list)
decisions: list[dict[str, Any]] = field(default_factory=list)
@property
def message_count(self) -> int:
return len(self.messages)
@property
def tool_call_count(self) -> int:
return len(self.tool_calls)
@property
def error_count(self) -> int:
return len(self.errors)
@property
def has_successful_tool_call(self) -> bool:
"""True if any tool call succeeded (no error entry follows it)."""
return self.tool_call_count > 0 and self.error_count == 0
@property
def is_multi_step(self) -> bool:
"""True if this trajectory involved multiple turns with tool use."""
return self.message_count >= 2 and self.tool_call_count >= 1
def to_chat_format(self) -> list[dict[str, str]]:
"""Convert trajectory to chat-format messages for training.
Interleaves messages and tool-call results as assistant/tool turns.
"""
chat: list[dict[str, str]] = []
# Merge all entries by timestamp and emit in order
all_entries = sorted(
self.messages + self.tool_calls + self.decisions,
key=lambda e: e.get("timestamp", ""),
)
for entry in all_entries:
etype = entry.get("type")
if etype == "message":
role = "user" if entry.get("role") == "user" else "assistant"
content = entry.get("content", "")
if content:
chat.append({"role": role, "content": content})
elif etype == "tool_call":
tool = entry.get("tool", "unknown")
result = entry.get("result", "")
chat.append(
{
"role": "assistant",
"content": f"[tool:{tool}] {result}",
}
)
elif etype == "decision":
decision = entry.get("decision", "")
if decision:
chat.append({"role": "assistant", "content": f"[decided] {decision}"})
return chat
class TrajectoryExporter:
"""Reads session JSONL logs and yields Trajectory objects for a date range."""
def __init__(self, logs_dir: str | Path | None = None, repo_root: str | Path | None = None):
if repo_root is None:
repo_root = Path(__file__).resolve().parent.parent.parent
self._repo_root = Path(repo_root)
if logs_dir is None:
self._logs_dir = self._repo_root / _LOGS_DIR_DEFAULT
else:
self._logs_dir = Path(logs_dir)
def export_week(self, weeks_ago: int = 0) -> list[Trajectory]:
"""Export all trajectories from the specified week.
Args:
weeks_ago: 0 = current week, 1 = last week, etc.
Returns:
List of Trajectory objects extracted from session logs.
"""
now = datetime.now(tz=UTC)
# Week boundaries: MonSun
days_since_monday = now.weekday()
week_start = (now - timedelta(days=days_since_monday + 7 * weeks_ago)).replace(
hour=0, minute=0, second=0, microsecond=0
)
week_end = week_start + timedelta(days=7)
logger.info(
"Exporting trajectories for week %s%s",
week_start.date().isoformat(),
week_end.date().isoformat(),
)
trajectories: list[Trajectory] = []
log_files = sorted(self._logs_dir.glob(_SESSION_GLOB))
for log_file in log_files:
# Parse date from filename: session_YYYY-MM-DD.jsonl
try:
date_str = log_file.stem.removeprefix("session_")
file_date = datetime.strptime(date_str, "%Y-%m-%d").replace(tzinfo=UTC)
except ValueError:
logger.debug("Skipping non-date session file: %s", log_file.name)
continue
if not (week_start <= file_date < week_end):
continue
file_trajectories = self._extract_from_file(log_file)
trajectories.extend(file_trajectories)
logger.info(
"Extracted %d trajectories from %s", len(file_trajectories), log_file.name
)
logger.info("Total trajectories exported: %d", len(trajectories))
return trajectories
def _extract_from_file(self, log_file: Path) -> list[Trajectory]:
"""Parse a single session JSONL file into trajectories.
Groups entries into trajectories by finding natural conversation
boundaries (gaps of inactivity or topic shifts in the message stream).
"""
entries: list[dict[str, Any]] = []
try:
with open(log_file) as f:
for line in f:
line = line.strip()
if not line:
continue
try:
entries.append(json.loads(line))
except json.JSONDecodeError:
logger.debug("Skipping malformed JSON line in %s", log_file.name)
except OSError as exc:
logger.warning("Could not read %s: %s", log_file, exc)
return []
if not entries:
return []
date_str = log_file.stem.removeprefix("session_")
return self._segment_trajectories(entries, date_str)
def _segment_trajectories(
self, entries: list[dict[str, Any]], session_date: str
) -> list[Trajectory]:
"""Split a flat list of session entries into discrete trajectories.
Segmentation rule: start a new trajectory when:
- A user message follows a Timmy message (new conversation turn)
- More than 5 minutes have elapsed between entries
This produces training examples that are coherent task attempts.
"""
if not entries:
return []
trajectories: list[Trajectory] = []
current_entries: list[dict[str, Any]] = []
prev_ts: datetime | None = None
_SEGMENT_GAP_MINUTES = 5
def _flush() -> None:
if current_entries:
traj = _build_trajectory(current_entries, session_date)
if traj.message_count > 0:
trajectories.append(traj)
for entry in entries:
ts_raw = entry.get("timestamp", "")
try:
ts = datetime.fromisoformat(ts_raw.replace("Z", "+00:00"))
except (ValueError, AttributeError):
ts = None
# Time-gap segmentation
if ts and prev_ts and (ts - prev_ts).total_seconds() > _SEGMENT_GAP_MINUTES * 60:
_flush()
current_entries = []
# New-turn segmentation: user message after assistant turn
etype = entry.get("type")
erole = entry.get("role")
if etype == "message" and erole == "user" and current_entries:
# Check if previous non-error entry was a Timmy message
for prev in reversed(current_entries):
if prev.get("type") == "message":
if prev.get("role") == "timmy":
_flush()
current_entries = []
break
current_entries.append(entry)
if ts:
prev_ts = ts
_flush()
return trajectories
def _build_trajectory(entries: list[dict[str, Any]], session_date: str) -> Trajectory:
"""Build a Trajectory from a flat list of entries."""
messages = [e for e in entries if e.get("type") == "message"]
tool_calls = [e for e in entries if e.get("type") == "tool_call"]
errors = [e for e in entries if e.get("type") == "error"]
decisions = [e for e in entries if e.get("type") == "decision"]
timestamps = [e.get("timestamp", "") for e in entries if e.get("timestamp")]
started_at = min(timestamps) if timestamps else ""
ended_at = max(timestamps) if timestamps else ""
return Trajectory(
session_date=session_date,
started_at=started_at,
ended_at=ended_at,
messages=messages,
tool_calls=tool_calls,
errors=errors,
decisions=decisions,
)