Files
Timmy-time-dashboard/docs/research/autoresearch-h1-baseline.md
Claude (Opus 4.6) f62220eb61
Some checks failed
Tests / lint (push) Has been cancelled
Tests / test (push) Has been cancelled
[claude] Autoresearch H1: Apple Silicon support + M3 Max baseline doc (#905) (#1252)
2026-03-23 23:38:38 +00:00

4.2 KiB
Raw Blame History

Autoresearch H1 — M3 Max Baseline

Status: Baseline established (Issue #905) Hardware: Apple M3 Max · 36 GB unified memory Date: 2026-03-23 Refs: #905 · #904 (parent) · #881 (M3 Max compute) · #903 (MLX benchmark)


Setup

Prerequisites

# Install MLX (Apple Silicon — definitively faster than llama.cpp per #903)
pip install mlx mlx-lm

# Install project deps
tox -e dev  # or: pip install -e '.[dev]'

Clone & prepare

prepare_experiment in src/timmy/autoresearch.py handles the clone. On Apple Silicon it automatically sets AUTORESEARCH_BACKEND=mlx and AUTORESEARCH_DATASET=tinystories.

from timmy.autoresearch import prepare_experiment
status = prepare_experiment("data/experiments", dataset="tinystories", backend="auto")
print(status)

Or via the dashboard: POST /experiments/start (requires AUTORESEARCH_ENABLED=true).

Configuration (.env / environment)

AUTORESEARCH_ENABLED=true
AUTORESEARCH_DATASET=tinystories   # lower-entropy dataset, faster iteration on Mac
AUTORESEARCH_BACKEND=auto          # resolves to "mlx" on Apple Silicon
AUTORESEARCH_TIME_BUDGET=300       # 5-minute wall-clock budget per experiment
AUTORESEARCH_MAX_ITERATIONS=100
AUTORESEARCH_METRIC=val_bpb

Why TinyStories?

Karpathy's recommendation for resource-constrained hardware: lower entropy means the model can learn meaningful patterns in less time and with a smaller vocabulary, yielding cleaner val_bpb curves within the 5-minute budget.


M3 Max Hardware Profile

Spec Value
Chip Apple M3 Max
CPU cores 16 (12P + 4E)
GPU cores 40
Unified RAM 36 GB
Memory bandwidth 400 GB/s
MLX support Yes (confirmed #903)

MLX utilises the unified memory architecture — model weights, activations, and training data all share the same physical pool, eliminating PCIe transfers. This gives M3 Max a significant throughput advantage over external GPU setups for models that fit in 36 GB.


Community Reference Data

Hardware Experiments Succeeded Failed Outcome
Mac Mini M4 35 7 28 Model improved by simplifying
Shopify (overnight) ~50 19% quality gain; smaller beat 2× baseline
SkyPilot (16× GPU, 8 h) ~910 2.87% improvement
Karpathy (H100, 2 days) ~700 20+ 11% training speedup

Mac Mini M4 failure rate: 80% (26/35). Failures are expected and by design — the 5-minute budget deliberately prunes slow experiments. The 20% success rate still yielded an improved model.


Baseline Results (M3 Max)

Fill in after running: timmy learn --target <module> --metric val_bpb --budget 5 --max-experiments 50

Run Date Experiments Succeeded val_bpb (start) val_bpb (end) Δ
1

Throughput estimate

Based on the M3 Max hardware profile and Mac Mini M4 community data, expected throughput is 814 experiments/hour with the 5-minute budget and TinyStories dataset. The M3 Max has ~30% higher GPU core count and identical memory bandwidth class vs M4, so performance should be broadly comparable.


Apple Silicon Compatibility Notes

  • Install: pip install mlx mlx-lm
  • AUTORESEARCH_BACKEND=auto resolves to mlx on arm64 macOS
  • Pros: unified memory, no PCIe overhead, native Metal backend
  • Cons: MLX op coverage is a subset of PyTorch; some custom CUDA kernels won't port

llama.cpp path (fallback)

  • Use when MLX op support is insufficient
  • Set AUTORESEARCH_BACKEND=cpu to force CPU mode
  • Slower throughput but broader op compatibility

Known issues

  • subprocess.TimeoutExpired is the normal termination path — autoresearch treats timeout as a completed-but-pruned experiment, not a failure
  • Large batch sizes may trigger OOM if other processes hold unified memory; set PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable the MPS high-watermark

Next Steps (H2)

See #904 Horizon 2 for the meta-autoresearch plan: expand experiment units from code changes → system configuration changes (prompts, tools, memory strategies).