4.2 KiB
Autoresearch H1 — M3 Max Baseline
Status: Baseline established (Issue #905) Hardware: Apple M3 Max · 36 GB unified memory Date: 2026-03-23 Refs: #905 · #904 (parent) · #881 (M3 Max compute) · #903 (MLX benchmark)
Setup
Prerequisites
# Install MLX (Apple Silicon — definitively faster than llama.cpp per #903)
pip install mlx mlx-lm
# Install project deps
tox -e dev # or: pip install -e '.[dev]'
Clone & prepare
prepare_experiment in src/timmy/autoresearch.py handles the clone.
On Apple Silicon it automatically sets AUTORESEARCH_BACKEND=mlx and
AUTORESEARCH_DATASET=tinystories.
from timmy.autoresearch import prepare_experiment
status = prepare_experiment("data/experiments", dataset="tinystories", backend="auto")
print(status)
Or via the dashboard: POST /experiments/start (requires AUTORESEARCH_ENABLED=true).
Configuration (.env / environment)
AUTORESEARCH_ENABLED=true
AUTORESEARCH_DATASET=tinystories # lower-entropy dataset, faster iteration on Mac
AUTORESEARCH_BACKEND=auto # resolves to "mlx" on Apple Silicon
AUTORESEARCH_TIME_BUDGET=300 # 5-minute wall-clock budget per experiment
AUTORESEARCH_MAX_ITERATIONS=100
AUTORESEARCH_METRIC=val_bpb
Why TinyStories?
Karpathy's recommendation for resource-constrained hardware: lower entropy means the model can learn meaningful patterns in less time and with a smaller vocabulary, yielding cleaner val_bpb curves within the 5-minute budget.
M3 Max Hardware Profile
| Spec | Value |
|---|---|
| Chip | Apple M3 Max |
| CPU cores | 16 (12P + 4E) |
| GPU cores | 40 |
| Unified RAM | 36 GB |
| Memory bandwidth | 400 GB/s |
| MLX support | Yes (confirmed #903) |
MLX utilises the unified memory architecture — model weights, activations, and training data all share the same physical pool, eliminating PCIe transfers. This gives M3 Max a significant throughput advantage over external GPU setups for models that fit in 36 GB.
Community Reference Data
| Hardware | Experiments | Succeeded | Failed | Outcome |
|---|---|---|---|---|
| Mac Mini M4 | 35 | 7 | 28 | Model improved by simplifying |
| Shopify (overnight) | ~50 | — | — | 19% quality gain; smaller beat 2× baseline |
| SkyPilot (16× GPU, 8 h) | ~910 | — | — | 2.87% improvement |
| Karpathy (H100, 2 days) | ~700 | 20+ | — | 11% training speedup |
Mac Mini M4 failure rate: 80% (26/35). Failures are expected and by design — the 5-minute budget deliberately prunes slow experiments. The 20% success rate still yielded an improved model.
Baseline Results (M3 Max)
Fill in after running:
timmy learn --target <module> --metric val_bpb --budget 5 --max-experiments 50
| Run | Date | Experiments | Succeeded | val_bpb (start) | val_bpb (end) | Δ |
|---|---|---|---|---|---|---|
| 1 | — | — | — | — | — | — |
Throughput estimate
Based on the M3 Max hardware profile and Mac Mini M4 community data, expected throughput is 8–14 experiments/hour with the 5-minute budget and TinyStories dataset. The M3 Max has ~30% higher GPU core count and identical memory bandwidth class vs M4, so performance should be broadly comparable.
Apple Silicon Compatibility Notes
MLX path (recommended)
- Install:
pip install mlx mlx-lm AUTORESEARCH_BACKEND=autoresolves tomlxon arm64 macOS- Pros: unified memory, no PCIe overhead, native Metal backend
- Cons: MLX op coverage is a subset of PyTorch; some custom CUDA kernels won't port
llama.cpp path (fallback)
- Use when MLX op support is insufficient
- Set
AUTORESEARCH_BACKEND=cputo force CPU mode - Slower throughput but broader op compatibility
Known issues
subprocess.TimeoutExpiredis the normal termination path — autoresearch treats timeout as a completed-but-pruned experiment, not a failure- Large batch sizes may trigger OOM if other processes hold unified memory;
set
PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0to disable the MPS high-watermark
Next Steps (H2)
See #904 Horizon 2 for the meta-autoresearch plan: expand experiment units from code changes → system configuration changes (prompts, tools, memory strategies).