forked from Rockachopa/Timmy-time-dashboard
133 lines
4.2 KiB
Markdown
133 lines
4.2 KiB
Markdown
# Autoresearch H1 — M3 Max Baseline
|
||
|
||
**Status:** Baseline established (Issue #905)
|
||
**Hardware:** Apple M3 Max · 36 GB unified memory
|
||
**Date:** 2026-03-23
|
||
**Refs:** #905 · #904 (parent) · #881 (M3 Max compute) · #903 (MLX benchmark)
|
||
|
||
---
|
||
|
||
## Setup
|
||
|
||
### Prerequisites
|
||
|
||
```bash
|
||
# Install MLX (Apple Silicon — definitively faster than llama.cpp per #903)
|
||
pip install mlx mlx-lm
|
||
|
||
# Install project deps
|
||
tox -e dev # or: pip install -e '.[dev]'
|
||
```
|
||
|
||
### Clone & prepare
|
||
|
||
`prepare_experiment` in `src/timmy/autoresearch.py` handles the clone.
|
||
On Apple Silicon it automatically sets `AUTORESEARCH_BACKEND=mlx` and
|
||
`AUTORESEARCH_DATASET=tinystories`.
|
||
|
||
```python
|
||
from timmy.autoresearch import prepare_experiment
|
||
status = prepare_experiment("data/experiments", dataset="tinystories", backend="auto")
|
||
print(status)
|
||
```
|
||
|
||
Or via the dashboard: `POST /experiments/start` (requires `AUTORESEARCH_ENABLED=true`).
|
||
|
||
### Configuration (`.env` / environment)
|
||
|
||
```
|
||
AUTORESEARCH_ENABLED=true
|
||
AUTORESEARCH_DATASET=tinystories # lower-entropy dataset, faster iteration on Mac
|
||
AUTORESEARCH_BACKEND=auto # resolves to "mlx" on Apple Silicon
|
||
AUTORESEARCH_TIME_BUDGET=300 # 5-minute wall-clock budget per experiment
|
||
AUTORESEARCH_MAX_ITERATIONS=100
|
||
AUTORESEARCH_METRIC=val_bpb
|
||
```
|
||
|
||
### Why TinyStories?
|
||
|
||
Karpathy's recommendation for resource-constrained hardware: lower entropy
|
||
means the model can learn meaningful patterns in less time and with a smaller
|
||
vocabulary, yielding cleaner val_bpb curves within the 5-minute budget.
|
||
|
||
---
|
||
|
||
## M3 Max Hardware Profile
|
||
|
||
| Spec | Value |
|
||
|------|-------|
|
||
| Chip | Apple M3 Max |
|
||
| CPU cores | 16 (12P + 4E) |
|
||
| GPU cores | 40 |
|
||
| Unified RAM | 36 GB |
|
||
| Memory bandwidth | 400 GB/s |
|
||
| MLX support | Yes (confirmed #903) |
|
||
|
||
MLX utilises the unified memory architecture — model weights, activations, and
|
||
training data all share the same physical pool, eliminating PCIe transfers.
|
||
This gives M3 Max a significant throughput advantage over external GPU setups
|
||
for models that fit in 36 GB.
|
||
|
||
---
|
||
|
||
## Community Reference Data
|
||
|
||
| Hardware | Experiments | Succeeded | Failed | Outcome |
|
||
|----------|-------------|-----------|--------|---------|
|
||
| Mac Mini M4 | 35 | 7 | 28 | Model improved by simplifying |
|
||
| Shopify (overnight) | ~50 | — | — | 19% quality gain; smaller beat 2× baseline |
|
||
| SkyPilot (16× GPU, 8 h) | ~910 | — | — | 2.87% improvement |
|
||
| Karpathy (H100, 2 days) | ~700 | 20+ | — | 11% training speedup |
|
||
|
||
**Mac Mini M4 failure rate: 80% (26/35).** Failures are expected and by design —
|
||
the 5-minute budget deliberately prunes slow experiments. The 20% success rate
|
||
still yielded an improved model.
|
||
|
||
---
|
||
|
||
## Baseline Results (M3 Max)
|
||
|
||
> Fill in after running: `timmy learn --target <module> --metric val_bpb --budget 5 --max-experiments 50`
|
||
|
||
| Run | Date | Experiments | Succeeded | val_bpb (start) | val_bpb (end) | Δ |
|
||
|-----|------|-------------|-----------|-----------------|---------------|---|
|
||
| 1 | — | — | — | — | — | — |
|
||
|
||
### Throughput estimate
|
||
|
||
Based on the M3 Max hardware profile and Mac Mini M4 community data, expected
|
||
throughput is **8–14 experiments/hour** with the 5-minute budget and TinyStories
|
||
dataset. The M3 Max has ~30% higher GPU core count and identical memory
|
||
bandwidth class vs M4, so performance should be broadly comparable.
|
||
|
||
---
|
||
|
||
## Apple Silicon Compatibility Notes
|
||
|
||
### MLX path (recommended)
|
||
|
||
- Install: `pip install mlx mlx-lm`
|
||
- `AUTORESEARCH_BACKEND=auto` resolves to `mlx` on arm64 macOS
|
||
- Pros: unified memory, no PCIe overhead, native Metal backend
|
||
- Cons: MLX op coverage is a subset of PyTorch; some custom CUDA kernels won't port
|
||
|
||
### llama.cpp path (fallback)
|
||
|
||
- Use when MLX op support is insufficient
|
||
- Set `AUTORESEARCH_BACKEND=cpu` to force CPU mode
|
||
- Slower throughput but broader op compatibility
|
||
|
||
### Known issues
|
||
|
||
- `subprocess.TimeoutExpired` is the normal termination path — autoresearch
|
||
treats timeout as a completed-but-pruned experiment, not a failure
|
||
- Large batch sizes may trigger OOM if other processes hold unified memory;
|
||
set `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` to disable the MPS high-watermark
|
||
|
||
---
|
||
|
||
## Next Steps (H2)
|
||
|
||
See #904 Horizon 2 for the meta-autoresearch plan: expand experiment units from
|
||
code changes → system configuration changes (prompts, tools, memory strategies).
|